Deep multi-channel acoustic modeling

    公开(公告)号:US10726830B1

    公开(公告)日:2020-07-28

    申请号:US16143910

    申请日:2018-09-27

    Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-channel DNN) that takes in raw signals and produces a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. These three models may be jointly optimized for speech processing (as opposed to individually optimized for signal enhancement), enabling improved performance despite a reduction in microphones and a reduction in bandwidth consumption during real-time processing.

    ANCHORED SPEECH DETECTION AND SPEECH RECOGNITION

    公开(公告)号:US20200035231A1

    公开(公告)日:2020-01-30

    申请号:US16437763

    申请日:2019-06-11

    Abstract: A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

    Anchored speech detection and speech recognition

    公开(公告)号:US10373612B2

    公开(公告)日:2019-08-06

    申请号:US15196228

    申请日:2016-06-29

    Abstract: A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

    Low latency and memory efficient keywork spotting
    24.
    发明授权
    Low latency and memory efficient keywork spotting 有权
    低延迟和内存高效的密钥检测

    公开(公告)号:US09390708B1

    公开(公告)日:2016-07-12

    申请号:US13903814

    申请日:2013-05-28

    Abstract: Features are disclosed for spotting keywords in utterance audio data without requiring the entire utterance to first be processed. Likelihoods that a portion of the utterance audio data corresponds to the keyword may be compared to likelihoods that the portion corresponds to background audio (e.g., general speech and/or non-speech sounds). The difference in the likelihoods may be determined, and keyword may be triggered when the difference exceeds a threshold, or shortly thereafter. Traceback information and other data may be stored during the process so that a second speech processing pass may be performed. For efficient management of system memory, traceback information may only be stored for those frames that may encompass a keyword; the traceback information for older frames may be overwritten by traceback information for newer frames.

    Abstract translation: 公开了用于在话音音频数据中发现关键字的特征,而不需要首先处理整个话语。 话音音频数据的一部分对应于该关键字的可能性可以与该部分对应于背景音频(例如,一般语音和/或非语音)的似然性进行比较。 可以确定可能性的差异,并且当差异超过阈值时或之后不久可以触发关键字。 跟踪信息和其他数据可以在该处理期间被存储,从而可以执行第二语音处理通行证。 为了有效地管理系统存储器,回溯信息可能仅存储在可能包含关键字的那些帧中; 旧帧的追溯信息可能被较新帧的追溯信息覆盖。

    Acoustic echo cancellation and automatic speech recognition with random noise
    25.
    发明授权
    Acoustic echo cancellation and automatic speech recognition with random noise 有权
    声回波消除和随机噪声的自动语音识别

    公开(公告)号:US09286883B1

    公开(公告)日:2016-03-15

    申请号:US14038319

    申请日:2013-09-26

    CPC classification number: G10L21/0208 G10L15/20 G10L2021/02082 H04M9/082

    Abstract: Features are disclosed for performing acoustic echo cancellation using random noise. The output may be used to perform speech recognition. Random noise may be introduced into a reference signal path and into a microphone signal path. The random noise introduced into the microphone signal path may be transformed based on an estimated echo path and then combined with microphone output. The random noise introduced into the reference signal path may be combined with a reference signal and then transformed. In some embodiments, the random noise in the reference signal path may be used in the absence of another reference signal, allowing the acoustic echo canceler to be continuously trained.

    Abstract translation: 公开了用于使用随机噪声执行声学回波消除的特征。 该输出可用于执行语音识别。 可将随机噪声引入参考信号路径并传入麦克风信号路径。 引入麦克风信号路径的随机噪声可以基于估计的回波路径而变换,然后与麦克风输出组合。 引入参考信号路径的随机噪声可以与参考信号组合,然后变换。 在一些实施例中,参考信号路径中的随机噪声可以在不存在另一个参考信号的情况下使用,从而允许连续训练声学回波消除器。

    Local speech recognition of frequent utterances
    26.
    发明授权
    Local speech recognition of frequent utterances 有权
    频繁话语的本地语音识别

    公开(公告)号:US09070367B1

    公开(公告)日:2015-06-30

    申请号:US13684969

    申请日:2012-11-26

    CPC classification number: G10L15/187 G10L15/063 G10L15/30

    Abstract: In a distributed automated speech recognition (ASR) system, speech models may be employed on a local device to allow the local device to process frequently spoken utterances while passing other utterances to a remote device for processing. Upon receiving an audio signal, the local device compares the audio signal to the speech models of the frequently spoken utterances to determine whether the audio signal matches one of the speech models. When the audio signal matches one of the speech models, the local device processes the utterance, for example by executing a command. When the audio signal does not match one of the speech models, the local device transmits the audio signal to a second device for ASR processing. This reduces latency and the amount of audio signals that are sent to the second device for ASR processing.

    Abstract translation: 在分布式自动语音识别(ASR)系统中,可以在本地设备上使用语音模型,以允许本地设备处理频繁的口语话语,同时将其他话语传送到远程设备进行处理。 在接收到音频信号时,本地设备将音频信号与经常说出的话语的语音模型进行比较,以确定音频信号是否与语音模型中的一个匹配。 当音频信号与语音模型中的一个匹配时,本地设备例如通过执行命令来处理话音。 当音频信号与语音模型之一不匹配时,本地设备将音频信号发送到用于ASR处理的第二设备。 这减少了延迟和发送到第二设备的ASR处理的音频信号量。

    Deep multi-channel acoustic modeling using multiple microphone array geometries

    公开(公告)号:US11574628B1

    公开(公告)日:2023-02-07

    申请号:US16368331

    申请日:2019-03-28

    Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.

    Device-directed utterance detection

    公开(公告)号:US11551685B2

    公开(公告)日:2023-01-10

    申请号:US16822744

    申请日:2020-03-18

    Abstract: A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.

    Deep multi-channel acoustic modeling

    公开(公告)号:US11475881B2

    公开(公告)日:2022-10-18

    申请号:US16932049

    申请日:2020-07-17

    Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-channel DNN) that takes in raw signals and produces a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. These three models may be jointly optimized for speech processing (as opposed to individually optimized for signal enhancement), enabling improved performance despite a reduction in microphones and a reduction in bandwidth consumption during real-time processing.

Patent Agency Ranking