Abstract:
In a speech-based system, a wake word or other trigger expression is used to preface user speech that is intended as a command. The system receives multiple directional audio signals, each of which emphasizes sound from a different direction. The signals are monitored and analyzed to detect the directions of interfering audio sources such as televisions or other types of electronic audio players. One of the directional signals having the strongest presence of speech is selected to be monitored for the trigger expression. If the directional signal corresponds to the direction of an interfering audio source, a more strict standard is used to detect the trigger expression. In addition, the directional audio signal having the second strongest presence of speech may also be monitored to detect the trigger expression.
Abstract:
Features are disclosed for filtering portions of an output audio signal in order to improve automatic speech recognition on an input signal which may include a representation of the output signal. A signal that includes audio content can be received, and a frequency or band of frequencies can be selected to be filtered from the signal. The frequency band may correspond to a desired frequency band for speech recognition. An input signal can be obtained comprising audio data corresponding to a user utterance and presentation of the output signal. Automatic speech recognition can be performed on the input signal. In some cases, an acoustic model trained for use with such frequency band filtering may be used to perform speech recognition.
Abstract:
Embodiments of systems and methods are described for determining weighting coefficients based at least in part on using convex optimization subject to one or more constraints to approximate a three-dimensional beampattern. In some implementations, the approximated three-dimensional beampattern comprises a main lobe that includes a look direction for which waveforms detected by a sensor array are not suppressed and a side lobe that includes other directions for which waveforms detected by the microphone array are suppressed. The one or more constraints can include a constraint that suppression of waveforms received by the sensor array from the side lobe are greater than a threshold. In some implementations, the threshold can be dependent on at least one of an angular direction of the waveform and a frequency of the waveform.
Abstract:
An echo path change detector may be used to control the rate of adaptation in an acoustic echo canceller. When echo path change is declared, the rate of adaptation may be increased. However, echo path change should not be declared in the presence of double talk, because rapid adaptation during double talk is undesirable. Accordingly, various features are disclosed for detecting echo path changes while avoiding the declaration of such changes in the presence of double talk.
Abstract:
Embodiments of systems and methods are described for determining which of a plurality of beamformed audio signals to select for signal processing. In some embodiments, a plurality of audio input signals are received from a microphone array comprising a plurality of microphones. A plurality of beamformed audio signals are determined based on the plurality of input audio signals, the beamformed audio signals comprising a direction. A plurality of signal features may be determined for each beamformed audio signal. Smoothed features may be determined for each beamformed audio signal based on at least a portion of the plurality of signal features. The beamformed audio signal corresponding to the maximum smoothed feature may be selected for further processing.
Abstract:
A system configured to improve audio processing by performing dereverberation and noise reduction during a communication session. In some examples, the system may include a deep neural network (DNN) configured to perform speech enhancement, which is located after an Acoustic Echo Cancellation (AEC) component. For example, the DNN may process isolated audio data output by the AEC component to jointly mitigate additive noise and reverberation. In other examples, the system may include a DNN configured to perform acoustic interference cancellation, which may jointly mitigate additive noise, reverberation, and residual echo, removing the need to perform residual echo suppression processing. The DNN is configured to process complex-valued spectrograms corresponding to the isolated audio data and/or estimated echo data generated by the AEC component.
Abstract:
The systems, devices, and processes described herein may identify a beam of a voice-controlled device that is directed toward a reflective surface, such as a wall. The beams may be created by a beamformer. An acoustic echo canceller (AEC) may create filter coefficients for a reference sound. The filter coefficients may be analyzed to identify beams that include multiple peaks. The multiple peaks may indicate presence of one or more reflective surfaces. Using the amplitude and the time delay between the peaks, the device may determine that it is close to a reflective surface in a direction of the beam.
Abstract:
The systems, devices, and processes described herein may identify a beam of a voice-controlled device that is directed toward a reflective surface, such as a wall. The beams may be created by a beamformer. An acoustic echo canceller (AEC) may create filter coefficients for a reference sound. The filter coefficients may be analyzed to identify beams that include multiple peaks. The multiple peaks may indicate presence of one or more reflective surfaces. Using the amplitude and the time delay between the peaks, the device may determine that it is close to a reflective surface in a direction of the beam.
Abstract:
A wearable computer is configured to use beamforming techniques to isolate a user's speech from extraneous audio signals occurring within a physical environment. A microphone array of the wearable computer may generate audio signal data from an utterance from a user's mouth. A motion sensor(s) of the wearable computer may generate motion data from movement of the wearable computer. This motion data may be used to determine a direction vector pointing from the wearable computer to the user's mouth, and a beampattern may be defined that has a beampattern direction in substantial alignment with the determined direction vector to focus the microphone array on the user's mouth for speech isolation.
Abstract:
Sound is banked laterally over an array of microphones arranged on a rear surface of a device. Sound enters a duct behind the device from different directions via inlets along the sides of the device. The duct directs the sound waves across the microphone array. An effective direction from which the banked sounds originated is determined, relative to a front of the device. Based on the determined effective direction, the device applies spatial filtering to isolate the received sound waves, selectively increasing a signal-to-noise ratio of sound from the selected source and at least partially occluding sounds from other sources.