Abstract:
A speech conversion system is described that includes a hierarchical encoder and a decoder. The system may comprise a processor and memory storing instructions executable by the processor. The instructions may comprise to: using a second recurrent neural network (RNN) (GRU1) and a first set of encoder vectors derived from a spectrogram as input to the second RNN, determine a second concatenated sequence; determine a second set of encoder vectors by doubling a stack height and halving a length of the second concatenated sequence; using the second set of encoder vectors, determine a third set of encoder vectors; and decode the third set of encoder vectors using an attention block.
Abstract:
A controller may be programmed to create a speech utterance set for speech recognition training by, in response to receiving data representing a neutral utterance and parameter values defining signal noise, generating data representing a Lombard effect version of the neutral utterance using a transfer function associated with the parameter values and defining distortion between neutral and Lombard effect versions of a same utterance due to the signal noise.
Abstract:
This disclosure generally relates to a system, apparatus, and method for achieving a vehicle state-based hands free noise reduction feature. A noise reduction tool is provided for applying a noise reduction strategy on a sound input that uses machine learning to develop future noise reduction strategies, where the noise reduction strategies include analyzing vehicle operational state information and external information that are predicted to contribute to cabin noise and selecting noise reducing pre-filter options based on the analysis. The machine learning may further be supplemented by off-line training to generate a speech quality performance measure for the sound input that may be referenced by the noise reduction tool for further noise reduction strategies.
Abstract:
Example natural speech data generation systems and methods are described. In one implementation, a natural speech data generator initiates a game between a first player and a second player and determines a scenario associated with the game. A first role is assigned to the first player and a second role is assigned to the second player. The natural speech data generator receives multiple natural speech utterances by the first player and the second player during the game.
Abstract:
An end-to-end deep-learning-based system that can solve both ASR and TTS problems jointly using unpaired text and audio samples is disclosed herein. An adversarially-trained approach is used to generate a more robust independent TTS neural network and an ASR neural network that can be deployed individually or simultaneously. The process for training the neural networks includes generating an audio sample from a text sample using the TTS neural network, then feeding the generated audio sample into the ASR neural network to regenerate the text. The difference between the regenerated text and the original text is used as a first loss for training the neural networks. A similar process is used for an audio sample. The difference between the regenerated audio and the original audio is used as a second loss. Text and audio discriminators are similarly used on the output of the neural network to generate additional losses for training.
Abstract:
Systems, methods, and devices for speech transformation and generating synthetic speech using deep generative models are disclosed. A method of the disclosure includes receiving input audio data comprising a plurality of iterations of a speech utterance from a plurality of speakers. The method includes generating an input spectrogram based on the input audio data and transmitting the input spectrogram to a neural network configured to generate an output spectrogram. The method includes receiving the output spectrogram from the neural network and, based on the output spectrogram, generating synthetic audio data comprising the speech utterance.
Abstract:
An automatic speech recognition system for a vehicle includes a controller configured to select an acoustic model from a library of acoustic models based on ambient noise in a cabin of the vehicle and operating parameters of the vehicle. The controller is further configured to apply the selected acoustic model to noisy speech to improve recognition of the speech.
Abstract:
A vehicle is disclosed that includes systems for adjusting the transmittance of one or more windows of the vehicle. The vehicle may include a camera outputting images taken of an occupant within the vehicle. The vehicle may also include an artificial neural network running on computer hardware carried on-board the vehicle. The artificial neural network may be trained to classify the occupant of the vehicle using the images captured by the camera as input. The vehicle may further include a controller controlling transmittance of the one or more windows based on classifications made by the artificial neural network. For example, if the artificial neural network classifies the occupant as squinting or shading his or her eyes with a hand, the controller may reduce the transmittance of a windshield, side window, or some combination thereof.
Abstract:
A vehicle is disclosed that includes systems for adjusting the transmittance of one or more windows of the vehicle. The vehicle may include a camera outputting images taken of an occupant within the vehicle. The vehicle may also include an artificial neural network running on computer hardware carried on-board the vehicle. The artificial neural network may be trained to classify the occupant of the vehicle using the images captured by the camera as input. The vehicle may further include a controller controlling transmittance of the one or more windows based on classifications made by the artificial neural network. For example, if the artificial neural network classifies the occupant as squinting or shading his or her eyes with a hand, the controller may reduce the transmittance of a windshield, side window, or some combination thereof
Abstract:
A system includes a head and torso simulation (HATS) system configured to play back pre-recorded audio commands while simulating a driver head location as an output location. The system also includes a vehicle speaker system and a processor configured to engage a vehicle heating, ventilation and air-conditioning (HVAC) system. The processor is also configured to play back audio commands through the HATS system while playing back pre-recorded vehicle environment noises through the speaker system. The processor is further configured to determine if the audio command, recorded by a vehicle microphone, is recognizable in the presence of the environment noises and HVAC noises. Also, the processor is configured to repeat the engagement, playback of commands and noises, and determination, recording the results of the determination for each command in a set of commands.