Abstract:
Systems and processes for generating complementary acoustic models for performing automatic speech recognition system combination are provided. In one example process, a deep neural network can be trained using a set of training data. The trained deep neural network can be a deep neural network acoustic model. A Gaussian-mixture model can be linked to a hidden layer of the trained deep neural network such that any feature vector outputted from the hidden layer is received by the Gaussian-mixture model. The Gaussian-mixture model can be trained via a first portion of the trained deep neural network and using the set of training data. The first portion of the trained deep neural network can include an input layer of the deep neural network and the hidden layer. The first portion of the trained deep neural network and the trained Gaussian-mixture model can be a Deep Neural Network-Gaussian-Mixture Model (DNN-GMM) acoustic model.
Abstract:
Systems and processes for speech recognition for multiple users are provided. For example, in response to receiving speech input from a user, a combined speech profile is obtained from a plurality of speech profiles. The speech input is interpreted based on the combined speech profile to obtain a plurality of speech recognition results. The plurality of speech recognition results includes a first speech recognition result corresponding to a first speech profile of the plurality of speech profiles, wherein the first speech profile corresponds to a first user, and a second speech recognition result corresponding to a second speech profile of the plurality of speech profiles, wherein the second speech profile corresponds to a second user different from the first user. A respective speech recognition result based on an identified voice profile is then selected from the plurality of speech recognition results.
Abstract:
The present disclosure generally relates to context-based endpoint detection in user speech input. A method for identifying an endpoint of a spoken request by a user may include receiving user input of natural language speech including one or more words; identifying at least one context associated with the user input; generating a probability, based on the at least one context associated with the user input, that a location in the user input is an endpoint; determining whether the probability is greater than a threshold; and in accordance with a determination that the probability is greater than the threshold, identifying the location in the user input as the endpoint.
Abstract:
Systems and processes for processing speech in a digital assistant are provided. In one example process, a first speech input can be received from a user. The first speech input can be processed using a first automatic speech recognition system to produce a first recognition result. An input indicative of a potential error in the first recognition result can be received. The input can be used to improve the first recognition result. For example, the input can include a second speech input that is a repetition of the first speech input. The second speech input can be processed using a second automatic speech recognition system to produce a second recognition result.