Abstract:
Systems and processes for providing user-specific acoustic models are provided. In accordance with one example, a method includes, at an electronic device having one or more processors, receiving a plurality of speech inputs, each of the speech inputs associated with a same user of the electronic device; providing each of the plurality of speech inputs to a user-independent acoustic model, the user-independent acoustic model providing a plurality of speech results based on the plurality of speech inputs; initiating a user-specific acoustic model on the electronic device; and adjusting the user-specific acoustic model based on the plurality of speech inputs and the plurality of speech results.
Abstract:
Speech recognition is performed on a received utterance to determine a plurality of candidate text representations of the utterance, including a primary text representation and one or more alternative text representations. Natural language processing is performed on the primary text representation to determine a plurality of candidate actionable intents, including a primary actionable intent and one or more alternative actionable intents. A result is determined based on the primary actionable intent. The result is provided to the user. A recognition correction trigger is detected. In response to detecting the recognition correction trigger, a set of alternative intent affordances and a set of alternative text affordances are concurrently displayed.
Abstract:
Systems and processes for evaluating embedded personalized systems are provided. In one example process, instructions that define an experiment associated with a personalized speech recognition system can be received. The instructions can define one or more experimental parameters. In accordance with the received instructions, a second personalized speech recognition system can be generated based on the personalized speech recognition system and the one or more experimental parameters. Additionally, the plurality of user speech samples can be processed using the second personalized speech recognition system to generate a plurality of speech recognition results and a plurality of accuracy scores corresponding to the plurality of speech recognition results. Second instructions can be received based on the plurality of accuracy scores. In accordance with the second instructions, the second speech recognition system can be activated.
Abstract:
Systems and processes for automatic accent detection are provided. In accordance with one example, a method includes, at an electronic device with one or more processors and memory, receiving a user input, determining a first similarity between a representation of the user input and a first acoustic model of a plurality of acoustic models, and determining a second similarity between the representation of the user input and a second acoustic model of the plurality of acoustic models. The method further includes determining whether the first similarity is greater than the second similarity. In accordance with a determination that the first similarity is greater than the second similarity, the first acoustic model may be selected; and in accordance with a determination that the first similarity is not greater than the second similarity, the second acoustic model may be selected.
Abstract:
Systems and processes are disclosed for discovering trending terms in automatic speech recognition. Candidate terms (e.g., words, phrases, etc.) not yet found in a speech recognizer vocabulary or having low language model probability can be identified based on trending usage in a variety of electronic data sources (e.g., social network feeds, news sources, search queries, etc.). When candidate terms are identified, archives of live or recent speech traffic can be searched to determine whether users are uttering the candidate terms in dictation or speech requests. Such searching can be done using open vocabulary spoken term detection to find phonetic matches in the audio archives. As the candidate terms are found in the speech traffic, notifications can be generated that identify the candidate terms, provide relevant usage statistics, identify the context in which the terms are used, and the like.
Abstract:
Systems and processes for generating complementary acoustic models for performing automatic speech recognition system combination are provided. In one example process, a deep neural network can be trained using a set of training data. The trained deep neural network can be a deep neural network acoustic model. A Gaussian-mixture model can be linked to a hidden layer of the trained deep neural network such that any feature vector outputted from the hidden layer is received by the Gaussian-mixture model. The Gaussian-mixture model can be trained via a first portion of the trained deep neural network and using the set of training data. The first portion of the trained deep neural network can include an input layer of the deep neural network and the hidden layer. The first portion of the trained deep neural network and the trained Gaussian-mixture model can be a Deep Neural Network-Gaussian-Mixture Model (DNN-GMM) acoustic model.
Abstract:
Systems and processes are disclosed for recognizing speech using a weighted finite state transducer (WFST) approach. Dynamic grammars can be supported by constructing the final recognition cascade during runtime using difference grammars. In a first grammar, non-terminals can be replaced with a, weighted phone loop that produces sequences of mono-phone words. In a second grammar, at runtime, non-terminals can be replaced with sub-grammars derived from user-specific usage data including contact, media, and application lists. Interaction frequencies associated with these entities can be used to weight certain words over others. With all non-terminals replaced, a static recognition cascade with the first grammar can be composed with the personalized second grammar to produce a user-specific WEST. User speech can then be processed to generate candidate words having associated probabilities, and the likeliest result can be output.