Abstract:
Features are disclosed for generating intent-specific results in an automatic speech recognition system. The results can be generated by utilizing a decoding graph containing tags that identify portions of the graph corresponding to a given intent. The tags can also identify high-information content slots and low-information carrier phrases for a given intent. The automatic speech recognition system may utilize these tags to provide a semantic representation based on a plurality of different tokens for the content slot portions and low information for the carrier portions. A user can be presented with a user interface containing top intent results with corresponding intent-specific top content slot values.
Abstract:
Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions.
Abstract:
Incremental speech recognition results are generated and used to determine a user's intent from an utterance. Utterance audio data may be partitioned into multiple portions, and incremental speech recognition results may be generated from one or more of the portions. A natural language understanding module or some other language processing module can generate semantic representations of the utterance from the incremental speech recognition results. Stability of the determined intent may be determined over the course of time, and actions may be taken in response to meeting certain stability thresholds.
Abstract:
In an automatic speech recognition (ASR) processing system, ASR processing may be configured to reduce a latency of returning speech results to a user. The latency may be determined by comparing a time stamp of an utterance in process to a current time. Latency may also be estimated based on an endpoint of the utterance or other considerations such as how difficult the utterance may be to process. To improve latency the ASR system may be configured to adjust various processing parameters, such as graph pruning factors, path weights, ASR models, etc. Latency checks and corrections may occur dynamically for a particular utterance while it is being processed, thus allowing the ASR system to adjust to rapidly changing latency conditions.
Abstract:
Features are disclosed for selecting and using multiple transforms associated with a particular remote device for use in automatic speech recognition (“ASR”). Each transform may be based on statistics that have been generated from processing utterances that share some characteristic (e.g., acoustic characteristics, time frame within which the utterances where processed, etc.). When an utterance is received from the remote device, a particular transform or set of transforms may be selected for use in speech processing based on data obtained from the remote device, speech processing of a portion of the utterance, speech processing of prior utterances, etc. The transform or transforms used in processing the utterances may then be updated based on the results of the speech processing.
Abstract:
Power consumption for a computing device may be managed by one or more keywords. For example, if an audio input obtained by the computing device includes a keyword, a network interface module and/or an application processing module of the computing device may be activated. The audio input may then be transmitted via the network interface module to a remote computing device, such as a speech recognition server. Alternately, the computing device may be provided with a speech recognition engine configured to process the audio input for on-device speech recognition.
Abstract:
Power consumption for a computing device may be managed by one or more keywords. For example, if an audio input obtained by the computing device includes a keyword, a network interface module and/or an application processing module of the computing device may be activated. The audio input may then be transmitted via the network interface module to a remote computing device, such as a speech recognition server. Alternately, the computing device may be provided with a speech recognition engine configured to process the audio input for on-device speech recognition.
Abstract:
Features are disclosed for automatically identifying a speaker. Artifacts of automatic speech recognition (“ASR”) and/or other automatically determined information may be processed against individual user profiles or models. Scores may be determined reflecting the likelihood that individual users made an utterance. The scores can be based on, e.g., individual components of Gaussian mixture models (“GMMs”) that score best for frames of audio data of an utterance. A user associated with the highest likelihood score for a particular utterance can be identified as the speaker of the utterance. Information regarding the identified user can be provided to components of a spoken language processing system, separate applications, etc.
Abstract:
Features are disclosed for transferring speech recognition workloads between pooled execution resources. For example, various parts of an automatic speech recognition engine may be implemented by various pools of servers. Servers in a speech recognition pool may explore a plurality of paths in a graph to find the path that best matches an utterance. A set of active nodes comprising the last node explored in each path may be transferred between servers in the pool depending on resource availability at each server. A history of nodes or arcs traversed in each path may be maintained by a separate pool of history servers, and used to generate text corresponding to the path identified as the best match by the speech recognition servers.
Abstract:
Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions.