Abstract:
An electronic device with one or more processors and memory trains an acoustic model with an international phonetic alphabet (IPA) phoneme mapping collection and audio samples in different languages, where the acoustic model includes: a foreground model; and a background model. The device generates a phone decoder based on the trained acoustic model. The device collects keyword audio samples, decodes the keyword audio samples with the phone decoder to generate phoneme sequence candidates, and selects a keyword phoneme sequence from the phoneme sequence candidates. After obtaining the keyword phoneme sequence, the device detects one or more keywords in an input audio signal with the trained acoustic model, including: matching phonemic keyword portions of the input audio signal with phonemes in the keyword phoneme sequence with the foreground model; and filtering out phonemic non-keyword portions of the input audio signal with the background model.
Abstract:
This application discloses a method implemented of recognizing a keyword in a speech that includes a sequence of audio frames further including a current frame and a subsequent frame. A candidate keyword is determined for the current frame using a decoding network that includes keywords and filler words of multiple languages, and used to determine a confidence score for the audio frame sequence. A word option is also determined for the subsequent frame based on the decoding network, and when the candidate keyword and the word option are associated with two distinct types of languages, the confidence score of the audio frame sequence is updated at least based on a penalty factor associated with the two distinct types of languages. The audio frame sequence is then determined to include both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
Abstract:
A method of processing information content based on a Chinese language model is performed at a computer, the method including: identifying a plurality of expressions in the information content extracted from a speech input through speech recognition that is queued to be processed; dividing the expressions into a plurality of characteristic units according to semantic features and predetermined characteristics associated with each characteristic unit, each including a subset of the expressions and the predetermined characteristics at least including a respective integer number of expressions that are included in the characteristic unit; extracting, from the Chinese language model, a plurality of probabilities for punctuation marks associated with each characteristic unit; and in accordance with the probabilities, associating a respective punctuation mark with each characteristic unit included in the information content. The method further comprises adding punctuation marks based on a weight determined for each punctuation mark.
Abstract:
The present invention, pertaining to the field of speech recognition, discloses a reminder setting method and apparatus. The method includes: acquiring speech signals; acquiring time information in speech signals by using keyword recognition, and determining reminder time for reminder setting according to the time information; acquiring text sequence corresponding to the speech signals by using continuous speech recognition, and determining reminder content for reminder setting according to the time information and the text sequence; and setting a reminder according to the reminder time and the reminder content. According to the present invention, acquiring time information in speech signals by using keyword recognition ensures correctness of time information extraction, and achieves an effect that correct time information is still acquired by keyword recognition to set a reminder even in the case that a recognized text sequence is incorrect due to poor precision in whole text recognition in the speech recognition.
Abstract:
An automatic speech recognition method includes at a computer having one or more processors and memory for storing one or more programs to be executed by the processors, obtaining a plurality of speech corpus categories through classifying and calculating raw speech corpus; obtaining a plurality of classified language models that respectively correspond to the plurality of speech corpus categories through a language model training applied on each speech corpus category; obtaining an interpolation language model through implementing a weighted interpolation on each classified language model and merging the interpolated plurality of classified language models; constructing a decoding resource in accordance with an acoustic model and the interpolation language model; and decoding input speech using the decoding resource, and outputting a character string with a highest probability as a recognition result of the input speech.
Abstract:
The present application discloses a method, an electronic system and a non-transitory computer readable storage medium for recognizing audio commands in an electronic device. The electronic device obtains audio data based on an audio signal provided by a user and extracts characteristic audio fingerprint features from the audio data. The electronic device further determines whether the corresponding audio signal is generated by an authorized user by comparing the characteristic audio fingerprint features with an audio fingerprint model for the authorized user and with a universal background model that represents user-independent audio fingerprint features, respectively. When the corresponding audio signal is generated by the authorized user of the electronic device, an audio command is extracted from the audio data, and an operation is performed according to the audio command.
Abstract:
A computer-implemented method is performed at a server having one or more processors and memory storing programs executed by the one or more processors for authenticating a user from video and audio data. The method includes: receiving a login request from a mobile device, the login request including video data and audio data; extracting a group of facial features from the video data; extracting a group of audio features from the audio data and recognizing a sequence of words in the audio data; identifying a first user account whose respective facial features match the group of facial features and a second user account whose respective audio features match the group of audio features. If the first user account is the same as the second user account, retrieve the sequence of words associated with the user account and compare the sequences of words for authentication purpose.
Abstract:
The present application discloses a method, an electronic system and a non-transitory computer readable storage medium for recognizing audio commands in an electronic device. The electronic device obtains audio data based on an audio signal provided by a user and extracts characteristic audio fingerprint features from the audio data. The electronic device further determines whether the corresponding audio signal is generated by an authorized user by comparing the characteristic audio fingerprint features with an audio fingerprint model for the authorized user and with a universal background model that represents user-independent audio fingerprint features, respectively. When the corresponding audio signal is generated by the authorized user of the electronic device, an audio command is extracted from the audio data, and an operation is performed according to the audio command.
Abstract:
Systems and methods are provided for speech recognition. For example, audio characteristics are extracted from acquired voice signals; a syllable confusion network is identified based on at least information associated with the audio characteristics; a word lattice is generated based on at least information associated with the syllable confusion network and a predetermined phonetic dictionary; and an optimal character sequence is calculated in the word lattice as a speech recognition result.
Abstract:
A method and a device for training a DNN model includes: at a device including one or more processors and memory: establishing an initial DNN model; dividing a training data corpus into a plurality of disjoint data subsets; for each of the plurality of disjoint data subsets, providing the data subset to a respective training processing unit of a plurality of training processing units operating in parallel, wherein the respective training processing unit applies a Stochastic Gradient Descent (SGD) process to update the initial DNN model to generate a respective DNN sub-model based on the data subset; and merging the respective DNN sub-models generated by the plurality of training processing units to obtain an intermediate DNN model, wherein the intermediate DNN model is established as either the initial DNN model for a next training iteration or a final DNN model in accordance with a preset convergence condition.