Abstract:
A method of recognizing speech commands includes generating a background acoustic model for a sound using a first sound sample, the background acoustic model characterized by a first precision metric. A foreground acoustic model is generated for the sound using a second sound sample, the foreground acoustic model characterized by a second precision metric. A third sound sample is received and decoded by assigning a weight to the third sound sample corresponding to a probability that the sound sample originated in a foreground using the foreground acoustic model and the background acoustic model. The method further includes determining if the weight meets predefined criteria for assigning the third sound sample to the foreground and, when the weight meets the predefined criteria, interpreting the third sound sample as a portion of a speech command. Otherwise, recognition of the third sound sample as a portion of a speech command is forgone.
Abstract:
A method and device for voiceprint recognition, include: establishing a first-level Deep Neural Network (DNN) model based on unlabeled speech data, the unlabeled speech data containing no speaker labels and the first-level DNN model specifying a plurality of basic voiceprint features for the unlabeled speech data; obtaining a plurality of high-level voiceprint features by tuning the first-level DNN model based on labeled speech data, the labeled speech data containing speech samples with respective speaker labels, and the tuning producing a second-level DNN model specifying the plurality of high-level voiceprint features; based on the second-level DNN model, registering a respective high-level voiceprint feature sequence for a user based on a registration speech sample received from the user; and performing speaker verification for the user based on the respective high-level voiceprint feature sequence registered for the user.
Abstract:
A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a nonvolatile memory to GPU video memories in the plurality of worker groups; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads. The method can enhance efficiency of multi-GPU parallel data processing. In addition, a parallel data processing apparatus is further provided.
Abstract:
A method, system and computer storage medium for visual searching based on cloud service is disclosed. The method includes: receiving, from a client, an image recognition request of cloud service, the request containing image data; forwarding, according to a set classified forwarding rule, the image data to a corresponding classified visual search service; recognizing, by the respective corresponding classified visual search services, corresponding classified type information in the image data, and determining a corresponding name of the image data in accordance with the respective classified type information, and obtaining a classified visual search result; summarizing and sending, to a client, the classified visual search result of the corresponding classified visual search service. By detection and recognition of the classified type information of the image data, the comprehensive feature information of a picture is obtained, based on which further applications are allowed, and thus the user experience is improved.
Abstract:
A method and a device for training an acoustic language model, include: conducting word segmentation for training samples in a training corpus using an initial language model containing no word class labels, to obtain initial word segmentation data containing no word class labels; performing word class replacement for the initial word segmentation data containing no word class labels, to obtain first word segmentation data containing word class labels; using the first word segmentation data containing word class labels to train a first language model containing word class labels; using the first language model containing word class labels to conduct word segmentation for the training samples in the training corpus, to obtain second word segmentation data containing word class labels; and in accordance with the second word segmentation data meeting one or more predetermined criteria, using the second word segmentation data containing word class labels to train the acoustic language model.
Abstract:
A server system with one or more processors and memory obtains, from a client device, a card image which includes an image of a card, and identifies a card configuration type corresponding to the card in the card image based on a database of stored card configuration types. Each stored card configuration type in the database is associated with layout information regarding respective features and information regions for the stored card configuration type. In accordance with the identified card configuration type, the server system determines one or more information regions of the card image containing respective card information of the card. The server system extracts at least a portion of the card information of the card from the one or more information regions of the card image and transmits, to the client device, at least the extracted portion of the card information.
Abstract:
This application discloses a method implemented of recognizing a keyword in a speech that includes a sequence of audio frames further including a current frame and a subsequent frame. A candidate keyword is determined for the current frame using a decoding network that includes keywords and filler words of multiple languages, and used to determine a confidence score for the audio frame sequence. A word option is also determined for the subsequent frame based on the decoding network, and when the candidate keyword and the word option are associated with two distinct types of languages, the confidence score of the audio frame sequence is updated at least based on a penalty factor associated with the two distinct types of languages. The audio frame sequence is then determined to include both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
Abstract:
A biometric-based authentication method, an apparatus, and a system are described. The method includes: receiving a biometric image to be authenticated sent from a client; performing feature extraction to the biometric image to be authenticated to obtain a biometric template to be authenticated; comparing the biometric template to be authenticated with a locally-stored biometric template; and returning an authentication result. In this case, the feature extraction process may be implemented at a cloud server side, as such, the complexity of the client may be reduced, the expandability of the client may be increased, a limitation that the biometric recognition may only be implemented on the client may be eliminated, and diversified utilization may be supported.
Abstract:
A method and device for communicating a video with a simulation image is provided. The method includes: acquiring, by a sender, video data, transforming the acquired video data into vector data in image recognition algorithm, and sending the vector data to a receiver; and calling, by the receiver, a cartoon rendering model and rendering the received vector data in the video with a corresponding cartoon simulation image according to the cartoon rendering model. By using the present invention, the amount of data transmitted in a network may be reduced, and network bandwidth resources are saved.
Abstract:
A method of presenting interactive content at a client device is disclosed. The client device records, in real-time, an audio stream of a piece of multimedia content broadcast by a content display device and sends an audio fingerprint of the piece of the multimedia content to a server. The server then determines, based on the audio fingerprint, an identifier of the piece of multimedia content, retrieves, based on the identifier of the piece of multimedia content, interactive content associated with the piece of multimedia content, and returns the interactive content associated with the piece of multimedia content to the client device. After receiving, from the server, the interactive content associated with the piece of multimedia content, the client device renders the interactive content to the user of the client device.