Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for obtaining, for each of multiple words or sub-words, audio data corresponding to multiple users speaking the word or sub-word; training, for each of the multiple words or sub-words, a pre-computed hotword model for the word or sub-word based on the audio data for the word or sub-word; receiving a candidate hotword from a computing device; identifying one or more pre-computed hotword models that correspond to the candidate hotword; and providing the identified, pre-computed hotword models to the computing device.
Abstract:
The technology described herein can be embodied in a method that includes receiving a first signal representing an output of a speaker device, and a second signal comprising the output of the speaker device, and an audio signal corresponding to an utterance of a speaker. The method includes aligning one or more segments of the first signal with one or more segments of the second signal. Acoustic features of the one or more segments of the first and second signals are classified to obtain a first set of vectors and a second set of vectors, respectively, the vectors being associated with speech units. The second set is modified using the first set, such that the modified second set represents a suppression of the output of the speaker device in the second signal. A transcription of the utterance of the speaker can be generated from the modified second set of vectors.
Abstract:
Systems and methods are provided herein relating to audio matching. Descriptors can be generated based on anchor points and interest points that characterize the local neighborhood surrounding the anchor point. Characterizing the local spectrogram neighborhood surrounding anchor points can be more robust to pitch shift distortions and time stretch distortions. Those anchor points surrounded by a lack of spectral activity or even spectral activity can be filtered from further examination. Using these pitch shift and time stretch resistant audio features within descriptors can provide for more accurate and efficient audio matching.
Abstract:
The technology described herein can be embodied in a method that includes receiving a first signal representing an output of a speaker device, and a second signal comprising the output of the speaker device, and an audio signal corresponding to an utterance of a speaker. The method includes aligning one or more segments of the first signal with one or more segments of the second signal. Acoustic features of the one or more segments of the first and second signals are classified to obtain a first set of vectors and a second set of vectors, respectively, the vectors being associated with speech units. The second set is modified using the first set, such that the modified second set represents a suppression of the output of the speaker device in the second signal. A transcription of the utterance of the speaker can be generated from the modified second set of vectors.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for designating certain voice commands as hotwords. The methods, systems, and apparatus include actions of receiving a hotword followed by a voice command. Additional actions include determining that the voice command satisfies one or more predetermined criteria associated with designating the voice command as a hotword, where a voice command that is designated as a hotword is treated as a voice input regardless of whether the voice command is preceded by another hotword. Further actions include, in response to determining that the voice command satisfies one or more predetermined criteria associated with designating the voice command as a hotword, designating the voice command as a hotword.
Abstract:
Systems and methods are provided herein relating to interactive gaming within a media sharing service. Game data, such as sets of notes extracted from the audio track of user generated videos or from audio samples, can be generated based on videos containing musical content or from audio content. A device can use the game data to facilitate an interactive game during playback of the user generated videos or audio samples. Players can press buttons, for example, corresponding to notes as the video with musical content is played within the game interface. Players can be scored for accuracy, and can play with other players in a multiplayer environment. In this sense, user generated video content or audio content can be transformed and used within a gaming interface to increase interaction and engagement between users in a media sharing service.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on a voice profile. In one aspect, a method includes the actions of receiving audio data corresponding to an utterance spoken by a particular user. The actions further include generating a voice profile for the particular user using at least a portion of the audio data. The actions further include determining in the audio data a beginning point or an ending point of the utterance based at least in part on the voice profile for the particular user. The actions further include based on the beginning point, the ending point, or both the beginning point and the ending point, outputting data indicating the utterance.
Abstract:
A method for displaying an aggregate count of endorsements is provided, including the following method operations: processing a request for an online resource from a mobile device, the online resource being associated with an object, the online resource including an endorsement mechanism; sending the online resource to the mobile device; processing an input from a user triggering the endorsement mechanism, to define an endorsement of the object by the user; updating an aggregate count of endorsements of the object to include the endorsement of the object by the user; sending the updated aggregate count of endorsements to the social display device for display on the social display device.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving (i) audio data that encodes a spoken natural language query, and (ii) environmental audio data, obtaining a transcription of the spoken natural language query, determining a particular content type associated with one or more keywords in the transcription, providing at least a portion of the environmental audio data to a content recognition engine, and identifying a content item that has been output by the content recognition engine, and that matches the particular content type.
Abstract:
Systems and methods are provided for suggesting actions for selected text based on content displayed on a mobile device. An example method can include converting a selection made via a display device into a query, providing the query to an action suggestion model that is trained to predict an action given a query, each action being associated with a mobile application, receiving one or more predicted actions, and initiating display of the one or more predicted actions on the display device. Another example method can include identifying, from search records, queries where a website is highly ranked, the website being one of a plurality of websites in a mapping of websites to mobile applications. The method can also include generating positive training examples for an action suggestion model from the identified queries, and training the action suggestion model using the positive training examples.