专利检索 ipc:G10L25/30 第 1 页

1.

发明公开
Methods and Apparatus for Dynamic Volume Adjustment Via Audio Classification 审中-公开

公开(公告)号：US20240354053A1

公开(公告)日：2024-10-24

申请号：US18760866

申请日：2024-07-01

申请人： Gracenote, Inc.

发明人： Markus Cremer , Robert Coover , Steven D. Scherf , Cameron Aubrey Summers

IPC分类号： G06F3/16 , G10L25/30 , G10L25/51

CPC分类号： G06F3/165 , G10L25/51 , G10L25/30

摘要： Methods, apparatus, systems and articles of manufacture are disclosed for dynamic volume adjustment via audio classification. Example apparatus include at least one memory; instructions; and at least one processor to execute the instructions to: analyze, with a neural network, a parameter of an audio signal associated with a first volume level to determine a classification group associated with the audio signal; determine an input volume of the audio signal; determine a classification gain value based on the classification group; determine an intermediate gain value as an intermediate between the input volume and the classification gain value by applying a first weight to the input volume and a second weight to the classification gain value; apply the intermediate gain value to the audio signal, the intermediate gain value to modify the first volume level to a second volume level; and apply a compression value to the audio signal, the compression value to modify the second volume level to a third volume level that satisfies a target volume threshold.

2.

发明授权
Cognitive assistant for real-time emotion detection from human speech 有权

公开(公告)号：US12119022B2

公开(公告)日：2024-10-15

申请号：US17536673

申请日：2021-11-29

申请人： Rishi Amit Sinha , Ria Sinha

发明人： Rishi Amit Sinha , Ria Sinha

IPC分类号： G10L25/63 , G10L21/0208 , G10L25/30 , H04L67/55

CPC分类号： G10L25/63 , G10L21/0208 , G10L25/30 , H04L67/55

摘要： Systems and methods used in a cognitive assistant for detecting human emotions from speech audio signals is described. The system obtains audio signals from an audio receiver and extracts human speech samples. Subsequently, it runs a machine learning based classifier to analyze the human speech signal and classify the emotion observed in it. The user is then notified, based on their preferences, with a summary of the emotion detected. Notifications can also be sent to other systems that have been configured to receive them. Optionally, the system may include the ability to store the speech sample and emotion classification detected for future analysis. The system's machine learning classifier is periodically re-trained based on labelled audio speech data and updated.

3.

发明授权
Method and apparatus for voice recognition in mixed audio based on pitch features using network models, and storage medium 有权

公开(公告)号：US12119012B2

公开(公告)日：2024-10-15

申请号：US17353636

申请日：2021-06-21

申请人： Beijing Xiaomi Pinecone Electronics Co., Ltd.

发明人： Na Xu , Yongtao Jia , Linzhang Wang

IPC分类号： G10L21/007 , G10L21/0272 , G10L25/30 , G10L25/51

CPC分类号： G10L21/007 , G10L21/0272 , G10L25/30 , G10L25/51

摘要： The present disclosure relates to a method and an apparatus for audio processing and a storage medium. The method includes: obtaining an audio mixing feature of a target object, in which the audio mixing feature at least includes: a voiceprint feature and a pitch feature of the target object; and determining a target audio matching with the target object in the mixed audio according to the audio mixing feature.

4.

发明公开
SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND SPEECH RECOGNITION PROGRAM 审中-公开

公开(公告)号：US20240339113A1

公开(公告)日：2024-10-10

申请号：US18294177

申请日：2021-08-05

申请人： NIPPON TELEGRAPH AND TELEPHONE CORPORATION

发明人： Takafumi MORIYA , Takanori ASHIHARA

IPC分类号： G10L15/22 , G10L25/30

CPC分类号： G10L15/22 , G10L25/30

摘要： A speech recognition device includes a label estimation unit, a trigger-firing label estimation unit, and an RNN-T trigger estimation unit. The label estimation unit predicts a symbol sequence of the speech data based on an intermediate acoustic feature amount sequence and an intermediate symbol feature amount sequence of the speech data using a model learned by the RNN-T. The trigger-firing label estimation unit predicts a next symbol of the speech data using the attention mechanism based on the intermediate acoustic feature amount sequence of the speech data. The RNN-T trigger estimation unit calculates a timing at which a probability of occurrence of symbols other than a block in the speech data becomes a maximum based on a symbol sequence of the speech data predicted by the label estimation unit. Then, the RNN-T trigger estimation unit outputs the calculated timing as a trigger for operating the trigger-firing label estimation unit.

5.

发明授权
Speech recognition with sequence-to-sequence models 有权

公开(公告)号：US12106749B2

公开(公告)日：2024-10-01

申请号：US17448119

申请日：2021-09-20

申请人： Google LLC

发明人： Rohit Prakash Prabhavalkar , Zhifeng Chen , Bo Li , Chung-cheng Chiu , Kanury Kanishka Rao , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly , Michiel A. u. Bacchiani , Tara N. Sainath , Jan Kazimierz Chorowski , Anjuli Patricia Kannan , Ekaterina Gonina , Patrick An Phu Nguyen

IPC分类号： G10L15/00 , G06N3/08 , G10L15/02 , G10L15/06 , G10L15/16 , G10L15/22 , G10L25/30 , G10L15/26

CPC分类号： G10L15/16 , G06N3/08 , G10L15/02 , G10L15/063 , G10L15/22 , G10L25/30 , G10L2015/025 , G10L15/26

摘要： A method for performing speech recognition using sequence-to-sequence models includes receiving audio data for an utterance and providing features indicative of acoustic characteristics of the utterance as input to an encoder. The method also includes processing an output of the encoder using an attender to generate a context vector, generating speech recognition scores using the context vector and a decoder trained using a training process, and generating a transcription for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.

6.

发明授权
Recommendation of audio based on video analysis using machine learning 有权

公开(公告)号：US12100416B2

公开(公告)日：2024-09-24

申请号：US17370138

申请日：2021-07-08

申请人： SONY GROUP CORPORATION

发明人： Kiran Charantimath , Karan Parikh

IPC分类号： G10L25/57 , G06F3/16 , G06F16/64 , G06F18/21 , G06F40/30 , G06T7/20 , G06T7/70 , G06V20/40 , G10L25/30 , G10L25/54

CPC分类号： G10L25/57 , G06F3/165 , G06F16/64 , G06F18/21 , G06F40/30 , G06T7/20 , G06T7/70 , G06V20/41 , G10L25/30 , G10L25/54 , G06T2207/10016

摘要： An electronic device and method for recommendation of audio based on video analysis is provided. The electronic device receives one or more frames of a first scene of a plurality of scenes of a video. The first scene includes a set of objects. The electronic device applies a trained neural network model on the received one or more frames to detect the set of objects. The electronic device determines an impact score of each object of the detected set of objects of the first scene based on the application of the trained neural network model on the set of objects. The electronic device further selects at least one first object from the set of objects based on the impact score of each object, and recommends one or more first audio tracks as a sound effect for the first scene based on the selected at least one first object.

7.

发明授权
Speech recognition with attention-based recurrent neural networks 有权

公开(公告)号：US12100391B2

公开(公告)日：2024-09-24

申请号：US17450235

申请日：2021-10-07

申请人： Google LLC

发明人： William Chan , Navdeep Jaitly , Quoc V. Le , Oriol Vinyals , Noam M. Shazeer

IPC分类号： G10L15/16 , G06F40/12 , G06F40/197 , G06N3/044 , G06N3/045 , G10L15/183 , G10L15/26 , G10L25/30

CPC分类号： G10L15/16 , G06F40/12 , G06F40/197 , G06N3/044 , G06N3/045 , G10L15/183 , G10L15/26 , G10L25/30

摘要： Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance.

8.

发明授权
Voice customization for synthetic speech generation 有权

公开(公告)号：US12100383B1

公开(公告)日：2024-09-24

申请号：US17707203

申请日：2022-03-29

申请人： Amazon Technologies, Inc.

发明人： Abdelhamid Ezzerg , Piotr Tadeusz Bilinski , Thomas Edward Merritt , Roberto Barra Chicote , Daniel Korzekwa , Kamil Pokora

IPC分类号： G10L13/047 , G06N3/045 , G10L25/30

CPC分类号： G10L13/047 , G06N3/045 , G10L25/30

摘要： Voice customization is an application of voice synthesis that involves synthesizing speech having certain voice characteristics, and/or modifying the voice characteristics of human speech. Certain techniques for voice customization may be used in conjunction with compressing speech for storage and/or transmission. For example, speech may be received at a first device and transformed into a latent representation and/or compressed for storage and/or transmission to a second device. The system may use normalizing flows to transform the source audio to a latent representation having a desired variable distribution, and to transform the latent representation back into audio data. A flow model may conditioned using first speech attributes when transforming the source audio, and an inverse flow model may use second speech attributes when transforming the latent representation back into audio data. The first and/or second speech attributes may be modified to alter voice characteristics of the transmitted speech.

9.

发明授权
User-customized synthetic voice 有权

公开(公告)号：US12087270B1

公开(公告)日：2024-09-10

申请号：US17955961

申请日：2022-09-29

申请人： Amazon Technologies, Inc.

发明人： Sebastian Dariusz Cygert , Daniel Korzekwa , Kamil Pokora , Piotr Tadeusz Bilinski , Kayoko Yanagisawa , Abdelhamid Ezzerg , Thomas Edward Merritt , Raghu Ram Sreepada Srinivas , Nikhil Sharma

IPC分类号： G10L15/16 , G10L13/033 , G10L13/047 , G10L13/10 , G10L15/06 , G10L25/30

CPC分类号： G10L13/033 , G10L13/047 , G10L13/10

摘要： Techniques for generating customized synthetic voices personalized to a user, based on user-provided feedback, are described. A system may determine embedding data representing a user-provided description of a desired synthetic voice and profile data associated with the user, and generate synthetic voice embedding data using synthetic voice embedding data corresponding a profile associated with a user determined to be similar to the current user. Based on user-provided feedback with respect to a customized synthetic voice, generated using synthetic voice characteristics corresponding to the synthetic voice embedding data and presented to the user, and the synthetic voice embedding data, the system may generate new synthetic voice embedding data, corresponding to a new customized synthetic voice. The system may be configured to assign the customized synthetic voice to the user, such that a subsequent user may not be presented with the same customized synthetic voice.

10.

发明授权
Weakly-supervised sound event detection method and system based on adaptive hierarchical pooling 有权

公开(公告)号：US12080319B2

公开(公告)日：2024-09-03

申请号：US18035934

申请日：2022-06-27

申请人： Jiangsu University

发明人： Qirong Mao , Lijian Gao , Yaxin Shen , Qinghua Ren , Yongzhao Zhan , Keyang Cheng

IPC分类号： G10L25/78 , G10L25/18 , G10L25/30

CPC分类号： G10L25/78 , G10L25/18 , G10L25/30

摘要： The present disclosure provides a weakly-supervised sound event detection method and system based on adaptive hierarchical pooling. The system includes an acoustic model and an adaptive hierarchical pooling algorithm module (AHPA-model), where the acoustic model inputs a pre-processed and feature-extracted audio signal, and predicts a frame-level prediction probability aggregated by the AHPA-module to obtain a sentence-level prediction probability. The acoustic model and a relaxation parameter are jointly optimized to obtain an optimal model weight and an optimal relaxation parameter based for formulating each category of sound event. A pre-processed and feature-extracted unknown audio signal is input to obtain frame-level prediction probabilities of all target sound events (TSEs), and sentence-level prediction probabilities of all categories of TSEs are obtained based on an optimal pooling strategy of each category of TSE. The disclosure has good versatility in being applicable to audio classification, complex acoustic scene, and locating in weakly-supervised sound event detection.

搜索结果

国家/区域

专利有效性

申请日

公布(公告)日

申请人

申请人所在国/区域

发明人

IPC

IPC部

IPC大类

IPC小类

IPC大组

IPC小组

外观分类