-
公开(公告)号:US20240354053A1
公开(公告)日:2024-10-24
申请号:US18760866
申请日:2024-07-01
申请人: Gracenote, Inc.
摘要: Methods, apparatus, systems and articles of manufacture are disclosed for dynamic volume adjustment via audio classification. Example apparatus include at least one memory; instructions; and at least one processor to execute the instructions to: analyze, with a neural network, a parameter of an audio signal associated with a first volume level to determine a classification group associated with the audio signal; determine an input volume of the audio signal; determine a classification gain value based on the classification group; determine an intermediate gain value as an intermediate between the input volume and the classification gain value by applying a first weight to the input volume and a second weight to the classification gain value; apply the intermediate gain value to the audio signal, the intermediate gain value to modify the first volume level to a second volume level; and apply a compression value to the audio signal, the compression value to modify the second volume level to a third volume level that satisfies a target volume threshold.
-
公开(公告)号:US12119022B2
公开(公告)日:2024-10-15
申请号:US17536673
申请日:2021-11-29
申请人: Rishi Amit Sinha , Ria Sinha
发明人: Rishi Amit Sinha , Ria Sinha
IPC分类号: G10L25/63 , G10L21/0208 , G10L25/30 , H04L67/55
CPC分类号: G10L25/63 , G10L21/0208 , G10L25/30 , H04L67/55
摘要: Systems and methods used in a cognitive assistant for detecting human emotions from speech audio signals is described. The system obtains audio signals from an audio receiver and extracts human speech samples. Subsequently, it runs a machine learning based classifier to analyze the human speech signal and classify the emotion observed in it. The user is then notified, based on their preferences, with a summary of the emotion detected. Notifications can also be sent to other systems that have been configured to receive them. Optionally, the system may include the ability to store the speech sample and emotion classification detected for future analysis. The system's machine learning classifier is periodically re-trained based on labelled audio speech data and updated.
-
公开(公告)号:US12119012B2
公开(公告)日:2024-10-15
申请号:US17353636
申请日:2021-06-21
发明人: Na Xu , Yongtao Jia , Linzhang Wang
IPC分类号: G10L21/007 , G10L21/0272 , G10L25/30 , G10L25/51
CPC分类号: G10L21/007 , G10L21/0272 , G10L25/30 , G10L25/51
摘要: The present disclosure relates to a method and an apparatus for audio processing and a storage medium. The method includes: obtaining an audio mixing feature of a target object, in which the audio mixing feature at least includes: a voiceprint feature and a pitch feature of the target object; and determining a target audio matching with the target object in the mixed audio according to the audio mixing feature.
-
公开(公告)号:US20240339113A1
公开(公告)日:2024-10-10
申请号:US18294177
申请日:2021-08-05
发明人: Takafumi MORIYA , Takanori ASHIHARA
摘要: A speech recognition device includes a label estimation unit, a trigger-firing label estimation unit, and an RNN-T trigger estimation unit. The label estimation unit predicts a symbol sequence of the speech data based on an intermediate acoustic feature amount sequence and an intermediate symbol feature amount sequence of the speech data using a model learned by the RNN-T. The trigger-firing label estimation unit predicts a next symbol of the speech data using the attention mechanism based on the intermediate acoustic feature amount sequence of the speech data. The RNN-T trigger estimation unit calculates a timing at which a probability of occurrence of symbols other than a block in the speech data becomes a maximum based on a symbol sequence of the speech data predicted by the label estimation unit. Then, the RNN-T trigger estimation unit outputs the calculated timing as a trigger for operating the trigger-firing label estimation unit.
-
公开(公告)号:US12106749B2
公开(公告)日:2024-10-01
申请号:US17448119
申请日:2021-09-20
申请人: Google LLC
发明人: Rohit Prakash Prabhavalkar , Zhifeng Chen , Bo Li , Chung-cheng Chiu , Kanury Kanishka Rao , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly , Michiel A. u. Bacchiani , Tara N. Sainath , Jan Kazimierz Chorowski , Anjuli Patricia Kannan , Ekaterina Gonina , Patrick An Phu Nguyen
IPC分类号: G10L15/00 , G06N3/08 , G10L15/02 , G10L15/06 , G10L15/16 , G10L15/22 , G10L25/30 , G10L15/26
CPC分类号: G10L15/16 , G06N3/08 , G10L15/02 , G10L15/063 , G10L15/22 , G10L25/30 , G10L2015/025 , G10L15/26
摘要: A method for performing speech recognition using sequence-to-sequence models includes receiving audio data for an utterance and providing features indicative of acoustic characteristics of the utterance as input to an encoder. The method also includes processing an output of the encoder using an attender to generate a context vector, generating speech recognition scores using the context vector and a decoder trained using a training process, and generating a transcription for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.
-
公开(公告)号:US12100416B2
公开(公告)日:2024-09-24
申请号:US17370138
申请日:2021-07-08
发明人: Kiran Charantimath , Karan Parikh
IPC分类号: G10L25/57 , G06F3/16 , G06F16/64 , G06F18/21 , G06F40/30 , G06T7/20 , G06T7/70 , G06V20/40 , G10L25/30 , G10L25/54
CPC分类号: G10L25/57 , G06F3/165 , G06F16/64 , G06F18/21 , G06F40/30 , G06T7/20 , G06T7/70 , G06V20/41 , G10L25/30 , G10L25/54 , G06T2207/10016
摘要: An electronic device and method for recommendation of audio based on video analysis is provided. The electronic device receives one or more frames of a first scene of a plurality of scenes of a video. The first scene includes a set of objects. The electronic device applies a trained neural network model on the received one or more frames to detect the set of objects. The electronic device determines an impact score of each object of the detected set of objects of the first scene based on the application of the trained neural network model on the set of objects. The electronic device further selects at least one first object from the set of objects based on the impact score of each object, and recommends one or more first audio tracks as a sound effect for the first scene based on the selected at least one first object.
-
公开(公告)号:US12100391B2
公开(公告)日:2024-09-24
申请号:US17450235
申请日:2021-10-07
申请人: Google LLC
发明人: William Chan , Navdeep Jaitly , Quoc V. Le , Oriol Vinyals , Noam M. Shazeer
IPC分类号: G10L15/16 , G06F40/12 , G06F40/197 , G06N3/044 , G06N3/045 , G10L15/183 , G10L15/26 , G10L25/30
CPC分类号: G10L15/16 , G06F40/12 , G06F40/197 , G06N3/044 , G06N3/045 , G10L15/183 , G10L15/26 , G10L25/30
摘要: Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance.
-
公开(公告)号:US12100383B1
公开(公告)日:2024-09-24
申请号:US17707203
申请日:2022-03-29
发明人: Abdelhamid Ezzerg , Piotr Tadeusz Bilinski , Thomas Edward Merritt , Roberto Barra Chicote , Daniel Korzekwa , Kamil Pokora
IPC分类号: G10L13/047 , G06N3/045 , G10L25/30
CPC分类号: G10L13/047 , G06N3/045 , G10L25/30
摘要: Voice customization is an application of voice synthesis that involves synthesizing speech having certain voice characteristics, and/or modifying the voice characteristics of human speech. Certain techniques for voice customization may be used in conjunction with compressing speech for storage and/or transmission. For example, speech may be received at a first device and transformed into a latent representation and/or compressed for storage and/or transmission to a second device. The system may use normalizing flows to transform the source audio to a latent representation having a desired variable distribution, and to transform the latent representation back into audio data. A flow model may conditioned using first speech attributes when transforming the source audio, and an inverse flow model may use second speech attributes when transforming the latent representation back into audio data. The first and/or second speech attributes may be modified to alter voice characteristics of the transmitted speech.
-
公开(公告)号:US12087270B1
公开(公告)日:2024-09-10
申请号:US17955961
申请日:2022-09-29
发明人: Sebastian Dariusz Cygert , Daniel Korzekwa , Kamil Pokora , Piotr Tadeusz Bilinski , Kayoko Yanagisawa , Abdelhamid Ezzerg , Thomas Edward Merritt , Raghu Ram Sreepada Srinivas , Nikhil Sharma
IPC分类号: G10L15/16 , G10L13/033 , G10L13/047 , G10L13/10 , G10L15/06 , G10L25/30
CPC分类号: G10L13/033 , G10L13/047 , G10L13/10
摘要: Techniques for generating customized synthetic voices personalized to a user, based on user-provided feedback, are described. A system may determine embedding data representing a user-provided description of a desired synthetic voice and profile data associated with the user, and generate synthetic voice embedding data using synthetic voice embedding data corresponding a profile associated with a user determined to be similar to the current user. Based on user-provided feedback with respect to a customized synthetic voice, generated using synthetic voice characteristics corresponding to the synthetic voice embedding data and presented to the user, and the synthetic voice embedding data, the system may generate new synthetic voice embedding data, corresponding to a new customized synthetic voice. The system may be configured to assign the customized synthetic voice to the user, such that a subsequent user may not be presented with the same customized synthetic voice.
-
10.
公开(公告)号:US12080319B2
公开(公告)日:2024-09-03
申请号:US18035934
申请日:2022-06-27
申请人: Jiangsu University
发明人: Qirong Mao , Lijian Gao , Yaxin Shen , Qinghua Ren , Yongzhao Zhan , Keyang Cheng
摘要: The present disclosure provides a weakly-supervised sound event detection method and system based on adaptive hierarchical pooling. The system includes an acoustic model and an adaptive hierarchical pooling algorithm module (AHPA-model), where the acoustic model inputs a pre-processed and feature-extracted audio signal, and predicts a frame-level prediction probability aggregated by the AHPA-module to obtain a sentence-level prediction probability. The acoustic model and a relaxation parameter are jointly optimized to obtain an optimal model weight and an optimal relaxation parameter based for formulating each category of sound event. A pre-processed and feature-extracted unknown audio signal is input to obtain frame-level prediction probabilities of all target sound events (TSEs), and sentence-level prediction probabilities of all categories of TSEs are obtained based on an optimal pooling strategy of each category of TSE. The disclosure has good versatility in being applicable to audio classification, complex acoustic scene, and locating in weakly-supervised sound event detection.
-
-
-
-
-
-
-
-
-