Methods and Apparatus for Dynamic Volume Adjustment Via Audio Classification

    公开(公告)号:US20240354053A1

    公开(公告)日:2024-10-24

    申请号:US18760866

    申请日:2024-07-01

    申请人: Gracenote, Inc.

    IPC分类号: G06F3/16 G10L25/30 G10L25/51

    CPC分类号: G06F3/165 G10L25/51 G10L25/30

    摘要: Methods, apparatus, systems and articles of manufacture are disclosed for dynamic volume adjustment via audio classification. Example apparatus include at least one memory; instructions; and at least one processor to execute the instructions to: analyze, with a neural network, a parameter of an audio signal associated with a first volume level to determine a classification group associated with the audio signal; determine an input volume of the audio signal; determine a classification gain value based on the classification group; determine an intermediate gain value as an intermediate between the input volume and the classification gain value by applying a first weight to the input volume and a second weight to the classification gain value; apply the intermediate gain value to the audio signal, the intermediate gain value to modify the first volume level to a second volume level; and apply a compression value to the audio signal, the compression value to modify the second volume level to a third volume level that satisfies a target volume threshold.

    Cognitive assistant for real-time emotion detection from human speech

    公开(公告)号:US12119022B2

    公开(公告)日:2024-10-15

    申请号:US17536673

    申请日:2021-11-29

    摘要: Systems and methods used in a cognitive assistant for detecting human emotions from speech audio signals is described. The system obtains audio signals from an audio receiver and extracts human speech samples. Subsequently, it runs a machine learning based classifier to analyze the human speech signal and classify the emotion observed in it. The user is then notified, based on their preferences, with a summary of the emotion detected. Notifications can also be sent to other systems that have been configured to receive them. Optionally, the system may include the ability to store the speech sample and emotion classification detected for future analysis. The system's machine learning classifier is periodically re-trained based on labelled audio speech data and updated.

    SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND SPEECH RECOGNITION PROGRAM

    公开(公告)号:US20240339113A1

    公开(公告)日:2024-10-10

    申请号:US18294177

    申请日:2021-08-05

    IPC分类号: G10L15/22 G10L25/30

    CPC分类号: G10L15/22 G10L25/30

    摘要: A speech recognition device includes a label estimation unit, a trigger-firing label estimation unit, and an RNN-T trigger estimation unit. The label estimation unit predicts a symbol sequence of the speech data based on an intermediate acoustic feature amount sequence and an intermediate symbol feature amount sequence of the speech data using a model learned by the RNN-T. The trigger-firing label estimation unit predicts a next symbol of the speech data using the attention mechanism based on the intermediate acoustic feature amount sequence of the speech data. The RNN-T trigger estimation unit calculates a timing at which a probability of occurrence of symbols other than a block in the speech data becomes a maximum based on a symbol sequence of the speech data predicted by the label estimation unit. Then, the RNN-T trigger estimation unit outputs the calculated timing as a trigger for operating the trigger-firing label estimation unit.

    Voice customization for synthetic speech generation

    公开(公告)号:US12100383B1

    公开(公告)日:2024-09-24

    申请号:US17707203

    申请日:2022-03-29

    摘要: Voice customization is an application of voice synthesis that involves synthesizing speech having certain voice characteristics, and/or modifying the voice characteristics of human speech. Certain techniques for voice customization may be used in conjunction with compressing speech for storage and/or transmission. For example, speech may be received at a first device and transformed into a latent representation and/or compressed for storage and/or transmission to a second device. The system may use normalizing flows to transform the source audio to a latent representation having a desired variable distribution, and to transform the latent representation back into audio data. A flow model may conditioned using first speech attributes when transforming the source audio, and an inverse flow model may use second speech attributes when transforming the latent representation back into audio data. The first and/or second speech attributes may be modified to alter voice characteristics of the transmitted speech.

    Weakly-supervised sound event detection method and system based on adaptive hierarchical pooling

    公开(公告)号:US12080319B2

    公开(公告)日:2024-09-03

    申请号:US18035934

    申请日:2022-06-27

    IPC分类号: G10L25/78 G10L25/18 G10L25/30

    CPC分类号: G10L25/78 G10L25/18 G10L25/30

    摘要: The present disclosure provides a weakly-supervised sound event detection method and system based on adaptive hierarchical pooling. The system includes an acoustic model and an adaptive hierarchical pooling algorithm module (AHPA-model), where the acoustic model inputs a pre-processed and feature-extracted audio signal, and predicts a frame-level prediction probability aggregated by the AHPA-module to obtain a sentence-level prediction probability. The acoustic model and a relaxation parameter are jointly optimized to obtain an optimal model weight and an optimal relaxation parameter based for formulating each category of sound event. A pre-processed and feature-extracted unknown audio signal is input to obtain frame-level prediction probabilities of all target sound events (TSEs), and sentence-level prediction probabilities of all categories of TSEs are obtained based on an optimal pooling strategy of each category of TSE. The disclosure has good versatility in being applicable to audio classification, complex acoustic scene, and locating in weakly-supervised sound event detection.