ONLINE SPEAKER DIARIZATION USING LOCAL AND GLOBAL CLUSTERING

    公开(公告)号:US20230419979A1

    公开(公告)日:2023-12-28

    申请号:US18046041

    申请日:2022-10-12

    CPC classification number: G10L21/028 G10L17/06 G10L17/02

    Abstract: A method includes obtaining at least a portion of an audio stream containing speech activity. At least the portion of the audio stream includes multiple segments. The method also includes, for each of the multiple segments, generating an embedding vector that represents the segment. The method further includes, within each of multiple local windows, clustering the embedding vectors into one or more clusters to perform speaker identification. Different clusters correspond to different speakers. The method also includes presenting at least one first sequence of speaker identities based on the speaker identification performed for the local windows. The method further includes, within each of multiple global windows, clustering the embedding vectors into one or more clusters to perform speaker identification. Each global window includes two or more local windows. In addition, the method includes presenting at least one second sequence of speaker identities based on the speaker identification performed for the global windows.

    System and method for improving named entity recognition

    公开(公告)号:US12170079B2

    公开(公告)日:2024-12-17

    申请号:US17444367

    申请日:2021-08-03

    Abstract: A method includes training a set of teacher models. Training the set of teacher models includes, for each individual teacher model of the set of teacher models, training the individual teacher model to transcribe unlabeled audio samples and predict a pseudo labeled dataset having multiple labels. At least some of the unlabeled audio samples contain named entity (NE) audio data. At least some of the labels include transcribed NE labels corresponding to the NE audio data. The method also includes correcting at least some of the transcribed NE labels using user-specific NE textual data. The method further includes retraining the set of teacher models based on the pseudo labeled dataset from a selected one of the teacher models, where the selected one of the teacher models predicts the pseudo labeled dataset more accurately than other teacher models of the set of teacher models.

    SYSTEM AND METHOD FOR IMPROVING NAMED ENTITY RECOGNITION

    公开(公告)号:US20230040181A1

    公开(公告)日:2023-02-09

    申请号:US17444367

    申请日:2021-08-03

    Abstract: A method includes training a set of teacher models. Training the set of teacher models includes, for each individual teacher model of the set of teacher models, training the individual teacher model to transcribe unlabeled audio samples and predict a pseudo labeled dataset having multiple labels. At least some of the unlabeled audio samples contain named entity (NE) audio data. At least some of the labels include transcribed NE labels corresponding to the NE audio data. The method also includes correcting at least some of the transcribed NE labels using user-specific NE textual data. The method further includes retraining the set of teacher models based on the pseudo labeled dataset from a selected one of the teacher models, where the selected one of the teacher models predicts the pseudo labeled dataset more accurately than other teacher models of the set of teacher models.

    JOINT END-TO-END SPOKEN LANGUAGE UNDERSTANDING AND AUTOMATIC SPEECH RECOGNITION

    公开(公告)号:US20250078824A1

    公开(公告)日:2025-03-06

    申请号:US18814275

    申请日:2024-08-23

    Abstract: A method includes receiving an utterance from an audio input device. The method also includes determining a context associated with the utterance. The method also includes providing the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context. The method also includes using an output of the joint model to perform an action requested in the utterance. The joint model is trained by training a shared encoder and a shared decoder using a text-to-text task and, after training the shared encoder and the shared decoder, training a speech encoder and the shared encoder using a speech self-supervised learning (SSL) learning task and a text-to-text task with a masked prediction loss.

    PERSONALIZED MULTI-MODAL SPOKEN LANGUAGE IDENTIFICATION

    公开(公告)号:US20230419958A1

    公开(公告)日:2023-12-28

    申请号:US17937692

    申请日:2022-10-03

    CPC classification number: G10L15/197 G10L15/005 G10L15/22

    Abstract: A method includes obtaining an audio input of a person speaking, where the audio input is captured by an electronic device. The method also includes, for each of multiple language types, (i) determining a first probability that the person is speaking in the language type by applying a trained spoken language identification model to the audio input, (ii) determining at least one second probability that the person is speaking in the language type based on at least one characteristic of the person or the electronic device, and (iii) determining a score for the language type based on a weighted sum of the first and second probabilities. The method further includes identifying the language type associated with a highest score as a spoken language of the person in the audio input.

Patent Agency Ranking