Emitting Word Timings with End-to-End Models

    公开(公告)号:US20240321263A1

    公开(公告)日:2024-09-26

    申请号:US18680797

    申请日:2024-05-31

    申请人: Google LLC

    IPC分类号: G10L15/06 G10L25/30 G10L25/78

    摘要: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

    Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

    公开(公告)号:US20230237993A1

    公开(公告)日:2023-07-27

    申请号:US18011571

    申请日:2021-10-01

    申请人: Google LLC

    IPC分类号: G10L15/16 G10L15/32 G10L15/22

    CPC分类号: G10L15/16 G10L15/32 G10L15/22

    摘要: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

    Emitting Word Timings with End-to-End Models

    公开(公告)号:US20230206907A1

    公开(公告)日:2023-06-29

    申请号:US18167050

    申请日:2023-02-09

    申请人: Google LLC

    IPC分类号: G10L15/06 G10L25/30 G10L25/78

    摘要: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

    Neural Architecture Search with Factorized Hierarchical Search Space

    公开(公告)号:US20220101090A1

    公开(公告)日:2022-03-31

    申请号:US17495398

    申请日:2021-10-06

    申请人: Google LLC

    摘要: The present disclosure is directed to an automated neural architecture search approach for designing new neural network architectures such as, for example, resource-constrained mobile CNN models. In particular, the present disclosure provides systems and methods to perform neural architecture search using a novel factorized hierarchical search space that permits layer diversity throughout the network, thereby striking the right balance between flexibility and search space size. The resulting neural architectures are able to be run relatively faster and using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art mobile-optimized models.