Inverse text normalization for automatic speech recognition

    公开(公告)号:US10592604B2

    公开(公告)日:2020-03-17

    申请号:US16024425

    申请日:2018-06-29

    申请人: Apple Inc.

    摘要: Techniques for inverse text normalization are provided. In some examples, speech input is received and a spoken-form text representation of the speech input is generated. The spoken-form text representation includes a token sequence. A feature representation is determined for the spoken-form text representation and a sequence of labels is determined based on the feature representation. The sequence of labels is assigned to the token sequence and specifies a plurality of edit operations to perform on the token sequence. Each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations. A written-form text representation of the speech input is generated by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels. A task responsive to the speech input is performed using the generated written-form text representation.

    Low-latency multi-speaker speech recognition

    公开(公告)号:US11475898B2

    公开(公告)日:2022-10-18

    申请号:US16534902

    申请日:2019-08-07

    申请人: Apple Inc.

    摘要: Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.