MIXTURE-OF-EXPERT CONFORMER FOR STREAMING MULTILINGUAL ASR
Abstract:
A method of a multilingual ASR model includes receiving a sequence of acoustic frames characterizing an utterance of speech. At a plurality of output steps, the method further includes generating a first higher order feature representation for an acoustic frame by a first encoder that includes a first plurality of multi-head attention layers; generating a second higher order feature representation for a corresponding first higher order feature representation by a second encoder that includes a second plurality of multi-head attention layers; and generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on the second higher order feature representation and a sequence of N previous non-blank symbols. A gating layer of each respective MoE layer configured to dynamically route an output from a previous multi-head attention layer at each of the plurality of output steps to a respective pair of feed-forward expert networks.
Information query
Patent Agency Ranking
0/0