DISTINGUISHING USER SPEECH FROM BACKGROUND SPEECH IN SPEECH-DENSE ENVIRONMENTS

    公开(公告)号:US20240062775A1

    公开(公告)日:2024-02-22

    申请号:US18452351

    申请日:2023-08-18

    申请人: Vocollect, Inc.

    发明人: David D. HARDEK

    摘要: A device, system, and method whereby a speech-driven system can distinguish speech obtained from users of the system from other speech spoken by background persons, as well as from background speech from public address systems. In one aspect, the present system and method prepares, in advance of field-use, a voice-data file which is created in a training environment. The training environment exhibits both desired user speech and unwanted background speech, including unwanted speech from persons other than a user and also speech from a PA system. The speech recognition system is trained or otherwise programmed to identify wanted user speech which may be spoken concurrently with the background sounds. In an embodiment, during the pre-field-use phase the training or programming may be accomplished by having persons who are training listeners audit the pre-recorded sounds to identify the desired user speech. A processor-based learning system is trained to duplicate the assessments made by the human listeners.

    REAL-TIME VOICE RECOGNITION METHOD, MODEL TRAINING METHOD, APPARATUSES, DEVICE, AND STORAGE MEDIUM

    公开(公告)号:US20240062744A1

    公开(公告)日:2024-02-22

    申请号:US18384009

    申请日:2023-10-26

    摘要: A real-time voice recognition method and a real-time voice recognition model training method are provided. The model training method includes: obtaining an audio feature sequence of sample voice data, the audio feature sequence comprising audio features of a plurality of audio frames of the sample voice data; inputting the audio feature sequence to an encoder of the real-time voice recognition model; chunking the audio feature sequence into a plurality of chunks by the encoder according to a mask matrix; encoding each of the chunks to obtain a hidden layer feature sequence of the sample voice data; decoding the hidden layer feature sequence by a decoder of the real-time voice recognition model to obtain a predicted recognition result for the sample voice data; and training the real-time voice recognition model based on the predicted recognition result and a real recognition result of the sample voice data.