Abstract:
An apparatus and method for performing asynchronous speech recognition using multiple microphones are disclosed. The apparatus includes a microphone selection unit, a signal-to-noise ratio measurement unit, a speech recognition and verification unit, and a final recognition result output unit. The microphone selection unit selects two or more microphones responsive to a user's voice from among a plurality of microphones distributed around the user. The signal-to-noise ratio measurement unit measures the signal to noise ratios of inputs of the selected two or more microphones. The speech recognition and verification unit performs speech recognition using the input of the microphone having a highest signal to noise ratio, and verifies the speech recognition using the inputs of the remaining microphones. The final recognition result output unit outputs the final recognition results of the user's voice based on the results of the speech recognition and verification unit.
Abstract:
Disclosed herein are an apparatus and method for self-supervised training of an end-to-end speech recognition model. The apparatus includes memory in which at least one program is recorded and a processor for executing the program. The program trains an end-to-end speech recognition model, including an encoder and a decoder, using untranscribed speech data. The program may add predetermined noise to the input signal of the end-to-end speech recognition model, and may calculate loss by reflecting a predetermined constraint based on the output of the encoder of the end-to-end speech recognition model.