-
公开(公告)号:US11152006B2
公开(公告)日:2021-10-19
申请号:US16020911
申请日:2018-06-27
Applicant: Microsoft Technology Licensing, LLC
Inventor: Eyal Krupka , Shixiong Zhang , Xiong Xiao
Abstract: Examples are disclosed that relate to voice identification enrollment. One example provides a method of voice identification enrollment comprising, during a meeting in which two or more human speakers speak at different times, determining whether one or more conditions of a protocol for sampling meeting audio used to establish human speaker voiceprints are satisfied, and in response to determining that the one or more conditions are satisfied, selecting a sample of meeting audio according to the protocol, the sample representing an utterance made by one of the human speakers. The method further comprises establishing, based at least on the sample, a voiceprint of the human speaker.
-
公开(公告)号:US10354656B2
公开(公告)日:2019-07-16
申请号:US15631995
申请日:2017-06-23
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yong Zhao , Jinyu Li , Yifan Gong , Shixiong Zhang , Zhuo Chen
Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.
-
公开(公告)号:US10580414B2
公开(公告)日:2020-03-03
申请号:US16006405
申请日:2018-06-12
Applicant: Microsoft Technology Licensing, LLC
Inventor: Shixiong Zhang , Xiong Xiao
Abstract: Computing devices and methods utilizing a joint speaker location/speaker identification neural network are provided. In one example a computing device receives a multi-channel audio signal of an utterance spoken by a user. Magnitude and phase information features are extracted from the signal and inputted into a joint speaker location/speaker identification neural network that is trained via utterances from a plurality of persons. A user embedding comprising speaker identification characteristics and location characteristics is received from the neural network and compared to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person. Based at least on the comparisons, the user is matched to an identity of one of the persons, and the identity of the person is outputted.
-
公开(公告)号:US11688399B2
公开(公告)日:2023-06-27
申请号:US17115293
申请日:2020-12-08
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Adi Diamant , Karen Master Ben-Dor , Eyal Krupka , Raz Halaly , Yoni Smolin , Ilya Gurvich , Aviv Hurvitz , Lijuan Qin , Wei Xiong , Shixiong Zhang , Lingfeng Wu , Xiong Xiao , Ido Leichter , Moshe David , Xuedong Huang , Amit Kumar Agarwal
CPC classification number: G10L15/26 , G06V40/172 , G10L17/00 , H04N7/15
Abstract: A method for facilitating a remote conference includes receiving a digital video and a computer-readable audio signal. A face recognition machine is operated to recognize a face of a first conference participant in the digital video, and a speech recognition machine is operated to translate the computer-readable audio signal into a first text. An attribution machine attributes the text to the first conference participant. A second computer-readable audio signal is processed similarly, to obtain a second text attributed to a second conference participant. A transcription machine automatically creates a transcript including the first text attributed to the first conference participant and the second text attributed to the second conference participant.
-
公开(公告)号:US11222640B2
公开(公告)日:2022-01-11
申请号:US16802993
申请日:2020-02-27
Applicant: Microsoft Technology Licensing, LLC
Inventor: Shixiong Zhang , Xiong Xiao
Abstract: Computing devices and methods utilizing a joint speaker location/speaker identification neural network are provided. In one example a computing device receives an audio signal of utterances spoken by multiple persons. Magnitude and phase information features are extracted from the signal and inputted into a joint speaker location and speaker identification neural network. The neural network utilizes both the magnitude and phase information features to determine a change in the person speaking. Output comprising the determination of the change is received from the neural network. The output is then used to perform a speaker recognition function, speaker location function, or both.
-
公开(公告)号:US20180374486A1
公开(公告)日:2018-12-27
申请号:US15631995
申请日:2017-06-23
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yong Zhao , Jinyu Li , Yifan Gong , Shixiong Zhang , Zhuo Chen
CPC classification number: G10L17/18 , G10L15/16 , G10L17/005 , G10L17/02 , G10L17/04 , G10L17/22 , G10L2015/025
Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.
-
-
-
-
-