-
公开(公告)号:US20250140242A1
公开(公告)日:2025-05-01
申请号:US18385749
申请日:2023-10-31
Applicant: Lemon Inc.
Inventor: Zongyu Yin , Qingqing Huang , Janne Jayne Harm Renee Spijkervet
Abstract: The present disclosure describes techniques for generating audio representations using a machine learning model. A machine learning model is pre-trained using unlabeled audio data. The pre-training enables the machine learning model to recognize audio patterns and generate initial audio representations. The machine learning model is refined by a task-specific fine-tuning process using labeled data. The task-specific fine-tuning process incorporates multi-task learning heads to optimize the machine learning model. The task-specific fine-tuning process enables the machine learning model to be specialized in specific audio tasks and generate continuous audio representations. The continuous audio representations retain acoustic nuances and subtleties of audio signals. The machine learning model is configured and enabled to generate quantized audio representations by incorporating vector quantization to the task-specific fine-tuning process.
-
公开(公告)号:US20250078814A1
公开(公告)日:2025-03-06
申请号:US18819280
申请日:2024-08-29
Applicant: Lemon Inc. , Beijing Zitiao Network Technology Co., Ltd.
Inventor: Dong Guo , Zihao He , Weituo Hao , Xuchen Song , Zongyu Yin , Jingsong Gao , Wei Tsung Lu , Junyu Dai
IPC: G10L15/06 , G06F40/126 , G10L25/30
Abstract: The present disclosure provides a multi-modal encoder processing method and apparatus, a computer device and a storage medium. The method includes: acquiring a pair of mask samples to be processed, the pair of mask samples including a text sample and an audio sample associated with each other, and at least one of the text sample and the audio sample is masked; based on a multi-modal encoder, generating a text encoding feature of the text sample, and generating an audio encoding feature of the audio sample, a linear spectrum feature of the audio sample being fused in the text encoding feature, and a linear word feature of the text sample being fused in the audio encoding feature; and predicting masked mask information according to the text encoding feature and the audio encoding feature, and correcting the multi-modal encoder based on an accuracy of the mask information.
-