-
公开(公告)号:US20240282294A1
公开(公告)日:2024-08-22
申请号:US18651296
申请日:2024-04-30
Applicant: Google LLC
Inventor: Qingqing Huang , Daniel Sung-Joon Park , Aren Jansen , Timo Immanuel Denk , Yue Li , Ravi Ganti , Dan Ellis , Tao Wang , Wei Han , Joonseok Lee
CPC classification number: G10L15/063 , G10L15/16
Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.
-
公开(公告)号:US20230419989A1
公开(公告)日:2023-12-28
申请号:US17808653
申请日:2022-06-24
Applicant: Google LLC
Inventor: Beat Gfeller , Kevin Ian Kilgour , Marco Tagliasacchi , Aren Jansen , Scott Thomas Wisdom , Qingqing Huang
CPC classification number: G10L25/84 , G10L15/16 , G10L15/063 , G06N3/0454
Abstract: Example methods include receiving training data comprising a plurality of audio clips and a plurality of textual descriptions of audio. The methods include generating a shared representation comprising a joint embedding. An audio embedding of a given audio clip is within a threshold distance of a text embedding of a textual description of the given audio clip. The methods include generating, based on the joint embedding, a conditioning vector and training, based on the conditioning vector, a neural network to: receive (i) an input audio waveform, and (ii) an input comprising one or more of an input textual description of a target audio source in the input audio waveform, or an audio sample of the target audio source, separate audio corresponding to the target audio source from the input audio waveform, and output the separated audio corresponding to the target audio source in response to the receiving of the input.
-