-
公开(公告)号:US20230308823A1
公开(公告)日:2023-09-28
申请号:US18042258
申请日:2020-08-26
Applicant: Manoj PLAKAL , Dan ELLIS , Shawn HERSHEY , Richard Channing MOORE, III , Aren JANSEN , Google LLC
Inventor: Aren Jansen , Manoj Plakal , Dan Ellis , Shawn Hershey , Richard Channing Moore, III
IPC: H04S7/00
CPC classification number: H04S7/301 , H04S2400/01
Abstract: A computer-implemented method for upmixing audiovisual data can include obtaining audiovisual data including input audio data and video data accompanying the input audio data. Each frame of the video data can depict only a portion of a larger scene. The input audio data can have a first number of audio channels. The computer-implemented method can include providing the audiovisual data as input to a machine-learned audiovisual upmixing model. The audiovisual upmixing model can include a sequence-to-sequence model configured to model a respective location of one or more audio sources within the larger scene over multiple frames of the video data. The computer-implemented method can include receiving upmixed audio data from the audiovisual upmixing model. The upmixed audio data can have a second number of audio channels. The second number of audio channels can be greater than the first number of audio channels.
-
公开(公告)号:US20240282294A1
公开(公告)日:2024-08-22
申请号:US18651296
申请日:2024-04-30
Applicant: Google LLC
Inventor: Qingqing Huang , Daniel Sung-Joon Park , Aren Jansen , Timo Immanuel Denk , Yue Li , Ravi Ganti , Dan Ellis , Tao Wang , Wei Han , Joonseok Lee
CPC classification number: G10L15/063 , G10L15/16
Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.
-