-
公开(公告)号:US20250119624A1
公开(公告)日:2025-04-10
申请号:US18894443
申请日:2024-09-24
Applicant: ADOBE INC.
Inventor: Seoung Wug Oh , Mingi Kwon , Joon-Young Lee , Yang Zhou , Difan Liu , Haoran Cai , Baqiao Liu , Feng Liu
IPC: H04N21/81
Abstract: A method, apparatus, non-transitory computer readable medium, and system for generating synthetic videos includes obtaining an input prompt describing a video scene. The embodiments then generate a plurality of frame-wise token embeddings corresponding to a sequence of video frames, respectively, based on the input prompt. Subsequently, embodiments generate, using a video generation model, a synthesized video depicting the video scene. The synthesized includes a plurality of images corresponding to the sequence of video frames.
-
公开(公告)号:US12119028B2
公开(公告)日:2024-10-15
申请号:US17967364
申请日:2022-10-17
Applicant: Adobe Inc.
Inventor: Xue Bai , Justin Jonathan Salamon , Aseem Omprakash Agarwala , Hijung Shin , Haoran Cai , Joel Richard Brandt , Lubomira Assenova Dontcheva , Cristin Ailidh Fraser
IPC: G11B27/036 , G06F40/166 , G10L15/26 , G10L25/57 , G11B27/34 , G06F3/0482 , G06F3/04845 , G06F3/0485
CPC classification number: G11B27/036 , G06F40/166 , G10L15/26 , G10L25/57 , G11B27/34 , G06F3/0482 , G06F3/04845 , G06F3/0485
Abstract: Embodiments of the present invention provide systems, methods, and computer storage media for identifying candidate boundaries for video segments, video segment selection using those boundaries, and text-based video editing of video segments selected via transcript interactions. In an example implementation, boundaries of detected sentences and words are extracted from a transcript, the boundaries are retimed into an adjacent speech gap to a location where voice or audio activity is a minimum, and the resulting boundaries are stored as candidate boundaries for video segments. As such, a transcript interface presents the transcript, interprets input selecting transcript text as an instruction to select a video segment with corresponding boundaries selected from the candidate boundaries, and interprets commands that are traditionally thought of as text-based operations (e.g., cut, copy, paste) as an instruction to perform a corresponding video editing operation using the selected video segment.
-
公开(公告)号:US12299401B2
公开(公告)日:2025-05-13
申请号:US17967562
申请日:2022-10-17
Applicant: Adobe Inc.
Inventor: Hanieh Deilamsalehy , Aseem Omprakash Agarwala , Haoran Cai , Hijung Shin , Joel Richard Brandt , Lubomira Assenova Dontcheva
Abstract: Embodiments of the present invention provide systems, methods, and computer storage media for segmenting a transcript into paragraphs. In an example embodiment, a transcript is segmented to start a new paragraph whenever there is a change in speaker and/or a long pause in speech. If any remaining paragraphs are longer than a designated length or duration (e.g., 50 or 100 words), each of those paragraphs is segmented using dynamic programming to minimize a cost function that penalizes candidate paragraphs based on divergence from a target paragraph length and/or that rewards candidate paragraphs that group semantically similar sentences. As such, the transcript is visualized, segmented at the identified paragraphs.
-
公开(公告)号:US12125501B2
公开(公告)日:2024-10-22
申请号:US17967399
申请日:2022-10-17
Applicant: Adobe Inc.
Inventor: Fabian David Caba Heilbron , Xue Bai , Aseem Omprakash Agarwala , Haoran Cai , Lubomira Assenova Dontcheva
IPC: G11B27/031 , G06V20/40
CPC classification number: G11B27/031 , G06V20/41
Abstract: Embodiments of the present invention provide systems, methods, and computer storage media for face-aware speaker diarization. In an example embodiment, an audio-only speaker diarization technique is applied to generate an audio-only speaker diarization of a video, an audio-visual speaker diarization technique is applied to generate a face-aware speaker diarization of the video, and the audio-only speaker diarization is refined using the face-aware speaker diarization to generate a hybrid speaker diarization that links detected faces to detected voices. In some embodiments, to accommodate videos with small faces that appear pixelated, a cropped image of any given face is extracted from each frame of the video, and the size of the cropped image is used to select a corresponding active speaker detection model to predict an active speaker score for the face in the cropped image.
-
-
-