-
公开(公告)号:US20240371164A1
公开(公告)日:2024-11-07
申请号:US18652703
申请日:2024-05-01
Applicant: Google LLC
Inventor: Shen Yan , Xuehan Xiong , Arsha Nagrani , Anurag Arnab , David Alexander Ross , Cordelia Schmid
IPC: G06V20/40 , G06V10/774 , G06V10/80
Abstract: Methods and systems for video localization using artificial intelligence are provided herein. A set of video embeddings representing features of one or more video frames of a media it em and a set of textual embeddings corresponding to an event associated with the media item are obtained. Fused video-textual data is generated based on the set of video embeddings and the set of textual embeddings. The fused video-textual data indicates features of the video frames of the media item and textual data pertaining to the media item. The fused video-textual data is provided as an input to an artificial intelligence (AI) model trained to perform multiple video localization tasks with respect to media items of a platform. One or move outputs of the AI model are obtained. A segment of the media item that depicts the event is determined based on the one or move outputs of the AI model.
-
公开(公告)号:US20250061917A1
公开(公告)日:2025-02-20
申请号:US18235372
申请日:2023-08-18
Applicant: Google LLC
Inventor: Josh Belanich , Taesik Gong , Krishna Somandepalli , Brian Eoff , Brendan Wesley Jou , Arsha Nagrani
Abstract: The technology relates to enhancing speech emotion recognition models with methods that enable the use of unlabeled data by inferring weak emotion labels. This is done by pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, a textual entailment approach selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. The system may employ a method that generates, by one or more processors, a text transcript for a snippet of input speech, and then applies the text transcript to a pre-trained language model. The system can generate, using the pre-trained language model according to an engineered prompt and a predetermined taxonomy, a textual entailment from the text transcript. Based on this, the system may generate, by the one or more processors using the textual entailment, a predicted emotion corresponding to the input speech.
-
公开(公告)号:US20240127794A1
公开(公告)日:2024-04-18
申请号:US17957291
申请日:2022-09-30
Applicant: Google LLC
Inventor: Hongsuck Seo , Arsha Nagrani , Anurag Arnab , Cordelia Luise Schmid
CPC classification number: G10L15/063 , G10L15/24 , G10L15/26
Abstract: Systems and methods method for performing captioning for image or video data are described herein. The method can include receiving unlabeled multimedia data, and outputting, from a machine learning model, one or more captions for the multimedia data. Training the machine learning model to create these outputs can include inputting a subset of video frames and a first utterance into the machine learning model, using the machine learning model to predict a predicted utterance based on the subset of video frames and the first utterance, and updating one or more parameters of the machine learning model based on a loss function that compares the predicted utterance with the second utterance.
-
公开(公告)号:US20230177384A1
公开(公告)日:2023-06-08
申请号:US17545526
申请日:2021-12-08
Applicant: Google LLC
Inventor: Arsha Nagrani , Shan Yang , Anurag Arnab , Chen Sun , Cordelia Luise Schmid
Abstract: Example embodiments according to aspects of the present disclosure provide an example computer-implemented method for multimodal data processing with improved cross-modal attention. The example method includes inputting a multimodal sequence to an example machine-learned model. The example model includes a first modal processing stream receiving a first modal portion of the multimodal sequence and a second modal processing stream receiving a second modal portion of the multimodal sequence. The example model includes fusing the first modal processing stream and the second modal processing stream across one or more fusion layers of the machine-learned model through a plurality of cross-modal context encodings. The example method includes outputting an inference based at least in part on the plurality of cross-modal context encodings.
-
-
-