-
公开(公告)号:US20230154188A1
公开(公告)日:2023-05-18
申请号:US17566173
申请日:2021-12-30
Applicant: salesforce.com, inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V20/40 , G06V10/74 , G06V10/26 , G06V10/80 , G06F40/284
CPC classification number: G06V20/41 , G06V10/761 , G06V20/47 , G06V10/26 , G06V10/806 , G06F40/284
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US20230154146A1
公开(公告)日:2023-05-18
申请号:US17566061
申请日:2021-12-30
Applicant: salesforce.com, inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V10/74 , G06V10/774 , G06F40/279 , G06V20/40 , G06V10/776
CPC classification number: G06V10/761 , G06V10/774 , G06F40/279 , G06V20/47 , G06V20/41 , G06V10/776 , G06V20/46
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-