-
公开(公告)号:US20210295091A1
公开(公告)日:2021-09-23
申请号:US16870621
申请日:2020-05-08
Applicant: salesforce.com, inc.
Inventor: Junnan Li , Chu Hong Hoi
Abstract: The system and method are directed to a prototypical contrastive learning (PCL). The PCL explicitly encodes the hierarchical semantic structure of the dataset into the learned embedding space and prevents the network from exploiting low-level cues for solving the unsupervised learning task. The PCL includes prototypes as the latent variables to help find the maximum-likelihood estimation of the network parameters in an expectation-maximization framework. The PCL iteratively performs an E-step for finding prototypes with clustering and M-step for optimizing the network on a contrastive loss.
-
公开(公告)号:US12210976B2
公开(公告)日:2025-01-28
申请号:US17219339
申请日:2021-03-31
Applicant: Salesforce.com, Inc.
Inventor: Hualin Liu , Chu Hong Hoi , Junnan Li
IPC: G06N3/084 , G06F18/214 , G06F18/22 , G06N3/088 , G06V10/75
Abstract: Embodiments described herein provide systems and methods for learning representation from unlabeled videos. Specifically, a method may comprise generating a set of strongly-augmented samples and a set of weakly-augmented samples from the unlabeled video samples; generating a set of predictive logits by inputting the set of strongly-augmented samples into a student model and a first teacher model; generating a set of artificial labels by inputting the set of weakly-augmented samples to a second teacher model that operates in parallel to the first teacher model, wherein the second teacher model shares one or more model parameters with the first teacher model; computing a loss objective based on the set of predictive logits and the set of artificial labels; updating student model parameters based on the loss objective via backpropagation; and updating the shared parameters for the first teacher model and the second teacher model based on the updated student model parameters.
-
公开(公告)号:US11776236B2
公开(公告)日:2023-10-03
申请号:US17591121
申请日:2022-02-02
Applicant: salesforce.com, inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06K9/62 , G06V10/44 , G06T7/73 , G06F18/23 , G06F18/214 , G06V10/762 , G06V10/774 , G06V10/776 , G06V10/82
CPC classification number: G06V10/454 , G06F18/2155 , G06F18/23 , G06T7/73 , G06V10/763 , G06V10/776 , G06V10/7753 , G06V10/82 , G06T2207/20084
Abstract: The system and method are directed to a prototypical contrastive learning (PCL). The PCL explicitly encodes the hierarchical semantic structure of the dataset into the learned embedding space and prevents the network from exploiting low-level cues for solving the unsupervised learning task. The PCL includes prototypes as the latent variables to help find the maximum-likelihood estimation of the network parameters in an expectation-maximization framework. The PCL iteratively performs an E-step for finding prototypes with clustering and M-step for optimizing the network on a contrastive loss.
-
公开(公告)号:US20230154188A1
公开(公告)日:2023-05-18
申请号:US17566173
申请日:2021-12-30
Applicant: salesforce.com, inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V20/40 , G06V10/74 , G06V10/26 , G06V10/80 , G06F40/284
CPC classification number: G06V20/41 , G06V10/761 , G06V20/47 , G06V10/26 , G06V10/806 , G06F40/284
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US20230162490A1
公开(公告)日:2023-05-25
申请号:US17589725
申请日:2022-01-31
Applicant: salesforce.com, inc.
Inventor: Shu Zhang , Junnan Li , Ran Xu , Caiming Xiong , Chetan Ramaiah
IPC: G06V10/776 , G06V10/74 , G06F40/284 , G06F40/166 , G06F40/126 , G06V10/80 , G06F16/583 , G06F16/56
CPC classification number: G06V10/776 , G06V10/761 , G06F40/284 , G06F40/166 , G06F40/126 , G06V10/806 , G06F16/5846 , G06F16/56
Abstract: Embodiments described herein a CROss-Modal Distribution Alignment (CROMDA) model for vision-language pretraining, which can be used for retrieval downstream tasks. In the CROMDA mode, global cross-modal representations are aligned on each unimodality. Specifically, a uni-modal global similarity between an image/text and the image/text feature queue are computed. A softmax-normalized distribution is then generated based on the computed similarity. The distribution thus takes advantage of property of the global structure of the queue. CROMDA then aligns the two distributions and learns a modal invariant global representation. In this way, CROMDA is able to obtain invariant property in each modality, where images with similar text representations should be similar and vice versa.
-
公开(公告)号:US20230154146A1
公开(公告)日:2023-05-18
申请号:US17566061
申请日:2021-12-30
Applicant: salesforce.com, inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V10/74 , G06V10/774 , G06F40/279 , G06V20/40 , G06V10/776
CPC classification number: G06V10/761 , G06V10/774 , G06F40/279 , G06V20/47 , G06V20/41 , G06V10/776 , G06V20/46
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US11263476B2
公开(公告)日:2022-03-01
申请号:US16870621
申请日:2020-05-08
Applicant: salesforce.com, inc.
Inventor: Junnan Li , Chu Hong Hoi
Abstract: The system and method are directed to a prototypical contrastive learning (PCL). The PCL explicitly encodes the hierarchical semantic structure of the dataset into the learned embedding space and prevents the network from exploiting low-level cues for solving the unsupervised learning task. The PCL includes prototypes as the latent variables to help find the maximum-likelihood estimation of the network parameters in an expectation-maximization framework. The PCL iteratively performs an E-step for finding prototypes with clustering and M-step for optimizing the network on a contrastive loss.
-
公开(公告)号:US20210374553A1
公开(公告)日:2021-12-02
申请号:US17015858
申请日:2020-09-09
Applicant: salesforce.com, inc.
Inventor: Junnan Li , Chu Hong Hoi
Abstract: Embodiments described herein provide systems and methods for noise-robust contrastive learning. In view of the need for a noise-robust learning system, embodiments described herein provides a contrastive learning mechanism that combats noise by learning robust representations of the noisy data samples. Specifically, the training images are projected into a low-dimensional subspace, and the geometric structure of the subspace is regularized with: (1) a consistency contrastive loss that enforces images with perturbations to have similar embeddings; and (2) a prototypical contrastive loss augmented with a predetermined learning principle, which encourages the embedding for a linearly-interpolated input to have the same linear relationship with respect to the class prototypes. The low-dimensional embeddings are also trained to reconstruct the high-dimensional features, which preserves the learned information and regularizes the classifier.
-
-
-
-
-
-
-