-
公开(公告)号:US20240420447A1
公开(公告)日:2024-12-19
申请号:US18336423
申请日:2023-06-16
Applicant: Adobe Inc.
Inventor: Aishwarya Agarwal , Srikrishna Karanam , Balaji Vasan Srinivasan
Abstract: The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing difference attention to evaluate and/or train machine learning models. In particular, in some embodiments, the disclosed systems generate, utilizing a machine learning model, a first feature vector from a digital image. In one or more implementations, the disclosed systems generate a masked digital image by masking a region from the digital image. Additionally, in some embodiments, the disclosed systems generate, utilizing the machine learning model, a second feature vector from the masked digital image. Moreover, in some implementations, the disclosed systems determine a difference feature vector between the first feature vector and the second feature vector. Furthermore, in some embodiments, the disclosed systems generate, from the difference feature vector, a difference attention map reflecting a visual grounding of the machine learning model relative to the region.
-
公开(公告)号:US20220230061A1
公开(公告)日:2022-07-21
申请号:US17153130
申请日:2021-01-20
Applicant: Adobe Inc.
Inventor: Hrituraj Singh , Jatin Lamba , Denil Pareshbhai Mehta , Balaji Vasan Srinivasan , Anshul Nasery , Aishwarya Agarwal
Abstract: In some embodiments, a multimodal computing system receives a query and identifies, from source documents, text passages and images that are relevant to the query. The multimodal computing system accesses a multimodal question-answering model that includes a textual stream of language models and a visual stream of language models. Each of the textual stream and the visual stream contains a set of transformer-based models and each transformer-based model includes a cross-attention layer using data generated by both the textual stream and visual stream of language models as an input. The multimodal computing system identifies text relevant to the query by applying the textual stream to the text passages and computes, using the visual stream, relevance scores of the images to the query, respectively. The multimodal computing system further generates a response to the query by including the text and/or an image according to the relevance scores.
-
公开(公告)号:US12198048B2
公开(公告)日:2025-01-14
申请号:US17153130
申请日:2021-01-20
Applicant: Adobe Inc.
Inventor: Hrituraj Singh , Jatin Lamba , Denil Pareshbhai Mehta , Balaji Vasan Srinivasan , Anshul Nasery , Aishwarya Agarwal
IPC: G06F16/24 , G06F16/242 , G06F18/214 , G06F18/22 , G06F40/20 , G06N3/045 , G06N3/08 , G06V30/40
Abstract: In some embodiments, a multimodal computing system receives a query and identifies, from source documents, text passages and images that are relevant to the query. The multimodal computing system accesses a multimodal question-answering model that includes a textual stream of language models and a visual stream of language models. Each of the textual stream and the visual stream contains a set of transformer-based models and each transformer-based model includes a cross-attention layer using data generated by both the textual stream and visual stream of language models as an input. The multimodal computing system identifies text relevant to the query by applying the textual stream to the text passages and computes, using the visual stream, relevance scores of the images to the query, respectively. The multimodal computing system further generates a response to the query by including the text and/or an image according to the relevance scores.
-
公开(公告)号:US20240428468A1
公开(公告)日:2024-12-26
申请号:US18337634
申请日:2023-06-20
Applicant: Adobe Inc.
Inventor: Aishwarya Agarwal , Srikrishna Karanam , Joseph Koonthanam Jose , Apoorv Umang Saxena , Koustava Goswami , Balaji Vasan Srinivasan
IPC: G06T11/00 , G06N3/0455
Abstract: The present disclosure relates to systems, methods, and non-transitory computer-readable media that utilizes attention segregation loss and/or attention retention loss at inference time of a diffusion neural network to generate a text-conditioned image. In particular, in some embodiments, the disclosed systems utilize the attention segregation loss to reduce overlap between concepts by comparing attention maps for multiple concepts of a text query corresponding to a denoising step. Further, in some embodiments, the disclosed systems utilize the attention retention loss to improve information retention for concepts across denoising steps by comparing attention maps between different denoising steps. Accordingly, in some embodiments, by utilizing the attention segregation loss and the attention retention loss, the disclosed systems accurately maintain multiple concepts from a text query when generating a text-conditioned image.
-
-
-