VISUAL GROUNDING OF SELF-SUPERVISED REPRESENTATIONS FOR MACHINE LEARNING MODELS UTILIZING DIFFERENCE ATTENTION

    公开(公告)号:US20240420447A1

    公开(公告)日:2024-12-19

    申请号:US18336423

    申请日:2023-06-16

    Applicant: Adobe Inc.

    Abstract: The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing difference attention to evaluate and/or train machine learning models. In particular, in some embodiments, the disclosed systems generate, utilizing a machine learning model, a first feature vector from a digital image. In one or more implementations, the disclosed systems generate a masked digital image by masking a region from the digital image. Additionally, in some embodiments, the disclosed systems generate, utilizing the machine learning model, a second feature vector from the masked digital image. Moreover, in some implementations, the disclosed systems determine a difference feature vector between the first feature vector and the second feature vector. Furthermore, in some embodiments, the disclosed systems generate, from the difference feature vector, a difference attention map reflecting a visual grounding of the machine learning model relative to the region.

    MODALITY ADAPTIVE INFORMATION RETRIEVAL

    公开(公告)号:US20220230061A1

    公开(公告)日:2022-07-21

    申请号:US17153130

    申请日:2021-01-20

    Applicant: Adobe Inc.

    Abstract: In some embodiments, a multimodal computing system receives a query and identifies, from source documents, text passages and images that are relevant to the query. The multimodal computing system accesses a multimodal question-answering model that includes a textual stream of language models and a visual stream of language models. Each of the textual stream and the visual stream contains a set of transformer-based models and each transformer-based model includes a cross-attention layer using data generated by both the textual stream and visual stream of language models as an input. The multimodal computing system identifies text relevant to the query by applying the textual stream to the text passages and computes, using the visual stream, relevance scores of the images to the query, respectively. The multimodal computing system further generates a response to the query by including the text and/or an image according to the relevance scores.

    Modality adaptive information retrieval

    公开(公告)号:US12198048B2

    公开(公告)日:2025-01-14

    申请号:US17153130

    申请日:2021-01-20

    Applicant: Adobe Inc.

    Abstract: In some embodiments, a multimodal computing system receives a query and identifies, from source documents, text passages and images that are relevant to the query. The multimodal computing system accesses a multimodal question-answering model that includes a textual stream of language models and a visual stream of language models. Each of the textual stream and the visual stream contains a set of transformer-based models and each transformer-based model includes a cross-attention layer using data generated by both the textual stream and visual stream of language models as an input. The multimodal computing system identifies text relevant to the query by applying the textual stream to the text passages and computes, using the visual stream, relevance scores of the images to the query, respectively. The multimodal computing system further generates a response to the query by including the text and/or an image according to the relevance scores.

    TEXT-TO-IMAGE SYNTHESIS UTILIZING DIFFUSION MODELS WITH TEST-TIME ATTENTION SEGREGATION AND RETENTION OPTIMIZATION

    公开(公告)号:US20240428468A1

    公开(公告)日:2024-12-26

    申请号:US18337634

    申请日:2023-06-20

    Applicant: Adobe Inc.

    Abstract: The present disclosure relates to systems, methods, and non-transitory computer-readable media that utilizes attention segregation loss and/or attention retention loss at inference time of a diffusion neural network to generate a text-conditioned image. In particular, in some embodiments, the disclosed systems utilize the attention segregation loss to reduce overlap between concepts by comparing attention maps for multiple concepts of a text query corresponding to a denoising step. Further, in some embodiments, the disclosed systems utilize the attention retention loss to improve information retention for concepts across denoising steps by comparing attention maps between different denoising steps. Accordingly, in some embodiments, by utilizing the attention segregation loss and the attention retention loss, the disclosed systems accurately maintain multiple concepts from a text query when generating a text-conditioned image.

Patent Agency Ranking