-
公开(公告)号:US20250140006A1
公开(公告)日:2025-05-01
申请号:US18620136
申请日:2024-03-28
Applicant: Google LLC
Inventor: Harshit Kharbanda , Boris Bluntschli , Vibhuti Mahajan , Louis Wang
IPC: G06V20/70 , G06V10/764 , G06V20/40
Abstract: Systems and methods for image understanding can include one or more object recognition systems and one or more vision language models to generate an augmented language output that can be both scene-aware and object-aware. The systems and methods can process an input image with an object recognition model to generate an object recognition output descriptive of identification details for an object depicted in the input image. The systems and methods can include processing the input image with a vision language model to generate a language output descriptive of a predicted scene description. The object recognition output can then be utilized to augment the language output to generate an augmented language output that includes the scene understanding of the language output with the specificity of the object recognition output.
-
公开(公告)号:US20250061146A1
公开(公告)日:2025-02-20
申请号:US18802734
申请日:2024-08-13
Applicant: GOOGLE LLC
Inventor: Olivier Siegenthaler , Ágoston Weisz , Boris Bluntschli , Dan Banica , Kaan Ege Özgün , Daniel Mogoreanu , Filip Sladek
IPC: G06F16/532 , G06F40/40 , G06V10/80 , G06V20/50
Abstract: Implementations utilize an LLM to respond to queries comprising image data, such as multimodal queries that include both text and image data. A natural language processing system is extended such that when an image is provided, the natural language processing system invokes one or more auxiliary image processing models (e.g., visual query) and/or image search engines. The results, of invoking such model(s) and/or search engine(s), are collected into structured data signals related to the image. These signals form part of the conversation context and are used to extend the text prompt that is sent to the LLM. This allows the LLM to take the context into account when being used to process the user query, thereby enabling generation of an LLM reply that addresses relevant feature(s) of the image.
-
公开(公告)号:US11978271B1
公开(公告)日:2024-05-07
申请号:US18496402
申请日:2023-10-27
Applicant: Google LLC
Inventor: Harshit Kharbanda , Boris Bluntschli , Vibhuti Mahajan , Louis Wang
IPC: G06V20/70 , G06V10/764 , G06V20/40
CPC classification number: G06V20/70 , G06V10/764 , G06V20/41
Abstract: Systems and methods for image understanding can include one or more object recognition systems and one or more vision language models to generate an augmented language output that can be both scene-aware and object-aware. The systems and methods can process an input image with an object recognition model to generate an object recognition output descriptive of identification details for an object depicted in the input image. The systems and methods can include processing the input image with a vision language model to generate a language output descriptive of a predicted scene description. The object recognition output can then be utilized to augment the language output to generate an augmented language output that includes the scene understanding of the language output with the specificity of the object recognition output.
-
-