Patent search ap:("Google LLC") AND inv:"Vibhuti Mahajan" Page 1

1.

发明申请
Instance Level Scene Recognition with a Vision Language Model 有权

公开(公告)号：US20250140006A1

公开(公告)日：2025-05-01

申请号：US18620136

申请日：2024-03-28

Applicant: Google LLC

Inventor： Harshit Kharbanda , Boris Bluntschli , Vibhuti Mahajan , Louis Wang

IPC: G06V20/70 , G06V10/764 , G06V20/40

Abstract: Systems and methods for image understanding can include one or more object recognition systems and one or more vision language models to generate an augmented language output that can be both scene-aware and object-aware. The systems and methods can process an input image with an object recognition model to generate an object recognition output descriptive of identification details for an object depicted in the input image. The systems and methods can include processing the input image with a vision language model to generate a language output descriptive of a predicted scene description. The object recognition output can then be utilized to augment the language output to generate an augmented language output that includes the scene understanding of the language output with the specificity of the object recognition output.

2.

发明授权
Instance level scene recognition with a vision language model 有权

公开(公告)号：US11978271B1

公开(公告)日：2024-05-07

申请号：US18496402

申请日：2023-10-27

Applicant: Google LLC

Inventor： Harshit Kharbanda , Boris Bluntschli , Vibhuti Mahajan , Louis Wang

IPC: G06V20/70 , G06V10/764 , G06V20/40

CPC classification number: G06V20/70 , G06V10/764 , G06V20/41

Abstract: Systems and methods for image understanding can include one or more object recognition systems and one or more vision language models to generate an augmented language output that can be both scene-aware and object-aware. The systems and methods can process an input image with an object recognition model to generate an object recognition output descriptive of identification details for an object depicted in the input image. The systems and methods can include processing the input image with a vision language model to generate a language output descriptive of a predicted scene description. The object recognition output can then be utilized to augment the language output to generate an augmented language output that includes the scene understanding of the language output with the specificity of the object recognition output.

Patent Agency Ranking