OPEN-VOCABULARY OBJECT DETECTION IN IMAGES

    公开(公告)号:US20250148759A1

    公开(公告)日:2025-05-08

    申请号:US19014029

    申请日:2025-01-08

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for object detection. In one aspect, a method comprises: obtaining: (i) an image, and (ii) a set of one or more query embeddings, wherein each query embedding represents a respective category of object; processing the image and the set of query embeddings using an object detection neural network to generate object detection data for the image, comprising: processing the image using an image encoding subnetwork of the object detection neural network to generate a set of object embeddings; processing each object embedding using a localization subnetwork to generate localization data defining a corresponding region of the image; and processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork to generate, for each object embedding, a respective classification score distribution over the set of query embeddings.

    GENERATING VIDEOS USING DIFFUSION MODELS
    4.
    发明公开

    公开(公告)号:US20240338936A1

    公开(公告)日:2024-10-10

    申请号:US18296938

    申请日:2023-04-06

    Applicant: Google LLC

    CPC classification number: G06V10/82 G06V10/771 H04N7/0117 H04N7/013

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an output video conditioned on an input. In one aspect, a method comprises receiving the input; initializing a current intermediate representation; generating an output video by updating the current intermediate representation at each of a plurality of iterations, wherein the updating comprises, at each iteration: processing an intermediate input for the iteration comprising the current intermediate representation using a diffusion model that is configured to process the intermediate input to generate a noise output; and updating the current intermediate representation using the noise output for the iteration.

    Neural Networks based Multimodal Transformer for Multi-Task User Interface Modeling

    公开(公告)号:US20230031702A1

    公开(公告)日:2023-02-02

    申请号:US17812208

    申请日:2022-07-13

    Applicant: Google LLC

    Abstract: A method includes receiving, via a computing device, a screenshot of a display provided by a graphical user interface of the computing device. The method also includes generating, by an image-structure transformer of a neural network, a representation by fusing a first embedding based on the screenshot and a second embedding based on a layout of virtual objects in the screenshot. The method additionally includes predicting, by the neural network and based on the generated representation, a modeling task output associated with the graphical user interface. The method further includes providing, by the computing device, the predicted modeling task output.

Patent Agency Ranking