Multi-Axis Vision Transformer
    1.
    发明申请

    公开(公告)号:US20250022269A1

    公开(公告)日:2025-01-16

    申请号:US18902546

    申请日:2024-09-30

    Applicant: Google LLC

    Abstract: Provided is an efficient and scalable attention model that can be referred to as multi-axis attention. Example implementations can include two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. The present disclosure also presents a new architectural element by effectively blending the proposed multi-axis attention model with convolutions. In addition, the present disclosure proposes a simple hierarchical vision backbone, example implementations of which can be referred to as MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages.

    Compression-Informed Video Super-Resolution
    2.
    发明公开

    公开(公告)号:US20240022760A1

    公开(公告)日:2024-01-18

    申请号:US18256837

    申请日:2021-08-05

    Applicant: Google LLC

    Abstract: Example aspects of the present disclosure are directed to systems and methods which feature a machine-learned video super-resolution (VSR) model which has been trained using a bi-directional training approach. In particular, the present disclosure provides a compression-informed (e.g., compression-aware) super-resolution model that can perform well on real-world videos with different levels of compression. Specifically, example models described herein can include three modules to robustly restore the missing information caused by video compression. First, a bi-directional recurrent module can be used to reduce the accumulated warping error from the random locations of the intra-frame from compressed video frames. Second, a detail-aware flow estimation module can be added to enable recovery of high resolution (HR) flow from compressed low resolution (LR) frames. Finally, a Laplacian enhancement module can add high-frequency information to the warped HR frames washed out by video encoding.

    Machine Learning Models Featuring Resolution-Flexible Multi-Axis Attention Blocks

    公开(公告)号:US20250069382A1

    公开(公告)日:2025-02-27

    申请号:US18726881

    申请日:2023-01-05

    Applicant: Google LLC

    Abstract: Provided are machine learning systems and models featuring resolution-flexible multi-axis attention blocks. In particular, the present disclosure provides example multi-axis MLP based architectures (example implementations of which can be generally referred to as MAXIM) that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. In some implementations, MAXIM can use a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, some example implementations of MAXIM can contain two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature mutual conditioning.

    Memory-Guided Video Object Detection

    公开(公告)号:US20220189170A1

    公开(公告)日:2022-06-16

    申请号:US17432221

    申请日:2019-02-22

    Applicant: Google LLC

    Abstract: Systems and methods for detecting objects in a video are provided. A method can include inputting a video comprising a plurality of frames into an interleaved object detection model comprising a plurality of feature extractor networks and a shared memory layer. For each of one or more frames, the operations can include selecting one of the plurality of feature extractor networks to analyze the one or more frames, analyzing the one or more frames by the selected feature extractor network to determine one or more features of the one or more frames, determining an updated set of features based at least in part on the one or more features and one or more previously extracted features extracted from a previous frame stored in the shared memory layer, and detecting an object in the one or more frames based at least in part on the updated set of features.

    Pose empowered RGB-flow net
    5.
    发明授权

    公开(公告)号:US11776156B2

    公开(公告)日:2023-10-03

    申请号:US17303969

    申请日:2021-06-11

    Applicant: Google LLC

    Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.

    Pose Empowered RGB-Flow Net
    6.
    发明申请

    公开(公告)号:US20210390733A1

    公开(公告)日:2021-12-16

    申请号:US17303969

    申请日:2021-06-11

    Applicant: Google LLC

    Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.

    Machine-Learned Models for Imperceptible Message Watermarking in Videos

    公开(公告)号:US20240020788A1

    公开(公告)日:2024-01-18

    申请号:US18256783

    申请日:2021-03-24

    Applicant: Google LLC

    CPC classification number: G06T1/0085 G06T2201/0083

    Abstract: Systems and methods of the present disclosure are directed to a computing system. The computing system can obtain a message vector and video data comprising a plurality of video frames. The computing system can process the input video with a transformation portion of a machine-learned watermark encoding model to obtain a three-dimensional feature encoding of the input video. The computing system can process the three-dimensional feature encoding of the input video and the message vector with an embedding portion of the machine-learned watermark encoding model to obtain spatial-temporal watermark encoding data descriptive of the message vector. The computing system can generate encoded video data comprising a plurality of encoded video frames, wherein at least one of the plurality of encoded video frames includes the spatial-temporal watermark encoding data.

    Pose Empowered RGB-Flow Net
    10.
    发明公开

    公开(公告)号:US20230419538A1

    公开(公告)日:2023-12-28

    申请号:US18464912

    申请日:2023-09-11

    Applicant: Google LLC

    Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.

Patent Agency Ranking