-
公开(公告)号:US20250022269A1
公开(公告)日:2025-01-16
申请号:US18902546
申请日:2024-09-30
Applicant: Google LLC
Inventor: Yinxiao Li , Feng Yang , Peyman Milanfar , Han Zhang , Zhengzhong Tu , Hossein Talebi
Abstract: Provided is an efficient and scalable attention model that can be referred to as multi-axis attention. Example implementations can include two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. The present disclosure also presents a new architectural element by effectively blending the proposed multi-axis attention model with convolutions. In addition, the present disclosure proposes a simple hierarchical vision backbone, example implementations of which can be referred to as MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages.
-
公开(公告)号:US20240022760A1
公开(公告)日:2024-01-18
申请号:US18256837
申请日:2021-08-05
Applicant: Google LLC
Inventor: Yinxiao Li , Peyman Milanfar , Feng Yang , Ce Liu , Ming-Hsuan Yang , Pengchong Jin
IPC: H04N19/59 , G06T3/00 , H04N19/117 , G06V10/74 , H04N19/503 , H04N19/70 , H04N19/80
CPC classification number: H04N19/59 , G06T3/0093 , H04N19/117 , G06V10/761 , H04N19/503 , H04N19/70 , H04N19/80
Abstract: Example aspects of the present disclosure are directed to systems and methods which feature a machine-learned video super-resolution (VSR) model which has been trained using a bi-directional training approach. In particular, the present disclosure provides a compression-informed (e.g., compression-aware) super-resolution model that can perform well on real-world videos with different levels of compression. Specifically, example models described herein can include three modules to robustly restore the missing information caused by video compression. First, a bi-directional recurrent module can be used to reduce the accumulated warping error from the random locations of the intra-frame from compressed video frames. Second, a detail-aware flow estimation module can be added to enable recovery of high resolution (HR) flow from compressed low resolution (LR) frames. Finally, a Laplacian enhancement module can add high-frequency information to the warped HR frames washed out by video encoding.
-
公开(公告)号:US20250069382A1
公开(公告)日:2025-02-27
申请号:US18726881
申请日:2023-01-05
Applicant: Google LLC
Inventor: Yinxiao Li , Zhengzhong Tu , Hossein Talebi , Han Zhang , Feng Yang , Peyman Milanfar
IPC: G06V10/82 , G06V10/764 , G06V10/77
Abstract: Provided are machine learning systems and models featuring resolution-flexible multi-axis attention blocks. In particular, the present disclosure provides example multi-axis MLP based architectures (example implementations of which can be generally referred to as MAXIM) that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. In some implementations, MAXIM can use a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, some example implementations of MAXIM can contain two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature mutual conditioning.
-
公开(公告)号:US20220189170A1
公开(公告)日:2022-06-16
申请号:US17432221
申请日:2019-02-22
Applicant: Google LLC
Inventor: Menglong Zhu , Mason Liu , Marie Charisse White , Dmitry Kalenichenko , Yinxiao Li
IPC: G06V20/40 , G06V10/70 , G06V10/80 , G06V10/82 , G06V10/94 , G06V10/776 , G06V10/774
Abstract: Systems and methods for detecting objects in a video are provided. A method can include inputting a video comprising a plurality of frames into an interleaved object detection model comprising a plurality of feature extractor networks and a shared memory layer. For each of one or more frames, the operations can include selecting one of the plurality of feature extractor networks to analyze the one or more frames, analyzing the one or more frames by the selected feature extractor network to determine one or more features of the one or more frames, determining an updated set of features based at least in part on the one or more features and one or more previously extracted features extracted from a previous frame stored in the shared memory layer, and detecting an object in the one or more frames based at least in part on the updated set of features.
-
公开(公告)号:US11776156B2
公开(公告)日:2023-10-03
申请号:US17303969
申请日:2021-06-11
Applicant: Google LLC
Inventor: Yinxiao Li , Zhichao Lu , Xuehan Xiong , Jonathan Huang
IPC: G06T7/73
CPC classification number: G06T7/73 , G06T2207/10016 , G06T2207/20081 , G06T2207/20084 , G06T2207/30196
Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.
-
公开(公告)号:US20210390733A1
公开(公告)日:2021-12-16
申请号:US17303969
申请日:2021-06-11
Applicant: Google LLC
Inventor: Yinxiao Li , Zhichao Lu , Xuehan Xiong , Jonathan Huang
IPC: G06T7/73
Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.
-
公开(公告)号:US20240212347A1
公开(公告)日:2024-06-27
申请号:US18603946
申请日:2024-03-13
Applicant: Google LLC
Inventor: Dmitry Kalenichenko , Menglong Zhu , Marie Charisse White , Mason Liu , Yinxiao Li
IPC: G06V20/40 , G06V10/70 , G06V10/774 , G06V10/776 , G06V10/80 , G06V10/82 , G06V10/94
CPC classification number: G06V20/40 , G06V10/774 , G06V10/776 , G06V10/806 , G06V10/82 , G06V10/87 , G06V10/955 , G06V20/46
Abstract: Systems and methods for detecting objects in a video are provided. A method can include inputting a video comprising a plurality of frames into an interleaved object detection model comprising a plurality of feature extractor networks and a shared memory layer. For each of one or more frames, the operations can include selecting one of the plurality of feature extractor networks to analyze the one or more frames, analyzing the one or more frames by the selected feature extractor network to determine one or more features of the one or more frames, determining an updated set of features based at least in part on the one or more features and one or more previously extracted features extracted from a previous frame stored in the shared memory layer, and detecting an object in the one or more frames based at least in part on the updated set of features.
-
公开(公告)号:US11961298B2
公开(公告)日:2024-04-16
申请号:US17432221
申请日:2019-02-22
Applicant: Google LLC
Inventor: Menglong Zhu , Mason Liu , Marie Charisse White , Dmitry Kalenichenko , Yinxiao Li
IPC: G06V10/00 , G06V10/70 , G06V10/774 , G06V10/776 , G06V10/80 , G06V10/82 , G06V10/94 , G06V20/40
CPC classification number: G06V20/40 , G06V10/774 , G06V10/776 , G06V10/806 , G06V10/82 , G06V10/87 , G06V10/955 , G06V20/46
Abstract: Systems and methods for detecting objects in a video are provided. A method can include inputting a video comprising a plurality of frames into an interleaved object detection model comprising a plurality of feature extractor networks and a shared memory layer. For each of one or more frames, the operations can include selecting one of the plurality of feature extractor networks to analyze the one or more frames, analyzing the one or more frames by the selected feature extractor network to determine one or more features of the one or more frames, determining an updated set of features based at least in part on the one or more features and one or more previously extracted features extracted from a previous frame stored in the shared memory layer, and detecting an object in the one or more frames based at least in part on the updated set of features.
-
公开(公告)号:US20240020788A1
公开(公告)日:2024-01-18
申请号:US18256783
申请日:2021-03-24
Applicant: Google LLC
Inventor: Xiyang Luo , Feng Yang , Ce Liu , Huiwen Chang , Peyman Milanfar , Yinxiao Li
IPC: G06T1/00
CPC classification number: G06T1/0085 , G06T2201/0083
Abstract: Systems and methods of the present disclosure are directed to a computing system. The computing system can obtain a message vector and video data comprising a plurality of video frames. The computing system can process the input video with a transformation portion of a machine-learned watermark encoding model to obtain a three-dimensional feature encoding of the input video. The computing system can process the three-dimensional feature encoding of the input video and the message vector with an embedding portion of the machine-learned watermark encoding model to obtain spatial-temporal watermark encoding data descriptive of the message vector. The computing system can generate encoded video data comprising a plurality of encoded video frames, wherein at least one of the plurality of encoded video frames includes the spatial-temporal watermark encoding data.
-
公开(公告)号:US20230419538A1
公开(公告)日:2023-12-28
申请号:US18464912
申请日:2023-09-11
Applicant: Google LLC
Inventor: Yinxiao Li , Zhichao Lu , Xuehan Xiong , Jonathan Huang
IPC: G06T7/73
CPC classification number: G06T7/73 , G06T2207/20081 , G06T2207/30196 , G06T2207/20084 , G06T2207/10016
Abstract: A method includes receiving video data that includes a series of frames of image data. Here, the video data is representative of an actor performing an activity. The method also includes processing the video data to generate a spatial input stream including a series of spatial images representative of spatial features of the actor performing the activity, a temporal input stream representative of motion of the actor performing the activity, and a pose input stream including a series of images representative of a pose of the actor performing the activity. Using at least one neural network, the method also includes processing the temporal input stream, the spatial input stream, and the pose input stream. The method also includes classifying, by the at least one neural network, the activity based on the temporal input stream, the spatial input stream, and the pose input stream.
-
-
-
-
-
-
-
-
-