-
公开(公告)号:US20240404238A1
公开(公告)日:2024-12-05
申请号:US18698997
申请日:2022-10-05
Applicant: Google LLC
Inventor: Jiahui Yu , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Yonghui Wu , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Han Zhang , Xin Li
IPC: G06V10/28 , G06F40/284 , G06V10/764 , G06V10/766 , G06V10/82
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pre-training a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
2.
公开(公告)号:US12067646B2
公开(公告)日:2024-08-20
申请号:US17467628
申请日:2021-09-07
Applicant: Google LLC
Inventor: Han Zhang , Jing Yu Koh , Jason Michael Baldridge , Yinfei Yang , Honglak Lee
IPC: G06T11/00 , G06F18/214 , G06F18/22 , G06N3/08 , G10L15/26
CPC classification number: G06T11/00 , G06F18/2148 , G06F18/22 , G06N3/08 , G10L15/26
Abstract: A computer-implemented method includes receiving, by a computing device, a particular textual description of a scene. The method also includes applying a neural network for text-to-image generation to generate an output image rendition of the scene, the neural network having been trained to cause two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other based on mutual information between a plurality of corresponding pairs, wherein the plurality of corresponding pairs comprise an image-to-image pair and a text-to-image pair. The method further includes predicting the output image rendition of the scene.
-
公开(公告)号:US20240112088A1
公开(公告)日:2024-04-04
申请号:US18520083
申请日:2023-11-27
Applicant: Google LLC
Inventor: Jiahui Yu , Xin Li , Han Zhang , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Yonghui Wu
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
公开(公告)号:US20230274143A1
公开(公告)日:2023-08-31
申请号:US18173985
申请日:2023-02-24
Applicant: Google LLC
Inventor: Zizhao Zhang , Zifeng Wang , Chen-Yu Lee , Ruoxi Sun , Sayna Ebrahimi , Xiaoqi Ren , Guolong Su , Vincent Perot , Tomas Pfister , Han Zhang
IPC: G06N3/08
CPC classification number: G06N3/08
Abstract: A method for rehearsal-free continual learning includes obtaining a set of training samples where training sample in the set of training samples is associated with a respective task of a plurality of different tasks. The method includes obtaining a task-invariant prompt representative of learned knowledge common to each respective task of the plurality of different tasks. The method includes, for each respective task of the plurality of different tasks, obtaining a respective task-specific prompt representative of learned knowledge specific to the respective task. The method includes, during each of one or more training iterations, for each respective training sample in the set of training samples, selecting the respective task-specific prompt representative of the respective task of the respective training sample and training a model using the task-invariant prompt and the selected respective task-specific prompt.
-
公开(公告)号:US20240337839A1
公开(公告)日:2024-10-10
申请号:US18130601
申请日:2023-04-04
Applicant: GOOGLE LLC
Inventor: Ozan Cakmakci , Oscar Alberto Martinez , Eliezer Glik , Han Zhang
IPC: G02B27/01
CPC classification number: G02B27/0172 , G02B2027/0178
Abstract: A head mounted display includes an eyeglasses frame, a lens framed in the eyeglasses frame, and a light engine disposed in the eyeglasses frame. The lens includes an optical shell comprising a world-facing spherical surface and an opposing eye-facing surface and a curved lightguide disposed in the optical shell. The curved lightguide includes an incoupler surface, a first freeform surface facing the world-facing spherical surface, and a second freeform surface facing the eye-facing surface. The lens further includes a first low refractive index region disposed between the first freeform surface and a first conformal freeform surface of the optical shell and a second low refractive index region disposed between the second freeform surface and a second conformal freeform surface of the optical shell.
-
公开(公告)号:US20230351192A1
公开(公告)日:2023-11-02
申请号:US18348587
申请日:2023-07-07
Applicant: Google LLC
Inventor: Zizhao Zhang , Sercan Omer Arik , Tomas Jon Pfister , Han Zhang
IPC: G06N3/084 , G06N20/00 , G06N5/04 , G06V10/762 , G06V10/771 , G06V10/774 , G06V10/776 , G06V10/82
CPC classification number: G06N3/084 , G06N20/00 , G06N5/04 , G06V10/763 , G06V10/771 , G06V10/774 , G06V10/776 , G06V10/82
Abstract: A method for training a model comprises obtaining a set of labeled training samples each associated with a given label. For each labeled training sample, the method includes generating a pseudo label and estimating a weight of the labeled training sample indicative of an accuracy of the given label. The method also includes determining whether the weight of the labeled training sample satisfies a weight threshold. When the weight of the labeled training sample satisfies the weight threshold, the method includes adding the labeled training sample to a set of cleanly labeled training samples. Otherwise, the method includes adding the labeled training sample to a set of mislabeled training samples. The method includes training the model with the set of cleanly labeled training samples using corresponding given labels and the set of mislabeled training samples using corresponding pseudo labels.
-
公开(公告)号:US20220375205A1
公开(公告)日:2022-11-24
申请号:US17664402
申请日:2022-05-20
Applicant: Google LLC
Inventor: Zizhao Zhang , Han Zhang , Long Zhao , Tomas Pfister
IPC: G06V10/77 , G06V10/764 , G06V10/22 , G06V10/44
Abstract: A method includes receiving image data including a series of image patches of an image. The method includes generating, using a first set of transformers of a vision transformer (V-T) model, a first set of higher order feature representations based on the series of image patches and aggregating the first set of higher order feature representations into a second set of higher order feature representations that is smaller than the first set. The method includes generating, using a second set of transformers of the V-T model, a third set of higher order feature representations based on the second set of higher order feature representations and aggregating the third set of higher order feature representations into a fourth set of higher order feature representations that is smaller than the third set. The method includes generating, using the V-T model, an image classification of the image based on the fourth set.
-
公开(公告)号:US20250069382A1
公开(公告)日:2025-02-27
申请号:US18726881
申请日:2023-01-05
Applicant: Google LLC
Inventor: Yinxiao Li , Zhengzhong Tu , Hossein Talebi , Han Zhang , Feng Yang , Peyman Milanfar
IPC: G06V10/82 , G06V10/764 , G06V10/77
Abstract: Provided are machine learning systems and models featuring resolution-flexible multi-axis attention blocks. In particular, the present disclosure provides example multi-axis MLP based architectures (example implementations of which can be generally referred to as MAXIM) that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. In some implementations, MAXIM can use a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, some example implementations of MAXIM can contain two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature mutual conditioning.
-
9.
公开(公告)号:US20240362830A1
公开(公告)日:2024-10-31
申请号:US18770154
申请日:2024-07-11
Applicant: Google LLC
Inventor: Han Zhang , Jing Yu Koh , Jason Michael Baldridge , Yinfei Yang , Honglak Lee
IPC: G06T11/00 , G06F18/214 , G06F18/22 , G06N3/08 , G10L15/26
CPC classification number: G06T11/00 , G06F18/2148 , G06F18/22 , G06N3/08 , G10L15/26
Abstract: A computer-implemented method includes receiving, by a computing device, a particular textual description of a scene. The method also includes applying a neural network for text-to-image generation to generate an output image rendition of the scene, the neural network having been trained to cause two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other based on mutual information between a plurality of corresponding pairs, wherein the plurality of corresponding pairs comprise an image-to-image pair and a text-to-image pair. The method further includes predicting the output image rendition of the scene.
-
公开(公告)号:US20250022269A1
公开(公告)日:2025-01-16
申请号:US18902546
申请日:2024-09-30
Applicant: Google LLC
Inventor: Yinxiao Li , Feng Yang , Peyman Milanfar , Han Zhang , Zhengzhong Tu , Hossein Talebi
Abstract: Provided is an efficient and scalable attention model that can be referred to as multi-axis attention. Example implementations can include two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. The present disclosure also presents a new architectural element by effectively blending the proposed multi-axis attention model with convolutions. In addition, the present disclosure proposes a simple hierarchical vision backbone, example implementations of which can be referred to as MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages.
-
-
-
-
-
-
-
-
-