Systems and Methods for Improved Video Understanding

    公开(公告)号:US20240428586A1

    公开(公告)日:2024-12-26

    申请号:US18827088

    申请日:2024-09-06

    Applicant: Google LLC

    Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of spatiotemporal representations from the video data, the plurality of spatiotemporal representations comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of spatiotemporal representations as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.

    TRAINING ULTRA-LARGE-SCALE VISION TRANSFORMER NEURAL NETWORKS

    公开(公告)号:US20240256835A1

    公开(公告)日:2024-08-01

    申请号:US18424420

    申请日:2024-01-26

    Applicant: Google LLC

    CPC classification number: G06N3/0455 G06N3/088

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing an input through each of a plurality of layers of a neural network to generate an output using a plurality of hardware accelerators. The plurality of layers comprise a fully connected layer having a plurality of parameters arranged in a row dimension and a column dimension. One of the methods comprises: generating a plurality of parameter blocks by partitioning the plurality of parameters along the row dimension and the column dimension; determining a ratio of a number of parameters along the row dimension relative to a number of parameters along the column dimension; and determining whether to use row sharding or column sharding with the plurality of hardware accelerators to calculate an output for the fully connected layer and then calculating the output for the fully connected layer using either row sharding or column sharding.

    Systems and Methods for Pretraining Models for Diverse Downstream Tasks

    公开(公告)号:US20250156756A1

    公开(公告)日:2025-05-15

    申请号:US18835666

    申请日:2022-12-30

    Applicant: Google LLC

    Abstract: An example method for pretraining a machine-learned model is provided. The example method includes obtaining a plurality of different combinations of configuration parameters of a pretraining objective framework. The example method includes generating, using the pretraining objective framework, a plurality of corrupted training examples from one or more training examples, wherein the plurality of corrupted training examples are respectively generated according to the plurality of different combinations. The example method includes inputting the plurality of corrupted training examples into the machine-learned model, wherein the machine-learned model is configured to generate uncorrupted subportions corresponding to corrupted subportions of the corrupted training examples. The example method includes obtaining, from the machine-learned model, a plurality of outputs respectively generated by the machine-learned model based on the plurality of corrupted training examples. The example method includes updating one or more parameters of the machine-learned model based on an evaluation of the plurality of outputs.

    Neural Network Architectures with Multiple Normalization Layers for Machine Vision

    公开(公告)号:US20240257511A1

    公开(公告)日:2024-08-01

    申请号:US18419170

    申请日:2024-01-22

    Applicant: Google LLC

    CPC classification number: G06V10/82 G06V10/26

    Abstract: One example aspect of the present disclosure is directed to a neural network for machine vision. The neural network may include a stem block that includes a set of stem layers. The neural network may additionally include a visual transformer block. The set of stem layers may include a patch layer, a first normalization layer, an embedding layer, and a second normalization layer. The patch layer subdivides an input image into a set of image patches. The first normalization layer generates a set of normalized image patches by performing a first normalization process on each image patch of the set of image patches. The patch layer feeds forward to the first normalization layer. The embedding layer generates a set of vector embeddings. Each vector embedding of the set of embedding vectors is a projection of a corresponding normalized image patch from the set of normalized image patches onto a visual token. The first normalization layer feeds forward to the embedding layer. The second normalization layer generates a set of normalized vector embeddings by performing a second normalization process on each vector embedding of the set of vector embeddings. The embedding layer feeds forward to the second normalization layer. The transformer block enables one or more machine vision tasks for the input image based on the set of normalized vectors. The second normalization layer feeds forward to the transformer block.

Patent Agency Ranking