Dense video captioning
    91.
    发明授权

    公开(公告)号:US10958925B2

    公开(公告)日:2021-03-23

    申请号:US16687405

    申请日:2019-11-18

    Abstract: Systems and methods for dense captioning of a video include a multi-layer encoder stack configured to receive information extracted from a plurality of video frames, a proposal decoder coupled to the encoder stack and configured to receive one or more outputs from the encoder stack, a masking unit configured to mask the one or more outputs from the encoder stack according to one or more outputs from the proposal decoder, and a decoder stack coupled to the masking unit and configured to receive the masked one or more outputs from the encoder stack. Generating the dense captioning based on one or more outputs of the decoder stack. In some embodiments, the one or more outputs from the proposal decoder include a differentiable mask. In some embodiments, during training, error in the dense captioning is back propagated to the decoder stack, the encoder stack, and the proposal decoder.

    Multitask Learning As Question Answering
    92.
    发明申请

    公开(公告)号:US20200380213A1

    公开(公告)日:2020-12-03

    申请号:US16996726

    申请日:2020-08-18

    Abstract: Approaches for multitask learning as question answering include an input layer for encoding a context and a question, a self-attention based transformer including an encoder and a decoder, a first bi-directional long-term short-term memory (biLSTM) for further encoding an output of the encoder, a long-term short-term memory (LSTM) for generating a context-adjusted hidden state from the output of the decoder and a hidden state, an attention network for generating first attention weights based on an output of the first biLSTM and an output of the LSTM, a vocabulary layer for generating a distribution over a vocabulary, a context layer for generating a distribution over the context, and a switch for generating a weighting between the distributions over the vocabulary and the context, generating a composite distribution based on the weighting, and selecting a word of an answer using the composite distribution.

    Weakly Supervised Natural Language Localization Networks

    公开(公告)号:US20200372116A1

    公开(公告)日:2020-11-26

    申请号:US16531343

    申请日:2019-08-05

    Abstract: Systems and methods are provided for weakly supervised natural language localization (WSNLL), for example, as implemented in a neural network or model. The WSNLL network is trained with long, untrimmed videos, i.e., videos that have not been temporally segmented or annotated. The WSNLL network or model defines or generates a video-sentence pair, which corresponds to a pairing of an untrimmed video with an input text sentence. According to some embodiments, the WSNLL network or model is implemented with a two-branch architecture, where one branch performs segment sentence alignment and the other one conducts segment selection.

    Interpretable counting in visual question answering

    公开(公告)号:US10592767B2

    公开(公告)日:2020-03-17

    申请号:US15882220

    申请日:2018-01-29

    Abstract: Approaches for interpretable counting for visual question answering include a digital image processor, a language processor, and a counter. The digital image processor identifies objects in an image, maps the identified objects into an embedding space, generates bounding boxes for each of the identified objects, and outputs the embedded objects paired with their bounding boxes. The language processor embeds a question into the embedding space. The scorer determines scores for the identified objects. Each respective score determines how well a corresponding one of the identified objects is responsive to the question. The counter determines a count of the objects in the digital image that are responsive to the question based on the scores. The count and a corresponding bounding box for each object included in the count are output. In some embodiments, the counter determines the count interactively based on interactions between counted and uncounted objects.

    SPATIAL ATTENTION MODEL FOR IMAGE CAPTIONING
    95.
    发明申请

    公开(公告)号:US20200057805A1

    公开(公告)日:2020-02-20

    申请号:US16661869

    申请日:2019-10-23

    Abstract: The technology disclosed presents a novel spatial attention model that uses current hidden state information of a decoder long short-term memory (LSTM) to guide attention and to extract spatial image features for use in image captioning. The technology disclosed also presents a novel adaptive attention model for image captioning that mixes visual information from a convolutional neural network (CNN) and linguistic information from an LSTM. At each timestep, the adaptive attention model automatically decides how heavily to rely on the image, as opposed to the linguistic model, to emit the next caption word. The technology disclosed further adds a new auxiliary sentinel gate to an LSTM architecture and produces a sentinel LSTM (Sn-LSTM). The sentinel gate produces a visual sentinel at each timestep, which is an additional representation, derived from the LSTM's memory, of long and short term visual and linguistic information.

    TRAINING A NEURAL NETWORK USING AUGMENTED TRAINING DATASETS

    公开(公告)号:US20190258901A1

    公开(公告)日:2019-08-22

    申请号:US16399163

    申请日:2019-04-30

    Abstract: A computer system generates augmented training datasets to train neural network models. The computer system receives an initial training dataset comprising images for training a neural network model, and generates an augmented training dataset by modifying images from the first training dataset. The computer system identifies a representation of a target object against a background from the initial training dataset and extracts a portion of the image displaying the target object. The computer system generates samples for including in the augmented training dataset based on the image. For example, new images may be obtained by performing transformations on the portion of the image displaying the target object and/or by overlaying the transformed portion of the image over a different background. The modified images are included in the augmented training dataset used for training the neural network model to recognize the target object.

    Training a neural network using augmented training datasets

    公开(公告)号:US10346721B2

    公开(公告)日:2019-07-09

    申请号:US15801297

    申请日:2017-11-01

    Abstract: A computer system generates augmented training datasets to train neural network models. The computer system receives an initial training dataset comprising images for training a neural network model, and generates an augmented training dataset by modifying images from the first training dataset. The computer system identifies a representation of a target object against a background from the initial training dataset and extracts a portion of the image displaying the target object. The computer system generates samples for including in the augmented training dataset based on the image. For example, new images may be obtained by performing transformations on the portion of the image displaying the target object and/or by overlaying the transformed portion of the image over a different background. The modified images are included in the augmented training dataset used for training the neural network model to recognize the target object.

    END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING

    公开(公告)号:US20190130897A1

    公开(公告)日:2019-05-02

    申请号:US15878113

    申请日:2018-01-23

    Abstract: The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition.

    NATURAL LANGUAGE PROCESSING USING A NEURAL NETWORK

    公开(公告)号:US20180349359A1

    公开(公告)日:2018-12-06

    申请号:US16000638

    申请日:2018-06-05

    Abstract: A system includes a neural network for performing a first natural language processing task. The neural network includes a first rectifier linear unit capable of executing an activation function on a first input related to a first word sequence, and a second rectifier linear unit capable of executing an activation function on a second input related to a second word sequence. A first encoder is capable of receiving the result from the first rectifier linear unit and generating a first task specific representation relating to the first word sequence, and a second encoder is capable of receiving the result from the second rectifier linear unit and generating a second task specific representation relating to the second word sequence. A biattention mechanism is capable of computing, based on the first and second task specific representations, an interdependent representation related to the first and second word sequences. In some embodiments, the first natural processing task performed by the neural network is one of sentiment classification and entailment classification.

    THREE-DIMENSIONAL (3D) CONVOLUTION WITH 3D BATCH NORMALIZATION
    100.
    发明申请
    THREE-DIMENSIONAL (3D) CONVOLUTION WITH 3D BATCH NORMALIZATION 审中-公开
    三维(3D)三维拼接正则化的解决方案

    公开(公告)号:US20170046616A1

    公开(公告)日:2017-02-16

    申请号:US15237575

    申请日:2016-08-15

    Abstract: The technology disclosed uses a 3D deep convolutional neural network architecture (DCNNA) equipped with so-called subnetwork modules which perform dimensionality reduction operations on 3D radiological volume before the 3D radiological volume is subjected to computationally expensive operations. Also, the subnetworks convolve 3D data at multiple scales by subjecting the 3D data to parallel processing by different 3D convolutional layer paths. Such multi-scale operations are computationally cheaper than the traditional CNNs that perform serial convolutions. In addition, performance of the subnetworks is further improved through 3D batch normalization (BN) that normalizes the 3D input fed to the subnetworks, which in turn increases learning rates of the 3D DCNNA. After several layers of 3D convolution and 3D sub-sampling with 3D across a series of subnetwork modules, a feature map with reduced vertical dimensionality is generated from the 3D radiological volume and fed into one or more fully connected layers.

    Abstract translation: 所公开的技术使用配备有所谓的子网模块的3D深卷积神经网络架构(DCNNA),其在3D放射体积经受计算上昂贵的操作之前对3D放射体积进行降维操作。 此外,子网络通过对3D数据进行不同的3D卷积层路径的并行处理,在多个尺度上卷积3D数据。 这种多尺度操作在计算上比执行串行卷积的传统CNN便宜。 此外,通过对馈送到子网络的3D输入进行归一化的3D批量归一化(BN),进一步提高了子网络的性能,从而提高了3D DCNNA的学习速率。 在通过一系列子网模块进行三维3D卷积和三维子采样与三维子采样之后,从3D放射体积产生具有降低的垂直维数的特征图,并将其馈送到一个或多个完全连接的层。

Patent Agency Ranking