-
公开(公告)号:US10963782B2
公开(公告)日:2021-03-30
申请号:US15421193
申请日:2017-01-31
Applicant: salesforce.com, inc.
Inventor: Caiming Xiong , Victor Zhong , Richard Socher
Abstract: The technology disclosed relates to an end-to-end neural network for question answering, referred to herein as “dynamic coattention network (DCN)”. Roughly described, the DCN includes an encoder neural network and a coattentive encoder that capture the interactions between a question and a document in a so-called “coattention encoding”. The DCN also includes a decoder neural network and highway maxout networks that process the coattention encoding to estimate start and end positions of a phrase in the document that responds to the question.
-
公开(公告)号:US10958925B2
公开(公告)日:2021-03-23
申请号:US16687405
申请日:2019-11-18
Applicant: salesforce.com, inc.
Inventor: Yingbo Zhou , Luowei Zhou , Caiming Xiong , Richard Socher
IPC: H04N19/46 , H04N19/44 , H04N19/60 , H04N19/187 , H04N21/81 , H04N19/33 , H04N19/126 , H04N19/132 , H04N21/488
Abstract: Systems and methods for dense captioning of a video include a multi-layer encoder stack configured to receive information extracted from a plurality of video frames, a proposal decoder coupled to the encoder stack and configured to receive one or more outputs from the encoder stack, a masking unit configured to mask the one or more outputs from the encoder stack according to one or more outputs from the proposal decoder, and a decoder stack coupled to the masking unit and configured to receive the masked one or more outputs from the encoder stack. Generating the dense captioning based on one or more outputs of the decoder stack. In some embodiments, the one or more outputs from the proposal decoder include a differentiable mask. In some embodiments, during training, error in the dense captioning is back propagated to the decoder stack, the encoder stack, and the proposal decoder.
-
公开(公告)号:US20200380213A1
公开(公告)日:2020-12-03
申请号:US16996726
申请日:2020-08-18
Applicant: salesforce.com, inc.
Inventor: Bryan McCann , Nitish Shirish Keskar , Caiming Xiong , Richard Socher
IPC: G06F40/30 , G06N3/08 , G06N5/04 , G06N3/04 , G06F40/56 , G06F16/242 , G06F16/33 , G06F16/332
Abstract: Approaches for multitask learning as question answering include an input layer for encoding a context and a question, a self-attention based transformer including an encoder and a decoder, a first bi-directional long-term short-term memory (biLSTM) for further encoding an output of the encoder, a long-term short-term memory (LSTM) for generating a context-adjusted hidden state from the output of the decoder and a hidden state, an attention network for generating first attention weights based on an output of the first biLSTM and an output of the LSTM, a vocabulary layer for generating a distribution over a vocabulary, a context layer for generating a distribution over the context, and a switch for generating a weighting between the distributions over the vocabulary and the context, generating a composite distribution based on the weighting, and selecting a word of an answer using the composite distribution.
-
公开(公告)号:US20200372116A1
公开(公告)日:2020-11-26
申请号:US16531343
申请日:2019-08-05
Applicant: salesforce.com, inc.
Inventor: Mingfei GAO , Richard SOCHER , Caiming Xiong
Abstract: Systems and methods are provided for weakly supervised natural language localization (WSNLL), for example, as implemented in a neural network or model. The WSNLL network is trained with long, untrimmed videos, i.e., videos that have not been temporally segmented or annotated. The WSNLL network or model defines or generates a video-sentence pair, which corresponds to a pairing of an untrimmed video with an input text sentence. According to some embodiments, the WSNLL network or model is implemented with a two-branch architecture, where one branch performs segment sentence alignment and the other one conducts segment selection.
-
公开(公告)号:US10592767B2
公开(公告)日:2020-03-17
申请号:US15882220
申请日:2018-01-29
Applicant: salesforce.com, inc.
Inventor: Alexander Richard Trott , Caiming Xiong , Richard Socher
IPC: G06K9/00 , G06K9/46 , G06F16/332 , G06N5/04 , G06N3/04
Abstract: Approaches for interpretable counting for visual question answering include a digital image processor, a language processor, and a counter. The digital image processor identifies objects in an image, maps the identified objects into an embedding space, generates bounding boxes for each of the identified objects, and outputs the embedded objects paired with their bounding boxes. The language processor embeds a question into the embedding space. The scorer determines scores for the identified objects. Each respective score determines how well a corresponding one of the identified objects is responsive to the question. The counter determines a count of the objects in the digital image that are responsive to the question based on the scores. The count and a corresponding bounding box for each object included in the count are output. In some embodiments, the counter determines the count interactively based on interactions between counted and uncounted objects.
-
公开(公告)号:US20200057805A1
公开(公告)日:2020-02-20
申请号:US16661869
申请日:2019-10-23
Applicant: salesforce.com, inc.
Inventor: Jiasen LU , Caiming Xiong , Richard Socher
IPC: G06F17/27 , G06N3/08 , G06K9/66 , G06K9/48 , G06K9/46 , G06K9/00 , G06F17/24 , G06K9/62 , G06N3/04
Abstract: The technology disclosed presents a novel spatial attention model that uses current hidden state information of a decoder long short-term memory (LSTM) to guide attention and to extract spatial image features for use in image captioning. The technology disclosed also presents a novel adaptive attention model for image captioning that mixes visual information from a convolutional neural network (CNN) and linguistic information from an LSTM. At each timestep, the adaptive attention model automatically decides how heavily to rely on the image, as opposed to the linguistic model, to emit the next caption word. The technology disclosed further adds a new auxiliary sentinel gate to an LSTM architecture and produces a sentinel LSTM (Sn-LSTM). The sentinel gate produces a visual sentinel at each timestep, which is an additional representation, derived from the LSTM's memory, of long and short term visual and linguistic information.
-
公开(公告)号:US20190258901A1
公开(公告)日:2019-08-22
申请号:US16399163
申请日:2019-04-30
Applicant: salesforce.com, inc.
Inventor: Evan Albright , Caiming Xiong
Abstract: A computer system generates augmented training datasets to train neural network models. The computer system receives an initial training dataset comprising images for training a neural network model, and generates an augmented training dataset by modifying images from the first training dataset. The computer system identifies a representation of a target object against a background from the initial training dataset and extracts a portion of the image displaying the target object. The computer system generates samples for including in the augmented training dataset based on the image. For example, new images may be obtained by performing transformations on the portion of the image displaying the target object and/or by overlaying the transformed portion of the image over a different background. The modified images are included in the augmented training dataset used for training the neural network model to recognize the target object.
-
公开(公告)号:US10346721B2
公开(公告)日:2019-07-09
申请号:US15801297
申请日:2017-11-01
Applicant: salesforce.com, inc.
Inventor: Evan Albright , Caiming Xiong
Abstract: A computer system generates augmented training datasets to train neural network models. The computer system receives an initial training dataset comprising images for training a neural network model, and generates an augmented training dataset by modifying images from the first training dataset. The computer system identifies a representation of a target object against a background from the initial training dataset and extracts a portion of the image displaying the target object. The computer system generates samples for including in the augmented training dataset based on the image. For example, new images may be obtained by performing transformations on the portion of the image displaying the target object and/or by overlaying the transformed portion of the image over a different background. The modified images are included in the augmented training dataset used for training the neural network model to recognize the target object.
-
公开(公告)号:US20190130897A1
公开(公告)日:2019-05-02
申请号:US15878113
申请日:2018-01-23
Applicant: salesforce.com, inc.
Inventor: Yingbo Zhou , Caiming Xiong
Abstract: The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition.
-
公开(公告)号:US20180349359A1
公开(公告)日:2018-12-06
申请号:US16000638
申请日:2018-06-05
Applicant: salesforce.com, inc.
Inventor: Bryan McCann , Caiming Xiong , Richard Socher
Abstract: A system includes a neural network for performing a first natural language processing task. The neural network includes a first rectifier linear unit capable of executing an activation function on a first input related to a first word sequence, and a second rectifier linear unit capable of executing an activation function on a second input related to a second word sequence. A first encoder is capable of receiving the result from the first rectifier linear unit and generating a first task specific representation relating to the first word sequence, and a second encoder is capable of receiving the result from the second rectifier linear unit and generating a second task specific representation relating to the second word sequence. A biattention mechanism is capable of computing, based on the first and second task specific representations, an interdependent representation related to the first and second word sequences. In some embodiments, the first natural processing task performed by the neural network is one of sentiment classification and entailment classification.
-
-
-
-
-
-
-
-
-