Abstract:
An example apparatus for visual question answering includes a receiver to receive an input image and a question. The apparatus also includes an encoder to encode the input image and the question into a query representation including visual attention features. The apparatus includes a knowledge spotter to retrieve a knowledge entry from a visual knowledge base pre-built on a set of question-answer pairs. The apparatus further includes a joint embedder to jointly embed the visual attention features and the knowledge entry to generate visual-knowledge features. The apparatus also further includes an answer generator to generate an answer based on the query representation and the visual-knowledge features.
Abstract:
An example apparatus for semantic image segmentation includes a receiver to receive an image to be segmented. The apparatus also includes a gated dense pyramid network including a plurality of gated dense pyramid (GDP) blocks to be trained to generate semantic labels for respective pixels in the received image. The apparatus further includes a generator to generate a segmented image based on the generated semantic labels.
Abstract:
A mechanism is described for facilitating slimming of neural networks in machine learning environments. A method of embodiments, as described herein, includes learning a first neural network associated with machine learning processes to be performed by a processor of a computing device, where learning includes analyzing a plurality of channels associated with one or more layers of the first neural network. The method may further include computing a plurality of scaling factors to be associated with the plurality of channels such that each channel is assigned a scaling factor, wherein each scaling factor to indicate relevance of a corresponding channel within the first neural network. The method may further include pruning the first neural network into a second neural network by removing one or more channels of the plurality of channels having low relevance as indicated by one or more scaling factors of the plurality of scaling factors assigned to the one or more channels.
Abstract:
Techniques are provided for training and operation of a topic-guided image captioning system. A methodology implementing the techniques according to an embodiment includes generating image feature vectors, for an image to be captioned, based on application of a convolutional neural network (CNN) to the image. The method further includes generating the caption based on application of a recurrent neural network (RNN) to the image feature vectors. The RNN is configured as a long short-term memory (LSTM) RNN. The method further includes training the LSTM RNN with training images and associated training captions. The training is based on a combination of: feature vectors of the training image; feature vectors of the associated training caption; and a multimodal compact bilinear (MCB) pooling of the training caption feature vectors and an estimated topic of the training image. The estimated topic is generated by an application of the CNN to the training image.
Abstract:
Techniques related to implementing convolutional neural networks for object recognition are discussed. Such techniques may include generating a set of binary neural features via convolutional neural network layers based on input image data and applying a strong classifier to the set of binary neural features to generate an object label for the input image data.
Abstract:
Techniques related to object detection using directional filtering are discussed. Such techniques may include determining directional weighted averages for pixels of an input image, generating a feature representation of the input image based on the directional weighted averages, and performing object detection by applying a multi-stage cascade classifier to the feature representation.
Abstract:
Methods and systems may provide for facial recognition of at least one input image utilizing hierarchical feature learning and pair-wise classification. Receptive field theory may be used on the input image to generate a pre-processed multi-channel image. Channels in the pre-processed image may be activated based on the amount of feature rich details within the channels. Similarly, local patches may be activated based on the discriminant features within the local patches. Features may be extracted from the local patches and the most discriminant features may be selected in order to perform feature matching on pair sets. The system may utilize patch feature pooling, pair-wise matching, and large-scale training in order to quickly and accurately perform facial recognition at a low cost for both system memory and computation.
Abstract:
Techniques related to implementing convolutional neural networks for object recognition are discussed. Such techniques may include generating a set of binary neural features via convolutional neural network layers based on input image data and applying a strong classifier to the set of binary neural features to generate an object label for the input image data.
Abstract:
Stereo image reconstruction techniques are described. An image from a root viewpoint is translated to an image from another viewpoint. Homography fitting is used to translate the image between viewpoints. Inverse compositional image alignment is used to determine a homography matrix and determine a pixel in the translated image.
Abstract:
Disclosed in some examples are various modifications to the shape regression technique for use in real-time applications, and methods, systems, and machine readable mediums which utilize the resulting facial landmark tracking methods.