Abstract:
Devices, systems and methods are disclosed for improving story assembly and video summarization. For example, video clips may be received and a theme may be determined from the received video clips based on annotation data or other characteristics of the received video data. Individual moments may be extracted from the video clips, based on the selected theme and the annotation data. The moments may be ranked based on a priority metric corresponding to content determined to be desirable for purposes of video summarization. Select moments may be chosen based on the priority metric and a structure may be determined based on the selected theme. Finally, a video summarization may be generated using the selected theme and the structure, the video summarization including the select moments.
Abstract:
Techniques are generally described for object detection in image data. First image data comprising a three-dimensional model representing an object may be received. First background image data comprising a first plurality of pixel values may be received. A first feature vector representing the three-dimensional model may be generated. A second feature vector representing the first plurality of pixel values of the first background image data may be generated. A first machine learning model may generate a transformed representation of the three-dimensional model using the first feature vector. First foreground image data comprising a two-dimensional representation of the transformed representation of the three-dimensional model may be generated. A frame of composite image data may be generated by combining the first foreground image data with the first background image data.
Abstract:
A system and method for selecting portions of video data from preview video data is provided. The system may extract image features from the preview video data and discard video frames associated with poor image quality based on the image features. The system may determine similarity scores between individual video frames and corresponding transition costs and may identify transition points in the preview video data based on the similarity scores and/or transition costs. The system may select portions of the video data for further processing based on the transition points and the image features. By selecting portions of the video data, the system may reduce a bandwidth consumption, processing burden and/or latency associated with uploading the video data or performing further processing.
Abstract:
The subject technology provides embodiments for tracking a user's face/head (or another object) using one or more cameras provided by a computing device. Embodiments implement exposure sweeping based on an average intensity of a current scene to a target intensity for a given image. If a face is not detected, an exposure duration and/or gain may be adjusted and the face detection is performed again. Once the face is detected, an average intensity of a virtual bounding box surrounding the detected face is determined and exposure sweeping may be performed solely within the virtual bounding box to reach a target intensity. When the average intensity is within a predetermined threshold of the target intensity, the detected face may be at an optimal exposure. Embodiments also provide for switching to another camera(s) of the computing device when not detecting a face in the image upon performing a full exposure sweep.
Abstract:
Object tracking, such as may involve face tracking, can utilize different detection templates that can be trained using different data. A computing device can determine state information, such as the orientation of the device, an active illumination, or an active camera to select an appropriate template for detecting an object, such as a face, in a captured image. Information about the object, such as the age range or gender of a person, can also be used, if available, to select an appropriate template. In some embodiments instances of templates can be used to process various orientations, while in other embodiments specific orientations, such as upside down orientations, may not be processed for reasons such as rate of inaccuracies or infrequency of use for the corresponding additional resource overhead.
Abstract:
Devices and techniques are generally described for estimating three-dimensional pose data. In some examples, a first machine learning network may generate first three-dimensional (3D) data representing input 2D data. In various examples, a first 2D projection of the first 3D data may be generated. A determination may be made that the first 2D projection conforms to a distribution of natural 2D data. A second machine learning network may generate parameters of a 3D model based at least in part on the input 2D data and based at least in part on the first 3D data. In some examples, second 3D data may be generated using the parameters of the 3D model.
Abstract:
Techniques are generally described for object detection in image data. First image data comprising a first plurality of pixel values representing an object and a second plurality of pixel values representing a background may be received. First foreground image data and first background image data may be generated from the first image data. A first feature vector representing the first plurality of pixel values may be generated. A second feature vector representing a first plurality of pixel values of second background image data may be generated. A first machine learning model may determine a first operation to perform on the first foreground image data. A transformed representation of the first foreground image data may be generated by performing the first operation on the first foreground image data. Composite image data may be generated by compositing the transformed representation of the first foreground image data with the second background image data.
Abstract:
The techniques described herein may identify images that likely depict one or more items by comparing features of the items to features of different regions-of-interest (ROIs) of the images. For instance, some of the images may include a user, and the techniques may define multiple regions within the image corresponding to different portions of the user. The techniques may then use a trained convolutional neural network or any other type of trained classifier to determine, for each region of the image, whether the region depicts a particular item. If so, the techniques may designate the corresponding image as depicting the item and may output an indication that the image depicts the item. The techniques may perform this process for multiple images, outputting an indication of each image deemed to depict the particular item.
Abstract:
The techniques described herein may identify images that likely depict one or more items by comparing features of the items to features of different regions-of-interest (ROIs) of the images. When a user requests to identify images that depict a particular item, the techniques may determine a region-of-interest (ROI) size based on the size of the requested item. The techniques may then search multiple images using the ROI size.
Abstract:
Various examples are directed to systems and methods for determining foreground regions in video frames. A computing device may select from a first frame of a video, a plurality of scene point locations and divide the first frame into a plurality of sections. For a first section of the first frame, the computing device may generate a first vector subspace, basis vectors of the first vector subspace are trajectories of scene point locations in the first section. The computing device may determine that a projection error for a first scene point location in the first section is greater than a projection error threshold and write an indication of the first pixel value to a listing of foreground pixel values.