Abstract:
Apparatuses, methods and storage medium associated with processing an image are disclosed herein. In embodiments, a method for processing one or more images may include generating a plurality of pairs of keypoint features for a pair of images. Each pair of keypoint features may include a keypoint feature from each image. Further, for each pair of keypoint features, corresponding adjoin features may be generated. Additionally, for each pair of keypoint features, whether the adjoin features are similar may be determined. Whether the pair of images have at least one similar object may also be determined, based at least in part on a result of the determination of similarity between the corresponding adjoin features. Other embodiments may be described and claimed.
Abstract:
Systems, apparatuses and methods may generate a map of a search environment based on a probability of a target human being present within the search environment, capture a red, green, blue, depth (RGBD) image of one or more potential target humans in the search environment based on the map, and cause a robot apparatus to obtain a frontal view position with respect to at least one of the one or more potential target humans based on the RGBD images.
Abstract:
Disclosed in some examples are various modifications to the shape regression technique for use in real-time applications, and methods, systems, and machine readable mediums which utilize the resulting facial landmark tracking methods.
Abstract:
Generally this disclosure describes a video communication system that replaces actual live images of the participating users with animated avatars. A method may include selecting an avatar, initiating communication, capturing an image, detecting a face in the image, extracting features from the face, converting the facial features to avatar parameters, and transmitting at least one of the avatar selection or avatar parameters.
Abstract:
An apparatus to facilitate learning reliable keypoints in situ with introspective self-supervision is disclosed. The apparatus includes one or more processors to provide a view-overlapped keyframe pair from a pose graph that is generated by a visual simultaneous localization and mapping (VSLAM) process executed by the one or more processors: determine a keypoint match from the view-overlapped key frame pair based on a keypoint detection and matching process, the keypoint match corresponding to a keypoint: calculate an inverse reliability score based on matched pixels corresponding to the keypoint match in the view-overlapped keyframe pair: identify a supervision signal associated with the keypoint match, the supervision signal comprising a keypoint reliability score of the keypoint based on a final pose output of the VSLAM process; and train a keypoint detection neural network using the keypoint match, the inverse reliability score, and the keypoint reliability score.
Abstract:
A long-term object tracker employs a continuous learning framework to overcome drift in the tracking position of a tracked object. The continuous learning framework consists of a continuous learning module that accumulates samples of the tracked object to improve the accuracy of object tracking over extended periods of time. The continuous learning module can include a sample pre-processor to refine a location of a candidate object found during object tracking, and a cropper to crop a portion of a frame containing a tracked object as a sample and to insert the sample into a continuous learning database to support future tracking.
Abstract:
Performing online learning for a model to detect unseen actions in an action recognition system is disclosed. The method includes extracting semantic features in a semantic domain from semantic action labels, transforming the semantic features from the semantic domain into mixed features in a mixed domain, and storing the mixed features in a feature database. The method further includes extracting visual features in a visual domain from a video stream and determining if the visual features indicate an unseen action in the video stream. If no unseen action is determined, applying an offline classification model to the visual features to identify seen actions, assigning identifiers to the identified seen actions, transforming the visual features from the visual domain into mixed features in the mixed domain, and storing the mixed features and seen action identifiers in the feature database. If an unseen action is determined, transforming the visual features from the visual domain into mixed features in the mixed domain, applying a continual learner model to mixed features from the feature database to identify unseen actions in the video stream, assigning identifiers to the identified unseen actions, and storing the unseen action identifiers in the feature database.
Abstract:
Methods and apparatus to match images using semantic features are disclosed. An example apparatus includes a semantic labeler to determine a semantic label for each of a first set of points of a first image and each of a second set of points of a second image; a binary robust independent element features (BRIEF) determiner to determine semantic BRIEF descriptors for a first subset of the first set of points and a second subset of the second set of points based on the semantic labels; and a point matcher to match first points of the first subset of points to second points of the second subset of points based on the semantic BRIEF descriptors.
Abstract:
Techniques are provided for estimation of human orientation and facial pose, in images that include depth information. A methodology embodying the techniques includes detecting a human in an image generated by a depth camera and estimating an orientation category associated with the detected human. The estimation is based on application of a random forest classifier, with leaf node template matching, to the image. The orientation category defines a range of angular offsets relative to an angle corresponding to the human facing the depth camera. The method also includes performing a three dimensional (3D) facial pose estimation of the detected human, based on detected facial landmarks, in response to a determination that the estimated orientation category includes the angle corresponding to the human facing the depth camera.