Abstract:
A navigation system for providing driving instructions to a driver of a vehicle traveling on a route is provided. The driving instructions are generated by executing a multimodal fusion method that comprises extracting features from sensor measurements, annotating the features with directions for the vehicle to follow the route with respect to objects sensed by the sensors, and encoding the annotated features with a multimodal attention neural network to produce encodings. The encodings are transformed into a common latent space, and the transformed encodings are fused using an attention mechanism producing an encoded representation of the scene. The method further comprises decoding the encoded representation with a sentence generation neural network to generate a driving instruction and submitting the driving instruction to an output device.
Abstract:
A point cloud encoder including an input interface to accept a dynamic point cloud including a sequence of point cloud frames of a scene. A processor encodes blocks of a current point cloud frame to produce an encoded frame. Wherein, for encoding a current block of the current point cloud frame, a reference block is selected similar to the current block according to a similarity metric to serve as a reference to encode the current block. Pair each point in the current block to a point in the reference block based on values of the paired points. Encode the current block based on a combination of an identification of the reference block and residuals between the values of the paired points. Wherein the residuals are ordered according to an order of the values of the points in the reference block. A transmitter transmits the encoded frame over a communication channel.
Abstract:
A method processes keypoint trajectories in a video, wherein the keypoint trajectories describe motion of a plurality of keypoints across pictures of the video over time, by first acquiring the video of a scene using a camera. Keypoints and associated feature descriptors are detected in each picture. The keypoints and associated features descriptors are matched between neighboring pictures to generate keypoint trajectories. Then, the keypoint trajectories are coded predictively into a bitstream, which is outputted.
Abstract:
A method processes a signal represented as a graph by first determining a graph spectral transform based on the graph. In a spectral domain, parameters of a graph filter are estimated using a training data set of unenhanced and corresponding enhanced signals. The graph filter is derived based on the graph spectral transform and the estimated graph filter parameters. Then, the signal is processed using the graph filter to produce an output signal. The processing can enhance signals such as images by denoising or interpolating missing samples.
Abstract:
A set of input images are acquired sequentially as image tensors. A low-tubal rank tensor and a sparse tensor are initialized using the image tensor, wherein the low-tubal rank tensor is a tensor product of a low-rank spanning tensor basis and corresponding tensor coefficients, and for each image, updating iteratively the image tensor, the tensor coefficients, and the sparse tensor using the image tensor and the low-rank spanning basis from a previous iteration. The spanning tensor basis is updated using the tensor coefficients, the sparse tensor, and the low rank tubal tensor, wherein the low rank tubal tensor represents a set of output images and the sparse tensor representing a set of sparse images.
Abstract:
A method segments an image acquired by a sensor of a scene by first obtaining motion vectors corresponding to the image and generating a motion vanishing point image, wherein each pixel in the motion vanishing point image represents a number of intersections of pairs of motion vectors at the pixel. In the motion vanishing point image, a representation point for each motion vector is generated and distances between the motion vectors are determined based on the representation points. Then, a motion graph is constructed wherein each node represents a motion vector, and each edge represents a weight based on the distance between the nodes. Graph spectral clustering is performed on the motion graph to produce segments of the image.
Abstract:
In a decoder, a desired image is estimated by first retrieving coding modes from an encoded side information image. For each bitplane in the encoded side information image, syndrome bits or parity bits are decoded to obtain an estimated bitplane of quantized transform coefficients of the desired image. A quantization and a transform are applied to a prediction residual obtained using the coding modes, wherein the decoding uses the quantized transform coefficients of the encoded side information image, and is based on previously decoded bitplanes in a causal neighborhood. The estimated bitplanes of quantized transform coefficients of the desired image are combined to produce combined bitplanes. Then, an inverse quantization, an inverse transform and a prediction based on the coding modes are applied to the combined bitplanes to recover the estimate of the desired image.
Abstract:
A method decodes blocks in pictures of a video in an encoded bitstream by storing previously decoded blocks in a buffer. The previously decoded blocks are displaced less than a predetermined range relative to a current block being decoded. Cached blocks are maintained in a cache. The cached blocks include a set of best matching previously decoded blocks that are displaced greater than the predetermined range relative to the current block. The bitstream is parsed to obtain a prediction indicator that determines whether the current block is predicted from the previously decoded blocks in the buffer or the cached blocks in the cache. Based on the prediction indicator, a prediction residual block is generated, and in a summation process, the prediction residual block is added to a reconstructed residual block to form a decoded block as output.
Abstract:
A method for decoding a bitstream, including compressed pictures of a video, wherein each picture includes one or more slices, wherein each slice includes one or more blocks of pixels, and each pixel has a value corresponding to a color, for each slice, first obtains a reduced number of colors corresponding to the slice, wherein each color is represented as a color triplet and the reduced number of colors is less than or equal to a number of colors in the slice. Then, for each block, a prediction mode is determined, wherein an independent uniform prediction mode is included in a candidate set of prediction modes. For each block, a predictor block is generated, wherein all values of the predictor block have a uniform value according to a color index when the prediction mode is set as the independent uniform prediction mode. Lastly, the predictor block is added to a reconstructed residue block to form a decoded block as output.
Abstract:
A method processes a video acquired of a scene by first aligning a group of video images using compressed domain motion information and then solving for a low rank component and a sparse component of the video. A homography map is computed from the motion information to determine image alignment parameters. The video images are then warped using the homography map to share a similar camera perspective. A Newton root step is followed to traverse separately Pareto curves of each low rank component and sparse component. The solving for the low rank component and the sparse component is repeated alternately until a termination condition is reached. Then, the low ranks component and the sparse component are outputted. The low rank component represents a background in the video, and the sparse component represents moving objects in the video.