Abstract:
According to an example aspect, there is provided an apparatus comprising at least one processing core, at least one memory including computer program code, the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus at least to obtain, using a first image recognition or depth sensing mechanism, first information defining an input space, obtain, using the first or a second image recognition or depth sensing mechanism, second information defining a virtual output area, and cause rendering of sensor information captured from the input space into the virtual output area
Abstract:
A method, apparatus and computer program product are provided for extracting semantic information from user-generated media content to create a video remix which is semantically enriched. An exemplary method comprises extracting media content data and sensor data from a plurality of media content, wherein the sensor data comprises a plurality of data modalities. The method may also include classifying the extracted media content data and the sensor data. The method may further include detecting predefined objects or events utilizing the sensor data to create remix video.
Abstract:
Data may be encoded to minimize distortion after decoding, but the quality required for presentation of the decoded data to a machine and the quality required for presentation to a human may be different. To accommodate different quality requirements, video data may be encoded to produce a first set of encoded data and a second set of encoded data, where the first set may be decoded for use by one of a machine consumer or a human consumer, and a combination of the first set and the second set may be decoded for use by the other of a machine consumer or a human consumer. The first and second set may be produced with a neural encoder and a neural decoder, and/or may be produced with the use of prediction and transform neural network modules. A human-targeted structure and a machine-targeted structure may produce the sets of encoded data.
Abstract:
A method, apparatus, and computer program product are provided for training a neural network or providing a pre-trained neural network with the weight-updates being compressible using at least a weight-update compression loss function and/or task loss function. The weight-update compression loss function can comprise a weight-update vector defined as a latest weight vector minus an initial weight vector before training. A pre-trained neural network can be compressed by pruning one or more small-valued weights. The training of the neural network can consider the compressibility of the neural network, for instance, using a compression loss function, such as a task loss and/or a weight-update compression loss. The compressed neural network can be applied within a decoding loop of an encoder side or in a post-processing stage, as well as at a decoder side.
Abstract:
An apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: encode or decode a high-level bitstream syntax for at least one neural network; wherein the high-level bitstream syntax comprises at least one information unit having metadata or compressed neural network data of a portion of the at least one neural network; and wherein a serialized bitstream comprises one or more of the at least one information unit.
Abstract:
Optimization of a neural network, for example in a video codec at the decoder side, may be guided to limit overfitting. The encoder may encode video(s) with different qualities for different frames in the video. Low-quality frames may be used as both input and ground-truth during optimization. High-quality frames may be used to optimize the neural network so that higher-quality versions of lower-quality inputs may be predicted. The neural network may be trained to make such predictions by making a prediction based on a constructed low-quality input for which the corresponding high-quality version is known, comparing the prediction to the high-quality version, and fine-tuning the neural network to improve its ability to predict a high-quality version of a low-quality input. To limit overfitting, the neural network may be concurrently or in an alternating fashion trained with low-quality input for which a higher-quality version of the low-quality input is known.
Abstract:
A method comprising: using a tracked real point of view of a user in a real space and a first mapping between the real space and a virtual space to determine a point of view of a virtual user within the virtual space; causing rendering to the user at least part of a virtual scene determined by the point of view of the virtual user within the virtual space; and using a selected one of a plurality of different mappings to map tracked user actions in the real space to actions of the virtual user in the virtual space, wherein, when a first mode is selected, the method comprises mapping tracked user actions in the real space, using the first mapping, to spatially-equivalent actions of the virtual user in the virtual space, and wherein, when a second mode is selected, the method comprises mapping tracked user actions in the real space, using a second mapping different to the first mapping, to non-spatially-equivalent actions of the virtual user in the virtual space, wherein the second mapping makes available user interactions within a zone of the virtual space unavailable using the first mapping.
Abstract:
Apparatuses, methods, and computer programs for compressing a neural network are disclosed. An apparatus includes at least one processor; and at least one non-transitory memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to: receive information from a second device, where the information comprises at least one parameter configured to be used for compression of a neural network, where the at least one parameter is in regard to at least one first aspect or task of the neural network; and compress the neural network, where the neural network is compressed based, at least partially, upon the at least one parameter received from the second device. The apparatus may also receive a compressed neural network from the second device, and further compress the compressed neural network based on the information.
Abstract:
The invention relates to a method comprising receiving a set of input samples, said set of input images comprising real images and generated images; extracting a set of feature maps from multiple layers of a pre-trained neural network for both the real images and the generated images; determining statistics for each feature map of the set of feature maps; comparing statistics of the feature maps for the real images to statistics of the feature maps for the generated images by using a distance function to obtain a vector of distances; and averaging the distances of the vector of distances to have a value indicating a diversity of the generated images. The invention also relates to technical equipment for implementing the method.
Abstract:
An apparatus for identifying which sound sources are associated with which microphone audio signals, the apparatus including a processor configured to: determine/receive a position/orientation of at least one sound source relative to a microphone array; receive at least one microphone audio signal, each microphone audio signal received from a microphone; receive an audio-focussed audio signal from the microphone array, wherein the audio-focussed audio signal is directed from the microphone array towards the one of the at least one sound source so as to enhance the audio-focussed audio signal; compare the audio-focussed audio signal against each microphone audio signal to identify a match between one of the at least one microphone audio signal and the audio focussed audio signal; and associate the one of the at least one microphone with the at least one sound source, based on the identified match.