Abstract:
Embodiments include computing devices, apparatus, and methods implemented by the apparatus for accelerating machine learning on a computing device. Raw data may be received in the computing device from a raw data source device. The apparatus may identify key features as two dimensional matrices of the raw data such that the key features are mutually exclusive from each other. The key features may be translated into key feature vectors. The computing device may generate a feature vector from at least one of the key feature vectors. The computing device may receive a first partial output resulting from an execution of a basic linear algebra subprogram (BLAS) operation using the feature vector and a weight factor. The first partial output may be combined with a plurality of partial outputs to produce an output matrix. Receiving the raw data on the computing device may include receiving streaming raw data.
Abstract:
Embodiments include computing devices, systems, and methods for task-based handling of repetitive processes in parallel. At least one processor of the computing device, or a specialized hardware controller, may be configured to partition iterations of a repetitive process and assign the partitions to initialized tasks to be executed in parallel by a plurality of processor cores. Upon completing a task, remaining divisible partitions of the repetitive process of ongoing tasks may be subpartitioned and assigned to the ongoing task, and the completed task or a newly initialized task. Information about the iteration space for a repetitive process may be stored in a descriptor table, and status information for all partitions of a repetitive process stored in a status table. Each processor core may have an associated local table that tracks iteration execution of each task, and is synchronized with the status table.
Abstract:
Removing invalid literal load values, and related circuits, methods, and computer-readable media are disclosed. In one aspect, an instruction processing circuit provides a literal load table containing one or more entries comprising an address and a cached literal load value. Upon detecting a literal load instruction in an instruction stream, the instruction processing circuit determines whether the literal load table contains an entry having an address of the literal load instruction. If so, the instruction processing circuit removes the literal load instruction from the instruction stream, and provides the cached literal load value stored in the entry to at least one dependent instruction. The instruction processing circuit further determines whether an invalidity indicator for the literal load table has been received. If so, the instruction processing circuit flushes the literal load table. The invalidity indicator may be generated responsive to modification of a constant table.
Abstract:
Methods, and devices implementing the methods, use device-specific classifiers in a privacy-preserving behavioral monitoring and analysis system for crowd-sourcing of device behaviors. Diverse devices having varying degrees of “smart” capabilities may monitor operational behaviors. Gathered operational behavior information may be transmitted to a nearby device having greater processing capabilities than a respective collecting device, or may be transmitted directly to an “always on” device. The behavior information may be used to generate behavior vectors, which may be analyzed for anomalies. Vectors containing anomaly flags may be anonymized to remove any user-identifying information and subsequently transmitted to a remote recipient such as a service provider or device manufacture. In this manner, operational behavior information may be gathered about different devices from a large number of users, to obtain statistical analysis of operational behavior for specific makes and models of devices, without divulging personal information about device users.
Abstract:
The aspects include systems and methods of managing processor device power consumption. A processor may determine a thread execution metric for each of a plurality of threads scheduled for execution in a processor comprising a plurality of processing cores. The processor may allocate to a selected processing core or cores those threads whose thread execution metric satisfies a threshold. The processor may reduce a frequency of the selected processing core or cores to reduce the power consumption.
Abstract:
Systems and methods are provided for allocating memory to dissimilar memory devices. An exemplary embodiment includes a method for allocating memory to dissimilar memory devices. An interleave bandwidth ratio is determined, which comprises a ratio of bandwidths for two or more dissimilar memory devices. The dissimilar memory devices are interleaved according to the interleave bandwidth ratio to define two or more memory zones having different performance levels. Memory address requests are allocated to the memory zones based on a quality of service (QoS).
Abstract:
Aspects include a computing devices, systems, and methods for hardware acceleration for inline caches in dynamic languages. An inline cache may be initialized for an instance of a dynamic software operation. A call of an initialized instance of the dynamic software operation may be executed by an inline cache hardware accelerator. The inline cache may be checked to determine that its data is current. When the data is current, the initialized instance of the dynamic software operation may be executed using the related inline cache data. When the data is not current, a new inline cache may be initialized for the instance of the dynamic software operation, including the not current data of a previously initialized instance of the dynamic software operation. The inline cache hardware accelerator may include an inline cache memory, a coprocessor, and/or a functional until one an inline cache pipeline connected to a processor pipeline.