Abstract:
Various embodiments include methods for data management in a computing device utilizing a plurality of processing units. Embodiment methods may include generating a data transfer heuristic model based on measurements from a plurality of sample data transfers between a plurality of data storage units. The generated data transfer heuristic model may be used to calculate data transfer costs for each of a plurality of tasks. The calculated data transfer costs may be used to schedule execution of the plurality of tasks in an execution order on selected ones of the plurality of processing units. The data transfer heuristic model may be updated based on measurements of data transfers occurring during the executions of the plurality of tasks (e.g., time, power consumption, etc.). Code executing on the processing units may indicate to a runtime when certain data blocks are no longer needed and thus may be evicted and/or pre-fetched for others.
Abstract:
Multi-processor computing device methods manage resource accesses by a signaling event manager signaling processor elements requesting access to a resource to wake up to access the resource when the resource is available or wait for an event when the resource is busy. Processor elements may enter a sleep state while awaiting access to the requested resource. When multiple elements are waiting for the resource, the processor element with a highest assigned priority is signaled to wake up when the resource is available without waking other elements. Priorities may be assigned to processor elements waiting for the resource based on a heuristic or parameter that may depend on a state of the computing device or the processor elements. A sleep duration may be estimated for a processor element waiting for a resource and the processor element may be removed from a scheduling queue or assigned another thread during the sleep duration.
Abstract:
Various embodiments include methods for data management in a computing device utilizing a plurality of processing units. Embodiment methods may include generating a data transfer heuristic model based on measurements from a plurality of sample data transfers between a plurality of data storage units. The generated data transfer heuristic model may be used to calculate data transfer costs for each of a plurality of tasks. The calculated data transfer costs may be used to schedule execution of the plurality of tasks in an execution order on selected ones of the plurality of processing units. The data transfer heuristic model may be updated based on measurements of data transfers occurring during the executions of the plurality of tasks (e.g., time, power consumption, etc.). Code executing on the processing units may indicate to a runtime when certain data blocks are no longer needed and thus may be evicted and/or pre-fetched for others.
Abstract:
Embodiments include computing devices, systems, and methods for task-based handling of repetitive processes in parallel. At least one processor of the computing device, or a specialized hardware controller, may be configured to partition iterations of a repetitive process and assign the partitions to initialized tasks to be executed in parallel by a plurality of processor cores. Upon completing a task, remaining divisible partitions of the repetitive process of ongoing tasks may be subpartitioned and assigned to the ongoing task, and the completed task or a newly initialized task. Information about the iteration space for a repetitive process may be stored in a descriptor table, and status information for all partitions of a repetitive process stored in a status table. Each processor core may have an associated local table that tracks iteration execution of each task, and is synchronized with the status table.
Abstract:
Aspects include computing devices, systems, and methods for task-based handling of nested repetitive processes in parallel. At least one processor of the computing device may be configured to partition iterations of an outer repetitive process and assign the partitions to initialized tasks to be executed in parallel by a plurality of processor cores. A shadow task may be initialized for each task to execute iterations of an inner repetitive process. Upon completing a task, divisible partitions of the outer repetitive process of ongoing tasks may be subpartitioned and assigned to the ongoing task, and the completed task and shadow task or a newly initialized task and shadow task. Upon completing all but one task and one iteration of the outer repetitive process, shadow tasks may be initialized to execute partitions of iterations of the inner repetitive process.
Abstract:
Various embodiments provide methods, devices, and non-transitory processor-readable storage media enabling joint goals, such as joint power and performance goals, to be realized on a per heterogeneous processing device basis for heterogeneous parallel computing constructs. Various embodiments may enable assignments of power states for heterogeneous processing devices on a per heterogeneous processing device basis to satisfy an overall goal on the heterogeneous processing construct. Various embodiments may enable dynamic adjustment of power states for heterogeneous processing devices on a per heterogeneous processing device basis.
Abstract:
Aspects include computing devices, systems, and methods for implementing scheduling and execution of lightweight kernels as simple tasks directly by a thread without setting up a task structure. A computing device may determine whether a task pointer in a task queue is a simple task pointer for the lightweight kernel. The computing device may schedule a first simple task for the lightweight kernel for execution by the thread. The computing device may retrieve, from an entry of a simple task table, a kernel pointer for the lightweight kernel. The entry in the simple task table may be associated with the simple task pointer. The computing device may directly execute the lightweight kernel as the simple task.
Abstract:
Embodiments include computing devices, apparatus, and methods implemented by the apparatus for implementing speculative loop iteration partitioning (SLIP) for heterogeneous processing devices. A computing device may receive iteration information for a first partition of iterations of a repetitive process and select a SLIP heuristic based on available SLIP information and iteration information for the first partition. The computing device may determine a split value for the first partition using the SLIP heuristic, and partition the first partition using the split value to produce a plurality of next partitions.
Abstract:
Embodiments include computing devices, apparatus, and methods implemented by the apparatus for implementing shared virtual index translation on a computing device. The computing device may receive a base virtual address for storing an output of a kernel function execution to a dedicated memory and determine whether the virtual address is in a range of virtual addresses for a privatized output buffer within the dedicated memory, which may be smaller than the dedicated memory. The computing device may calculate a first modified physical address using a physical address mapped to the base virtual address and an offset of a first processing device associated with the dedicated memory in response to determining that the base virtual address is in the range of virtual addresses. The computing device may store the output of the kernel function execution to the privatized output buffer at the first modified physical address.
Abstract:
Aspects include computing devices, systems, and methods for implementing scheduling and execution of lightweight kernels as simple tasks directly by a thread without setting up a task structure. A computing device may determine whether a task pointer in a task queue is a simple task pointer for the lightweight kernel. The computing device may schedule a first simple task for the lightweight kernel for execution by the thread. The computing device may retrieve, from an entry of a simple task table, a kernel pointer for the lightweight kernel. The entry in the simple task table may be associated with the simple task pointer. The computing device may directly execute the lightweight kernel as the simple task.