Abstract:
A system and method for protecting memory instructions against faults are described. The system and method include converting the slave instructions to dummy operations, modifying memory arbiter to issue up to N master and N slave global/shared memory instructions per cycle, sending master memory requests to memory system, using slave requests for error checking, entering master requests to the GM/LM FIFO, storing slave requests in a register, and comparing the entered master requests with the stored slave requests.
Abstract:
A system and method for protecting memory instructions against faults are described. The system and method include converting the slave instructions to dummy operations, modifying memory arbiter to issue up to N master and N slave global/shared memory instructions per cycle, sending master memory requests to memory system, using slave requests for error checking, entering master requests to the GM/LM FIFO, storing slave requests in a register, and comparing the entered master requests with the stored slave requests.
Abstract:
A system, method and computer program product to execute a first and a second work-group, and compare the signature variables of the first work-group to the signature variables of the second work-group via a synchronization mechanism. The first and the second work-group are mapped to an identifier via software. This mapping ensures that the first and second work-groups execute exactly the same data for exactly the same code without changes to the underlying hardware. By executing the first and second work-groups independently, the underlying computation of the first and second work-groups can be verified. Moreover, system performance is not substantially affected because the execution results of the first and second work-groups are compared only at specified comparison points.
Abstract:
A system and method for optimizing redundant output verification, are provided. A hardware-based store fingerprint buffer receives multiple instances of output from multiple instances of computation. The store fingerprint buffer generates a signature from the content included in the multiple instances of output. When a barrier is reached, the store fingerprint buffer uses the signature to verify the content is error-free.
Abstract:
A system and method for optimizing redundant output verification, are provided. A hardware-based store fingerprint buffer receives multiple instances of output from multiple instances of computation. The store fingerprint buffer generates a signature from the content included in the multiple instances of output. When a barrier is reached, the store fingerprint buffer uses the signature to verify the content is error-free.
Abstract:
A method and apparatus for mitigating row hammer attacks is provided. A row hammer alert is generated by a component of a memory architecture controlling operation of a memory device. The component may be a memory controller, coherency logic, or data fabric. The component obtains a physical address of an aggressor row that caused the alert and obtains an identifier of an execution context corresponding to the physical address. The component generates an error message for a processing device, the error message including the identifier of the execution context. The processing device retrieves the error message when performing a context switch. The processing device then generates an event received by the operating system. The operating system then takes action to reduce row hammer by the execution context, such as ending, restarting, or throttling the execution context.
Abstract:
Methods and processing devices are provided for error protection to support instruction replay for executing idempotent instructions at a processing in memory PIM device. The processing apparatus includes a PIM device configured to execute an idempotent instruction. The processing apparatus also includes a processor, in communication with the PIM device, configured to issue the idempotent instruction to the PIM device for execution at the PIM device and reissue the idempotent instruction to the PIM device when one of execution of the idempotent instruction at the PIM device results in an error and a predetermined latency period expires from when the idempotent instruction is issued.
Abstract:
Techniques for performing in-memory matrix multiplication, taking into account temperature variations in the memory, are disclosed. In one example, the matrix multiplication memory uses ohmic multiplication and current summing to perform the dot products involved in matrix multiplication. One downside to this analog form of multiplication is that temperature affects the accuracy of the results. Thus techniques are provided herein to compensate for the effects of temperature increases on the accuracy of in-memory matrix multiplications. According to the techniques, portions of input matrices are classified as effective or ineffective. Effective portions are mapped to low temperature regions of the in-memory matrix multiplier and ineffective portions are mapped to high temperature regions of the in-memory matrix multiplier. The matrix multiplication is then performed.
Abstract:
Providing host-based error detection capabilities in a remote execution device is disclosed. A remote execution device performs a host-offloaded operation that modifies a block of data stored in memory. Metadata is generated locally for the modified of block of data such that the local metadata generation emulates host-based metadata generation. Stored metadata for the block of data is updated with the locally generated metadata for the modified portion of the block of data. When the host performs an integrity check on the modified block of data using the updated metadata, the host does not distinguish between metadata generated by the host and metadata generated in the remote execution device.
Abstract:
A method and apparatus for predicting and managing a fault in memory includes detecting an error in data. The error is compared to one or more stored errors in a filter, and based upon the comparison, the error is predicted as a transient error or a permanent error for further action.