Dynamic voltage and frequency scaling based on memory channel slack

    公开(公告)号:US11119665B2

    公开(公告)日:2021-09-14

    申请号:US16212388

    申请日:2018-12-06

    Abstract: A processing system scales power to memory and memory channels based on identifying causes of stalls of threads of a wavefront. If the cause is other than an outstanding memory request, the processing system throttles power to the memory to save power. If the stall is due to memory stalls for a subset of the memory channels servicing memory access requests for threads of a wavefront, the processing system adjusts power of the memory channels servicing memory access request for the wavefront based on the subset. By boosting power to the subset of channels, the processing system enables the wavefront to complete processing more quickly, resulting in increased processing speed. Conversely, by throttling power to the remainder of channels, the processing system saves power without affecting processing speed.

    Method and apparatus to expedite system services using processing-in-memory (PIM)

    公开(公告)号:US12197378B2

    公开(公告)日:2025-01-14

    申请号:US17804949

    申请日:2022-06-01

    Abstract: An apparatus configured for offloading system service tasks to a processing-in-memory (“PIM”) device includes an agent configured to: receive, from a host processor, a request to offload a memory task associated with a system service to the PIM device; determine at least one PIM command and at least one memory page associated with the host processor based upon the request; and issue the at least one PIM command to the PIM device for execution by the PIM device to perform the memory task upon the at least one memory page.

    METHOD AND APPARATUS FOR MANAGING A CACHE DIRECTORY

    公开(公告)号:US20220206946A1

    公开(公告)日:2022-06-30

    申请号:US17135657

    申请日:2020-12-28

    Abstract: Method and apparatus monitor eviction conflicts among cache directory entries in a cache directory and produce cache directory victim entry information for a memory manager. In some examples, the memory manager reduces future cache directory conflicts by changing a page level physical address assignment for a page of memory based on the produced cache directory victim entry information. In some examples, a scalable data fabric includes hardware control logic that performs the monitoring of the eviction conflicts among cache directory entries in the cache directory and produces the cache directory victim entry information.

    Processing Element-Centric All-to-All Communication

    公开(公告)号:US20240220336A1

    公开(公告)日:2024-07-04

    申请号:US18147081

    申请日:2022-12-28

    CPC classification number: G06F9/54 G06F9/5044 G06F15/17356

    Abstract: In accordance with described techniques for PE-centric all-to-all communication, a distributed computing system includes processing elements, such as graphics processing units, distributed in clusters. An all-to-all communication procedure is performed by the processing elements that are each configured to generate data packets in parallel for all-to-all data communication between the clusters. The all-to-all communication procedure includes a first stage of intra-cluster parallel data communication between respective processing elements of each of the clusters; a second stage of inter-cluster data exchange for all-to-all data communication between the clusters; and a third stage of intra-cluster data distribution to the respective processing elements of each of the clusters.

    Memory request priority assignment techniques for parallel processors

    公开(公告)号:US11507522B2

    公开(公告)日:2022-11-22

    申请号:US16706421

    申请日:2019-12-06

    Abstract: Systems, apparatuses, and methods for implementing memory request priority assignment techniques for parallel processors are disclosed. A system includes at least a parallel processor coupled to a memory subsystem, where the parallel processor includes at least a plurality of compute units for executing wavefronts in lock-step. The parallel processor assigns priorities to memory requests of wavefronts on a per-work-item basis by indexing into a first priority vector, with the index generated based on lane-specific information. If a given event is detected, a second priority vector is generated by applying a given priority promotion vector to the first priority vector. Then, for subsequent wavefronts, memory requests are assigned priorities by indexing into the second priority vector with lane-specific information. The use of priority vectors to assign priorities to memory requests helps to reduce the memory divergence problem experienced by different work-items of a wavefront.

    Multi-Tree Reduction with Execution Skew
    8.
    发明公开

    公开(公告)号:US20240311182A1

    公开(公告)日:2024-09-19

    申请号:US18185641

    申请日:2023-03-17

    CPC classification number: G06F9/4881

    Abstract: A device includes a communication scheduler to generate schedule trees for scheduling data communication among a plurality of nodes configured to perform a collective operation using data contributed from the plurality of nodes. The device includes data reduction logic to: identify one or more skewed nodes among the plurality of nodes, perform, according to a first set of schedule trees, a first operation to generate partial results based on data contributed from non-skewed nodes, and perform, according to a second set of schedule trees, a second operation to generate final results based on the partial results and data contributed from the one or more skewed nodes.

    Data compression and decompression for processing in memory

    公开(公告)号:US12050531B2

    公开(公告)日:2024-07-30

    申请号:US17952697

    申请日:2022-09-26

    CPC classification number: G06F12/0292 G06F2212/1024 G06F2212/401

    Abstract: In accordance with the described techniques for data compression and decompression for processing in memory, a page address is received by a processing in memory component that maps to a first location in memory where data of a page is maintained. The data of the page is compressed by the processing in memory component. Further, compressed data of the page is written by the processing in memory component to a compressed block device responsive to the compressed data satisfying one or more compressibility criteria. The compressed block device is a portion of the memory dedicated to storing data in a compressed form.

    DISTRIBUTED CACHING POLICY FOR LARGE-SCALE DEEP LEARNING TRAINING DATA PRE-PROCESSING

    公开(公告)号:US20240211399A1

    公开(公告)日:2024-06-27

    申请号:US18089480

    申请日:2022-12-27

    CPC classification number: G06F12/0813 G06N20/00

    Abstract: A distributed cache network used for machine learning is provided which comprises a network fabric having file systems which store data and a plurality of processing devices, each comprising cache memory and a processor configured to execute a training of a machine learning model and selectively cache portions of the data based on a frequency with which the data is accessed by the processor. Each processing device stores metadata identifying portions of data which are cached in the cache memory and other portions of the data which are cached in other processing devices of the network. When requested data is not cached in another processing device, the portion of requested data is accessed from a network file system via a client to server channel and is accessed from another processing device via a client to client channel when the requested data is cached in the other processing device.

Patent Agency Ranking