SYSTEM AND METHOD FOR PERFORMING SHAPED MEMORY ACCESS OPERATIONS
    11.
    发明申请
    SYSTEM AND METHOD FOR PERFORMING SHAPED MEMORY ACCESS OPERATIONS 审中-公开
    用于执行形状记忆访问操作的系统和方法

    公开(公告)号:US20130145124A1

    公开(公告)日:2013-06-06

    申请号:US13312954

    申请日:2011-12-06

    IPC分类号: G06F9/30

    摘要: One embodiment of the present invention sets forth a technique that provides an efficient way to retrieve operands from a register file. Specifically, the instruction dispatch unit receives one or more instructions, each of which includes one or more operands. Collectively, the operands are organized into one or more operand groups from which a shaped access may be formed. The operands are retrieved from the register file and stored in a collector. Once all operands are read and collected in the collector, the instruction dispatch unit transmits the instructions and corresponding operands to functional units within the streaming multiprocessor for execution. One advantage of the present invention is that multiple operands are retrieved from the register file in a single register access operation without resource conflict. Performance in retrieving operands from the register file is improved by forming shaped accesses that efficiently retrieve operands exhibiting recognized memory access patterns.

    摘要翻译: 本发明的一个实施例提出了提供从寄存器文件中检索操作数的有效方式的技术。 具体地,指令分派单元接收一个或多个指令,每个指令包括一个或多个操作数。 总的来说,操作数被组织成一个或多个操作数组,从中可以形成成形的访问。 操作数从寄存器文件中检索并存储在收集器中。 一旦所有操作数被读取并收集在收集器中,指令分派单元将指令和相应的操作数发送到流多处理器内的功能单元以供执行。 本发明的一个优点是在没有资源冲突的情况下,在单个寄存器访问操作中从寄存器文件中检索多个操作数。 通过形成有效地检索具有公认的存储器访问模式的操作数的形状访问来改进从寄存器文件中检索操作数的性能。

    Throttling instruction issue rate based on updated moving average to avoid surges in DI/DT
    12.
    发明授权
    Throttling instruction issue rate based on updated moving average to avoid surges in DI/DT 有权
    基于更新移动平均线的限制指令发布率,以避免DI / DT中的激增

    公开(公告)号:US09430242B2

    公开(公告)日:2016-08-30

    申请号:US13437765

    申请日:2012-04-02

    摘要: Systems and methods for throttling GPU execution performance to avoid surges in DI/DT. A processor includes one or more execution units coupled to a scheduling unit configured to select instructions for execution by the one or more execution units. The execution units may be connected to one or more decoupling capacitors that store power for the circuits of the execution units. The scheduling unit is configured to throttle the instruction issue rate of the execution units based on a moving average issue rate over a large number of scheduling periods. The number of instructions issued during the current scheduling period is less than or equal to a throttling rate maintained by the scheduling unit that is greater than or equal to a minimum throttling issue rate. The throttling rate is set equal to the moving average plus an offset value at the end of each scheduling period.

    摘要翻译: 节省GPU执行性能的系统和方法,以避免DI / DT中的浪涌。 处理器包括耦合到调度单元的一个或多个执行单元,调度单元被配置为选择用于由一个或多个执行单元执行的指令。 执行单元可以连接到一个或多个存储执行单元的电路的去耦电容器。 调度单元被配置为基于在大量调度周期上的移动平均发布速率来抑制执行单元的指令发布速率。 在当前调度周期内发出的指令数小于或等于由调度单元维持的大于或等于最小节流发布率的节流速率。 节流速度设置为等于每个调度周期结束时的移动平均加上偏移值。

    Multi-level instruction cache prefetching
    13.
    发明授权
    Multi-level instruction cache prefetching 有权
    多级指令缓存预取

    公开(公告)号:US09110810B2

    公开(公告)日:2015-08-18

    申请号:US13312962

    申请日:2011-12-06

    摘要: One embodiment of the present invention sets forth an improved way to prefetch instructions in a multi-level cache. Fetch unit initiates a prefetch operation to transfer one of a set of multiple cache lines, based on a function of a pseudorandom number generator and the sector corresponding to the current instruction L1 cache line. The fetch unit selects a prefetch target from the set of multiple cache lines according to some probability function. If the current instruction L1 cache 370 is located within the first sector of the corresponding L1.5 cache line, then the selected prefetch target is located at a sector within the next L1.5 cache line. The result is that the instruction L1 cache hit rate is improved and instruction fetch latency is reduced, even where the processor consumes instructions in the instruction L1 cache at a fast rate.

    摘要翻译: 本发明的一个实施例提出了一种改进的方式来预取多级缓存中的指令。 提取单元基于伪随机数发生器的功能和与当前指令L1高速缓存行相对应的扇区,发起预取操作以传送一组多个高速缓存行中的一个。 提取单元根据一些概率函数从多条高速缓存行集合中选择预取目标。 如果当前指令L1高速缓存370位于对应的L1.5高速缓存行的第一扇区内,则所选择的预取目标位于下一个L1.5高速缓存行内的扇区处。 结果是,即使在处理器以快速的速率消耗指令L1高速缓存中的指令的情况下,指令L1高速缓存命中率得到改善并且指令提取延迟被降低。

    Methods and apparatus for source operand collector caching
    14.
    发明授权
    Methods and apparatus for source operand collector caching 有权
    源操作数采集器缓存的方法和装置

    公开(公告)号:US08639882B2

    公开(公告)日:2014-01-28

    申请号:US13326183

    申请日:2011-12-14

    IPC分类号: G06F12/00

    摘要: Methods and apparatus for source operand collector caching. In one embodiment, a processor includes a register file that may be coupled to storage elements (i.e., an operand collector) that provide inputs to the datapath of the processor core for executing an instruction. In order to reduce bandwidth between the register file and the operand collector, operands may be cached and reused in subsequent instructions. A scheduling unit maintains a cache table for monitoring which register values are currently stored in the operand collector. The scheduling unit may also configure the operand collector to select the particular storage elements that are coupled to the inputs to the datapath for a given instruction.

    摘要翻译: 源操作数采集器缓存的方法和装置。 在一个实施例中,处理器包括可以耦合到存储元件(即,操作数收集器)的寄存器文件,其提供用于执行指令的处理器核的数据路径的输入。 为了减少寄存器文件和操作数收集器之间的带宽,操作数可以在随后的指令中缓存并重新使用。 调度单元维护高速缓存表,用于监视当前存储在操作数收集器中的寄存器值。 调度单元还可以配置操作数收集器以选择耦合到给定指令的数据路径的输入的特定存储元件。

    MULTI-LEVEL INSTRUCTION CACHE PREFETCHING
    15.
    发明申请
    MULTI-LEVEL INSTRUCTION CACHE PREFETCHING 有权
    多级指导高速缓存

    公开(公告)号:US20130145102A1

    公开(公告)日:2013-06-06

    申请号:US13312962

    申请日:2011-12-06

    IPC分类号: G06F12/02

    摘要: One embodiment of the present invention sets forth an improved way to prefetch instructions in a multi-level cache. Fetch unit initiates a prefetch operation to transfer one of a set of multiple cache lines, based on a function of a pseudorandom number generator and the sector corresponding to the current instruction L1 cache line. The fetch unit selects a prefetch target from the set of multiple cache lines according to some probability function. If the current instruction L1 cache 370 is located within the first sector of the corresponding L1.5 cache line, then the selected prefetch target is located at a sector within the next L1.5 cache line. The result is that the instruction L1 cache hit rate is improved and instruction fetch latency is reduced, even where the processor consumes instructions in the instruction L1 cache at a fast rate.

    摘要翻译: 本发明的一个实施例提出了一种改进的方式来预取多级缓存中的指令。 提取单元基于伪随机数发生器的功能和与当前指令L1高速缓存行相对应的扇区,发起预取操作以传送一组多个高速缓存行中的一个。 提取单元根据一些概率函数从多条高速缓存行集合中选择预取目标。 如果当前指令L1高速缓存370位于对应的L1.5高速缓存行的第一扇区内,则所选择的预取目标位于下一个L1.5高速缓存行内的扇区处。 结果是,即使在处理器以快速的速率消耗指令L1高速缓存中的指令的情况下,指令L1高速缓存命中率得到改善并且指令提取延迟被降低。

    Thread group scheduler for computing on a parallel thread processor
    18.
    发明授权
    Thread group scheduler for computing on a parallel thread processor 有权
    线程组调度程序,用于在并行线程处理器上进行计算

    公开(公告)号:US08732713B2

    公开(公告)日:2014-05-20

    申请号:US13247819

    申请日:2011-09-28

    IPC分类号: G06F9/46

    CPC分类号: G06F9/4881 G06F2209/483

    摘要: A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.

    摘要翻译: 并行线程处理器执行属于多个协作线程数组(CTA)的线程组。 在并行线程处理器的每个周期,指令调度器在随后的周期中选择要发行的线程组以执行。 指令调度器通过(i)识别可用线程组的池,(ii)识别具有最大资历值的CTA来选择要执行的线程组,以及(iii)选择具有最大信用值的线程组 从具有最高资历价值的CTA内。

    METHODS AND APPARATUS TO AVOID SURGES IN DI/DT BY THROTTLING GPU EXECUTION PERFORMANCE
    19.
    发明申请
    METHODS AND APPARATUS TO AVOID SURGES IN DI/DT BY THROTTLING GPU EXECUTION PERFORMANCE 有权
    通过GPU执行性能避免DI / DT中的采样的方法和设备

    公开(公告)号:US20130262831A1

    公开(公告)日:2013-10-03

    申请号:US13437765

    申请日:2012-04-02

    IPC分类号: G06F9/30

    摘要: Systems and methods for throttling GPU execution performance to avoid surges in DI/DT. A processor includes one or more execution units coupled to a scheduling unit configured to select instructions for execution by the one or more execution units. The execution units may be connected to one or more decoupling capacitors that store power for the circuits of the execution units. The scheduling unit is configured to throttle the instruction issue rate of the execution units based on a moving average issue rate over a large number of scheduling periods. The number of instructions issued during the current scheduling period is less than or equal to a throttling rate maintained by the scheduling unit that is greater than or equal to a minimum throttling issue rate. The throttling rate is set equal to the moving average plus an offset value at the end of each scheduling period.

    摘要翻译: 节省GPU执行性能的系统和方法,以避免DI / DT中的浪涌。 处理器包括耦合到调度单元的一个或多个执行单元,调度单元被配置为选择用于由一个或多个执行单元执行的指令。 执行单元可以连接到一个或多个存储执行单元的电路的去耦电容器。 调度单元被配置为基于在大量调度周期上的移动平均发布速率来抑制执行单元的指令发布速率。 在当前调度周期内发出的指令数小于或等于由调度单元维持的大于或等于最小节流发布率的节流速率。 节流速度设置为等于每个调度周期结束时的移动平均加上偏移值。

    METHODS AND APPARATUS FOR SCHEDULING INSTRUCTIONS USING PRE-DECODE DATA
    20.
    发明申请
    METHODS AND APPARATUS FOR SCHEDULING INSTRUCTIONS USING PRE-DECODE DATA 有权
    使用预编码数据调度指令的方法和装置

    公开(公告)号:US20130166881A1

    公开(公告)日:2013-06-27

    申请号:US13333879

    申请日:2011-12-21

    IPC分类号: G06F9/30 G06F9/312

    摘要: Systems and methods for scheduling instructions using pre-decode data corresponding to each instruction. In one embodiment, a multi-core processor includes a scheduling unit in each core for selecting instructions from two or more threads each scheduling cycle for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The pre-decode data is determined by a compiler and is extracted by the scheduling unit during runtime and used to control selection of threads for execution. The pre-decode data may specify a number of scheduling cycles to wait before scheduling the instruction. The pre-decode data may also specify a scheduling priority for the instruction. Once the scheduling unit selects an instruction to issue for execution, a decode unit fully decodes the instruction.

    摘要翻译: 用于使用对应于每个指令的预解码数据调度指令的系统和方法。 在一个实施例中,多核处理器包括每个核心中的调度单元,用于从两个或更多个线程中选择用于在该特定核心上执行的调度周期的指令。 由于线程被安排在核心上执行,所以来自线程的指令被取入到缓冲器中而不被解码。 预解码数据由编译器确定,并且在运行时由调度单元提取并用于控制用于执行的线程的选择。 预解码数据可以指定在调度指令之前等待的多个调度周期。 预解码数据还可以指定该指令的调度优先级。 一旦调度单元选择要执行的指令,则解码单元完全解码该指令。