Dynamic control of SIMDs
    1.
    发明授权
    Dynamic control of SIMDs 有权
    SIMD的动态控制

    公开(公告)号:US09311102B2

    公开(公告)日:2016-04-12

    申请号:US13180721

    申请日:2011-07-12

    摘要: Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.

    摘要翻译: 本文描述了用于改善图形处理单元中的性能的系统和方法。 实施例通过在包括多个SIMD单元的着色器复合体中动态地激活/去激活各个SIMD来实现图形处理单元中的功率节省。 实时动态禁用和启用单个SIMD可以灵活地实现给定处理应用程序所需的性能和功能级别。 本发明的实施例还实现了着色器复合体中SIMD的动态中粒时钟门控。 实施例通过提供时钟按需机制来将时钟树关闭到未使用的逻辑来降低开关功率。 以这种方式,实施例增强时钟选通以在SIMD空闲(或不分配工作)的持续时间内节省更多的开关功率。 实施例还可以通过SIMD在长时间空闲的持续时间内通过电源门控SIMD来节省泄漏功率。

    Scalable and unified compute system
    2.
    发明授权
    Scalable and unified compute system 有权
    可扩展和统一的计算系统

    公开(公告)号:US08558836B2

    公开(公告)日:2013-10-15

    申请号:US12476161

    申请日:2009-06-01

    IPC分类号: G06T15/50

    摘要: A Scalable and Unified Compute System performs scalable, repairable general purpose and graphics shading operations, memory load/store operations and texture filtering. A Scalable and Unified Compute. Unit Module comprises a shader pipe array, a texture mapping unit, and a level one texture cache system. It accepts ALU instructions, input/output instructions, and texture or memory requests for a specified set of pixels, vertices, primitives, surfaces, or general compute work items from a shader program and performs associated operations to compute the programmed output data. The texture mapping unit accepts source data addresses and instruction constants in order to fetch, format, and perform instructed filtering interpolations to generate formatted results based on the specific corresponding data stored in a level one texture cache system. The texture mapping unit consists of an address generating system, a pre-formatter module, interpolator module, accumulator module and a format module.

    摘要翻译: 可扩展和统一的计算系统执行可扩展,可修复的通用和图形着色操作,存储器加载/存储操作和纹理过滤。 可扩展和统一的计算。 单元模块包括着色器管阵列,纹理映射单元和一级纹理缓存系统。 它接受来自着色器程序的指定像素集,顶点,基元,曲面或一般计算工作项的ALU指令,输入/输出指令和纹理或存储器请求,并执行相关操作以计算编程的输出数据。 纹理映射单元接受源数据地址和指令常数,以便获取,格式化和执行指示的过滤内插,以基于存储在一级纹理缓存系统中的特定对应数据生成格式化的结果。 纹理映射单元由地址生成系统,预格式化模块,插值器模块,累加器模块和格式模块组成。

    Distributed clock gating with centralized state machine control
    3.
    发明授权
    Distributed clock gating with centralized state machine control 有权
    分布式时钟门控与集中式状态机控制

    公开(公告)号:US08316252B2

    公开(公告)日:2012-11-20

    申请号:US12192530

    申请日:2008-08-15

    IPC分类号: G06F1/00 H04L7/00

    摘要: A method, computer program product, and system are provided for controlling a clock distribution network. For example, an embodiment of the method can include programming a predetermined delay time into a plurality of processing elements and controlling an activation and de-activation of these processing elements in a sequence based on the predetermined delay time. The processing elements are located in a system incorporating the clock distribution network, where the predetermined delay time can be programmed in a control register of a clock gate control circuit residing in the processing element. Further, when controlling the activation and de-activation of the processing elements, this activity can be controlled with a state machine based on the system's mode of operation. In controlling the activation and de-activation of the processing elements, the method described above can not only control the effects of di/dt in the system but also shut off clock signals in the clock distribution network when idle, thus reducing dynamic power consumption.

    摘要翻译: 提供了一种用于控制时钟分配网络的方法,计算机程序产品和系统。 例如,该方法的实施例可以包括将预定的延迟时间编程到多个处理元件中,并且基于预定的延迟时间来控制这些处理元件在一个序列中的激活和去激活。 处理元件位于包含时钟分配网络的系统中,其中预定的延迟时间可以被编程在驻留在处理元件中的时钟门控制电路的控制寄存器中。 此外,当控制处理元件的激活和去激活时,可以使用状态机基于系统的操作模式来控制该活动。 在控制处理元件的激活和去激活时,上述方法不仅可以控制系统中di / dt的影响,还可以在空闲时关闭时钟分配网络中的时钟信号,从而降低动态功耗。

    Video instruction processing of desired bytes in multi-byte buffers by shifting to matching byte location
    4.
    发明授权
    Video instruction processing of desired bytes in multi-byte buffers by shifting to matching byte location 有权
    通过转移到匹配的字节位置来处理多字节缓冲器中所需字节的视频指令

    公开(公告)号:US08473721B2

    公开(公告)日:2013-06-25

    申请号:US12762020

    申请日:2010-04-16

    IPC分类号: G06F9/30

    CPC分类号: G06T1/00

    摘要: Disclosed herein is a processing unit configured to process video data, and applications thereof. In an embodiment, the processing unit includes a buffer and an execution unit. The buffer is configured to store a data word, wherein the data word comprises a plurality of bytes of video data. The execution unit is configured to execute a single instruction to (i) shift bytes of video data contained in the data word to align a desired byte of video data and (ii) process the desired byte of the video data to provide processed video data.

    摘要翻译: 这里公开了一种处理单元,其被配置为处理视频数据及其应用。 在一个实施例中,处理单元包括缓冲器和执行单元。 缓冲器被配置为存储数据字,其中数据字包括多个字节的视频数据。 执行单元被配置为执行单个指令,以(i)移动包含在数据字中的视频数据的字节以对准视频数据的所需字节,并且(ii)处理视频数据的期望字节以提供经处理的视频数据。

    Method and system for multi-precision computation
    5.
    发明授权
    Method and system for multi-precision computation 有权
    多精度计算方法与系统

    公开(公告)号:US08468191B2

    公开(公告)日:2013-06-18

    申请号:US12813074

    申请日:2010-06-10

    IPC分类号: G06F7/48

    CPC分类号: G06F7/5443 G06F2207/382

    摘要: Systems and methods for multi-precision computation are disclosed. One embodiment of the present invention includes a plurality of multiply-add units (MADDs) configured to perform one or more single precision operations and an arrangement generator to generate one or more mantissa arrangements using a plurality of double precision numbers. Each MADD is configured to receive and load said mantissa arrangements from the arrangement generator. The MADDs compute a result of a multi-precision computation using the mantissa arrangements. In an embodiment, the MADDs are configured to simultaneously perform operations that include, single precision operations, double-precision additions and double-precision multiply and additions.

    摘要翻译: 公开了用于多精度计算的系统和方法。 本发明的一个实施例包括被配置为执行一个或多个单精度操作的多个乘法单元(MADD)和用于使用多个双精度数字生成一个或多个尾数布置的布置发生器。 每个MADD被配置为从布置发生器接收和加载所述尾数布置。 MADD使用尾数排列来计算多精度计算的结果。 在一个实施例中,MADD被配置为同时执行包括单精度操作,双精度加法和双精度乘法和加法的操作。

    Shader complex with distributed level one cache system and centralized level two cache
    6.
    发明授权
    Shader complex with distributed level one cache system and centralized level two cache 有权
    分布式一级缓存系统和集中式二级缓存的着色器复合体

    公开(公告)号:US08195882B2

    公开(公告)日:2012-06-05

    申请号:US12476159

    申请日:2009-06-01

    IPC分类号: G06F12/00

    摘要: A shader pipe texture filter utilizes a level one cache system as a primary method of storage but with the ability to have the level one cache system read and write to a level two cache system when necessary. The level one cache system communicates with the level two cache system via a wide channel memory bus. In addition, the level one cache system can be configured to support dual shader pipe texture filters while maintaining access to the level two cache system. A method utilizing a level one cache system as a primary method of storage with the ability to have the level one cache system read and write a level two cache system when necessary is also presented. In addition, level one cache systems can allocate a defined area of memory to be sharable amongst other resources.

    摘要翻译: 着色器管道纹理过滤器使用一级缓存系统作为存储的主要方法,但是在必要时能够使一级缓存系统读取和写入二级缓存系统。 一级缓存系统通过宽通道存储器总线与二级缓存系统进行通信。 此外,一级缓存系统可以配置为支持双重着色器管道纹理过滤器,同时保持对二级缓存系统的访问。 还提出了一种利用一级缓存系统作为主要的存储方法,具有使得一级缓存系统在必要时读取和写入二级缓存系统的能力。 此外,一级缓存系统可以分配其他资源之间可共享的定义的内存区域。

    Unified Shader Engine Filtering System
    7.
    发明申请
    Unified Shader Engine Filtering System 审中-公开
    统一着色引擎过滤系统

    公开(公告)号:US20090315909A1

    公开(公告)日:2009-12-24

    申请号:US12476152

    申请日:2009-06-01

    IPC分类号: G06T1/20 G09G5/10

    摘要: Each row of a row based shader engine comprises a shader pipe array, a texture filter, and a level one texture cache system. The shader pipe array accepts texture requests for a specified pixel from a resource and performs associated rendering calculations, outputting texel data. The texture mapping unit receives texel data from a level one cache system and through formatting and bilinear filtering interpolations, generates a formatted bilinear result based on a specific pixel's corresponding four texels. Utilizing multiple rows of a row based shader engine within the shader engine allows for the parallel processing of multiple simultaneous resource requests. A method for texture filtering utilizing a row based shader engine is also presented.

    摘要翻译: 基于行的着色器引擎的每一行包括着色器管道阵列,纹理过滤器和第一级纹理缓存系统。 着色器管道数组从资源接收指定像素的纹理请求,并执行关联的渲染计算,输出纹理数据。 纹理映射单元从一级缓存系统接收纹理数据,并通过格式化和双线性滤波插值,基于特定像素的相应四个纹素生成格式化双线性结果。 使用着色器引擎内的多行基于行的着色引擎可以并行处理多个同时的资源请求。 还提出了使用基于行的着色引擎的纹理过滤的方法。

    Distributed Clock Gating with Centralized State Machine Control
    8.
    发明申请
    Distributed Clock Gating with Centralized State Machine Control 有权
    分布式时钟门控与集中式机器控制

    公开(公告)号:US20090300388A1

    公开(公告)日:2009-12-03

    申请号:US12192530

    申请日:2008-08-15

    IPC分类号: G06F1/32 G06F1/08

    摘要: A method, computer program product, and system are provided for controlling a clock distribution network. For example, an embodiment of the method can include programming a predetermined delay time into a plurality of processing elements and controlling an activation and de-activation of these processing elements in a sequence based on the predetermined delay time. The processing elements are located in a system incorporating the clock distribution network, where the predetermined delay time can be programmed in a control register of a clock gate control circuit residing in the processing element. Further, when controlling the activation and de-activation of the processing elements, this activity can be controlled with a state machine based on the system's mode of operation. In controlling the activation and de-activation of the processing elements, the method described above can not only control the effects of di/dt in the system but also shut off clock signals in the clock distribution network when idle, thus reducing dynamic power consumption.

    摘要翻译: 提供了一种用于控制时钟分配网络的方法,计算机程序产品和系统。 例如,该方法的实施例可以包括将预定的延迟时间编程到多个处理元件中,并且基于预定的延迟时间来控制这些处理元件在一个序列中的激活和去激活。 处理元件位于包含时钟分配网络的系统中,其中预定的延迟时间可以被编程在驻留在处理元件中的时钟门控制电路的控制寄存器中。 此外,当控制处理元件的激活和去激活时,可以使用状态机基于系统的操作模式来控制该活动。 在控制处理元件的激活和去激活时,上述方法不仅可以控制系统中di / dt的影响,还可以在空闲时关闭时钟分配网络中的时钟信号,从而降低动态功耗。

    Method and apparatus for executing a predefined instruction set
    9.
    发明授权
    Method and apparatus for executing a predefined instruction set 有权
    用于执行预定义指令集的方法和装置

    公开(公告)号:US06784888B2

    公开(公告)日:2004-08-31

    申请号:US09969669

    申请日:2001-10-03

    IPC分类号: G06T1500

    摘要: The occurrence of an (n+m) input operand instruction that requires more than n of its input operands from an n-output data source is recognized by a programmable vertex shader (PVS) controller. In turn, the PVS controller provides at least two substitute instructions, neither of which requires more than n operands from the n output data source, to a PVS engine. A first of the substitute instructions is executed by the PVS engine to provide an intermediate result that is temporarily stored and used as an input to another of the at least two substitute instructions. In this manner, the present invention avoids the expense of additional or significantly modified memory. In one embodiment of the present invention, a pre-accumulator register internal to the PVS engine is used to store the intermediate result. In this manner, the present invention provides a relatively inexpensive solution for a relatively infrequent occurrence.

    摘要翻译: 可编程顶点着色器(PVS)控制器识别需要从n输出数据源输入操作数大于n的(n + m)个输入操作数指令。 反过来,PVS控制器提供至少两个替代指令,这两个指令都不需要n个输出数据源的n个操作数到PVS引擎。 替代指令中的第一个由PVS引擎执行,以提供临时存储的中间结果,并将其用作至少两个替代指令中的另一个的输入。 以这种方式,本发明避免了附加或显着修改的存储器的费用。 在本发明的一个实施例中,PVS引擎内部的预累加器寄存器用于存储中间结果。 以这种方式,本发明提供了相对不频繁发生的相对便宜的解决方案。