Manifold Array Processor
    1.
    发明申请
    Manifold Array Processor 审中-公开
    歧管阵列处理器

    公开(公告)号:US20130019082A1

    公开(公告)日:2013-01-17

    申请号:US13616942

    申请日:2012-09-14

    IPC分类号: G06F15/80

    摘要: An array processor includes processing elements arranged in to form a rectangular array. Inter-cluster communication paths are mutually exclusive. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path, thus eliminating half the wiring required for the path. The length of the longest communication path is not directly determined by the overall dimension of the array, as in conventional torus arrays. Rather, the longest communications path is limited by the inter-cluster spacing. Transpose elements of an N×N torus may be combined in clusters and communicate with one another through intra-cluster communications paths. Transpose operation latency is eliminated in this approach. Each PE may have a single transmit port and a single receive port. Thus, the individual PEs are decoupled from the array topology.

    摘要翻译: 阵列处理器包括布置成形成矩形阵列的处理元件。 群集间通信路径是互斥的。 由于数据路径的相互独占性,每个集群的处理元件之间的通信可以组合在单个集群间路径中,从而消除路径所需的一半接线。 最长通信路径的长度不直接取决于阵列的整体尺寸,如在常规环形阵列中。 相反,最长的通信路径受群间间隔的限制。 N×N环面的移位元素可以组合在一起,并通过群内通信路径相互通信。 这种方法消除了转置操作延迟。 每个PE可以具有单个发送端口和单个接收端口。 因此,各个PE与阵列拓扑分离。

    Methods and Apparatus for Video Decoding
    2.
    发明申请
    Methods and Apparatus for Video Decoding 有权
    视频解码方法与装置

    公开(公告)号:US20100238999A1

    公开(公告)日:2010-09-23

    申请号:US12792228

    申请日:2010-06-02

    IPC分类号: H04N7/12

    摘要: Techniques for performing the processing of blocks of video in multiple stages. Each stage is executed for blocks of data in the frame that need to go through that stage, based on the coding type, before moving to the next stage. This order of execution allows blocks of data to be processed in a nonsequential order, unless the blocks need to go through the same processing stages. Multiple processing elements (PEs) operating in SIMD mode executing the same task and operating on different blocks of data may be utilized, avoiding idle times for the PEs. In another aspect, inverse scan and dequantization operations for blocks of data are merged in a single procedure operating on multiple PEs operating in SIMD mode. This procedure makes efficient use of the multiple PEs and speeds up processing by combining two operations, inverse scan (reordering) and dequantization, which load the execution units differently. The reordering loads mainly the load and store units of the PEs, while the dequantization loads mainly other units. By combining the inverse scan and dequantization in an efficient VLIW packing performance, processing gain is achieved.

    摘要翻译: 用于在多个阶段中执行视频块处理的技术。 在移动到下一阶段之前,根据编码类型,在需要经过该阶段的帧中的数据块执行每个阶段。 这种执行顺序允许以非顺序的顺序处理数据块,除非块需要经历相同的处理阶段。 可以利用以SIMD模式运行的执行相同任务并在不同的数据块上操作的多个处理元件(PE),避免了PE的空闲时间。 在另一方面,用于数据块的逆扫描和去量化操作在以在SIMD模式下操作的多个PE上操作的单个过程中合并。 该过程有效地利用多个PE,并通过组合两个操作,反向扫描(重新排序)和去量化来加快处理,从而不同地加载执行单元。 重新排序负载主要是PE的负载和存储单元,而反量化主要负载其他单元。 通过在有效的VLIW包装性能中组合逆扫描和去量化,实现了处理增益。

    Methods and apparatus to support conditional execution in a VLIW-based array processor with subword execution
    4.
    发明授权
    Methods and apparatus to support conditional execution in a VLIW-based array processor with subword execution 有权
    在具有子字执行的基于VLIW的阵列处理器中支持条件执行的方法和装置

    公开(公告)号:US06366999B1

    公开(公告)日:2002-04-02

    申请号:US09238446

    申请日:1999-01-28

    IPC分类号: G06F1580

    摘要: General purpose flags (ACFs) are defined and encoded utilizing a hierarchical one-, two- or three-bit encoding. Each added bit provides a superset of the previous functionality. With condition combination, a sequential series of conditional branches based on complex conditions may be avoided and complex conditions can then be used for conditional execution. ACF generation and use can be specified by the programmer. By varying the number of flags affected, conditional operation parallelism can be widely varied, for example, from mono-processing to octal-processing in VLIW execution, and across an array of processing elements (PE)s. Multiple PEs can generate condition information at the same time with the programmer being able to specify a conditional execution in one processor based upon a condition generated in a different processor using the communications interface between the processing elements to transfer the conditions. Each processor in a multiple processor array may independently have different units conditionally operate based upon their ACFs.

    摘要翻译: 使用分层一位,二位或三位编码来定义和编码通用标志(ACF)。 每个添加的位提供了先前功能的超集。 通过条件组合,可以避免基于复杂条件的顺序一系列条件分支,然后可以将复杂条件用于条件执行。 ACF生成和使用可以由程序员指定。 通过改变受影响的标志的数量,条件操作并行性可以被广泛地变化,例如,从VLIW执行中的单处理到八进制处理,以及处理元件(PE)的阵列。 多个PE可以同时生成条件信息,程序员能够基于使用处理元件之间的通信接口在不同的处理器中生成的条件来指定一个处理器中的条件执行以传送条件。 多处理器阵列中的每个处理器可以独立地具有基于它们的ACF有条件地操作的不同单元。

    Communicaton across shared mutually exclusive direction paths between clustered processing elements
    5.
    发明授权
    Communicaton across shared mutually exclusive direction paths between clustered processing elements 有权
    在群集处理元素之间的共享互斥方向路径之间进行通信

    公开(公告)号:US09390057B2

    公开(公告)日:2016-07-12

    申请号:US13616942

    申请日:2012-09-14

    IPC分类号: G06F15/173 G06F15/80 G06F9/30

    摘要: An array processor includes processing elements arranged in clusters to form a rectangular array. Inter-cluster communication paths are mutually exclusive. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path, thus eliminating half the wiring required for the path. The length of the longest communication path is not directly determined by the overall dimension of the array, as in conventional torus arrays. Rather, the longest communications path is limited by the inter-cluster spacing. Transpose elements of an N×N torus may be combined in clusters and communicate with one another through intra-cluster communications paths. Transpose operation latency is eliminated in this approach. Each PE may have a single transmit port and a single receive port. Thus, the individual PEs are decoupled from the array topology.

    摘要翻译: 阵列处理器包括以簇形成矩阵阵列的处理元件。 群集间通信路径是互斥的。 由于数据路径的相互独占性,每个集群的处理元件之间的通信可以组合在单个集群间路径中,从而消除路径所需的一半接线。 最长通信路径的长度不直接取决于阵列的整体尺寸,如在常规环形阵列中。 相反,最长的通信路径受群间间隔的限制。 N×N环面的移位元素可以组合在一起,并通过群内通信路径相互通信。 这种方法消除了转置操作延迟。 每个PE可以具有单个发送端口和单个接收端口。 因此,各个PE与阵列拓扑分离。

    Methods and apparatus for efficient cosine transform implementations
    7.
    发明授权
    Methods and apparatus for efficient cosine transform implementations 有权
    用于有效余弦变换实现的方法和装置

    公开(公告)号:US06754687B1

    公开(公告)日:2004-06-22

    申请号:US09711218

    申请日:2000-11-09

    IPC分类号: G06F1714

    摘要: Many video processing applications, such as the decoding and encoding standards promulgated by the moving picture experts group (MPEG), are time constrained applications with multiple complex compute intensive algorithms such as the two-dimensional 8×8 IDCT. In addition, for encoding applications, cost, performance, and programming flexibility for algorithm optimizations are important design requirements. Consequently, it is of great advantage to meeting performance requirements to have a programmable processor that can achieve extremely high performance on the 2D 8×8 IDCT function. The ManArray 2×2 processor is able to process the 2D 8×8 IDCT in 34-cycles and meet the IEEE standard 1180-1990 for precision of the IDCT. A unique distributed 2D 8×8 IDCT process is presented along with the unique data placement supporting the high performance algorithm. In addition, a scalable 2D 8×8 IDCT algorithm that is operable on a 1×0, 1×1, 1×2, 2×2, 2×3, and further arrays of greater numbers of processors is presented that minimizes the VIM memory size by reuse of VLIWs and streamlines further application processing by having the IDCT results output in a standard row-major order. The techniques are applicable to cosine transforms more generally, such as discrete cosine transforms (DCTs).

    摘要翻译: 诸如运动图像专家组(MPEG)所公布的解码和编码标准的许多视频处理应用是具有诸如二维8×8 IDCT的复杂计算密集型算法的时间约束应用。 此外,对于编码应用,算法优化的成本,性能和编程灵活性是重要的设计要求。 因此,满足性能要求具有可在2D 8x8 IDCT功能上实现极高性能的可编程处理器是非常有利的。 ManArray 2x2处理器能够以34个周期处理2D 8x8 IDCT,并符合IEEE标准1180-1990的IDCT精度。 提供独特的分布式2D 8x8 IDCT过程以及支持高性能算法的独特数据布局。 此外,还提出了一种可扩展的2D 8x8 IDCT算法,可在1x0,1x1,1x2,2x2,2x3以及更多数量处理器的其他阵列上工作,可通过重用VLIW来最小化VIM存储器大小,并通过以下方式简化进一步的应用处理 将IDCT结果输出为标准行主要顺序。 这些技术更适用于更一般的余弦变换,例如离散余弦变换(DCT)。

    Manifold array processor
    9.
    发明授权
    Manifold array processor 有权
    歧管阵列处理器

    公开(公告)号:US06338129B1

    公开(公告)日:2002-01-08

    申请号:US09323609

    申请日:1999-06-01

    IPC分类号: G06F1516

    摘要: An array processor includes processing elements arranged in clusters which are, in turn, combined in a rectangular array. Each cluster is formed of processing elements which preferably communicate with the processing elements of at least two other clusters. Additionally each inter-cluster communication path is mutually exclusive, that is, each path carries either north and west, south and east, north and east, or south and west communications. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path. That is, communications from a cluster which communicates to the north and east with another cluster may be combined in one path, thus eliminating half the wiring required for the path. Additionally, the length of the longest communication path is not directly determined by the overall dimension of the array, as it is in conventional torus arrays. Rather, the longest communications path is limited only by the inter-cluster spacing. In one implementation, transpose elements of an N×N torus are combined in clusters and communicate with one another through intra-cluster communications paths. Since transpose elements have direct connections to one another, transpose operation latency is eliminated in this approach. Additionally, each PE may have a single transmit port and a single receive port. As a result, the individual PEs are decoupled from the topology of the array.

    摘要翻译: 阵列处理器包括按簇排列的处理元件,它们依次以矩形阵列组合。 每个簇由优选地与至少两个其他簇的处理元件通信的处理元件形成。 另外每个集群间的通信路径是相互排斥的,也就是说,每条路径都有北西,南,东,北,东,或南,西通信。 由于数据路径的相互独占性,每个集群的处理元件之间的通信可以组合在单个集群间路径中。 也就是说,来自与北部和东部与另一个群集通信的群集的通信可以组合在一个路径中,从而消除路径所需的一半布线。 此外,最长通信路径的长度不是直接由阵列的整体尺寸决定,就像在传统的环面阵列中一样。 相反,最长的通信路径仅受群间间隔限制。 在一个实现中,将NxN环面的转置元素组合在一起并通过集群内通信路径相互通信。 由于转置元素具有彼此的直接连接,因此在此方法中消除了转置操作延迟。 另外,每个PE可以具有单个发送端口和单个接收端口。 因此,各个PE与阵列的拓扑结构分离。

    Manifold array processor
    10.
    发明授权

    公开(公告)号:US6023753A

    公开(公告)日:2000-02-08

    申请号:US885310

    申请日:1997-06-30

    摘要: An array processor includes processing elements arranged in clusters which are, in turn, combined in a rectangular array. Each cluster is formed of processing elements which preferably communicate with the processing elements of at least two other clusters. Additionally each inter-cluster communication path is mutually exclusive, that is, each path carries either north and west, south and east, north and east, or south and west communications. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path. That is, communications from a cluster which communicates to the north and east with another cluster may be combined in one path, thus eliminating half the wiring required for the path. Additionally, the length of the longest communication path is not directly determined by the overall dimension of the array, as it is in conventional torus arrays. Rather, the longest communications path is limited only by the inter-cluster spacing. In one implementation, transpose elements of an N.times.N torus are combined in clusters and communicate with one another through intra-cluster communications paths. Since transpose elements have direct connections to one another, transpose operation latency is eliminated in this approach. Additionally, each PE may have a single transmit port and a single receive port. As a result, the individual PEs are decoupled from the topology of the array.