DISTRIBUTED TRACE USING CENTRAL PERFORMANCE COUNTER MEMORY
    1.
    发明申请
    DISTRIBUTED TRACE USING CENTRAL PERFORMANCE COUNTER MEMORY 失效
    分布式跟踪使用中央性能计数器内存

    公开(公告)号:US20110173366A1

    公开(公告)日:2011-07-14

    申请号:US12684804

    申请日:2010-01-08

    IPC分类号: G06F15/76 G06F9/06 G06F13/14

    摘要: A plurality of processing cores, are central storage unit having at least memory connected in a daisy chain manner, forming a daisy chain ring layout on an integrated chip. At least one of the plurality of processing cores places trace data on the daisy chain connection for transmitting the trace data to the central storage unit, and the central storage unit detects the trace data and stores the trace data in the memory co-located in with the central storage unit.

    摘要翻译: 多个处理核心是至少具有以菊花链方式连接的存储器的中央存储单元,在集成芯片上形成菊花链环形布局。 多个处理核心中的至少一个处理核心将跟踪数据放置在菊花链连接上,用于将跟踪数据发送到中央存储单元,中央存储单元检测跟踪数据,并将跟踪数据存储在与 中央存储单元。

    Performance Evaluation of Algorithmic Tasks and Dynamic Parameterization on Multi-Core Processing Systems
    2.
    发明申请
    Performance Evaluation of Algorithmic Tasks and Dynamic Parameterization on Multi-Core Processing Systems 审中-公开
    多核处理系统中算法任务和动态参数化的性能评估

    公开(公告)号:US20090144745A1

    公开(公告)日:2009-06-04

    申请号:US11947185

    申请日:2007-11-29

    IPC分类号: G06F9/50 G06F9/44

    摘要: Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.

    摘要翻译: 用于评估目标多核处理系统上基于DMA的算法任务的性能的装置包括存储器和耦合到存储器的至少一个处理器。 处理器是可操作的:输入指定任务的模板,该模板包括指定DMA操作的DMA相关参数和要执行的计算操作; 通过在目标多核处理系统上运行基准测试来评估指定任务的性能,该基准测试用于使用DMA操作生成数据访问模式,并调用由输入模板指定的规定的计算例程; 并提供表示与目标多核处理系统相对应的指定任务的性能度量的基准测试结果。

    Executing Multiple Instructions Multiple Data ('MIMD') Programs on a Single Instruction Multiple Data ('SIMD') Machine
    3.
    发明申请
    Executing Multiple Instructions Multiple Data ('MIMD') Programs on a Single Instruction Multiple Data ('SIMD') Machine 失效
    在单指令多数据(“SIMD”)机器上执行多指令多数据('MIMD')程序

    公开(公告)号:US20090024830A1

    公开(公告)日:2009-01-22

    申请号:US11780072

    申请日:2007-07-19

    IPC分类号: G06F15/00

    CPC分类号: G06F15/161

    摘要: Executing Multiple Instructions Multiple Data (‘MIMD’) programs on a Single Instruction Multiple Data (‘SIMD’) machine, the SIMD machine including a plurality of compute nodes, each compute node capable of executing only a single thread of execution, the compute nodes initially configured exclusively for SIMD operations, the SIMD machine further comprising a data communications network, the network comprising synchronous data communications links among the compute nodes, including establishing a SIMD partition comprising a plurality of the compute nodes; booting the SIMD partition in MIMD mode; executing by launcher programs a plurality of MIMD programs on compute nodes in the SIMD partition; and re-executing a launcher program by an operating system on a compute node in the SIMD partition upon termination of the MIMD program executed by the launcher program.

    摘要翻译: 在单指令多数据(“SIMD”)机器上执行多指令多数据(“MIMD”)程序,SIMD机器包括多个计算节点,每个计算节点只能执行单个执行线程,计算节点 最初被配置为专用于SIMD操作,所述SIMD机器还包括数据通信网络,所述网络包括所述计算节点之间的同步数据通信链路,包括建立包括多个所述计算节点的SIMD分区; 以MIMD模式引导SIMD分区; 通过启动程序执行SIMD分区中的计算节点上的多个MIMD程序; 以及在由所述启动程序执行的所述MIMD程序终止时,由所述SIMD分区中的计算节点上的操作系统重新执行启动程序。

    Method and structure for producing high performance linear algebra routines using register block data format routines
    4.
    发明授权
    Method and structure for producing high performance linear algebra routines using register block data format routines 失效
    使用寄存器块数据格式例程生成高性能线性代数程序的方法和结构

    公开(公告)号:US07469266B2

    公开(公告)日:2008-12-23

    申请号:US10671888

    申请日:2003-09-29

    IPC分类号: G06F7/38

    CPC分类号: G06F12/0875 G06F17/16

    摘要: A method (and structure) of executing a matrix operation, includes, for a matrix A, separating the matrix A into blocks, each block having a size p-by-q. The blocks of size p-by-q are then stored in a cache or memory in at least one of the two following ways. The elements in at least one of the blocks is stored in a format in which elements of the block occupy a location different from an original location in the block, and/or the blocks of size p-by-q are stored in a format in which at least one block occupies a position different relative to its original position in the matrix A.

    摘要翻译: 执行矩阵运算的方法(和结构)包括对于矩阵A,将矩阵A分成块,每个块具有大小p-by-q。 然后以p-by-q的大小的块以以下两种方式中的至少一种存储在高速缓存或存储器中。 至少一个块中的元素以块的元素占据与块中的原始位置不同的位置的格式存储,和/或大小为p-by-q的块以 其中至少一个块占据与矩阵A中其原始位置不同的位置。

    Distributed trace using central performance counter memory
    5.
    发明授权
    Distributed trace using central performance counter memory 失效
    分布式跟踪使用中央性能计数器内存

    公开(公告)号:US08356122B2

    公开(公告)日:2013-01-15

    申请号:US12684804

    申请日:2010-01-08

    IPC分类号: G06F13/00 G06F3/00 G06F11/00

    摘要: A plurality of processing cores, are central storage unit having at least memory connected in a daisy chain manner, forming a daisy chain ring layout on an integrated chip. At least one of the plurality of processing cores places trace data on the daisy chain connection for transmitting the trace data to the central storage unit, and the central storage unit detects the trace data and stores the trace data in the memory co-located in with the central storage unit.

    摘要翻译: 多个处理核心是至少具有以菊花链方式连接的存储器的中央存储单元,在集成芯片上形成菊花链环形布局。 多个处理核心中的至少一个处理核心将跟踪数据放置在菊花链连接上,用于将跟踪数据发送到中央存储单元,中央存储单元检测跟踪数据,并将跟踪数据存储在与 中央存储单元。

    Optimizing layout of an application on a massively parallel supercomputer
    6.
    发明授权
    Optimizing layout of an application on a massively parallel supercomputer 失效
    在大型并行超级计算机上优化应用程序的布局

    公开(公告)号:US08117288B2

    公开(公告)日:2012-02-14

    申请号:US10963101

    申请日:2004-10-12

    IPC分类号: G06F15/177

    CPC分类号: G06F9/5066

    摘要: A general computer-implement method and apparatus to optimize problem layout on a massively parallel supercomputer is described. The method takes as input the communication matrix of an arbitrary problem in the form of an array whose entries C(i, j) are the amount to data communicated from domain i to domain j. Given C(i, j), first implement a heuristic map is implemented which attempts sequentially to map a domain and its communications neighbors either to the same supercomputer node or to near-neighbor nodes on the supercomputer torus while keeping the number of domains mapped to a supercomputer node constant (as much as possible). Next a Markov Chain of maps is generated from the initial map using Monte Carlo simulation with Free Energy (cost function) F=Σi,jC(i,j)H(i,j)− where H(i,j) is the smallest number of hops on the supercomputer torus between domain i and domain j. On the cases tested, found was that the method produces good mappings and has the potential to be used as a general layout optimization tool for parallel codes. At the moment, the serial code implemented to test the method is un-optimized so that computation time to find the optimum map can be several hours on a typical PC. For production implementation, good parallel code for our algorithm would be required which could itself be implemented on supercomputer.

    摘要翻译: 描述了在大型并行超级计算机上优化问题布局的通用计算机实现方法和装置。 该方法采用数组形式的任意问题的通信矩阵作为输入,其条目C(i,j)是从域i到域j传送的数据量。 给定C(i,j),首先实现启发式映射,其尝试顺序地将域及其通信邻居映射到超级计算机节点或超级计算机环面上的近邻节点,同时保持域的数量映射到 超级计算机节点常数(尽可能多)。 接下来,使用具有自由能的蒙特卡罗模拟(成本函数)F =&Sgr; i,jC(i,j)H(i,j),从初始映射生成马尔科夫链映射。其中H(i,j) 域i和域j之间的超级计算机环面上的最小跳数。 在测试的情况下,发现该方法产生良好的映射,并且有可能被用作并行代码的通用布局优化工具。 此时,实现测试方法的序列号未优化,以便在典型的PC上找到最佳映射的计算时间可以为几个小时。 对于生产实现,将需要我们的算法的良好的并行代码,这本身可以在超级计算机上实现。

    METHOD AND STRUCTURE FOR FAST IN-PLACE TRANSFORMATION OF STANDARD FULL AND PACKED MATRIX DATA FORMATS
    7.
    发明申请
    METHOD AND STRUCTURE FOR FAST IN-PLACE TRANSFORMATION OF STANDARD FULL AND PACKED MATRIX DATA FORMATS 有权
    标准完整和包装矩阵数据格式的快速插入转换的方法和结构

    公开(公告)号:US20090063607A1

    公开(公告)日:2009-03-05

    申请号:US11849272

    申请日:2007-09-01

    IPC分类号: G06F7/32

    摘要: A method and structure for an in-place transformation of matrix data. For a matrix A stored in one of a standard full format or a packed format and a transformation T having a compact representation, blocking parameters MB and NB are chosen, based on a cache size. A sub-matrix A1 of A, A1 having size M1=m*MB by N1=n*NB, is worked on, and any of a residual remainder of A is saved in a buffer B. Sub-matrix A1 is worked on by contiguously moving and contiguously transforming A1 in-place into a New Data Structure (NDS), applying the transformation T in units of MB*NB contiguous double words to the NDS format of A1, thereby replacing A1 with the contents of T(A1), and moving and transforming NDS T(A1) to standard data format T(A1) with holes for the remainder of A in buffer B. The contents of buffer B is contiguously copied into the holes of A2, thereby providing in-place transformed matrix T(A).

    摘要翻译: 矩阵数据的就地转换的方法和结构。 对于以标准全格式或打包格式之一存储的矩阵A和具有紧凑表示的变换T,基于高速缓存大小来选择阻塞参数MB和NB。 对于具有M1 = m * MB的N1 = n * NB的A的A1的矩阵A1进行加工,并且A的剩余余数中的任一个保存在缓冲器B中。子矩阵A1由 将A1原位连续移动并连续地转换为新数据结构(NDS),将以MB * NB连续双字为单位的变换T应用于A1的NDS格式,从而将A1替换为T(A1)的内容, 并且将NDS T(A1)移动并变换为具有用于缓冲器B中的剩余部分的空穴的标准数据格式T(A1)。缓冲器B的内容被连续地复制到A2的孔中,从而提供就地变换矩阵T (一个)。

    Executing Multiple Instructions Multiple Data ('MIMD') Programs on a Single Instruction Multiple Data ('SIMD') Machine
    8.
    发明申请
    Executing Multiple Instructions Multiple Data ('MIMD') Programs on a Single Instruction Multiple Data ('SIMD') Machine 有权
    在单指令多数据(“SIMD”)机器上执行多指令多数据('MIMD')程序

    公开(公告)号:US20090024831A1

    公开(公告)日:2009-01-22

    申请号:US11780112

    申请日:2007-07-19

    IPC分类号: G06F15/76 G06F9/30

    CPC分类号: G06F15/177 G06F9/5061

    摘要: Executing MIMD programs on a SIMD machine, including establishing on the SIMD machine a plurality of SIMD partitions; booting a first SIMD partition in MIMD mode; executing, on a compute node of the first SIMD partition booted in MIMD mode, a MIMD accelerator program; executing a SIMD program in a second SIMD partition, one instance of the SIMD program executing on each compute node of the second SIMD partition, each instance of the SIMD program carrying out a portion of the data processing effected by the SIMD program; and accelerating, by an instance of the SIMD program through the MIMD accelerator program, a portion of the data processing of the instance of the SIMD program.

    摘要翻译: 在SIMD机器上执行MIMD程序,包括在SIMD机上建立多个SIMD分区; 在MIMD模式下启动第一个SIMD分区; 在以MIMD模式引导的第一SIMD分区的计算节点上执行MIMD加速器程序; 在第二SIMD分区中执行SIMD程序,SIMD程序的一个实例在第二SIMD分区的每个计算节点上执行,SIMD程序的每个实例执行由SIMD程序实现的数据处理的一部分; 并且通过MIMD加速器程序通过SIMD程序的实例加速SIMD程序的实例的数据处理的一部分。

    Performance evaluation of algorithmic tasks and dynamic parameterization on multi-core processing systems
    10.
    发明授权
    Performance evaluation of algorithmic tasks and dynamic parameterization on multi-core processing systems 有权
    算法任务的性能评估和多核处理系统的动态参数化

    公开(公告)号:US08037215B2

    公开(公告)日:2011-10-11

    申请号:US12130167

    申请日:2008-05-30

    IPC分类号: G06F13/28 G06F17/50

    摘要: Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.

    摘要翻译: 用于评估目标多核处理系统上基于DMA的算法任务的性能的装置包括存储器和耦合到存储器的至少一个处理器。 处理器是可操作的:输入指定任务的模板,该模板包括指定DMA操作的DMA相关参数和要执行的计算操作; 通过在目标多核处理系统上运行基准测试来评估指定任务的性能,该基准测试用于使用DMA操作生成数据访问模式,并调用由输入模板指定的规定的计算例程; 并提供表示与目标多核处理系统相对应的指定任务的性能度量的基准测试结果。