Abstract:
An integrated circuit device having memory is disclosed. The integrated circuit device comprises programmable resources; programmable interconnect elements coupled to the programmable resources, the programmable interconnect elements enabling a communication of signals with the programmable resources; a plurality of memory blocks; and dedicated interconnect elements coupled to the plurality of memory blocks, the dedicated interconnect elements enabling access to the plurality of memory blocks. A method of implementing memory in an integrated circuit device is also disclosed.
Abstract:
An integrated circuit comprises a memory matrix including: a first memory cell array; a first multiplexer (MUX) coupled to an input of the first memory cell array; a second MUX coupled to an output of the first memory cell array; a second memory cell array; a third MUX coupled to an input of the second memory cell array; and a fourth MUX coupled to an output of the second memory cell array. The second MUX is coupled to the fourth MUX. The fourth MUX is configured to pass a selected one of: (1) an output from the third MUX, (2) an output from the second memory cell array, or (3) an output from the second MUX.
Abstract:
An integrated circuit (IC) includes a plurality of dies. The IC includes a plurality of memory channel interfaces configured to communicate with a memory, wherein the plurality of memory channel interfaces are disposed within a first die of the plurality of dies. The IC may include a compute array distributed across the plurality of dies and a plurality of remote buffers distributed across the plurality of dies. The plurality of remote buffers are coupled to the plurality of memory channels and to the compute array. The IC further includes a controller configured to determine that each of the plurality of remote buffers has data stored therein and, in response, broadcast a read enable signal to each of the plurality of remote buffers initiating data transfers from the plurality of remote buffers to the compute array across the plurality of dies.
Abstract:
A circuit arrangement includes an array of MAC circuits, wherein each MAC circuit includes a cache configured for storage of a plurality of kernels. The MAC circuits are configured to receive a first set of data elements of an IFM at a first rate. The MAC circuits are configured to perform first MAC operations on the first set of the data elements and a first one of the kernels associated with a first OFM depth index during a first MAC cycle, wherein a rate of MAC cycles is faster than the first rate. The MAC circuits are configured to perform second MAC operations on the first set of the data elements and a second one of the kernels associated with a second OFM depth index during a second MAC cycle that consecutively follows the first MAC cycle.
Abstract:
A memory device can be used with a shared routing resource that provides access to the memory device. The memory device can include a random access memory (RAM) circuit that includes a plurality of ports configured to provide access to the RAM circuit by the shared routing resource. A memory partition register circuit can be configured to store a plurality of addresses specifying respective context partitions within the RAM circuit. A plurality of pointer register circuits that can each be associated with a corresponding port of the plurality of ports and can be configured to store a respective set of pointers that specify a location in the RAM circuit relative to a respective context partition. Addressing logic that can be configured to provide access to the RAM circuit using the respective set of pointers for each port.
Abstract:
In disclosed circuit arrangements, memory cell arrays are addressed by a first portion of an input address, and memory cells within each memory cell array are addressed by a second portion of the input address. A first first-in-first-out (FIFO) buffer is coupled to the memory cell arrays and delays the second portion of each input address to the memory cell arrays for a sleep period. Control circuits respectively coupled to the memory cell arrays include second FIFO buffers and decode the first portion of each input address and generate corresponding states of enable signals. The control circuits store the corresponding states of the enable signals in the second FIFO buffers concurrently with input of the second portion of each input address to the first FIFO buffer. The second FIFO buffers delay output of the corresponding states of the enable signals to the memory cell arrays for the sleep period. Each control circuit further switches a corresponding memory cell array into a sleep mode in response to all states of the enable signal in the corresponding second FIFO buffer being in a non-enabled state.
Abstract:
A memory circuit includes an input stage having N input ports and N output ports, wherein N is an integer greater than one. The memory circuit further includes an N:1 port multiplexer coupled to the N output ports of the input stage and configured to time division multiplex the N output ports to one multiplexed port. The memory circuit also includes a random access memory matrix and a 1:N port multiplexer. The memory circuit is coupled to the multiplexed port. The 1:N port multiplexer is coupled to the random access memory matrix and is configured to de-multiplex signals received from the random access memory matrix into N output ports.
Abstract:
Embodiments herein describe circuitry with improved efficiency when executing layers in a nested neural network. As mentioned above, a nested neural network has at least one split operation where a tensor generated by a first layer is transmitted to, and processed by several branches in the neural network. Each of these branches can have several layers that have data dependencies which result in a multiply-add array sitting idly. In one embodiment, the circuitry can include a dedicated pre-pooler for performing a pre-pooling operation. Thus, the pre-pooling operation can be performing in parallel with other operations (e.g., the convolution performed by another layer). Once the multiply-add array is idle, the pre-pooling operation has already completed (or at least, has already started) which means the time the multiply-add array must wait before it can perform the next operation is reduced or eliminated.
Abstract:
The embodiments herein store tabulated values representing a linear or non-linear function in separate memory banks to reduce the size of memory used to store the tabulated values while being able to provide upper and lower values for performing linear interpolation in parallel (e.g., the same cycle). To do so, a linear interpolation system includes a first memory bank that stores the even indexed tabulated values while a second memory bank stores the odd indexed tabulated values. During each clock cycle, the first and second memory banks can output upper and lower values for linear interpolation (although which memory bank outputs the upper value and which outputs the lower value can vary). Using the upper and lower values, the linear interpolation system performs linear interpolation to approximate the value of a non-linear function that is between the upper and lower values.
Abstract:
An example semiconductor device includes a first integrated circuit (IC) die including a first column of cascade-coupled resource blocks; a second IC die including a second column of cascade-coupled resource blocks, where an active side of the second IC die is mounted to an active side of the first IC die; and a plurality of electrical connections between the active side of the first IC and the active side of the second IC, the plurality of electrical connections including at least one electrical connection between the first column of cascade-coupled resource blocks and the second column of cascade-coupled resource blocks.