Mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice
First Claim
1. A method of performing operations associated with a process occurring in at least one emulated lattice of at least one sector having lattice sites therein, the operations being performed by at least one processing node, the processing node associated with the at least one sector and including a memory, comprising:
- associating each of the lattice sites with data in a data structure;
partitioning the data at the lattice sites into sets of homologous bits, one from each data structure at each lattice site, to form corresponding bit-fields;
partitioning the lattice sites in a shift-invariant manner into groups each including a plurality of lattice sites to form corresponding first site-aggregates, the data in each bit-field being correspondingly partitioned into first bit-aggregates;
grouping together the first site-aggregates to form a plurality of second site-aggregates that partition the lattice sites in a shift-invariant manner;
grouping together pluralities of the first bit-aggregates to form second bit-aggregates, each second bit-aggregate aggregating data associated with the lattice sites of a corresponding second site-aggregate;
storing in the memory each second bit-aggregate as an addressable unit composed of separately addressable first bit-aggregates; and
shifting data for at least one of the bit-fields within the at least one sector by addressing each second bit-aggregate in which a portion of the at least one of the bit-fields is stored, and addressing each of the constituent first bit-aggregates in the addressed each second bit-aggregate.
0 Assignments
0 Petitions
Accused Products
Abstract
A mechanism for performing parallel computations on an emulated spatial lattice by scheduling memory and communication operations on a static mesh-connected array of synchronized processing nodes. The lattice data are divided up among the array of processing nodes, each having a memory and a plurality of processing elements within each node. The memory is assumed to have a hierarchical granular structure that distinguishes groups of bits that are most efficiently accessed together, such as words or rows. The lattice data is organized in memory so that the sets of bits that interact during processing are always accessed together. Such an organization is based on mapping the lattice data into the granular structure of the memories in a manner that has simple spatial translation properties in the emulated space. The mapping permits data movement in the emulated lattice to be achieved by a combination of scheduled memory access and scheduled communication. Moreover, the same mapping spreads interprocessor communication demands evenly over time.
-
Citations
69 Claims
-
1. A method of performing operations associated with a process occurring in at least one emulated lattice of at least one sector having lattice sites therein, the operations being performed by at least one processing node, the processing node associated with the at least one sector and including a memory, comprising:
-
associating each of the lattice sites with data in a data structure;
partitioning the data at the lattice sites into sets of homologous bits, one from each data structure at each lattice site, to form corresponding bit-fields;
partitioning the lattice sites in a shift-invariant manner into groups each including a plurality of lattice sites to form corresponding first site-aggregates, the data in each bit-field being correspondingly partitioned into first bit-aggregates;
grouping together the first site-aggregates to form a plurality of second site-aggregates that partition the lattice sites in a shift-invariant manner;
grouping together pluralities of the first bit-aggregates to form second bit-aggregates, each second bit-aggregate aggregating data associated with the lattice sites of a corresponding second site-aggregate;
storing in the memory each second bit-aggregate as an addressable unit composed of separately addressable first bit-aggregates; and
shifting data for at least one of the bit-fields within the at least one sector by addressing each second bit-aggregate in which a portion of the at least one of the bit-fields is stored, and addressing each of the constituent first bit-aggregates in the addressed each second bit-aggregate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
processing bit-field data for each of the lattice sites to be updated to transform the value of the associated data structure.
-
-
3. The method of claim 2, wherein processing comprises performing a symbolic operation.
-
4. The method of claim 2, wherein processing comprises performing a numerical operation.
-
5. The method of claim 2, wherein processing comprises:
-
reading from the memory the bit-field data for each lattice site to be updated;
updating the read bit-field data; and
writing the updated bit-field data to the memory.
-
-
6. The method of claim 5, wherein the step of updating occurs after the step of shifting and the bit-field data read from the memory are shifted bit-field data.
-
7. The method of claim 5, wherein the step of updating occurs before the step of shifting and the bit-field data written to the memory are shifted bit-field data.
-
8. The method of claim 1, wherein the at least one sector comprises a plurality of sectors and the operations are performed by an array of processing nodes, each associated with a different one of the sectors in the plurality of sectors and communicating with others of the processing nodes associated with neighboring ones of the sectors in the plurality of sectors.
-
9. The method of claim 8, further comprising:
shifting periodically the bit-field data within each sector of each associated processing node, whereby the data that shifts past an edge of the sector wraps to the beginning of an opposite edge of the sector, the periodic shifting being performed by memory addressing and by re-ordering bits within addressed ones of the first bit-aggregates.
-
10. The method of claim 9, further comprising:
-
reading by the processing nodes the periodically shifted bit-field data, each accessing data for a one of the first site-aggregates to be processed; and
communicating the wrapped data to a nearby one of the processing nodes, the communicated wrapped data being substituted for the wrapped data within the nearby one of the processing nodes to which it is communicated.
-
-
11. The method of claim 10, further comprising:
processing the shifted bit-field data.
-
12. The method of claim 11, wherein processing includes using a table lookup.
-
13. The method of claim 12, wherein each of the processing nodes includes a plurality of processing elements for processing a parallel stream of the bit-field data and the table lookup is shared by all of the processing elements in each processing node.
-
14. The method of claim 13, further comprising:
loading the bit-field data into the shared lookup table so that data from all of the lattice sites in a given one of the sectors can be used to randomly access data belonging to a fixed set of the lattice sites.
-
15. The method of claim 11, wherein the plurality of lattice sites aggregated within each of the first site-aggregates have a uniform spacing relative to each edge of the at least one sector, the difference for any two of the first site-aggregates in the respective numbers of lattice sites lying within a give distance of an edge being at most one.
-
16. The method of claim 11, wherein each second bit-aggregate aggregates first bit-aggregates associated with a single one of the sectors in the plurality of sectors, and which in their pattern of grouping of data associated with the lattice sites, are all periodic translations of each other along a single line in the single sector.
-
17. The method of claim 11, wherein each of the processing nodes includes a plurality of processing elements for processing a parallel stream of the bit-field data, wherein the second bit-aggregate aggregates first bit-aggregates associated with a plurality of the bit-fields, and wherein each processing element includes bit-serial arithmetic hardware.
-
18. The method of claim 11, wherein the at least one emulated lattice includes at least two emulated lattices, the at least two emulated lattices having unequal numbers of the bit-fields, and wherein shifted bit-field data from the at least two emulated lattices are processed together.
-
19. The method of claim 11, wherein the at least one sector has at least two dimensions, and each of the first site-aggregates includes a set of the lattice sites that is unsymmetric about every parallel to at least one edge of the at least one sector.
-
20. The method of claim 1, wherein the memory has the property that consecutive accesses to each of the plurality of first bit-aggregates is fastest if each first bit-aggregate in the plurality of first bit-aggregates is a part of a single one of the second bit-aggregates.
-
21. The method of claim 20, wherein the at least one sector has at least two dimensions, and each of the first site-aggregates includes a set of the lattice sites that is unsymmetric about every parallel to at least one edge of the at least one sector.
-
22. The method of claim 21, wherein the at least one sector comprises a plurality of sectors and the operations are performed by an array of processing nodes, each associated with a different one of the sectors in the plurality of sectors and communicating with others of the processing nodes associated with neighboring ones of the sectors in the plurality of sectors.
-
23. A processor for performing operations associated with a process occurring in at least one emulated lattice having at least one sector and having a plurality of lattice sites therein, the processor comprising:
-
a processing node associated with the at least one sector, the processing node including a memory for storing lattice site data associated with the plurality of lattice sites, each of the lattice sites having an associated data structure;
wherein sets of homologous bits, one from each associated data structure at each lattice site, form bit-fields;
wherein a shift-invariant partition of the least one sector into pluralities of lattice sites form first site-aggregates;
wherein first site-aggregates are grouped to partition the lattice sites of the at least one sector in a shift-invariant manner to form a plurality of second site-aggregates;
wherein a portion of each bit-field associated with each first site-aggregate forms a first bit-aggregate;
wherein pluralities of the first bit-aggregates are grouped together to form second bit-aggregates, each aggregating data associated with a corresponding second site-aggregate;
wherein the memory stores each second bit-aggregate as an addressable unit composed of separately addressable first bit-aggregates; and
wherein the processing node shifts data for at least one of the bit-fields within the at least one sector by addressing each second bit-aggregate in which each portion of the at least one of the bit-fields is stored, and addressing each of the constituent first bit-aggregates in the addressed each second bit-aggregate. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
a plurality of the processing nodes, each of the processing nodes in the plurality of processing nodes connected by mesh I/O links to neighboring processing nodes in the plurality of processing nodes to form a mesh array, each of the processing nodes in the plurality of processing nodes being associated with an equal-sized sector of the emulated lattice; and
wherein the performance of the operations is divided among the plurality of the processing nodes.
-
-
30. The processor of claim 28, further comprising:
a barrel shifter connected to the at least one memory block for re-ordering bits within memory words.
-
31. The processor of claim 28, further comprising:
a butterfly network connected to the at least one memory block for re-ordering bits within memory words.
-
32. The processor of claim 28, further comprising:
a mesh I/O unit coupled to the at least one memory block for identifying a bit as having shifted beyond a sector boundary and transferring the identified bit to a next adjacent sector for a corresponding bit substitution.
-
33. The processor of claim 24, wherein the operations are performed under the control of a host to which the processor is connected.
-
34. The processor of claim 24, wherein the processing node is coupled to a nonvolatile memory device for storing a program and a copy of the program is loaded into the processing node at boot time.
-
35. The processor of claim 24, wherein the processing node includes reprogrammable logic blocks of the sort used in FPGA devices, along with reprogrammable I/O pins, for interfacing with other electronic devices.
-
36. The processor of claim 24, wherein the processing node controls an external memory device used for storing bit-field data and for storing control information.
-
37. The processor of claim 24, wherein the memory has the property that consecutive memory accesses to each of a set of several first bit-aggregates is fastest if each first bit-aggregate of the set of several first bit-aggregates is a part of a single one of the second bit-aggregates.
-
38. The processor of claim 23, wherein the at least one sector has at least two dimensions, and each of the first site-aggregates includes a set of lattice sites that is unsymmetric about every parallel to at least one edge of the at least one sector.
-
39. The processor of claim 23, wherein constituent first bit-aggregates are ordered within each second bit-aggregate, such ordering being reflected in the associated memory addresses, and wherein the grouping and ordering of first bit-aggregates is such that the shifting of the at least one bit-field involves only a cyclic permutation in the order of each set of constituent first bit-aggregates within the corresponding second bit-aggregate.
-
40. A method of performing operations associated with a process occurring in at least one emulated lattice of at least one sector of at least two dimensions having lattice sites therein, the operations being performed by at least one processing node, the at least one processing node associated with the at least one sector and including a memory, comprising:
-
associating each of the lattice sites with data in a data structure;
partitioning the data at the lattice sites into sets of homologous bits, one from each data structure at each lattice site, to form corresponding bit-fields;
partitioning the lattice sites in a shift-invariant manner into groups of lattice sites, each group being unsymmetric about every parallel to at least one edge of the at least one sector, to form a plurality of corresponding site-aggregates, the data in each bit-field being correspondingly partitioned to form bit-aggregates;
storing each bit-aggregate as an addressable unit in the memory; and
shifting data for at least one of the bit-fields within the at least one sector of the emulated lattice by addressing each bit-aggregate in which a portion of the at least one of the bit-fields is stored. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
processing bit-field data for each of the lattice sites to be updated to transform the value of the associated data structure.
-
-
42. The method of claim 41, wherein processing comprises performing a symbolic operation.
-
43. The method of claim 41, wherein processing comprises performing a numerical operation.
-
44. The method of claim 41, wherein processing comprises:
-
reading from the memory the bit-field data for each lattice site to be updated;
updating the read bit-field data; and
writing the updated bit-field data to the memory.
-
-
45. The method of claim 44, wherein the step of updating occurs after the step of shifting and the bit-field data read from the memory are shifted bit-field data.
-
46. The method of claim 44, wherein the step of updating occurs before the step of shifting and the bit-field data written to the memory are shifted bit-field data.
-
47. The method of claim 40, wherein the at least one sector comprises a plurality of sectors and the operations are performed by an array of processing nodes, each associated with a different one of the sectors in the plurality of sectors and communicating with others of the processing nodes associated with neighboring ones of the sectors in the plurality of sectors.
-
48. The method of claim 47, further comprising:
shifting periodically the bit-field data within each sector of each processing node, whereby the data that shifts past an edge of the sector wraps to the beginning of an opposite edge of the sector, the periodic shifting being performed by memory addressing and by re-ordering bits within the addressed ones of the bit-aggregates.
-
49. The method of claim 48, further comprising:
-
reading by the processing nodes the periodically shifted bit-field data, each accessing data for a one of the site-aggregates to be processed; and
communicating the wrapped data to a nearby one of the processing nodes, the communicated wrapped data being substituted for the wrapped data within the nearby one of the processing nodes to which it is communicated.
-
-
50. The method of claim 49, further comprising:
processing the shifted bit-field data.
-
51. The method of claim 50, wherein processing includes using a table lookup.
-
52. The method of claim 51, wherein each of the processing nodes includes a plurality of processing elements for processing a parallel stream of the bit-field data and the table lookup is shared by all of the processing elements in each processing node.
-
53. The method of claim 52, further comprising:
loading the bit-field data into the shared lookup table so that data from all of the lattice sites in a given one of the sectors in the plurality of sectors can be used to randomly access data belonging to a fixed set of the lattice sites.
-
54. The method of claim 50, wherein the lattice sites that are aggregated within each of the site-aggregates have a uniform spacing relative to each edge of the at least one sector, the difference for any two of the site-aggregates in the respective numbers of lattice sites lying within a given distance of an edge being at most one.
-
55. The method of claim 50, wherein there are at least two emulated lattices, the at least two emulated lattices having unequal numbers of the bit-fields, and wherein shifted bit-field data from the at least two emulated lattices are processed together.
-
56. A processor for performing operations associated with a process occurring in at least one emulated lattice having at least one sector of at least two dimensions having lattice sites therein, the processor comprising:
-
a processing node associated with the at least one sector, the processing node including a memory for storing lattice site data associated with the lattice sites, each of the lattice sites having an associated data structure;
wherein sets of homologous bits, one from each associated data structure at each lattice site, form bit-fields;
wherein a shift-invariant partition of the at least one sector into pluralities of lattice sites forms pluralities of site-aggregates, each site-aggregate being unsymmetric about every parallel to at least one edge of the at least one sector;
wherein a portion of each bit-field associated with each site-aggregate forms a bit-aggregate;
wherein the memory stores each bit-aggregate as an addressable unit; and
wherein the processing node shifts data for at least one of the bit-fields within the at least one sector of the emulated lattice by addressing each bit-aggregate in which each portion of the at least one of the bit-fields is stored. - View Dependent Claims (57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69)
a plurality of the processing nodes, each of the processing nodes in the plurality of processing nodes connected by mesh I/O links to neighboring processing nodes in the plurality of processing nodes to form a mesh array, each of the processing nodes in the plurality of processing nodes being associated with an equal-sized sector of the emulated lattice; and
wherein the performance of the operations is divided among the plurality of the processing nodes.
-
-
63. The processor of claim 61, further comprising:
a barrel shifter connected to the at least one memory block for re-ordering bits within memory words.
-
64. The processor of claim 61, further comprising:
a butterfly network connected to the at least one memory block for re-ordering bits within memory words.
-
65. The processor of claim 61, further comprising:
a mesh I/O unit coupled to the at least one memory block for identifying a bit as having shifted beyond a sector boundary and transferring the identified bit to a next adjacent sector for a corresponding bit substitution.
-
66. The processor of claim 57, wherein the operations are performed under the control of a host to which the processor is connected.
-
67. The processor of claim 57, wherein the processing node is coupled to a nonvolatile memory device for storing a program and a copy of the program is loaded into the processing node at boot time.
-
68. The processor of claim 57, wherein the processing node includes reprogrammable logic blocks of the sort used in FPGA devices, along with reprogrammable I/O pins, for interfacing with other electronic devices.
-
69. The processor of claim 57, wherein the processing node controls an external memory device used for storing bit-field data and for storing control information.
Specification