INSTRUCTION EXECUTION THAT BROADCASTS AND MASKS DATA VALUES AT DIFFERENT LEVELS OF GRANULARITY
First Claim
1. A system comprising:
- an integrated memory controller unit; and
a processor core coupled with the integrated memory controller unit, the processor core comprising;
multiple levels of cache, including a Level 2 (L2) cache,a plurality of vector registers,a plurality of mask registers,a decode unit circuit to decode a first instruction and a second instruction,the first instruction having fields to specify a base and an index corresponding to a location in a memory of a first 128-bit packed data structure having two 64-bit elements, having a field to specify a mask register of the plurality of mask registers as a source of a first mask, and having a field to specify a destination register of the plurality of vector registers,the second instruction having fields to specify a base and an index corresponding to a location in the memory of a second 128-bit packed data structure having four 32-bit elements, having a field to specify a mask register of the plurality of mask registers as a source of a second mask, and having a field to specify a destination register of the plurality of vector registers, andan execution unit circuit coupled with the decode unit circuit, the plurality of vector registers, and the plurality of mask registers,the execution unit circuit to perform the first instruction to;
load at least one 64-bit element of the first 128-bit packed data structure,generate a first masked replication data structure from the first 128-bit packed data structure based on applying the first mask at a 64-bit data element granularity, and with zeroed masking where masked out elements are zeroed, andstore a first result including the first masked replication data structure in the destination register specified by the first instruction, wherein a length of the first masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the first instruction, andthe execution unit circuit to perform the second instruction to;
load at least one 32-bit element of the second 128-bit packed data structure,generate a second masked replication data structure from the second 128-bit packed data structure based on applying the second mask at a 32-bit data element granularity, and with the zeroed masking where masked out elements are zeroed, andstore a second result including the second masked replication data structure in the destination register specified by the second instruction, wherein a length of the second masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the second instruction.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.
-
Citations
31 Claims
-
1. A system comprising:
-
an integrated memory controller unit; and a processor core coupled with the integrated memory controller unit, the processor core comprising; multiple levels of cache, including a Level 2 (L2) cache, a plurality of vector registers, a plurality of mask registers, a decode unit circuit to decode a first instruction and a second instruction, the first instruction having fields to specify a base and an index corresponding to a location in a memory of a first 128-bit packed data structure having two 64-bit elements, having a field to specify a mask register of the plurality of mask registers as a source of a first mask, and having a field to specify a destination register of the plurality of vector registers, the second instruction having fields to specify a base and an index corresponding to a location in the memory of a second 128-bit packed data structure having four 32-bit elements, having a field to specify a mask register of the plurality of mask registers as a source of a second mask, and having a field to specify a destination register of the plurality of vector registers, and an execution unit circuit coupled with the decode unit circuit, the plurality of vector registers, and the plurality of mask registers, the execution unit circuit to perform the first instruction to; load at least one 64-bit element of the first 128-bit packed data structure, generate a first masked replication data structure from the first 128-bit packed data structure based on applying the first mask at a 64-bit data element granularity, and with zeroed masking where masked out elements are zeroed, and store a first result including the first masked replication data structure in the destination register specified by the first instruction, wherein a length of the first masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the first instruction, and the execution unit circuit to perform the second instruction to; load at least one 32-bit element of the second 128-bit packed data structure, generate a second masked replication data structure from the second 128-bit packed data structure based on applying the second mask at a 32-bit data element granularity, and with the zeroed masking where masked out elements are zeroed, and store a second result including the second masked replication data structure in the destination register specified by the second instruction, wherein a length of the second masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the second instruction. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system comprising:
-
an integrated memory controller unit; and a processor core coupled with the integrated memory controller unit, the processor core comprising; multiple levels of cache, including a Level 2 (L2) cache, a plurality of vector registers, a plurality of mask registers, a decode unit circuit to decode an instruction, the instruction having fields to specify a base and an index corresponding to a location in a memory of a 128-bit packed data structure having two 64-bit elements, having a field to specify a mask register of the plurality of mask registers as a source of a mask, and having a field to specify a destination register of the plurality of vector registers, and an execution unit circuit coupled with the decode unit circuit, the plurality of vector registers, and the plurality of mask registers, the execution unit circuit to perform the instruction to; load at least one 64-bit element of the 128-bit packed data structure, generate a masked replication data structure from the 128-bit packed data structure based on applying the mask at a 64-bit data element granularity, and with zeroed masking where masked out elements are zeroed, and store a result including the masked replication data structure in the destination register, wherein a length of the masked replication data structure is a multiple of 128-bits and is the same as the destination register. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A method comprising:
-
accessing a memory with an integrated memory controller unit; decoding a first instruction having fields specifying a base and an index corresponding to a location in a memory of a first 128-bit packed data structure having two 64-bit elements, having a field specifying a mask register of a plurality of mask registers as a source of a first mask, and having a field specifying a destination register of a plurality of vector registers; performing the first instruction, including; loading at least one 64-bit element of the first 128-bit packed data structure, generating a first masked replication data structure from the first 128-bit packed data structure based on applying the first mask at a 64-bit data element granularity, and with zeroed masking where masked out elements are zeroed, and storing a first result including the first masked replication data structure in the destination register specified by the first instruction, wherein a length of the first masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the first instruction; decoding a second instruction having fields specifying a base and an index corresponding to a location in the memory of a second 128-bit packed data structure having four 32-bit elements, having a field specifying a mask register of the plurality of mask registers as a source of a second mask, and having a field specifying a destination register of the plurality of vector registers; and performing the second instruction, including; loading at least one 32-bit element of the second 128-bit packed data structure, generating a second masked replication data structure from the second 128-bit packed data structure based on applying the second mask at a 32-bit data element granularity, and with the zeroed masking where masked out elements are zeroed, and storing a second result including the second masked replication data structure in the destination register specified by the second instruction, wherein a length of the second masked replication data structure is a multiple of 128-bits and is the same as the destination register specified by the second instruction.
-
Specification