Staggering execution of a single packed data instruction using the same circuit
First Claim
1. A method comprising:
- receiving a single macro instruction specifying at least two logical registers, wherein the two logical registers respectively store first and second 128-bit packed data operands, each of the first and second 128-bit packed data operands have four 32-bit single precision floating point data elements; and
independently performing an operation specified by the single macro instruction on a first and a second plurality of corresponding ones of the 32-bit single precision floating point data elements of the first and second 128-bit packed data operands, at different times, using the same circuit, to independently generate a first and a second plurality of resulting data elements, wherein the first and the second plurality of resulting data elements are stored in a single logical register as a third 128-bit packed data operand.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus are disclosed for staggering execution of an instruction. According to one embodiment of the invention, a macro instruction specifying an operation, and specifying a first and a second data operand in first and second registers, respectively, is received. The macro instruction is then split into a first micro instruction and a second micro instruction, the first micro instruction specifying the operation on a first corresponding segment including a first portion of the first data operand and a first portion of the second data operand, and the second micro instruction specifying the operation on a second corresponding segment including a second portion of the first data operand and a second portion of the second data operand. The first and second micro instructions are then executed.
98 Citations
26 Claims
-
1. A method comprising:
-
receiving a single macro instruction specifying at least two logical registers, wherein the two logical registers respectively store first and second 128-bit packed data operands, each of the first and second 128-bit packed data operands have four 32-bit single precision floating point data elements; and
independently performing an operation specified by the single macro instruction on a first and a second plurality of corresponding ones of the 32-bit single precision floating point data elements of the first and second 128-bit packed data operands, at different times, using the same circuit, to independently generate a first and a second plurality of resulting data elements, wherein the first and the second plurality of resulting data elements are stored in a single logical register as a third 128-bit packed data operand. - View Dependent Claims (2, 3)
-
-
4. An apparatus comprising:
-
a register file to contain first and second 128-bit packed data operands, the first and the second 128-bit packed data operands including four pairs of corresponding 32-bit single precision floating point data elements; and
a circuit coupled to the register file, the circuit in response to a single packed data instruction specifying an operation to;
retrieve the corresponds data elements from the register file;
execute the operation in an execution unit on a lower order two of the four pairs of corresponding 32-bit single precision floating paint data elements to output a first result including two 32-bit single precision floating point data elements;
at a different time, execute the operation in the execution unit on a higher order two of the four pairs of corresponding 32-bit single precision floating point data elements to output a second result including two 32-bit single precision floating point data elements; and
store the first and the second results in the register file as a third 128-bit packed data operand. - View Dependent Claims (5, 6)
-
-
7. A method comprising:
-
receiving a first packed data instruction, the first packed data instruction specifying logical registers in a 128-bit logical register file of a processor storing a first 128-bit packed data operand and a second 128-bit packed data operand, the first packed data instruction also specifying an operation to be performed on corresponding 32-bit single precision floating point data elements of the first and the second 128-bit packed data operands, each of the 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half including two of the 32-bit single precision floating point data elements; and
performing the operation specified by the first packed data instruction on the corresponding 32-bit single precision floating point data elements of the lower order halves of the first and the second 128-bit packed data operands using a circuit; and
at a different time, performing the operation specified by the first packed data instruction on the corresponding 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands using the circuit. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. An apparatus comprising:
-
a plurality of physical registers to operate as a 128-bit logical register file of a processor;
a decoder to receive and decode instructions including a packed data instruction that specifies a first 128-bit packed data operand and a second 128-bit packed data operand by specifying logical registers in the 128-bit logical register file, each of the 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half including two 32-bit single precision floating point data elements, the packed data instruction also specifying an operation to be performed on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 128-bit packed data operands; and
an execution unit, coupled with the decoder, to execute the packed data instruction to generate a 128-bit packed data result operand by performance of the operation specified by the packed data instruction on the corresponding ones of the 32-bit single precision floating point data elements in the lower order halves of the first and the second 128-bit packed data operands, and, at a different time, performance of the operation specified by the packed data instruction on the corresponding ones of the 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A method comprising:
-
receiving a single macro instruction that specifies an operation to be independently performed on corresponding 32-bit single precision floating point data elements from a first 128-bit packed data operand and a second 128-bit packed data operand to generate a third 128-bit packed data operand;
performing the operations on lower halves of the first and second 128-bit packed data operands at a different time than on upper halves of the first and second 128-bit packed data operands using the same hardware; and
storing results of the four operations as four 32-bit single precision floating point packed data elements of the third 128-bit packed data operand.
-
-
20. A method comprising:
-
receiving a packed data instruction specifying logical registers storing a first 128-bit packed data operand and a second 128-bit packed data operand;
accessing full widths of the first 128-bit packed data operand and the second 128-bit packed data operand from the logical registers;
splitting the full width of the first 128-bit packed data operand into a first 64-bit lower order segment and a first 64-bit higher order segment, the first 64-bit lower order segment and the first 64-bit higher order segment each having two 32-bit single precision floating point data elements;
splitting the full width of the second 128-bit packed data operand into a second 64-bit lower order segment and a second 64-bit higher order segment, the second 64-bit lower order segment and the second 64-bit higher order segment each having two 32-bit single precision floating point data elements;
generating a first result by using a circuit to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first 64-bit lower order segment and the second 64-bit lower order segment; and
at a different time, generating a second result by using the circuit to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first 64-bit higher order segment and the second 64-bit higher order segment.
-
-
21. An apparatus comprising:
-
a first port to receive a full width of a first 128-bit packed data operand from a logical register specified by a packed data instruction, the first 128-bit packed data operand having four 32-bit single precision floating point data elements;
a second port to receive a full width of a second 128-bit packed data operand from a logical register specified by the packed data instruction, the second 128-bit packed data operand also having four 32-bit single precision floating point data elements;
a first circuit coupled with the first port to receive the full width of the first 128-bit packed data operand and to split the first 128-bit packed data operand into a first 64-bit lower order half and a first 64-bit higher order half;
a second circuit coupled with the second port to receive the full width of the second 128-bit packed data operand and to split the second 128-bit packed data operand into a second 64-bit lower order half and a second 64-bit higher order half;
a first delay element coupled with the first circuit to receive and delay the first 64-bit higher order half;
a second delay element coupled with the second circuit to receive and delay the second 64-bit higher order half;
an execution unit coupled with the first circuit, the second circuit, the first delay element, and the second delay element, the execution unit to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 64-bit lower order halves, and, after a delay to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 64-bit higher order halves.
-
-
22. A method comprising:
-
receiving a packed data instruction specifying logical registers having a first 128-bit packed data operand having four 32-bit single precision floating point data elements and a second 128-bit packed data operand having four 32-bit single precision floating point data elements;
converting the packed data instruction into a first micro instruction and a second micro instruction;
retrieving lower order halves of the first and the second 128-bit packed data operands specified by the first micro instruction, each of the lower order halves having two 32-bit single precision floating point data elements;
generating a first result by asing hardware to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the lower order halves of the first and the second 128-bit packed data operands;
retrieving higher order halves of the first and the second 128-bit packed data operands specified by the second micro instruction, each of the higher order halves having two 32-bit single precision floating point data elements;
generating a second result by using the hardware to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands;
writing the first and the second results to a 128-bit logical register specified by the packed data instruction.
-
-
23. An apparatus comprising:
-
a decoder to receive a packed data instruction, the packed data instruction specifying logical registers having a first 128-bit packed data operand and a second 128-bit packed data operand, each of the first and the second 128-bit packed data operands having four 32-bit single precision floating point data elements, the decoder to convert the packed data instruction into at least a first micro instruction and a second micro instruction;
an execution unit coupled with the decoder to execute the first micro instruction, and, at a different time, to execute the second micro instruction, in the execution of the first micro instruction the execution unit to retrieve lower order halves of the first and the second 128-bit packed data operands and to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in the lower order halves, and in the execution of the second micro instruction the execution unit to retrieve higher order halves of the first and the second 128-bit packed data operands and to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in the higher order halves.
-
-
24. A method comprising:
-
receiving a first packed data instruction specifying logical registers that respectively store a first 128-bit packed data operand and a second 128-bit packed data operand, each of the first and the second 128-bit packed data operands having four 32-bit single precision floating point data elements, the first packed data instruction specifying an ADD operation on the first and the second 128-bit packed data operands;
receiving a second packed data instruction specifying logical registers that respectively store a third 128-bit packed data operand and a fourth 128-bit packed data operand, each of the third and the fourth 128-bit packed data operands having four 32-bit single precision floating point data elements, the second packed data instruction specifying a MULTIPLY operation on the third and the fourth 128-bit packed data operands;
at a first time, performing the ADD operation specified by the first packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in lower order halves of the first 128-bit packed data operand and the second 128-bit packed data operand using an ADD execution unit;
at a second time, which is different than the first time, performing the ADD operation specified by the first packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in higher order halves of the first 128-bit packed data operand and the second 128-bit packed data operand using the ADD execution unit;
at the second time, performing the MULTIPLY operation specified by the second packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in lower order halves of the third 128-bit packed data operand and the fourth 128-bit packed data operand using a MULTIPLY execution unit; and
at a third time, which is different than the second time, performing the MULTIPLY operation specified by the second packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in higher order halves of the third 128-bit packed data operand and the fourth 128-bit packed data operand using the MULTIPLY execution unit. - View Dependent Claims (25, 26)
-
Specification