Staggering execution of a single packed data instruction using the same circuit

US 6,925,553 B2
Filed: 10/20/2003
Issued: 08/02/2005
Est. Priority Date: 03/31/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

receiving a single macro instruction specifying at least two logical registers, wherein the two logical registers respectively store first and second 128-bit packed data operands, each of the first and second 128-bit packed data operands have four 32-bit single precision floating point data elements; and

independently performing an operation specified by the single macro instruction on a first and a second plurality of corresponding ones of the 32-bit single precision floating point data elements of the first and second 128-bit packed data operands, at different times, using the same circuit, to independently generate a first and a second plurality of resulting data elements, wherein the first and the second plurality of resulting data elements are stored in a single logical register as a third 128-bit packed data operand.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus are disclosed for staggering execution of an instruction. According to one embodiment of the invention, a macro instruction specifying an operation, and specifying a first and a second data operand in first and second registers, respectively, is received. The macro instruction is then split into a first micro instruction and a second micro instruction, the first micro instruction specifying the operation on a first corresponding segment including a first portion of the first data operand and a first portion of the second data operand, and the second micro instruction specifying the operation on a second corresponding segment including a second portion of the first data operand and a second portion of the second data operand. The first and second micro instructions are then executed.

98 Citations

View as Search Results

26 Claims

1. A method comprising:
- receiving a single macro instruction specifying at least two logical registers, wherein the two logical registers respectively store first and second 128-bit packed data operands, each of the first and second 128-bit packed data operands have four 32-bit single precision floating point data elements; and
  
  independently performing an operation specified by the single macro instruction on a first and a second plurality of corresponding ones of the 32-bit single precision floating point data elements of the first and second 128-bit packed data operands, at different times, using the same circuit, to independently generate a first and a second plurality of resulting data elements, wherein the first and the second plurality of resulting data elements are stored in a single logical register as a third 128-bit packed data operand.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein said independently performing comprises:
    - accessing full widths of the first and second 128-bit packed data operands;
      
      splitting the full widths of each of the first and second 128-bit packed data operands into lower and higher halves that each include two 32-bit single precision floating point data elements; and
      
      delaying the higher halves of each of the first and second 128-bit packed data operands.
  - 3. The method of claim 1, wherein said independently performing comprises:
    - converting the single macro instruction into a first micro instruction and a second micro instruction;
      
      accessing lower halves of the first and second 128-bit packed data operands and performing an operation specified by the first micro instruction on each of two corresponding pairs of the 32-bit single precision floating point data elements of the lower halves; and
      
      accessing higher halves of the first and second 128-bit packed data operands and performing the operation specified by the second micro instruction on each of two corresponding pairs of the 32-bit single precision floating point data elements of the higher halves.

4. An apparatus comprising:
- a register file to contain first and second 128-bit packed data operands, the first and the second 128-bit packed data operands including four pairs of corresponding 32-bit single precision floating point data elements; and
  
  a circuit coupled to the register file, the circuit in response to a single packed data instruction specifying an operation to;
  
  retrieve the corresponds data elements from the register file;
  
  execute the operation in an execution unit on a lower order two of the four pairs of corresponding 32-bit single precision floating paint data elements to output a first result including two 32-bit single precision floating point data elements;
  
  at a different time, execute the operation in the execution unit on a higher order two of the four pairs of corresponding 32-bit single precision floating point data elements to output a second result including two 32-bit single precision floating point data elements; and
  
  store the first and the second results in the register file as a third 128-bit packed data operand.
- View Dependent Claims (5, 6)
- - 5. The apparatus of claim 4, further comprising:
    - ports of the circuit to receive full widths of the first and the second 128-bit packed data operands;
      
      logic to divide the full widths of each of the first and the second 128-bit packed data operands into the lower and the higher order two; and
      
      delay elements coupled with the logic to delay the higher order two.
  - 6. The apparatus of claim 4, further comprising a decoder to convert the single packed data instruction into a first micro instruction that causes the lower order two to be accessed from the register file and a second micro instruction that causes the higher order two to be accessed from the register file.

7. A method comprising:
- receiving a first packed data instruction, the first packed data instruction specifying logical registers in a 128-bit logical register file of a processor storing a first 128-bit packed data operand and a second 128-bit packed data operand, the first packed data instruction also specifying an operation to be performed on corresponding 32-bit single precision floating point data elements of the first and the second 128-bit packed data operands, each of the 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half including two of the 32-bit single precision floating point data elements; and
  
  performing the operation specified by the first packed data instruction on the corresponding 32-bit single precision floating point data elements of the lower order halves of the first and the second 128-bit packed data operands using a circuit; and
  
  at a different time, performing the operation specified by the first packed data instruction on the corresponding 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands using the circuit.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, further comprising:
    - receiving a second packed data instruction, the second packed data instruction specifying logical registers in the 128-bit logical register file of the processor storing a third 128-bit packed data operand and a fourth 128-bit packed data operand, the second packed data instruction also specifying an operation to be performed on corresponding 64-bit data elements of the third and the fourth 128-bit packed data operands, each of the third and the fourth 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half of the third and the fourth 128-bit packed data operands including one of the 64-bit data elements; and
      
      performing the operation specified by the second packed data instruction on the corresponding 64-bit data elements of the lower order halves of the third and the fourth 128-bit packed data operands asing the circuit; and
      
      at a different time performing the operation specified by the second picked data instruction on the corresponding 64-bit data elements of the higher order halves of the third and the fourth 128-bit packed data operands using the circuit.
  - 9. The method of claim 7:
    - wherein the circuit comprises a 64-bit execution unit; and
      
      wherein the operation is one of ADD and MULTIPLY.
  - 10. The method of claim 7, further comprising:
    - converting the first packed data instruction into at least a first micro instruction and a second micro instruction;
      
      accessing the lower order halves of the first and the second 128-bit packed data operands from the logical registers responsive to the first micro instruction; and
      
      accessing the higher order halves of the first and the second 128-bit packed data operands from the logical registers responsive to the second micro instruction.
  - 11. The method of claim 7, further comprising, prior to said performing the operation, accessing full widths of the first and the second 128-bit packed data operands from the logical registers.
  - 12. The method of claim 7, wherein the operation of the first packed data instruction comprises an ADD operation, and further comprising scheduling a second packed data instruction, which specifies a MULTIPLY operation, out-of-order so that it follows the first packed data instruction.

13. An apparatus comprising:
- a plurality of physical registers to operate as a 128-bit logical register file of a processor;
  
  a decoder to receive and decode instructions including a packed data instruction that specifies a first 128-bit packed data operand and a second 128-bit packed data operand by specifying logical registers in the 128-bit logical register file, each of the 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half including two 32-bit single precision floating point data elements, the packed data instruction also specifying an operation to be performed on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 128-bit packed data operands; and
  
  an execution unit, coupled with the decoder, to execute the packed data instruction to generate a 128-bit packed data result operand by performance of the operation specified by the packed data instruction on the corresponding ones of the 32-bit single precision floating point data elements in the lower order halves of the first and the second 128-bit packed data operands, and, at a different time, performance of the operation specified by the packed data instruction on the corresponding ones of the 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The apparatus of claim 13:
    - wherein the decoder is to receive and decode a second packed data instruction that specifies a third 128-bit packed data operand and a fourth 128-bit packed data operand by specifying logical registers in the 128-bit logical register file, each of the third and the fourth 128-bit packed data operands including a lower order half and a higher order half, each of the lower order half and the higher order half of the third and the fourth 128-bit packed data operands including a 64-bit data element, the second packed data instruction also specifying an operation to be performed on corresponding ones of the 64-bit data elements of the third and the fourth 128-bit packed data operands; and
      
      wherein the execution units is to execute the second packed data instruction to generate a second 128-bit packed data result operand by performance of the operation specified by the second packed data instruction on the corresponding ones of the 64-bit single precision floating point data elements in the lower order halves of the third and fourth 128-bit packed data operands, and, at a different time, performance of the operation specified by the second packed data instruction on the corresponding ones of the 64-bit single precision floating point data elements of the higher order halves of the third and the fourth 128-bit packed data operands.
  - 15. The apparatus of claim 13:
    - wherein the execution unit comprises a 64-bit execution unit;
      
      wherein the operation is one of ADD and MULTIPLY; and
      
      wherein the 128-bit packed data result operand is to be stored over the first 128-bit packed data operand.
  - 16. The apparatus of claim 13, wherein the decoder is to convert the packed data instruction into at least a first micro instruction and a second micro instruction, wherein the first micro instruction causes the lower order halves of the first and the second 128-bit packed data operands to be accessed from the logical registers, and wherein the second micro instruction causes the higher order halves of the first and the second 128-bit packed data operands to be accessed from the logical registers.
  - 17. The apparatus of claim 13, wherein the execution unit comprises logic to access full widths of the first and the second 128-bit packed data operands from the logical registers, and wherein the execution unit comprises delay elements to delay the higher order halves of the first and the second 128-bit packed data operands while performing the operation on the lower order halves of the first and the second 128-bit packed data operands.
  - 18. The apparatus of claim 13, further comprising a scheduling unit, coupled between the decoder and the execution unit, to try to schedule packed data instructions that specify ADD and MULTIPLY operations on 128-bit single precision floating point operands out-of-order so that they alternate.

19. A method comprising:
- receiving a single macro instruction that specifies an operation to be independently performed on corresponding 32-bit single precision floating point data elements from a first 128-bit packed data operand and a second 128-bit packed data operand to generate a third 128-bit packed data operand;
  
  performing the operations on lower halves of the first and second 128-bit packed data operands at a different time than on upper halves of the first and second 128-bit packed data operands using the same hardware; and
  
  storing results of the four operations as four 32-bit single precision floating point packed data elements of the third 128-bit packed data operand.

20. A method comprising:
- receiving a packed data instruction specifying logical registers storing a first 128-bit packed data operand and a second 128-bit packed data operand;
  
  accessing full widths of the first 128-bit packed data operand and the second 128-bit packed data operand from the logical registers;
  
  splitting the full width of the first 128-bit packed data operand into a first 64-bit lower order segment and a first 64-bit higher order segment, the first 64-bit lower order segment and the first 64-bit higher order segment each having two 32-bit single precision floating point data elements;
  
  splitting the full width of the second 128-bit packed data operand into a second 64-bit lower order segment and a second 64-bit higher order segment, the second 64-bit lower order segment and the second 64-bit higher order segment each having two 32-bit single precision floating point data elements;
  
  generating a first result by using a circuit to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first 64-bit lower order segment and the second 64-bit lower order segment; and
  
  at a different time, generating a second result by using the circuit to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first 64-bit higher order segment and the second 64-bit higher order segment.

21. An apparatus comprising:
- a first port to receive a full width of a first 128-bit packed data operand from a logical register specified by a packed data instruction, the first 128-bit packed data operand having four 32-bit single precision floating point data elements;
  
  a second port to receive a full width of a second 128-bit packed data operand from a logical register specified by the packed data instruction, the second 128-bit packed data operand also having four 32-bit single precision floating point data elements;
  
  a first circuit coupled with the first port to receive the full width of the first 128-bit packed data operand and to split the first 128-bit packed data operand into a first 64-bit lower order half and a first 64-bit higher order half;
  
  a second circuit coupled with the second port to receive the full width of the second 128-bit packed data operand and to split the second 128-bit packed data operand into a second 64-bit lower order half and a second 64-bit higher order half;
  
  a first delay element coupled with the first circuit to receive and delay the first 64-bit higher order half;
  
  a second delay element coupled with the second circuit to receive and delay the second 64-bit higher order half;
  
  an execution unit coupled with the first circuit, the second circuit, the first delay element, and the second delay element, the execution unit to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 64-bit lower order halves, and, after a delay to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the first and the second 64-bit higher order halves.

22. A method comprising:
- receiving a packed data instruction specifying logical registers having a first 128-bit packed data operand having four 32-bit single precision floating point data elements and a second 128-bit packed data operand having four 32-bit single precision floating point data elements;
  
  converting the packed data instruction into a first micro instruction and a second micro instruction;
  
  retrieving lower order halves of the first and the second 128-bit packed data operands specified by the first micro instruction, each of the lower order halves having two 32-bit single precision floating point data elements;
  
  generating a first result by asing hardware to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the lower order halves of the first and the second 128-bit packed data operands;
  
  retrieving higher order halves of the first and the second 128-bit packed data operands specified by the second micro instruction, each of the higher order halves having two 32-bit single precision floating point data elements;
  
  generating a second result by using the hardware to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements of the higher order halves of the first and the second 128-bit packed data operands;
  
  writing the first and the second results to a 128-bit logical register specified by the packed data instruction.

23. An apparatus comprising:
- a decoder to receive a packed data instruction, the packed data instruction specifying logical registers having a first 128-bit packed data operand and a second 128-bit packed data operand, each of the first and the second 128-bit packed data operands having four 32-bit single precision floating point data elements, the decoder to convert the packed data instruction into at least a first micro instruction and a second micro instruction;
  
  an execution unit coupled with the decoder to execute the first micro instruction, and, at a different time, to execute the second micro instruction, in the execution of the first micro instruction the execution unit to retrieve lower order halves of the first and the second 128-bit packed data operands and to perform an operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in the lower order halves, and in the execution of the second micro instruction the execution unit to retrieve higher order halves of the first and the second 128-bit packed data operands and to perform the operation specified by the packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in the higher order halves.

24. A method comprising:
- receiving a first packed data instruction specifying logical registers that respectively store a first 128-bit packed data operand and a second 128-bit packed data operand, each of the first and the second 128-bit packed data operands having four 32-bit single precision floating point data elements, the first packed data instruction specifying an ADD operation on the first and the second 128-bit packed data operands;
  
  receiving a second packed data instruction specifying logical registers that respectively store a third 128-bit packed data operand and a fourth 128-bit packed data operand, each of the third and the fourth 128-bit packed data operands having four 32-bit single precision floating point data elements, the second packed data instruction specifying a MULTIPLY operation on the third and the fourth 128-bit packed data operands;
  
  at a first time, performing the ADD operation specified by the first packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in lower order halves of the first 128-bit packed data operand and the second 128-bit packed data operand using an ADD execution unit;
  
  at a second time, which is different than the first time, performing the ADD operation specified by the first packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in higher order halves of the first 128-bit packed data operand and the second 128-bit packed data operand using the ADD execution unit;
  
  at the second time, performing the MULTIPLY operation specified by the second packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in lower order halves of the third 128-bit packed data operand and the fourth 128-bit packed data operand using a MULTIPLY execution unit; and
  
  at a third time, which is different than the second time, performing the MULTIPLY operation specified by the second packed data instruction on corresponding ones of the 32-bit single precision floating point data elements in higher order halves of the third 128-bit packed data operand and the fourth 128-bit packed data operand using the MULTIPLY execution unit.
- View Dependent Claims (25, 26)
- - 25. The method of claim 24, further comprising executing code having instructions with operations arranged in the pattern ADD, MULTIPLY, ADD, MULTIPLY.
  - 26. The method of claim 24, further comprising scheduling packed data instructions out-of-order so that packed data instructions specifying MULTIPLY operations are interleaved with packed data instructions specifying ADD operations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Thakkar, Shreekant S., Roussel, Patrice, Hinton, Glenn J., Boswell, Brent R., Menezes, Karol F.
Primary Examiner(s)
Coleman, Eric

Application Number

US10/689,291
Publication Number

US 20040083353A1
Time in Patent Office

652 Days
Field of Search

712/245, 712/22, 712/300
US Class Current

712/245
CPC Class Codes

G06F 9/30014   with variable precision

G06F 9/30036   Instructions to perform ope...

G06F 9/3826   Bypassing or forwarding of ...

G06F 9/3828   with global bypass, e.g. be...

G06F 9/3875   Pipelining a single stage, ...

G06F 9/3885   using a plurality of indepe...

Staggering execution of a single packed data instruction using the same circuit

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

98 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Staggering execution of a single packed data instruction using the same circuit

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

98 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links