Method for fast large-integer arithmetic on IA processors

US 9,292,283 B2
Filed: 12/06/2012
Issued: 03/22/2016
Est. Priority Date: 07/11/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A method in an integrated circuit, the integrated circuit having a plurality of registers for storing operands, the method comprising:

receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits;

performing a 512-bit squaring algorithm by;

(i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×

1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;

(iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and

wherein the (iv) adding all sub-elements within their respective columns, further includes;

(a) adding a first of the four diagonals each of 7×

1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×

1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×

1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatuses are disclosed for implementing fast large-integer arithmetic within an integrated circuit, such as on IA (Intel Architecture) processors, in which such means include receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits and performing a 512-bit squaring algorithm by: (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×1 sub-elements of the 64-bits in length arranged across a plurality of columns, (iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once. Other related embodiments are disclosed.

17 Citations

View as Search Results

30 Claims

1. A method in an integrated circuit, the integrated circuit having a plurality of registers for storing operands, the method comprising:
- receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits;
  
  performing a 512-bit squaring algorithm by;
  
  (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;
  
  (iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and
  
  wherein the (iv) adding all sub-elements within their respective columns, further includes;
  
  (a) adding a first of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the (iii) reorganizing constitutes a change in the timing and sequencing of performing additions for the 7×
    - 1 sub-elements of the 64-bits in length arranged across a plurality of columns without creating a structure of a different shape or organization through a change in timing of performing additions or a change in sequencing of performing additions, or both.
  - 3. The method of claim 1:
    - wherein the asymmetric intermediate result having the seven diagonals therein, each of a different length, requires seven overhead carries within a non-optimized 512-bit squaring algorithm; and
      
      wherein the symmetric intermediate result having the four diagonals each of 7×
      
      1 sub-elements of the 64-bits in length arranged across a plurality of columns, requires four overhead carries within an optimized 512-bit squaring algorithm.
  - 4. The method of claim 1, wherein the asymmetric intermediate result having the seven diagonals therein comprises an asymmetric triangular shaped structure in which each of the seven diagonals are of a different length.
  - 5. The method of claim 1, wherein multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, comprises:
    - each of the seven diagonals offset by 64-bits such that a first of the seven diagonals is a 7×
      
      1 diagonal in length, then a second is a 6×
      
      1 diagonal in length, then a third is a 5×
      
      1 diagonal in length, then a fourth is a 4×
      
      1 diagonal in length, then a fifth is a 3×
      
      1 diagonal in length, then a sixth is a 2×
      
      1 diagonal in length, and then a seventh is a 1×
      
      1 diagonal in length.
  - 6. The method of claim 1, wherein (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once, comprises performing a computation according to one of:
    - computing T1+(2*T2);
      
      orcomputing T1+T2+T2.
  - 7. The method of claim 1, wherein (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once, comprises:
    - computing T1+T2+T2 using an adcx operation for a first of the two additions and an adox operation for a second of the two additions;
      
      wherein the adcx operation constitutes an add with carry operation having an extension utilizing a Carry Flag (CF flag); and
      
      wherein the adox operation constitutes an add with carry operation having an extension utilizing an Overflow Flag (OF flag).
  - 8. The method of claim 7, wherein using the adcx operation for the first of the two additions and the adox operation for the second of the two additions comprises:
    - using two distinct carry chains for each of the respective adcx and adox operations resulting in the fewer load and store operations being required in comparison with using a single carry chain, and further wherein latency is reduced over performing the computation using two single passes using a legacy x86 add-with-carry (adc) instruction.
  - 9. The method of claim 1, wherein the integrated circuit comprises an Intel Architecture type Central Processing Unit (CPU) or alternatively wherein the integrated circuit comprises a 64-bit processor core.
  - 10. The method of claim 1, wherein the integrated circuit is embodied within one of a tablet computing device or a smartphone.
  - 11. The method of claim 1, wherein the (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once comprises yielding the final 512-bit squared result further by using two distinct carry chains for adding the value of T2 twice and the value of T1 once, wherein fewer load and store operations are required in comparison with using a single carry chain.

12. A method in an integrated circuit, the method comprising:
- receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits;
  
  performing a 512-bit squaring algorithm by;
  
  (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length arranged across a plurality of columns,(iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, wherein the adding all sub-elements within their respective columns, comprises;
  
  (a) adding a first of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch;
  
  (b) adding a second and a third of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load; and
  
  (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once.
- View Dependent Claims (13, 14)
- - 13. The method of claim 12, wherein the integrated circuit operates such that loading and switching the operands comprises retrieving operands from and storing operands to the plurality of registers.
  - 14. The method of claim 12, wherein the integrated circuit comprises an Intel Architecture type Central Processing Unit (CPU) or alternatively wherein the integrated circuit comprises a 64-bit processor core.

15. One or more non-transitory computer readable storage media having instructions stored thereon that, when executed by an integrated circuit having a plurality of registers for storing operands, the instructions cause the integrated circuit to perform operations including:
- receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits;
  
  performing a 512-bit squaring algorithm by;
  
  (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;
  
  (iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and
  
  wherein the (iv) adding all sub-elements within their respective columns, further includes;
  
  (a) adding a first of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.
- View Dependent Claims (16, 17, 18)
- - 16. The one or more non-transitory computer readable storage media of claim 15, wherein the (iii) reorganizing constitutes a change in the timing and sequencing of performing additions for the 7×
    - 1 sub-elements of the 64-bits in length arranged across a plurality of columns without creating a structure of a different shape or organization through a change in timing of performing additions or a change in sequencing of performing additions, or both.
  - 17. The one or more non-transitory computer readable storage media of claim 15:
    - wherein the asymmetric intermediate result having the seven diagonals therein, each of a different length, requires seven overhead carries within a non-optimized 512-bit squaring algorithm; and
      
      wherein the symmetric intermediate result having the four diagonals each of 7×
      
      1 sub-elements of the 64-bits in length arranged across a plurality of columns, requires four overhead carries within an optimized 512-bit squaring algorithm.
  - 18. The one or more non-transitory computer readable storage media of claim 15, wherein multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, comprises:
    - each of the seven diagonals offset by 64-bits such that a first of the seven diagonals is a 7×
      
      1 diagonal in length, then a second is a 6×
      
      1 diagonal in length, then a third is a 5×
      
      1 diagonal in length, then a fourth is a 4×
      
      1 diagonal in length, then a fifth is a 3×
      
      1 diagonal in length, then a sixth is a 2×
      
      1 diagonal in length, and then a seventh is a 1×
      
      1 diagonal in length.

19. An integrated circuit, comprising:
- a plurality of registers for storing operands;
  
  an input to receive a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; and
  
  a 512-bit squaring algorithm implemented as a multiply extension (“
  
  mulx”
  
  ) of an Instruction Set Architecture (ISA) instruction, wherein the 512-bit squaring algorithm to operate by;
  
  (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;
  
  (iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once;
  
  an output to egress the final 512-bit squared result of the 512-bit value; and
  
  wherein the (iv) adding all sub-elements within their respective columns, further includes;
  
  (a) adding a first of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The integrated circuit of claim 19:
    - wherein the integrated circuit is embodied within one of a tablet computing device or smartphone; and
      
      wherein the tablet computing device or smartphone further comprises a touch screen interface.
  - 21. The integrated circuit of claim 19, wherein the integrated circuit comprises an Intel Architecture type Central Processing Unit (CPU).
  - 22. The integrated circuit of claim 19, wherein the integrated circuit comprises a 64-bit processor core.
  - 23. The integrated circuit of claim 19:
    - wherein the asymmetric intermediate result having the seven diagonals therein, each of a different length, requires seven overhead carries within a non-optimized 512-bit squaring algorithm; and
      
      wherein the symmetric intermediate result having the four diagonals each of 7×
      
      1 sub-elements of the 64-bits in length arranged across a plurality of columns, requires four overhead carries within an optimized 512-bit squaring algorithm.
  - 24. The integrated circuit of claim 19, wherein multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, comprises:
    - each of the seven diagonals offset by 64-bits such that a first of the seven diagonals is a 7×
      
      1 diagonal in length, then a second is a 6×
      
      1 diagonal in length, then a third is a 5×
      
      1 diagonal in length, then a fourth is a 4×
      
      1 diagonal in length, then a fifth is a 3×
      
      1 diagonal in length, then a sixth is a 2×
      
      1 diagonal in length, and then a seventh is a 1×
      
      1 diagonal in length.
  - 25. The integrated circuit of claim 19, wherein the integrated circuit comprises an Intel Architecture type Central Processing Unit (CPU) or alternatively wherein the integrated circuit comprises a 64-bit processor core.

26. A system comprising:
- a system bus;
  
  a touch screen interface coupled with the system bus;
  
  a memory coupled with the system bus;
  
  a processor coupled with the system bus; and
  
  a 512-bit squaring algorithm to operate at an integrated circuit of the system, the integrated circuit having a plurality of registers for storing operands, wherein the 512-bit squaring algorithm is to operate in conjunction with the memory and the processor by;
  
  (i) receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; and
  
  (ii) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(iii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iv) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;
  
  (v) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(vi) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and
  
  wherein the (iv) adding all sub-elements within their respective columns, further includes;
  
  (a) adding a first of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
  
  1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The system of claim 26:
    - wherein the system, including the system bus, the memory, the processor, and the 512-bit squaring algorithm, is embodied within one of a tablet computing device or smartphone.
  - 28. The system of claim 26, wherein the processor comprises a 64-bit Intel Architecture type Central Processing Unit Core (CPU).
  - 29. The system of claim 26:
    - wherein the asymmetric intermediate result having the seven diagonals therein, each of a different length, requires seven overhead carries within a non-optimized 512-bit squaring algorithm; and
      
      wherein the symmetric intermediate result having the four diagonals each of 7×
      
      1 sub-elements of the 64-bits in length arranged across a plurality of columns, requires four overhead carries within an optimized 512-bit squaring algorithm.
  - 30. The system of claim 26, wherein the integrated circuit comprises an Intel Architecture type Central Processing Unit (CPU) or alternatively wherein the integrated circuit comprises a 64-bit processor core.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Ozturk, Erdinc, Gopal, Vinodh, Guilford, James
Primary Examiner(s)
Caldwell, Andrew
Assistant Examiner(s)
Brien, Calvin M

Application Number

US13/707,105
Publication Number

US 20140019725A1
Time in Patent Office

1,202 Days
Field of Search

708/606, 708/620, 708653-714, 708/490, 708/495, 708/503, 708/625
US Class Current

1/1
CPC Class Codes

G06F 2207/5523   Calculates a power, e.g. th...

G06F 7/52   Multiplying; Dividing G06F7...

G06F 7/523   Multiplying only

G06F 7/544   for evaluating functions by...

G06F 9/3001   Arithmetic instructions

G06F 9/30036   Instructions to perform ope...

Method for fast large-integer arithmetic on IA processors

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Method for fast large-integer arithmetic on IA processors

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links