Method for fast large-integer arithmetic on IA processors
First Claim
1. A method in an integrated circuit, the integrated circuit having a plurality of registers for storing operands, the method comprising:
- receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits;
performing a 512-bit squaring algorithm by;
(i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1,(ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length,(iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;
(iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and
wherein the (iv) adding all sub-elements within their respective columns, further includes;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatuses are disclosed for implementing fast large-integer arithmetic within an integrated circuit, such as on IA (Intel Architecture) processors, in which such means include receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits and performing a 512-bit squaring algorithm by: (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×1 sub-elements of the 64-bits in length arranged across a plurality of columns, (iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once. Other related embodiments are disclosed.
17 Citations
30 Claims
-
1. A method in an integrated circuit, the integrated circuit having a plurality of registers for storing operands, the method comprising:
-
receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; performing a 512-bit squaring algorithm by; (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;(iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and wherein the (iv) adding all sub-elements within their respective columns, further includes;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method in an integrated circuit, the method comprising:
-
receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; performing a 512-bit squaring algorithm by; (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns,(iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, wherein the adding all sub-elements within their respective columns, comprises;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch;
(b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load; and(v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once. - View Dependent Claims (13, 14)
-
-
15. One or more non-transitory computer readable storage media having instructions stored thereon that, when executed by an integrated circuit having a plurality of registers for storing operands, the instructions cause the integrated circuit to perform operations including:
-
receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; performing a 512-bit squaring algorithm by; (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;(iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and wherein the (iv) adding all sub-elements within their respective columns, further includes;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load. - View Dependent Claims (16, 17, 18)
-
-
19. An integrated circuit, comprising:
-
a plurality of registers for storing operands; an input to receive a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; and a 512-bit squaring algorithm implemented as a multiply extension (“
mulx”
) of an Instruction Set Architecture (ISA) instruction, wherein the 512-bit squaring algorithm to operate by;(i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;(iv) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; an output to egress the final 512-bit squared result of the 512-bit value; and wherein the (iv) adding all sub-elements within their respective columns, further includes;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
-
26. A system comprising:
-
a system bus; a touch screen interface coupled with the system bus; a memory coupled with the system bus; a processor coupled with the system bus; and a 512-bit squaring algorithm to operate at an integrated circuit of the system, the integrated circuit having a plurality of registers for storing operands, wherein the 512-bit squaring algorithm is to operate in conjunction with the memory and the processor by; (i) receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits; and (ii) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (iii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iv) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×
1 sub-elements of the 64-bits in length arranged across a plurality of columns, wherein fewer load and store operations from the plurality of registers are required after the reorganizing;(v) for each of the plurality of columns, adding all sub-elements within the respective one of the plurality of columns, the added sub-elements collectively identified as T2, and (vi) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once; and wherein the (iv) adding all sub-elements within their respective columns, further includes;
(a) adding a first of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which one operand is loaded once and does not switch, (b) adding a second and a third of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which only one operand is switched after an initial load, and (c) adding a fourth of the four diagonals each of 7×
1 sub-elements of the 64-bits in length in which a plurality of operand switches occur after an initial load. - View Dependent Claims (27, 28, 29, 30)
-
Specification