Outer Product Engine
First Claim
1. An apparatus comprising:
- a processor configured to fetch an outer product instruction; and
an outer product engine coupled to the processor, wherein;
the outer product engine is configured to perform an outer product operation specified for the outer product instruction;
the outer product engine comprises at least two input memories configured to store input vectors for the outer product operation and an output memory configured to accumulate outer product results;
the processor is configured to retire the outer product instruction in response to transmitting the outer product operation to the outer product engine and prior to the outer product operation being completed by the outer product engine; and
a size of each input memory exceeds a size of vector registers in the processor.
1 Assignment
0 Petitions
Accused Products
Abstract
In an embodiment, an outer product engine is configured to perform outer product operations. The outer product engine may perform numerous multiplication operations in parallel on input vectors, in an embodiment, generating a resulting outer product matrix. In an embodiment, the outer product engine may be configured to accumulate results in a result matrix, performing fused multiply add (FMA) operations to produce the outer product elements (multiply) and to accumulate the outer product elements with previous elements from the result matrix memory (add). A processor may fetch outer product instructions, and may transmit the instructions to the outer product engine when the instructions become non-speculative in an embodiment. The processor may be configured to retire the outer product instructions responsive to transmitting them to the outer product engine.
19 Citations
20 Claims
-
1. An apparatus comprising:
-
a processor configured to fetch an outer product instruction; and an outer product engine coupled to the processor, wherein; the outer product engine is configured to perform an outer product operation specified for the outer product instruction; the outer product engine comprises at least two input memories configured to store input vectors for the outer product operation and an output memory configured to accumulate outer product results; the processor is configured to retire the outer product instruction in response to transmitting the outer product operation to the outer product engine and prior to the outer product operation being completed by the outer product engine; and a size of each input memory exceeds a size of vector registers in the processor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An outer product engine comprising:
-
a circuit configured to perform an outer product operation on a first vector operand and a second vector operand, producing a resulting outer product matrix; a first operand memory coupled to the circuit, wherein the first operand memory is sized to store a first number of elements of the first vector operand at a first element size and a second number of elements of the first vector operand at a second element size, wherein the second element size is larger than the first element size; a second operand memory coupled to the circuit, wherein the second operand memory is sized to store a third number of elements of the second vector operand at the first element size and a fourth number of elements of the second vector operand at the second element size; a third memory coupled to the circuit, wherein the third memory is sized to store the resulting outer product matrix for the outer product operation performed on the first element size, and wherein a portion of the third memory is unused for the outer product operation performed at the second element size. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. An apparatus comprising:
-
a processor configured to fetch an outer product instruction; and an outer product engine coupled to the processor, wherein; the outer product engine is configured to perform an outer product operation specified for the outer product instruction; the outer product engine comprises at least two input memories configured to store input vectors for the outer product operation and an output memory configured to accumulate outer product results; and the outer product engine is configured to read the elements of the output memory and accumulate corresponding elements of the outer product operation with existing data in the output memory in response to the outer product instruction. - View Dependent Claims (18, 19, 20)
-
Specification