Method and apparatus for floating point to fixed point conversion with compensation for lost precision
First Claim
1. A method of compensating lost precision for a truncated binary number stored in a register having a fewer number of bits than the original untruncated binary number, the method comprising the steps of:
- (a) determining if any Of the bits to be truncated to reduce the precision of the binary number are 1'"'"'s;
(b) truncating the original binary number to reduce its precision;
(c) storing the truncated binary number in a fixed point format having a fractional portion; and
(d) altering the least significant bit of the truncated binary number to be a 1 if the fractional portion of the truncated binary number is all 0'"'"'s and the determination in step (a) was in the affirmative.
2 Assignments
0 Petitions
Accused Products
Abstract
A floating point binary number that is to be converted to a fixed point representation, or a fixed point number to be reduced in precision, is originally located in a source register. A conversion mechanism connects the source register to a destination register. After the conversion the least significant bit of the fixed point representation may deliberately retain an indication of the existence of less significant non-zero bits that were truncated. When such retention is desired it is accomplished by forcing that least significant bit to be a one if the fractional portion of the converted number is zero and there were such truncated non-zero bits of lesser significance. To do this the direction and amount of mantissa shift needed during conversion are inspected to reveal which bit positions in the original floating point number are going to be truncated. An array of two-input AND gates has one AND gate per possible truncated bit. A mask is generated by a lookup table according to the number of bits to be truncated. The mask supplies a logic 1 to one input of each such corresponding gate; the other input of each gate is driven by the bit to be truncated. If any such bit to be truncated is a one, then the output of the corresponding gate will be true. The outputs of all these AND gates or OR'"'"'ed together and the result stored in a latch; a SET latch then indicates the impending truncation of at least one 1. After the conversion the fractional portion of the destination register is checked to see if it is all zeros. If it is, and if the latch is also SET, then the least significant bit of the fractional portion of the destination register is forced to be understood as a 1 when the register is read.
48 Citations
3 Claims
-
1. A method of compensating lost precision for a truncated binary number stored in a register having a fewer number of bits than the original untruncated binary number, the method comprising the steps of:
-
(a) determining if any Of the bits to be truncated to reduce the precision of the binary number are 1'"'"'s; (b) truncating the original binary number to reduce its precision; (c) storing the truncated binary number in a fixed point format having a fractional portion; and (d) altering the least significant bit of the truncated binary number to be a 1 if the fractional portion of the truncated binary number is all 0'"'"'s and the determination in step (a) was in the affirmative.
-
-
2. A method of converting a floating point binary number having an exponent and a mantissa to a fixed point number that includes a fractional portion, the method comprising the steps of:
-
(a) inspecting the exponent to determine the direction and number of bits to shift the mantissa of the floating point binary number to align it with the bit positions of a fixed point destination; (b) inspecting those bits of the mantissa that will be lost by truncation to determine if any of those bits are 1'"'"'s; (c) storing the shifted version of the mantissa as a fixed point binary number that includes a fractional portion; (d) inspecting the bits of the fractional portion to determine if they are all 0'"'"'s; (e) altering the least significant bit of the fixed point binary number to be a 1 if the fractional portion is all 0'"'"'s and the determination in step (b) was in the affirmative.
-
-
3. A circuit for compensating a binary fixed point representation of a binary floating point number for precision lost in a conversion from floating point to fixed point, the circuit comprising:
-
a first register containing a binary floating point number including an exponent and a mantissa; a first circuit, coupled to the exponent, that produces a shift signal that indicates the direction and number of shifts that the mantissa is to be shifted as part of its conversion to a binary fixed point representation that includes a fractional part; a shift circuit having a first input coupled to the mantissa contained in the first register, a second input coupled to the shift signal, and an output at which appears a bit pattern at the first input after being shifted in the direction and the amount indicted by the shift signal; a second register, coupled to the output of the shifting circuit, that receives a binary fixed point representation of the binary floating point number in the first register; a second circuit, coupled to the mantissa contained in the first register and to the shift signal, that produces a loss signal that indicates if any of the bits in that mantissa that are not represented in the second register, owing to truncation, were 1--s; a third circuit, coupled to the bits in the second register representing the fractional part of the binary fixed point number, that produces an output signal indicative that the fractional part is all zeros; and a fourth circuit, coupled to the least significant bit of the fractional part contained in the second register, to the output from the third circuit and to the loss signal, that represents the least significant bit of the fractional part as a 1 when the output from the third circuit indicates that the fractional part is all 0--s and the loss signal indicates that at least one 1 was truncated, and that at other times indicates the least significant bit of the fractional part as it actually is in the second register.
-
Specification