Method of identifying redundant text in an electronic document
First Claim
1. A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic document, comprising:
- a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment comprising at least one glyph, the document further comprising Unicode values for all glyphs as well as geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs that include at least one of font size, character spacing, and text distortion;
b) identifying two text fragments as redundant candidates, if corresponding Unicode sequences of the two text fragments are identical;
c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics wherein the height of the bounding box is essentially equal to the font size of the first glyph in a text fragment, and wherein the width of the bounding box is essentially equal to the accumulated widths of all glyphs in the text fragment;
d) calculating the overlapping area of the two bounding boxes; and
e) determining whether the two candidates form redundant text fragments wherein a ratio of the overlapping area to the area of the smaller bounding box of both text fragments is calculated and the ratio is compared with a predetermined threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic page description language document includes a) providing a page having a plurality of text fragments, each text fragment comprising at least one glyph, the document including Unicode values for all glyphs and geometric information of all text fragments on the page and page description language parameters of all glyphs, b) identifying two text fragments as redundant candidates, if the Unicode sequence of the text fragments have identical corresponding Unicode sequences, c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics, d) calculating the overlapping area of the two bounding boxes, and e) determining whether the two candidates form redundant text fragments by comparing the ratio of the overlapping area to the area of the smaller bounding box of both text fragments with a predetermined threshold.
-
Citations
7 Claims
-
1. A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic document, comprising:
-
a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment comprising at least one glyph, the document further comprising Unicode values for all glyphs as well as geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs that include at least one of font size, character spacing, and text distortion;
b) identifying two text fragments as redundant candidates, if corresponding Unicode sequences of the two text fragments are identical;
c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics wherein the height of the bounding box is essentially equal to the font size of the first glyph in a text fragment, and wherein the width of the bounding box is essentially equal to the accumulated widths of all glyphs in the text fragment;
d) calculating the overlapping area of the two bounding boxes; and
e) determining whether the two candidates form redundant text fragments wherein a ratio of the overlapping area to the area of the smaller bounding box of both text fragments is calculated and the ratio is compared with a predetermined threshold. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform an operation of identifying redundant text fragments in an electronic document, the operation comprising the steps of:
-
a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment comprising at least one glyph, the document further comprising Unicode values for all glyphs as well as geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs that include at least one of font size, character spacing, and text distortion;
b) identifying two text fragments as redundant candidates, if corresponding Unicode sequences of the two text fragments are identical;
c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics wherein a height of the bounding box is essentially equal to the font size Of the first glyph in the text fragment, and wherein a width of the bounding box is essentially equal to accumulated widths of all glyphs in the text fragment;
d) calculating the overlapping area of the two bounding boxes; and
e) determining whether the two candidates form redundant text fragments. wherein a ratio of the overlapping area to the area of the smaller bounding box of both text fragments is calculated and this ratio is compared with a predetermined threshold.
-
Specification