×

Method of identifying redundant text in an electronic document

  • US 20060282769A1
  • Filed: 04/18/2006
  • Published: 12/14/2006
  • Est. Priority Date: 06/09/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic document, comprising:

  • a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment comprising at least one glyph, the document further comprising Unicode values for all glyphs as well as geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs that include at least one of font size, character spacing, and text distortion;

    b) identifying two text fragments as redundant candidates, if corresponding Unicode sequences of the two text fragments are identical;

    c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics wherein the height of the bounding box is essentially equal to the font size of the first glyph in a text fragment, and wherein the width of the bounding box is essentially equal to the accumulated widths of all glyphs in the text fragment;

    d) calculating the overlapping area of the two bounding boxes; and

    e) determining whether the two candidates form redundant text fragments wherein a ratio of the overlapping area to the area of the smaller bounding box of both text fragments is calculated and the ratio is compared with a predetermined threshold.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×