×

Method of identifying semantic units in an electronic document

  • US 20070002054A1
  • Filed: 04/18/2006
  • Published: 01/04/2007
  • Est. Priority Date: 07/01/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method of identifying semantic units in an electronic document comprising the steps of:

  • a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further comprising geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs, the parameters including at least one of font size, character spacing, and text distortion;

    b) determining a strip as a geometrical area of at least one glyph by comparing a geometric position of each glyph with a geographic position of the subsequent glyph in a description of the page, wherein every two glyphs are considered part of one strip if;

    i) the difference of their vertical coordinates is smaller than a first predetermined value; and

    if ii) the difference of their horizontal coordinates is greater than zero, and smaller than a second predetermined value;

    c) determining a zone of at least one strip wherein a zone is defined by the combined area of strips that overlap each other;

    d) determining a potential boundary between two semantic units in a zone by a process comprising the steps of;

    i) defining a list of glyphs in a zone according to their location in the page description with associated font characteristics and geometric position;

    ii) identifying the presence of a potential boundary between semantic units between two glyphs if the difference between the vertical coordinate of a selected glyph and the vertical coordinate of the preceding glyph is larger than a third predetermined value; and

    /or iii) defining a space box of rectangular shape for each font/encoding combination used in the document, the width of the space box being defined as the width of one of an existing and a hypothetical space glyph for the corresponding font size, and identifying the presence of a potential boundary between semantic units, if the distance between a selected glyph and the preceding glyph equals the width of the space box or exceeds it by a fourth predetermined value, thereby rendering a potential semantic unit; and

    iv) repeating steps i) to iii) until the end of the glyph list has been reached;

    e) sorting the identified potential semantic units in the zone according to the vertical position and, if the vertical positions differ by at most a fifth predetermined value, according to the horizontal position, whereby a sorted list of potential semantic units is provided;

    f) processing the sorted list of potential semantic units by combining two potential semantic units if the distance between the last glyph of a first potential semantic unit and the first glyph of a next potential semantic unit is smaller than the sum of the width of the space box and a sixth predetermined value, until the end of the list has been reached, thereby rendering a sorted list of finally identified semantic units in the zone; and

    g) repeating steps d) to f) until all zones have been processed.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×