Method of identifying semantic units in an electronic document
First Claim
1. A method of identifying semantic units in an electronic document comprising the steps of:
- a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further comprising geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs, the parameters including at least one of font size, character spacing, and text distortion;
b) determining a strip as a geometrical area of at least one glyph by comparing a geometric position of each glyph with a geographic position of the subsequent glyph in a description of the page, wherein every two glyphs are considered part of one strip if;
i) the difference of their vertical coordinates is smaller than a first predetermined value; and
if ii) the difference of their horizontal coordinates is greater than zero, and smaller than a second predetermined value;
c) determining a zone of at least one strip wherein a zone is defined by the combined area of strips that overlap each other;
d) determining a potential boundary between two semantic units in a zone by a process comprising the steps of;
i) defining a list of glyphs in a zone according to their location in the page description with associated font characteristics and geometric position;
ii) identifying the presence of a potential boundary between semantic units between two glyphs if the difference between the vertical coordinate of a selected glyph and the vertical coordinate of the preceding glyph is larger than a third predetermined value; and
/or iii) defining a space box of rectangular shape for each font/encoding combination used in the document, the width of the space box being defined as the width of one of an existing and a hypothetical space glyph for the corresponding font size, and identifying the presence of a potential boundary between semantic units, if the distance between a selected glyph and the preceding glyph equals the width of the space box or exceeds it by a fourth predetermined value, thereby rendering a potential semantic unit; and
iv) repeating steps i) to iii) until the end of the glyph list has been reached;
e) sorting the identified potential semantic units in the zone according to the vertical position and, if the vertical positions differ by at most a fifth predetermined value, according to the horizontal position, whereby a sorted list of potential semantic units is provided;
f) processing the sorted list of potential semantic units by combining two potential semantic units if the distance between the last glyph of a first potential semantic unit and the first glyph of a next potential semantic unit is smaller than the sum of the width of the space box and a sixth predetermined value, until the end of the list has been reached, thereby rendering a sorted list of finally identified semantic units in the zone; and
g) repeating steps d) to f) until all zones have been processed.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of identifying semantic units in an electronic document includes the steps of: providing an electronic document being described in a page description language, the document having at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further including geometric information and page description language parameters; determining strips of at least one glyph by comparing the geometric position of subsequent glyphs; determining zones of at least one strip wherein a zone is defined by the combined area of strips, the geometrical areas of which overlap with each other; determining a boundary between two semantic units in a zone based on the geometric properties of the glyphs; sorting the identified semantic units in the zone in a sorted list; and, combining subsequent semantic units in the sorted list according to geometric considerations.
-
Citations
10 Claims
-
1. A method of identifying semantic units in an electronic document comprising the steps of:
-
a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further comprising geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs, the parameters including at least one of font size, character spacing, and text distortion;
b) determining a strip as a geometrical area of at least one glyph by comparing a geometric position of each glyph with a geographic position of the subsequent glyph in a description of the page, wherein every two glyphs are considered part of one strip if;
i) the difference of their vertical coordinates is smaller than a first predetermined value; and
ifii) the difference of their horizontal coordinates is greater than zero, and smaller than a second predetermined value;
c) determining a zone of at least one strip wherein a zone is defined by the combined area of strips that overlap each other;
d) determining a potential boundary between two semantic units in a zone by a process comprising the steps of;
i) defining a list of glyphs in a zone according to their location in the page description with associated font characteristics and geometric position;
ii) identifying the presence of a potential boundary between semantic units between two glyphs if the difference between the vertical coordinate of a selected glyph and the vertical coordinate of the preceding glyph is larger than a third predetermined value; and
/oriii) defining a space box of rectangular shape for each font/encoding combination used in the document, the width of the space box being defined as the width of one of an existing and a hypothetical space glyph for the corresponding font size, and identifying the presence of a potential boundary between semantic units, if the distance between a selected glyph and the preceding glyph equals the width of the space box or exceeds it by a fourth predetermined value, thereby rendering a potential semantic unit; and
iv) repeating steps i) to iii) until the end of the glyph list has been reached;
e) sorting the identified potential semantic units in the zone according to the vertical position and, if the vertical positions differ by at most a fifth predetermined value, according to the horizontal position, whereby a sorted list of potential semantic units is provided;
f) processing the sorted list of potential semantic units by combining two potential semantic units if the distance between the last glyph of a first potential semantic unit and the first glyph of a next potential semantic unit is smaller than the sum of the width of the space box and a sixth predetermined value, until the end of the list has been reached, thereby rendering a sorted list of finally identified semantic units in the zone; and
g) repeating steps d) to f) until all zones have been processed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer for identifying semantic units in a portable electronic document, the program including the steps of:
-
a) providing an electronic document being described in a page description language, the document comprising at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further comprising geometric information including position and width of all text fragments on the page and page description language parameters of all glyphs, the parameters including at least one of font size, character spacing, and text distortion;
b) determining a strip as a geometrical area of at least one glyph by comparing a geometric position of each glyph with a geographic position of the subsequent glyph in a description of the page, wherein every two glyphs are considered part of one strip if;
i) the difference of their vertical coordinates is smaller than a first predetermined value; and
ifii) the difference of their horizontal coordinates is greater than zero, and smaller than a second predetermined value;
c) determining a zone of at least one strip wherein a zone is defined by the combined area of strips that overlap each other;
d) determining a potential boundary between two semantic units in a zone by a process comprising the steps of;
i) defining a list of glyphs in a zone according to their location in the page description with associated font characteristics and geometric position;
ii) identifying the presence of a potential boundary between semantic units between two glyphs if the difference between the vertical coordinate of a selected glyph and the vertical coordinate of the preceding glyph is larger than a third predetermined value; and
/oriii) defining a space box of rectangular shape for each font/encoding combination used in the document, the width of the space box being defined as the width of one of an existing and a hypothetical space glyph for the corresponding font size, and identifying the presence of a potential boundary between semantic units, if the distance between a selected glyph and the preceding glyph equals the width of the space box or exceeds it by a fourth predetermined value, thereby rendering a potential semantic unit; and
iv) repeating steps i) to iii) until the end of the glyph list has been reached;
e) sorting the identified potential semantic units in the zone according to the vertical position and, if the vertical positions differ by at most a fifth predetermined value, according to the horizontal position, whereby a sorted list of potential semantic units is provided;
f) processing the sorted list of potential semantic units by combining two potential semantic units if the distance between the last glyph of a first potential semantic unit and the first glyph of a next potential semantic unit is smaller than the sum of the width of the space box and a sixth predetermined value, until the end of the list has been reached, thereby rendering a sorted list of finally identified semantic units in the zone; and
g) repeating steps d) to f) until all zones have been processed.
-
Specification