Identification of guides and gutters of a document
First Claim
1. A machine readable medium storing a program which when executed by at least one processing unit analyzes a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, the program comprising sets of instructions for:
- identifying a group of words aligned in a first direction based on the location coordinates of a particular glyph in each of the words in the group of aligned words;
based on the identified group of aligned words, defining an alignment element for the particular glyphs at a particular location in the first direction, wherein the alignment element that indicates a particular alignment for a plurality of text lines containing the group of aligned words;
removing particular locations from the alignment element in a second direction based on an analysis of the group of aligned words, wherein each particular location of the alignment element corresponds to a particular text line; and
defining a structured document based on the glyphs and remaining locations of the defined alignment element.
1 Assignment
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for analyzing an unstructured document that includes a number of words. Each word is an associated set of glyphs and each glyph has location coordinates. The method identifies clusters of words based on the location coordinates. Based on the identified clusters, the method defines a set of boundary elements for the glyphs that identify a set of borders for the glyphs. The method defines a structured document for the unstructured document based on the glyphs and the defined boundary elements. To identify clusters of words, the method orders the location coordinates and identifies several partitions of the location coordinates. Each partition specifies a particular grouping of the coordinates into subsets. For each partition, the method identifies a particular set of subsets of location values that satisfy a particular set of constraints and determines a set of subsets of location values that optimizes a particular measure.
89 Citations
24 Claims
-
1. A machine readable medium storing a program which when executed by at least one processing unit analyzes a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, the program comprising sets of instructions for:
-
identifying a group of words aligned in a first direction based on the location coordinates of a particular glyph in each of the words in the group of aligned words; based on the identified group of aligned words, defining an alignment element for the particular glyphs at a particular location in the first direction, wherein the alignment element that indicates a particular alignment for a plurality of text lines containing the group of aligned words; removing particular locations from the alignment element in a second direction based on an analysis of the group of aligned words, wherein each particular location of the alignment element corresponds to a particular text line; and defining a structured document based on the glyphs and remaining locations of the defined alignment element. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14)
-
-
12. A method for (i) analyzing a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, and (ii) generating structural elements that define structure in said document based on said analysis, the method comprising:
-
defining an alignment guide for a set of words in different text lines based on the location coordinates of the glyphs at the edges of the words; removing locations from the alignment guide based on an analysis of the words in different text lines, wherein each particular location of the alignment guide corresponds to a particular text line; associating groups of text lines as aligned groups of text lines based on remaining locations of the alignment guide; using the aligned groups of text lines to specify the structural elements; and defining a structured document based on the glyphs, the aligned groups of text lines, and the structural elements. - View Dependent Claims (13, 15, 16)
-
-
17. A machine readable medium storing a program which when executed by at least one processing unit analyzes a document that comprises a plurality of glyphs, each glyph having location coordinates, the program comprising sets of instructions for:
-
identifying sets of glyphs to associate as words; defining a potential alignment guide for a set of words in different text lines based on the location coordinates of the glyphs at the edges of the words; identifying a plurality of alignment elements within the potential alignment guide by removing particular locations from the potential alignment guide based on an analysis of the group of vertically aligned words within each set of text lines, wherein each particular location of the potential alignment guide corresponds to a particular text line, each identified alignment element indicating a particular alignment for a set of text lines that contain a group of vertically aligned words; using the identified alignment elements to specify additional structural elements within the document; and defining a structured document based on the glyphs, the aligned groups of text lines, and the structural elements. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification