Identification of Guides and Gutters of a Document
First Claim
1. A computer readable medium storing a computer program which when executed by at least one processor analyzes a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, the computer program comprising sets of instructions for:
- identifying a group of words based on the location coordinates of the glyphs;
based on the identified cluster of words, defining a set of boundary elements for the glyphs;
defining a structured document based on the glyphs and the defined set of boundary elements.
1 Assignment
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for analyzing an unstructured document that includes a number of words. Each word is an associated set of glyphs and each glyph has location coordinates. The method identifies clusters of words based on the location coordinates. Based on the identified clusters, the method defines a set of boundary elements for the glyphs that identify a set of borders for the glyphs. The method defines a structured document for the unstructured document based on the glyphs and the defined boundary elements. To identify clusters of words, the method orders the location coordinates and identifies several partitions of the location coordinates. Each partition specifies a particular grouping of the coordinates into subsets. For each partition, the method identifies a particular set of subsets of location values that satisfy a particular set of constraints and determines a set of subsets of location values that optimizes a particular measure.
-
Citations
24 Claims
-
1. A computer readable medium storing a computer program which when executed by at least one processor analyzes a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, the computer program comprising sets of instructions for:
-
identifying a group of words based on the location coordinates of the glyphs; based on the identified cluster of words, defining a set of boundary elements for the glyphs; defining a structured document based on the glyphs and the defined set of boundary elements. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for defining a program for (i) analyzing a document that comprises a plurality of words, each word comprising an associated set of glyphs, each glyph having location coordinates, and (ii) generating structural elements that define structure in said document based on said analysis the method comprising:
-
defining a module for identifying groups of words based on the location coordinates of the glyphs; defining a module for defining a set of boundary elements for the glyphs based on the identified clusters of words, the boundary elements identifying a set of borders for the glyphs; and defining a module for using the identified boundary elements to specify the structural elements - View Dependent Claims (13)
-
-
14. A computer readable medium storing a computer program which when executed by at least one processor analyzes a document comprising a plurality of words, each word comprising a plurality of glyphs, each word having a particular location value, the computer program comprising sets of instructions for:
-
ordering the location values of the words; identifying a plurality of different groupings of the location values into subsets. for each different grouping, identifying a set of the subsets of location values that satisfy a particular set of constraints; and determining a particular one of the sets of the subsets of location values that optimizes a particular measure; and defining boundary elements for the glyphs based on the particular set of subsets that optimizes the particular measure. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer readable medium storing a computer program which when executed by at least one processor identifies clusters of data, the computer program comprising sets of instructions for:
-
receiving a set of data values to be clustered; identifying a plurality of partitions of the data values, each different partition specifying a particular grouping of the data values into subsets; for each group of subsets of data values, identifying a set of the subsets that satisfy a particular set of constraints; and determining a set of subsets that optimizes a particular measure. - View Dependent Claims (20)
-
-
21. A computer readable medium storing a computer program which when executed by at least one processor analyzes a document that comprises a plurality of words, each word an associated set of glyphs, the computer program comprising sets of instructions for:
-
identifying a set of left alignment points and a set of right alignment points in the document; identifying an empty space in the document between a left alignment point and a right alignment point that satisfies particular criteria; and defining a structured document using the identified white space. - View Dependent Claims (22, 23, 24)
-
Specification