Sorting image segments into clusters based on a distance measurement
First Claim
1. A method for sorting document images stored in a memory of a document management system, comprising the steps of:
- segmenting each document image recorded in the memory into a set of layout objects;
each layout object in each of the sets of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;
selecting a feature of a document from a set of features;
each of the features in the set of features identifying groups of layout objects in different ones of the sets of layout objects recorded in the memory;
assembling in the memory a set of image segments;
each image segment in the set of image segments identifying those layout objects of a document image stored in the memory that form the selected feature;
computing a distance measurement between a selected image segment and ones of the image segments in the assembled set of image segments; and
sorting the assembled set of image segments into clusters in the memory with the computed distance measurements;
each cluster defining a grouping of image segments that have similar layout objects forming the selected feature.
7 Assignments
0 Petitions
Accused Products
Abstract
A programming interface of document search system enables a user to dynamically specifying features of documents recorded in a corpus of documents. The programming interface provides category and format flexibility for defining different genre of documents. The document search system initially segments document images into one or more layout objects. Each layout object identifies a structural element in a document such as text blocks, graphics, or halftones. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. Using the set of attributes a user defines features of a document with the programming interface. After receiving a feature or attribute and a set of document images selected by a user, the system forms a set of image segments by identifying those layout objects in the set of document images that make up the selected feature or attribute. The system then sorts the set of image segments into meaningful groupings of objects which have similarities and/or recurring patterns. Subsequently, document images in the set of document images are ordered and displayed to a user in accordance with the meaningful groupings.
224 Citations
23 Claims
-
1. A method for sorting document images stored in a memory of a document management system, comprising the steps of:
-
segmenting each document image recorded in the memory into a set of layout objects;
each layout object in each of the sets of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;
selecting a feature of a document from a set of features;
each of the features in the set of features identifying groups of layout objects in different ones of the sets of layout objects recorded in the memory;
assembling in the memory a set of image segments;
each image segment in the set of image segments identifying those layout objects of a document image stored in the memory that form the selected feature;
computing a distance measurement between a selected image segment and ones of the image segments in the assembled set of image segments; and
sorting the assembled set of image segments into clusters in the memory with the computed distance measurements;
each cluster defining a grouping of image segments that have similar layout objects forming the selected feature.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
selecting a first image segment from the set of image segment to define the selected image segment;
computing a distance measurement between the first image segment and image segments remaining in the set of image segments; and
defining a first cluster with the first image segment and certain of the remaining image segments having a distance measurement that is within a threshold distance.
-
-
6. The method according to claim 1, further comprising the steps of:
-
selecting a document image from the memory;
assembling a single image segment to define the selected image segment by identifying those layout objects of the selected document image that form the selected feature;
computing a distance measurement between the single image segment and each image segment in the set of image segments; and
forming clusters of document images by ranking the computed distance measurement between the single image segment and each image segment in the set of image segments.
-
-
7. The method according to claim 1, further comprising the step of displaying the assembled image segments in the clusters sorted by said sorting step.
-
8. The method according to claim 1, further comprising the step of computing attributes for each layout object in the set of layout objects;
- the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects in the document image.
-
9. The method according to claim 8, further comprising the step of executing a routine for identifying a feature of the document image;
- the routine having a sequence of selection operations that consumes the set of layout objects and uses the computed attributes to produce a subset of layout objects;
said executing step identifying the subset of layout objects as the feature of the document image.
- the routine having a sequence of selection operations that consumes the set of layout objects and uses the computed attributes to produce a subset of layout objects;
-
10. The method according to claim 1, further comprising the step of defining a structural model for identifying a genre of document;
- wherein the structural model defines a class of document images which express a common communicative purpose that is independent of document content.
-
11. The method according to claim 1, further comprising the step of providing a user interface for selecting the feature.
-
12. The method according to claim 1, wherein said assembling step assembles more than one layout object to form the selected feature of a document image stored in the memory.
-
13. The method according to claim 1, further comprising the steps of:
-
specifying a set of features in addition to the selected feature;
wherein said sorting step assembles the set of image segments into clusters that include the selected feature and the specified set of features; and
wherein the selected feature includes a subset of layout objects of ones of the specified set of features.
-
-
14. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for sorting document images stored in a memory of a document management system, said method steps comprising:
-
segmenting each document image recorded in the memory into a set of layout objects;
each layout object in each of the sets of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;
selecting a feature of a document from a set of features;
each of the features in the set of features identifying groups of layout objects in different ones of the sets of layout objects recorded in the memory;
assembling in the memory a set of image segments;
each image segment in the set of image segments identifying those layout objects of a document image stored in the memory that form the selected feature;
computing a distance measurement between a selected image segment and ones of the image segments in the assembled set of image segments; and
sorting the assembled set of image segments into clusters in the memory with the computed distance measurements;
each cluster defining a grouping of image segments that have similar layout objects forming the selected feature.- View Dependent Claims (15, 16, 17)
selecting a first image segment from the set of image segment to define the selected image segment;
computing a distance measurement between the first image segment and image segments remaining in the set of image segments; and
defining a first cluster with the first image segment and certain of the remaining image segments having a distance measurement that is within a threshold distance.
-
-
16. The program storage device as recited in claim 14, wherein said method steps further comprises the steps of:
-
selecting a document image from the memory;
assembling a single image segment to define the selected image segment by identifying those layout objects of the selected document image that form the selected feature;
computing a distance measurement between the single image segment and each image segment in the set of image segments; and
forming clusters of document images by ranking the computed distance measurement between the single image segment and each image segment in the set of image segments.
-
-
17. The program storage device as recited in claim 14, wherein said method steps further comprise the step of:
-
specifying a set of features in addition to the selected feature;
wherein said sorting step assembles the set of image segments into clusters that include the selected feature and the specified set of features; and
wherein the selected feature includes a subset of layout objects of ones of the specified set of features.
-
-
18. A document management system for sorting document images, comprising:
-
a memory for storing the document images and image processing instructions of the document management system; and
a processor coupled to the memory for executing the image processing instructions of the document management system;
the processor in executing the image processing instructions;
segmenting each document image recorded in the memory into a set of layout objects;
each layout object in each of the sets of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;
selecting a feature of a document from a set of features;
each of the features in the set of features identifying groups of layout objects in different ones of the sets of layout objects recorded in the memory;
assembling in the memory a set of image segments;
each image segment in the set of image segments identifying those layout objects of a document image stored in the memory that form the selected feature;
computing a distance measurement between a selected image segment and ones of the image segments in the assembled set of image segments; and
sorting the assembled set of image segments into clusters in the memory with the computed distance measurements;
each cluster defining a grouping of image segments that have similar layout objects forming the selected feature.- View Dependent Claims (19, 20, 21, 22, 23)
specifies a set of features in addition to the selected feature; and
wherein said sorting image processing instruction assembles the set of image segments into clusters that include the selected feature and the specified set of features; and
wherein the selected feature includes a subset of layout objects of ones of the specified set of features.
-
Specification