System for searching a corpus of document images by user specified document layout components
First Claim
1. A method for searching a corpus of document images stored in a memory, comprising the steps of:
- segmenting each document image in the corpus of document images into a first set of layout objects;
each layout object in the first set of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;
for each segmented document image, computing attributes for each layout object in the first set of layout objects;
the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;
providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images;
the certain layout objects defining a feature of the example document image;
executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images;
for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects;
said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the example document image; and
displaying at the program interface the ones of the document images in the corpus of document images identified by said executing step.
4 Assignments
0 Petitions
Accused Products
Abstract
A document search system provides a user with a programming interface for dynamically specifying features of documents recorded in a corpus of documents. The programming interface operates at a high-level that is suitable for interactive user specification of layout components and structures of documents. In operation, a bitmap image of a document is analyzed by the document search system to identify layout objects such as text blocks or graphics. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes which are identified are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. After computing attributes for each layout object, a user can operate the programming interface to define unique document features. Each document feature is a routine defined by a sequence of selections operations which consume a first set of layout objects and produce a second set of layout objects. The second set of layout objects constitutes the feature in a page image of a document. Using the programming interface, a user flexibly defines a genre of document using the user-specified document features.
-
Citations
20 Claims
-
1. A method for searching a corpus of document images stored in a memory, comprising the steps of:
-
segmenting each document image in the corpus of document images into a first set of layout objects;
each layout object in the first set of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;for each segmented document image, computing attributes for each layout object in the first set of layout objects;
the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images;
the certain layout objects defining a feature of the example document image;executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images;
for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects;
said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the example document image; anddisplaying at the program interface the ones of the document images in the corpus of document images identified by said executing step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A program storage device readable by a machine, embodying a program of instructions executable by the machine to perform method steps for searching a corpus of document images stored in a memory of a document management system, said method steps comprising:
-
segmenting each document image in the corpus of document images into a first set of layout objects;
each layout object in the first set of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;for each segmented document image, computing attributes for each layout object in the first set of layout objects;
the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images;
the certain layout objects defining a feature of the example document image;executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images;
for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects;
said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the example document image; anddisplaying at the program interface the ones of the document images in the corpus of document images identified by said executing step. - View Dependent Claims (10)
-
-
11. A document management system for searching a corpus of document images, comprising:
-
a memory for storing the corpus of document images and image processing instructions of the document management system; a display; and a processor coupled to the memory and the display for executing the document image processing instructions of the document management system;
the processor in executing the document image processing instructions;segmenting each document image in the corpus of document images into a first set of layout objects;
each layout object in the first set of layout objects being one of a plurality of layout object types;
each of the plurality of layout object types identifying a structural element of a document;for each segmented document image, computing attributes for each layout object in the first set of layout objects;
the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;providing a program interface on the display for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images;
the certain layout objects defining a feature of the example document image;executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images;
for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects;identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects to have the feature of the example document image; and displaying at the program interface on the display the ones of the document images in the corpus of document images identified as having the feature of the example document image. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification