Method and apparatus for creating, indexing and viewing abstracted documents
First Claim
1. A method for creating a retrieval index by which an image of an arbitrarily formatted document may be retrieved, the method comprising the steps of:
- processing the image of the arbitrarily formatted document to identify text regions on the document and non-text regions on the document, said processing step including, for each text region on the document, the step of automatically determining a region type using rule-based decisions automatically applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules;
converting the image of the document in text regions into text;
indexing the converted text so as to permit retrieval by reference to the converted text;
indexing the automatically determined region types so as to permit retrieval by reference to one of the determined region types; and
storing the image of the document such that the stored document image may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types.
2 Assignments
0 Petitions
Accused Products
Abstract
Method and apparatus for storing document images, for creating retrieval index by which the document images may be retrieved, and for displaying the retrieved document images. When a document image is obtained, the document image is subjected to rule-based block selection techniques whereby individual regions within the document region are identified, and the types of regions are also identified, such as title-type regions, text-type regions, line art-type regions, halftone-type regions and color image-type regions. The identification is used to create structural information and both the document image and the structural information is stored. A word-based retrieval index is created based on title-type regions and/or text-type regions, the retrieval index being used in conjunction with a search query so as to be able to retrieve documents which match the search query. The retrieved documents are displayed in either a full image mode or a rapid browsing mode. In the rapid browsing mode, the full image of the document is not displayed, but rather only an abstract structural view of the document image based on the stored structural information. The level of abstraction may be specified by the operator in connection with the identified structural regions of the document, whereby, for example, only a structural view is displayed, or only title-type regions are displayed mixed with remaining structure.
216 Citations
118 Claims
-
1. A method for creating a retrieval index by which an image of an arbitrarily formatted document may be retrieved, the method comprising the steps of:
-
processing the image of the arbitrarily formatted document to identify text regions on the document and non-text regions on the document, said processing step including, for each text region on the document, the step of automatically determining a region type using rule-based decisions automatically applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules; converting the image of the document in text regions into text; indexing the converted text so as to permit retrieval by reference to the converted text; indexing the automatically determined region types so as to permit retrieval by reference to one of the determined region types; and storing the image of the document such that the stored document image may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. Apparatus for creating a retrieval index by which an image of an arbitrarily formatted document may be retrieved, said apparatus comprising:
-
a scanner for scanning an original arbitrarily formatted document and for outputting a bit map document image corresponding to the original arbitrarily formatted document; a first memory region for storing a document image and a retrieval index; a second memory region for storing process steps; and a processor for executing the process steps stored in said second memory region; wherein said second memory region includes process steps to (a) receive the bit map document image scanned by said scanner, (b) process the bit map document image to identify text regions in the document and non-text regions in the document, said processing step including, for each text region on the document, the step of automatically determining a region type using rule-based decisions automatically applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules, (c) convert the document image in text regions into text, (d) index the converted text so as to permit retrieval by reference to the converted text, (e) index the automatically determined region types so as to permit retrieval by reference to one of the predetermined region types, and (f) store the document image in said first memory region such that the stored document image may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A method for displaying documents comprising the steps of:
-
providing an image of a document and corresponding composition information for the document, the composition information including region type information for each of up to plural regions of the document; displaying an abstracted view of the document using the composition information; replacing, within the abstracted view itself, at least one region of the abstracted view with a corresponding image of the document, so that composition information and image information are mixedly displayed. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)
-
-
51. Apparatus for displaying a document comprising:
-
means for providing an image of a document and corresponding composition information for the document, the composition information including region type information for each of up to plural regions of the document; means for displaying an abstracted view of the document using the composition information; and means for replacing, within the abstracted view itself, at least one region of the abstracted view with a corresponding image of the document, so that composition information and image information are mixedly displayed. - View Dependent Claims (52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
-
-
65. A document display apparatus comprising:
-
a display; a first memory region from which an image of a document and corresponding composition information for the document may be retrieved; a second memory region for storing process steps; a processor for executing the process steps stored in said second memory region; wherein said second memory region includes process steps to (a) display on said display an abstracted view of a retrieved document in accordance with the composition information, and (b) replace, within the abstracted view itself, a selected region of the abstracted view with a corresponding document image, so that composition information and image information are mixedly displayed. - View Dependent Claims (66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78)
-
-
79. A method for creating a retrieval index by which an image of an arbitrarily formatted document may be retrieved, the method comprising the steps of:
-
processing the image of the arbitrarily formatted document to identify text regions on the document and non-text regions on the document, said processing step including, for each text region on the document, the step of automatically determining a region type using rule-based decisions automatically applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules; converting the image of the document in text regions into text; indexing the converted text so as to permit retrieval by reference to the converted text; indexing the automatically determined region types so as to permit retrieval by reference to one of the determined region types; and associating the image of the document with the indexed text such that the image of the document may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types. - View Dependent Claims (80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96)
-
-
97. Apparatus for creating a retrieval index by which an image of an arbitrarily formatted document may be retrieved, said apparatus comprising:
-
a scanner for scanning an arbitrarily formatted original document and for outputting a bit map document image corresponding to the original arbitrarily formatted document; a first memory region for storing a document image and a retrieval index; a second memory region for storing process steps; and a processor for executing the process steps stored in said second memory region; wherein said second memory region includes process steps to (a) receive the bit map document image scanned by said scanner, (b) process the bit map document image to identify text regions in the document and non-text regions in the document, said processing step including, for each text region on the document, the step of automatically determining a region type using rule-based decisions applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules, (c) convert the document image in text regions into text, (d) index the converted text so as to permit retrieval by reference to the converted text, (e) index the automatically determined region types so as to permit retrieval by reference to one of the region types, and (f) associate the document image with the indexed text such that the document image may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types. - View Dependent Claims (98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114)
-
-
115. A computer-readable memory medium storing computer-executable process steps to create a retrieval index by which an image of an arbitrarily formatted document may be retrieved, comprising:
-
a processing step to process the image of the arbitrarily formatted document to identify text regions on the document and non-text regions on the document, said processing step including, for each text region on the document, a step to determine a region type using rule-based decisions automatically applied to the image of the text region without regard to a position of the text region in the document and without regard to a predetermined format for the document, the region type being one of plural different predefined region types encompassed by the rules; a converting step to convert the image of the document in text regions into text; a text indexing step to index the converted text so as to permit retrieval by reference to the converted text; a region type indexing step to index the automatically determined region types so as to permit retrieval by reference to one of the determined region types; and a storing step to store the image of the document such that the stored document image may be retrieved by reference to whether text in a text query appears in the indexed text and by reference to whether the text in the text query appears in one of the indexed region types. - View Dependent Claims (116, 117)
-
-
118. A computer-readable memory medium storing computer-executable process steps to display documents comprising:
-
a providing step to provide an image of a document and corresponding composition information for the document, the composition information including region type information for each of up to plural regions of the document; a displaying step to display an abstracted view of the document using the composition information; a replacing step to replace, within the abstracted view itself, at least one region of the abstracted view with a corresponding image of the document, so that composition information and image information are mixedly displayed.
-
Specification