High-speed retrieval by example
First Claim
1. A document retrieval apparatus, wherein a target document is input and a matching document is retrieved from a document database, comprising:
- character detecting means for detecting character bounds in the target document based on image content of the target document;
discrimination means, coupled to the character detecting means, for discriminating the character bounds into classes, wherein discrimination is done based on at least one unambiguous characteristic of the character bound including the pixel density over an area of the character bound;
descriptor generating means, coupled to receive class indications of character bounds from the character detecting means, for generating target document descriptors based on patterns of class indications;
searching means, coupled to receive the target document descriptors from the descriptor generating means, for searching the document database for potentially matching documents which have descriptors in common with the target document;
evaluation means, coupled to receive a set of potentially matching documents from the searching means, for determining at least one matching document from among the potentially matching documents; and
output means, coupled to the evaluation means, for outputting the at least one matching document or indication thereof as a result of a retrieval request wherein the target document is input.
2 Assignments
0 Petitions
Accused Products
Abstract
An improved document management system with high-speed retrieval by example retrieves a document attaching a target document, in whole or part, by comparing descriptors of documents. A descriptor is derived from a pattern of labels, where each label is associated with a character, or more precisely, a character bounding box. A bounding box is found by examining contiguous pixels in an image. The particular label associated with a bounding box depends on the value of a metric measured from that bounding box. In one system, the metric is the spacing between the bounding box and an adjacent bounding box, in which the labels approximately reflect a pattern of word lengths. In other systems, where words lengths are not present, the metric might be pixel density and the pattern of labels approximately reflect a pattern of denser characters and sparser characters. The document management system, or just the query portion of the document management system could be part of a copier, where a sample page is input to the copier and the copier retrieves the matching document and prints it.
-
Citations
17 Claims
-
1. A document retrieval apparatus, wherein a target document is input and a matching document is retrieved from a document database, comprising:
-
character detecting means for detecting character bounds in the target document based on image content of the target document; discrimination means, coupled to the character detecting means, for discriminating the character bounds into classes, wherein discrimination is done based on at least one unambiguous characteristic of the character bound including the pixel density over an area of the character bound; descriptor generating means, coupled to receive class indications of character bounds from the character detecting means, for generating target document descriptors based on patterns of class indications; searching means, coupled to receive the target document descriptors from the descriptor generating means, for searching the document database for potentially matching documents which have descriptors in common with the target document; evaluation means, coupled to receive a set of potentially matching documents from the searching means, for determining at least one matching document from among the potentially matching documents; and output means, coupled to the evaluation means, for outputting the at least one matching document or indication thereof as a result of a retrieval request wherein the target document is input. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of using a target document to specify a matching document in a document retrieval system which includes a document database in which the matching document is stored and the matching document is the target document or a document with elements in common with the target document, the method comprising the steps of:
-
inputting an image of the target document to the document retrieval system; detecting character bounds in the target document image; measuring at least one metric for each character bound in a sample of the character bounds, thereby resulting in a distribution of metric values, the at least one metric including a pixel density over an area of the character bound; dividing the distribution of metric values into a plurality of ranges where each range is associated with one class of character bounds; labelling each character bound in the sample with an indication of its class based on the metric value for said each character bound; forming descriptors for the target document based on patterns of class indications; searching an index of descriptors for documents in the document database using the formed descriptors of the target document; identifying at least one document in the document database as a matching document when the at least one document has more descriptors in common with the target document than a nonmatching document. - View Dependent Claims (9)
-
-
10. An apparatus for matching an input document to a reference document in a document database, comprising:
-
a document database, wherein reference descriptors are derived from content of reference documents in said document database; a descriptor database associating reference descriptors and reference documents, wherein a reference descriptor describes, at least in part, a pattern of character densities and a specific reference descriptor and a specific reference document are associated in the descriptor database when the pattern of character densities described by the specific reference descriptor is found in the specific reference document; input means for inputting content of an input document to be matched against said reference documents of said document database; descriptor derivation means, coupled to said input means, for deriving at least one input descriptor from the input document where said input descriptor describes, at least in part, a pattern of character densities found in the input document; and output means, coupled to said descriptor derivation means, for outputting an indication of reference documents which are associated with reference descriptors which match the input descriptor. - View Dependent Claims (11)
-
-
12. A method of using a target document to specify a matching document in a document retrieval system which includes a document database in which the matching document is stored and the matching document is the target document or a document with elements in common with the target document, the method comprising the steps of:
-
inputting an image of the target document to the document retrieval system; detecting character features in the target document image according to a plurality of classes of character features, wherein at least one of the character features detected is a pixel density over an area of a character bound; forming descriptors for the target document based on the detected character features of each of the plurality of classes of character features; for each class of character features, searching an index of descriptors for documents in the document database using the formed descriptors of the target document for said each class; and identifying at least one document in the document database as a matching document when the at least one document has more descriptors in common with the target document than a nonmatching document. - View Dependent Claims (13)
-
-
14. A method of using a target document to specify a matching document in a document retrieval system which includes a document database in which the matching document is stored and the matching document is the target document or a document with elements in common with the target document, the method comprising the steps of:
-
inputting an image of the target document to the document retrieval system; detecting character features in the target document image according to a plurality of classes of character features; forming descriptors for the target document based on the detected character features of each of the plurality of classes of character features; for each class of character features, searching an index of descriptors for documents in the document database using the formed descriptors of the target document for said each class; and identifying at least one document in the document database as a matching document when the at least one document has more descriptors in common with the target document than a nonmatching document wherein the plurality of classes of character features includes a word length class of features describing patterns of word lengths and a phoneme word length class of features describing patterns of the number of phonemes per word.
-
-
15. A method of using a target document to specify a matching document in a document retrieval system which includes a document database in which the matching document is stored and the matching document is the target document or a document with elements in common with the target document, the method comprising the steps of:
-
inputting an image of the target document to the document retrieval system; detecting character features in the target document image according to a plurality of classes of character features; forming descriptors for the target document based on the detected character features of each of the plurality of classes of character features; for each class of character features, searching an index of descriptors for documents in the document database using the formed descriptors of the target document for said each class; and identifying at least one document in the document database as a matching document when the at least one document has more descriptors in common with the target document than a nonmatching document wherein the plurality of classes of character features includes an intercharacter spacing class of features describing patterns of word lengths, a character pixel density class of features describing a pattern of at least character densities compared with a threshold, and a phoneme spacing class of features describing the number of phonemes per word.
-
-
16. A method of using a target document to specify a matching document in a document retrieval system which includes a document database in which the matching document is stored and the matching document is the target document or a document with elements in common with the target document, the method comprising the steps of:
-
inputting an image of the target document to the document retrieval system; detecting phoneme features in the target document image according to a plurality of classes of phoneme features; forming descriptors for the target document based on the detected phoneme features of each of the plurality of classes of phoneme features; for each class of phoneme features, searching an index of descriptors for documents in the document database using the formed descriptors of the target document for said each class; identifying at least one document in the document database as a matching document when the at least one document has more descriptors in common with the target document than a nonmatching document. - View Dependent Claims (17)
-
Specification