Index extraction from documents
First Claim
1. A method for index extraction, comprising the steps of:
- storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
classifying a document by drawing an association between the document to be indexed and one of the classifications;
attempting to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
attempting to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and programs embodied in a computer readable medium are provided for index extraction. Stored in a database are ground truth documents that are organized according to a plurality of classifications, each classification having a group of predefined indices. A document to be indexed is classified by drawing an association between the document and one of the classifications. An attempt is made to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications. Upon a failure to extract the subset of the group of predefined indices, attempts are made to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications.
-
Citations
30 Claims
-
1. A method for index extraction, comprising the steps of:
-
storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
classifying a document by drawing an association between the document to be indexed and one of the classifications;
attempting to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
attempting to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A program embodied in a computer-readable medium for index extraction, comprising:
-
a database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
at least one indexing entity that attempts to extract from a document to be indexed at least a subset of the group of predefined indices associated with one of the classifications; and
a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus for index extraction, comprising:
-
a processor circuit having a processor and a memory;
a database stored in the memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
a document to be indexed stored in the memory, the document being associated with one of the classifications; and
an automated document indexing system stored in the memory and executable by the processor, the automated document indexing system comprising;
at least one indexing entity that attempts to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. An apparatus for index extraction, comprising:
-
a database stored in a memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
a document to be indexed stored in the memory, the document being associated with one of the classifications;
means for attempting to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
means for attempting to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices. - View Dependent Claims (29, 30)
-
Specification