Index extraction from documents
First Claim
1. A method for index extraction, comprising the steps of:
- storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
classifying a document by drawing an association in a computer system between the document to be indexed and one of the classifications;
attempting in the computer system to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
attempting in the computer system to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and programs embodied in a computer readable medium are provided for index extraction. Stored in a database are ground truth documents that are organized according to a plurality of classifications, each classification having a group of predefined indices. A document to be indexed is classified by drawing an association between the document and one of the classifications. An attempt is made to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications. Upon a failure to extract the subset of the group of predefined indices, attempts are made to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications.
37 Citations
30 Claims
-
1. A method for index extraction, comprising the steps of:
-
storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices; classifying a document by drawing an association in a computer system between the document to be indexed and one of the classifications; attempting in the computer system to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and attempting in the computer system to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable medium embodying a program for index extraction, comprising:
-
a database including a plurality of ground truth documents stored on the computer-readable medium, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices; code stored on the computer-readable medium that, when executed by a computer system, provides at least one indexing entity that attempts to extract from a document to be indexed at least a subset of the group of predefined indices associated with one of the classifications; and code stored on the computer-readable medium that, when executed by a computer system, provides a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document so that they may be corrected. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus for index extraction, comprising:
-
a processor circuit having a processor and a memory; a database stored in the memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices; a document to be indexed stored in the memory, the document being associated with one of the classifications; and an automated document indexing system stored in the memory and executable by the processor, the automated document indexing system comprising; at least one indexing entity that attempts to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. An apparatus for index extraction, comprising:
-
a database stored in a memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices; a document to be indexed stored in the memory, the document being associated with one of the classifications; means for attempting to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and means for attempting to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document. - View Dependent Claims (29, 30)
-
Specification