Index extraction from documents

US 8,805,803 B2
Filed: 08/12/2004
Issued: 08/12/2014
Est. Priority Date: 08/12/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method for index extraction, comprising the steps of:

storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;

classifying a document by drawing an association in a computer system between the document to be indexed and one of the classifications;

attempting in the computer system to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and

attempting in the computer system to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and programs embodied in a computer readable medium are provided for index extraction. Stored in a database are ground truth documents that are organized according to a plurality of classifications, each classification having a group of predefined indices. A document to be indexed is classified by drawing an association between the document and one of the classifications. An attempt is made to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications. Upon a failure to extract the subset of the group of predefined indices, attempts are made to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications.

37 Citations

View as Search Results

30 Claims

1. A method for index extraction, comprising the steps of:
- storing a plurality of ground truth documents in a database, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
  
  classifying a document by drawing an association in a computer system between the document to be indexed and one of the classifications;
  
  attempting in the computer system to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
  
  attempting in the computer system to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising the step of making a subsequent attempt in the computer system to extract from the document at least the subset of the group of predefined indices after correcting at least one text recognition error in the document.
  - 3. The method of claim 1, further comprising the step of reclassifying in the computer system the document upon a failure to extract at least the subset of the group of predefined indices from the document.
  - 4. The method of claim 3, wherein the step of reclassifying the document further comprises the steps of:
    - organizing a plurality of terms in the document into a number of language groups, each of the language groups being defined by a characteristic of a native language of the document, wherein at least one of the terms is considered to be a putative index;
      
      searching indices associated with the ground truth documents for occurrences of the putative index; and
      
      reclassifying the document as belonging to a classification that includes at least one ground truth document that having at least one occurrence of the putative index.
  - 5. The method of claim 4, wherein at least one of language groups is defined by capitalization of the terms.
  - 6. The method of claim 4, wherein at least one of language groups is defined by whether a term appears in a native language dictionary.
  - 7. The method of claim 3, wherein the step of reclassifying the document further comprises the steps of:
    - determining a relative frequency of each of the terms in the document;
      
      identifying at least one putative index of the document based in part upon the relative frequency of each of the terms in the document;
      
      searching indices associated with the ground truth documents for occurrences of the at least one putative index; and
      
      reclassifying the document as belonging to a classification that includes at least one ground truth document that having at least one occurrence of the at least one putative index.
  - 8. The method of claim 7, wherein the step of identifying the at least one putative index of the document further comprises:
    - calculating a metric for each of the terms in the document as a function of a relative frequency of each term in the document multiplied, respectively, by an inverse of a generalized relative frequency of each term in a native language; and
      
      selecting at least the term in the document that has the highest one of the metrics calculated as the at least one putative index.
  - 9. The method of claim 3, wherein the step of reclassifying the document further comprises the steps of:
    - comparing structure of the document with a structure of each one of a plurality of the ground truth documents; and
      
      reclassifying the document as belonging to a classification that includes at least one ground truth document that having a structure that substantially matches a structure of the document.

10. A non-transitory computer-readable medium embodying a program for index extraction, comprising:
- a database including a plurality of ground truth documents stored on the computer-readable medium, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
  
  code stored on the computer-readable medium that, when executed by a computer system, provides at least one indexing entity that attempts to extract from a document to be indexed at least a subset of the group of predefined indices associated with one of the classifications; and
  
  code stored on the computer-readable medium that, when executed by a computer system, provides a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document so that they may be corrected.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The non-transitory computer-readable medium of claim 10, further comprising code that makes a subsequent attempt to extract from the document at least the subset of the group of predefined indices after correcting at least one text recognition error in the document.
  - 12. The non-transitory computer-readable medium of claim 10, wherein the document is classified by association with the one of the classifications, the program further comprising a reclassification engine that reclassifies the document upon a failure to extract at least the subset of the group of predefined indices from the document.
  - 13. The non-transitory computer-readable medium of claim 12, wherein the reclassification engine further comprises:
    - code that organizes a plurality of terms in the document into a number of language groups, each of the language groups being defined by a characteristic of a native language of the document, wherein at least one of the terms is considered to be a putative index;
      
      code that searches indices associated with the ground truth documents for occurrences of the putative index; and
      
      code that reclassifies the document as belonging to a classification that includes at least one ground truth document that having at least one occurrence of the putative index.
  - 14. The non-transitory computer-readable medium of claim 13, wherein at least one of language groups is defined by capitalization of the terms.
  - 15. The non-transitory computer-readable medium of claim 13, wherein at least one of language groups is defined by whether a term appears in a native language dictionary.
  - 16. The non-transitory computer-readable medium of claim 12, wherein the reclassification engine further comprises:
    - code that determines a relative frequency of each of the terms in the document;
      
      code that identifies at least one putative index of the document based in part upon the relative frequency of each of the terms in the document;
      
      code that searches indices associated with the ground truth documents for occurrences of the at least one putative index; and
      
      code that reclassifies the document as belonging to a classification that includes at least one ground truth document having at least one occurrence of the at least one putative index.
  - 17. The non-transitory computer-readable medium of claim 16, wherein the code that identifies the at least one putative index of the document further comprises:
    - code that calculates a metric for each of the terms in the document as a function of a relative frequency of each term in the document multiplied, respectively, by an inverse of a generalized relative frequency of each term in a native language; and
      
      code that selects at least the term in the document that has the highest one of the metrics calculated as the at least one putative index.
  - 18. The non-transitory computer-readable medium of claim 12, wherein the reclassification engine further comprises:
    - code that compares structure of the document with a structure of each one of a plurality of the ground truth documents; and
      
      code that reclassifies the document as belonging to a classification that includes at least one ground truth document that having a structure that substantially matches a structure of the document.

19. An apparatus for index extraction, comprising:
- a processor circuit having a processor and a memory;
  
  a database stored in the memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
  
  a document to be indexed stored in the memory, the document being associated with one of the classifications; and
  
  an automated document indexing system stored in the memory and executable by the processor, the automated document indexing system comprising;
  
  at least one indexing entity that attempts to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
  
  a corrective engine that attempts to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The apparatus of claim 19, wherein the automated document indexing system further comprises logic that makes a subsequent attempt to extract from the document at least the subset of the group of predefined indices after correcting at least one text recognition error in the document.
  - 21. The apparatus of claim 19, wherein the automated document indexing system further comprises a reclassification engine that reclassifies the document upon a failure to extract at least the subset of the group of predefined indices from the document.
  - 22. The apparatus of claim 21, wherein the reclassification engine further comprises:
    - logic that organizes a plurality of terms in the document into a number of language groups, each of the language groups being defined by a characteristic of a native language of the document, wherein at least one of the terms is considered to be a putative index;
      
      logic that searches indices associated with the ground truth documents for occurrences of the putative index; and
      
      logic that reclassifies the document as belonging to a classification that includes at least one ground truth document that having the occurrence of the putative index.
  - 23. The apparatus of claim 22, wherein at least one of language groups is defined by capitalization of the terms.
  - 24. The apparatus of claim 22, wherein at least one of language groups is defined by whether a term appears in a native language dictionary.
  - 25. The apparatus of claim 21, wherein the reclassification engine further comprises:
    - logic that determines a relative frequency of each of the terms in the document;
      
      logic that identifies at least one putative index of the document based in part upon the relative frequency of each of the terms in the document;
      
      logic that searches indices associated with the ground truth documents for occurrences of the at least one putative index; and
      
      logic that reclassifies the document as belonging to a classification that includes at least one ground truth document having at least one occurrence of the at least one putative index.
  - 26. The apparatus of claim 25, wherein the logic that identifies the at least one putative index of the document further comprises:
    - logic that calculates a metric for each of the terms in the document as a function of a relative frequency of each term in the document multiplied, respectively, by an inverse of a generalized relative frequency of each term in a native language; and
      
      logic that selects at least the term in the document that has the highest one of the metrics calculated as the at least one putative index.
  - 27. The apparatus of claim 21, wherein the reclassification engine further comprises:
    - logic that compares structure of the document with a structure of each one of a plurality of the ground truth documents; and
      
      logic that reclassifies the document as belonging to a classification that includes at least one ground truth document that having a structure that substantially matches a structure of the document.

28. An apparatus for index extraction, comprising:
- a database stored in a memory, the database including a plurality of ground truth documents, the documents being organized according to a plurality of classifications, each classification having a group of predefined indices;
  
  a document to be indexed stored in the memory, the document being associated with one of the classifications;
  
  means for attempting to extract from the document at least a subset of the group of predefined indices associated with the one of the classifications; and
  
  means for attempting to find and correct at least one text recognition error in the document based upon a salient dictionary associated with the one of the classifications upon a failure to extract the subset of the group of predefined indices, wherein anticipated misspellings associated with each of the classifications are stored in the salient dictionary and the document is searched for anticipated misspellings of predefined indices that have not been extracted from the document.
- View Dependent Claims (29, 30)
- - 29. The apparatus of claim 28, further comprising means for making a subsequent attempt to extract from the document at least the subset of the group of predefined indices after correcting at least one text recognition error in the document.
  - 30. The apparatus of claim 28, further comprising means for reclassifying the document upon a failure to extract at least the subset of the group of predefined indices from the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Simske, Steven J., Wright, David W.
Primary Examiner(s)
Badawi, Sherief
Assistant Examiner(s)
Raab, Christopher J

Application Number

US10/916,877
Publication Number

US 20060036614A1
Time in Patent Office

3,652 Days
Field of Search

707/694, 707/696, 707/741, 715/257
US Class Current

707/694
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

G06F 16/35 Clustering; Classification

Index extraction from documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Index extraction from documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links