Document fingerprint

US 8,843,493 B1
Filed: 09/18/2012
Issued: 09/23/2014
Est. Priority Date: 09/18/2012
Status: Active Grant

First Claim

Patent Images

1. A method for comparing documents, comprising:

extracting, by a computer processor, a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;

extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;

comparing, by the computer processor, the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and

determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for comparing documents, including extracting, by a computer processor, a plurality of extracted elements from a first image of a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first image, comparing, by the computer processor, the first text feature and a second formatted document to generate a comparison result, and determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content.

33 Citations

View as Search Results

25 Claims

1. A method for comparing documents, comprising:
- extracting, by a computer processor, a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;
  
  extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;
  
  comparing, by the computer processor, the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and
  
  determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising:
    - calculating a frequency of occurrence of the first plurality of text fingerprints in the second formatted document,wherein the comparison result is further based on the frequency of occurrence.
  - 3. The method of claim 1, further comprising:
    - extracting a second plurality of text fingerprints from the second formatted document to form the second text feature of the second formatted document; and
      
      generating, based at least on the second text feature, an inverted index data structure for a document library comprising the second formatted document,wherein the inverted index data structure comprises a tally of at least one of the second plurality of text fingerprints occurring in the second formatted document, andwherein the comparison result is generated using the inverted index data structure.
  - 4. The method of claim 1,wherein the plurality of extracted elements are extracted from a first image generated from at least one selected from a group consisting of a displayed copy and a printed copy of the first formatted document.
  - 5. The method of claim 4, wherein the first plurality of text fingerprints comprises a plurality of n-grams, the method further comprising:
    - identifying an error rate model of an optical character recognition (OCR) module used to extract the plurality of extracted elements from the first image; and
      
      determining a length of the n-gram based on the error rate model.
  - 6. The method of claim 1,wherein the first formatted document and the second formatted documents contain the same text content.
  - 7. The method of claim 1,wherein the first formatted document comprises at least one selected from a group consisting of a subset and a superset of the second formatted document.
  - 8. The method of claim 1, wherein the plurality of extracted elements comprises a plurality of words, the method further comprising:
    - extracting a segment of consecutive words from the plurality of words,wherein the first plurality of text fingerprints is based at least on the segment.
  - 9. The method of claim 1, wherein the plurality of extracted elements comprises a plurality of word lengths, the method further comprising:
    - extracting a segment of consecutive word lengths from the plurality of word lengths,wherein the first plurality of text fingerprints is based at least on the segment.
  - 10. The method of claim 9, wherein at least one word length of the plurality of word lengths is normalized based on a line length where the at least one word length belongs.
  - 11. The method of claim 1,wherein the first and second text geometric features are extracted in response to determining that the text content match rate is less than the pre-determined threshold.
  - 12. The method of claim 1, further comprising:
    - dividing the first plurality of text fingerprints into a plurality of subsets,wherein comparing the first text feature and the second formatted document comprises a first comparison and a second comparison performed concurrently, the first comparison being between a first subset of the plurality of subsets and the second formatted document, the second comparison being between a second subset of the plurality of subsets and the second formatted document.

13. A system for comparing documents, comprising:
- a processor;
  
  a text analyzer executing on the processor and configured to;
  
  extract a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document,wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;
  
  a fingerprint extractor executing on the processor and configured to;
  
  extract a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document,wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams,wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;
  
  a comparison module executing on the processor and configured to;
  
  compare the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and
  
  determine, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains a common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold; and
  
  a repository couple to the processor and configured to store the first formatted document, the plurality of extracted elements, first text feature, and second text feature.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The system of claim 13, the comparison module further configured to:
    - calculate a frequency of occurrence of the first plurality of text fingerprints in the second formatted document,wherein the comparison result is further based on the frequency of occurrence.
  - 15. The system of claim 14, the text analyzer further configured to:
    - extract a second plurality of text fingerprints from the second formatted document to form the second text feature of the second formatted document; and
      
      generate, based at least on the second text feature, an inverted index data structure for a document library comprising the second formatted document,wherein the inverted index data structure comprises a tally of at least one of the second plurality of text fingerprints occurring in the second formatted document, andwherein the comparison result is generated using the inverted index data structure.
  - 16. The system of claim 13,wherein the plurality of extracted elements are extracted from a first image generated from at least one selected from a group consisting of a displayed copy and a printed copy of the first formatted document.
  - 17. The system of claim 16, wherein the first plurality of text fingerprints comprises a plurality of n-grams, the fingerprint extractor further configured to:
    - identify an error rate model of an optical character recognition (OCR) module used to extract the plurality of extracted elements from the first image; and
      
      determine a length of the n-gram based on the error rate model.
  - 18. The system of claim 13,wherein the first formatted document and the second formatted documents contain the same text content.
  - 19. The system of claim 13,wherein the first formatted document comprises at least one selected from a group consisting of a subset and a superset of the second formatted document.
  - 20. The system of claim 13,wherein the text analyzer comprises a text content analyzer configured to extract a plurality of words from the first formatted document as at least a portion of the plurality of extracted elements,wherein the fingerprint extractor comprises a text content fingerprint extractor configured to extract a segment of consecutive words from the plurality of words, andwherein the first plurality of text fingerprints is based at least on the segment.
  - 21. The system of claim 13,wherein text analyzer comprises a text geometric analyzer configured to extract a plurality of word lengths from the first formatted document as at least a portion of the plurality of extracted elements,wherein the fingerprint extractor comprises a text geometric fingerprint extractor configured to extract a segment of consecutive word lengths from the plurality of word lengths, andwherein the first plurality of text fingerprints is based at least on the segment.
  - 22. The system of claim 21, wherein at least one word length of the plurality of word lengths is normalized based on a line length where the at least one word length belongs.
  - 23. The system of claim 13,wherein the first and second text geometric features are extracted in response to determining that the text content match rate is less than the pre-determined threshold.
  - 24. The system of claim 13, the comparison module further configured to:
    - divide the first plurality of text fingerprints into a plurality of subsets,wherein comparing the first text feature and the second formatted document comprises a first comparison and a second comparison performed concurrently, the first comparison being between a first subset of the plurality of subsets and the second formatted document, the second comparison being between a second subset of the plurality of subsets and the second formatted document.

25. A non-transitory computer readable medium embodying instructions for comparing documents, the instructions when executed by a processor comprising functionality for:
- extracting a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;
  
  extracting a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;
  
  comparing the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and
  
  determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Boeing Co.
Original Assignee
Narus, Inc. (Gen Digital Inc.)
Inventors
Liao, Yong, Nucci, Antonio
Primary Examiner(s)
Casanova, Jorge A

Application Number

US13/622,316
Time in Patent Office

735 Days
Field of Search

707/737, 707/739
US Class Current

707/737
CPC Class Codes

G06F 16/319   Inverted lists

G06F 16/90344   by using string matching te...

G06F 21/6209   to a single file or object,...

G06F 40/194   Calculation of difference b...

G06F 40/253   Grammatical analysis; Style...

Document fingerprint

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Document fingerprint

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links