×

Document fingerprint

  • US 8,843,493 B1
  • Filed: 09/18/2012
  • Issued: 09/23/2014
  • Est. Priority Date: 09/18/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method for comparing documents, comprising:

  • extracting, by a computer processor, a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;

    extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;

    comparing, by the computer processor, the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and

    determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×