Document fingerprint
First Claim
1. A method for comparing documents, comprising:
- extracting, by a computer processor, a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths;
extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams;
comparing, by the computer processor, the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and
determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for comparing documents, including extracting, by a computer processor, a plurality of extracted elements from a first image of a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first image, comparing, by the computer processor, the first text feature and a second formatted document to generate a comparison result, and determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content.
33 Citations
25 Claims
-
1. A method for comparing documents, comprising:
-
extracting, by a computer processor, a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths; extracting, by the computer processor, a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams; comparing, by the computer processor, the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for comparing documents, comprising:
-
a processor; a text analyzer executing on the processor and configured to; extract a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths; a fingerprint extractor executing on the processor and configured to; extract a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams; a comparison module executing on the processor and configured to; compare the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and determine, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains a common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold; and a repository couple to the processor and configured to store the first formatted document, the plurality of extracted elements, first text feature, and second text feature. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A non-transitory computer readable medium embodying instructions for comparing documents, the instructions when executed by a processor comprising functionality for:
-
extracting a plurality of extracted elements from a first formatted document, wherein each of the plurality of extracted elements corresponds to a text element of the first formatted document, wherein the plurality of extracted elements comprises at least one selected from a group consisting of a plurality of words and a plurality of word lengths; extracting a first plurality of text fingerprints from a sequence of the plurality of extracted elements to form a first text feature of the first formatted document, wherein the first plurality of text fingerprints comprises at least one selected from a group consisting of a plurality of word n-grams and a plurality of word length n-grams, wherein the first text feature comprises at least one selected from a group consisting of a first text content feature based on the plurality of word n-grams and a first text geometric feature based on the plurality of word length n-grams; comparing the first text feature and a second text feature of a second formatted document to generate a comparison result, wherein the second text feature comprises at least one selected from a group consisting of a second text content feature and a second text geometric feature, wherein the comparison result comprises at least one selected from a group consisting of a text content match rate between the first and second text content features and a text geometric match rate between the first and second text geometric features; and determining, in response to the comparison result meeting a pre-determined criterion, that each of the first formatted document and the second formatted document contains common text content, wherein the comparison result meeting the pre-determined criterion is based on at least one selected from a group consisting of the text content match rate and the text geometric match rate exceeding a pre-determined threshold.
-
Specification