Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments
First Claim
Patent Images
1. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
- initially, selecting distinctive features contained in the collection of documents,then, for each document, identifying the distinctive features contained in the document, andthen, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents,wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection,wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents,wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments,wherein the distinctiveness score is the reciprocal of the number of documents containing the text fragment multiplied by a monotonic function of the number of words in the text fragment.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a computer-assisted method for finding duplicate or near-duplicate documents or text spans within a document collection by using high-discriminability text fragments. Distinctive features of the documents or text spans are identified. For each pair of documents or text spans with at least one distinctive feature in common, the distinctive features of each document or text span are compared to determine whether the pair is duplicates or near-duplicates. An apparatus for performing this computer-assisted method is also disclosed.
-
Citations
34 Claims
-
1. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
-
initially, selecting distinctive features contained in the collection of documents, then, for each document, identifying the distinctive features contained in the document, and then, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents, wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection, wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents, wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments, wherein the distinctiveness score is the reciprocal of the number of documents containing the text fragment multiplied by a monotonic function of the number of words in the text fragment. - View Dependent Claims (2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
4. The computer-assisted method according to claim to 1, wherein the distinctive features appear in a different order in each of the documents.
-
15. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
-
initially, selecting distinctive features contained in the collection of documents, then, for each document, identifying the distinctive features contained in the document, and then, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents, wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection, wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents, further including the step of, for each pair of documents having at least one distinctive feature in common, counting the number of distinctive features in common, wherein determining whether the pair of documents is duplicates or near-duplicates includes the steps of; for each pair of documents, calculating an overlap ratio by dividing the number of distinctive features in common by the smaller of the number of distinctive features per document, and comparing the overlap ratio to a threshold and if the overlap ratio is greater than the threshold, then the pair of documents are duplicates or near-duplicates, otherwise the pair of documents is not duplicates or near-duplicates, building a document index that maps each document to its associated distinctive features, wherein if one distinctive feature is repeated within one document the index maps the document to the distinctive feature once, and building a feature index that maps each distinctive feature to its associated document, wherein if one distinctive feature is repeated within one document, the index maps the distinctive feature to the document once, wherein determining whether the pair of documents are duplicates or near-duplicates further includes the steps of; creating a list of unique distinctive features from the document index, for each unique distinctive feature, creating a list of documents which contain the unique distinctive feature, and for each document, creating a list of documents that have at least one feature in common with the document and the number of features in common with the document. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A computer-assisted method for identifying duplicate and near-duplicate documents in a large collection of documents, comprising the steps of:
-
initially, selecting distinctive features contained in the collection of documents, then, for each document, identifying the distinctive features contained in the document, and then, for each pair of documents having at least one distinctive feature in common, comparing the distinctive features of the documents to determine whether the documents are duplicate or near-duplicate documents, wherein the distinctive features are text fragments, which are sequences of at least two words that appear in a limited number of documents in the document collection, wherein the text fragments are determined to be distinctive features based upon a function of the frequency of a text fragment within a document in the large collection of documents, wherein for each sequence of at least two words, a distinctiveness score is calculated, and the highest scoring sequences that are found in at least two documents in the document collection are considered distinctive text fragments, wherein the distinctiveness score is the percentage of documents not containing the phrase multiplied by a monotonic function of the number of words in the text fragment. - View Dependent Claims (22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
25. The computer-assisted method according to claim to 21, wherein the distinctive features appear in a different order in each of the documents.
Specification