Phrase-based detection of duplicate documents in an information retrieval system
First Claim
Patent Images
1. A method of detecting a duplicate document, the method comprising:
- selecting a first document and a second document from a set of documents;
comparing, by operation of a processor adapted to manipulate data within a computer system, a document description of the first document with a document description of the second document,wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences,wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold, the information gain of gj with respect to gk being a function of both actual and expected co-occurrence rates of gj and gk in the set of documents; and
responsive to the document description of the first document matching the document description of the second document, identifying the first document and the second document as duplicate documents in the set of documents.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
214 Citations
14 Claims
-
1. A method of detecting a duplicate document, the method comprising:
-
selecting a first document and a second document from a set of documents; comparing, by operation of a processor adapted to manipulate data within a computer system, a document description of the first document with a document description of the second document, wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences, wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold, the information gain of gj with respect to gk being a function of both actual and expected co-occurrence rates of gj and gk in the set of documents; and responsive to the document description of the first document matching the document description of the second document, identifying the first document and the second document as duplicate documents in the set of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
-
selecting a first document and a second document from a set of documents; comparing a document description of the first document with a document description of the second document, wherein the document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences; wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences, wherein a phrase gj is a related phrase of another phrase gk, occurring in the set of documents when an information gain of gj with respect to gk, exceeds a predetermined threshold, the information gain of gj with respect to gk being a function of both actual and expected co-occurrence rates of gj and gk in the set of documents; and responsive to the document description corresponding to the first document matching the document description corresponding to the second document, indentifying the first document and the second document as duplicate documents in the set of documents.
-
-
14. A system for detecting a duplicate document, comprising:
-
a document description system, executed by a processor, and configured to associate a set of documents with a set of corresponding document descriptions and store the associations in a memory, wherein a document description of the first document comprises a selected subset of sentences of the first document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences; wherein the document description of the second document comprises a selected subset of sentences of the second document, the sentences being selected and ordered in the document description as a function of a number of related phrases in the selected sentences, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold, the information gain of gj with respect to gk being a function of both actual and expected co-occurrence rates of gj and gk in the set of documents; and a duplicate detection system, executed by a processor and configured to; select a first document and a second document from the document description system; compare the document description corresponding to the first document with the document description corresponding to the second document, and responsive to the document description corresponding to the first document matching the document description corresponding to the second document, indentifying the first document and the second document as duplicate documents in the set of documents.
-
Specification