PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
-
Citations
17 Claims
-
1. (canceled)
-
2. (canceled)
-
3. A method of detecting a duplicate document, the method comprising:
-
selecting a first document and a second document from a set of documents; comparing a document description of the first document with a document description of the second document, wherein the document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and responsive to the document description of the first document matching the document description of the second document, discarding at least one of the first document or the second document from the set of documents. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
-
selecting a first document and a second document from a set of documents; comparing a document description of the first document with a document description of the second document, wherein the document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and responsive to the document description of the first document matching the document description of the second document, discarding at least one of the first document or the second document from the set of documents.
-
-
17. A system for detecting a duplicate document, comprising:
-
a document description system, executed by a processor, and configured to associate a set of documents with a set of corresponding document descriptions and store the associations in a memory, wherein the corresponding document description of each document comprises selected sentences of the document that are ordered in the document description as a function of a number of phrases in each sentence; and a duplicate detection system, executed by a processor and configured to; select a first document and a second document from the document description system; compare the document description corresponding to the first document with the document description corresponding to the second document, and responsive to the document description corresponding to the first document matching the document description corresponding to the second document, disassociate at least one of the first document or the second document from the set of documents.
-
Specification