Phrase-based detection of duplicate documents in an information retrieval system
First Claim
Patent Images
1. A method of detecting duplicate documents in search results, the method comprising:
- receiving a query comprising at least one phrase;
retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;
for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold;
responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
218 Citations
38 Claims
-
1. A method of detecting duplicate documents in search results, the method comprising:
-
receiving a query comprising at least one phrase; retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents; for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold; responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of detecting duplicate documents in search results, the method comprising:
-
receiving a query comprising at least one phrase; retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents; for each of the retrieved documents, by operation of a processor within a computer system, retrieving a stored document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold; responsive to the document description at least two documents matching, discarding at least one of the two documents from the search result. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
-
receiving a query comprising at least one phrase; retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents; for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold; responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A tangible computer readable storage medium storing a computer program executable by a processor for detecting a duplicate document, the operations of the computer program comprising:
-
receiving a query comprising at least one phrase; retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents; for each of the retrieved documents, by operation of a processor within a computer system, retrieving a stored document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold; responsive to the document description at least two documents matching, discarding at least one of the two documents from the search result. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38)
-
Specification