Phrase-based detection of duplicate documents in an information retrieval system
First Claim
Patent Images
1. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:
- receiving a query;
identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;
identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and
selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
213 Citations
58 Claims
-
1. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:
-
receiving a query; identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase; identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:
-
receiving a query; identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase; identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-implemented method of:
-
receiving a query from a user; identifying a multiple word phrase in the query; identifying, by at least one processor of a computing system, a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and selecting, by at least one processor of the computing system, documents from the document collection containing the phrase extension. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer-implemented method of:
-
receiving a query from a user; identifying a multiple word phrase in the query; identifying, by at least one processor of a computing system, a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query. - View Dependent Claims (24, 25, 26, 27, 28, 29)
-
-
30. A system for selecting documents from a document collection in response to a query, the system comprising:
-
one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query; identify an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase; identify a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and select documents from the document collection, wherein the selected documents contain the phrase extension. - View Dependent Claims (31, 32, 33, 34, 35, 36)
-
-
37. A system for selecting documents from a document collection in response to a query, the system comprising:
-
one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query; identify an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase; identify a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and suggest the phrase extension to the user to use in the query. - View Dependent Claims (38, 39, 40, 41, 42, 43)
-
-
44. A system comprising:
-
one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query from a user; identify a multiple word phrase in the query; identify a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and select documents from the document collection containing the phrase extension. - View Dependent Claims (45, 46, 47, 48, 49, 50, 51)
-
-
52. A system comprising:
-
one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query from a user; identify a multiple word phrase in the query; identify a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and suggest the phrase extension to the user to use in the query. - View Dependent Claims (53, 54, 55, 56, 57, 58)
-
Specification