Phrase-based detection of duplicate documents in an information retrieval system

US 8,489,628 B2
Filed: 12/01/2011
Issued: 07/16/2013
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:

receiving a query;

identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;

identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and

selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

213 Citations

58 Claims

1. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:
- receiving a query;
  
  identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;
  
  identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and
  
  selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.
  - 3. The method of claim 1, wherein identifying the incomplete phrase in the query includes:
    - identifying a candidate phrase in the query;
      
      comparing the candidate phrase to an incomplete phrase in a list of incomplete phrases that are present in documents of the document collection; and
      
      if the candidate phrase matches to the incomplete phrase, identifying the candidate phrase as an incomplete phrase and identifying a phrase extension associated with the incomplete phrase as the phrase extension of the incomplete phrase.
  - 4. The method of claim 1, further comprising suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query.
  - 5. The method of claim 4, wherein the suggested phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 6. The method of claim 4, further comprising receiving a selection from the user of the suggested phrase extension to use in the query.
  - 7. The method of claim 1, further comprising providing the selected documents to a user in response to the query.

8. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:
- receiving a query;
  
  identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;
  
  identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and
  
  suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.
  - 10. The method of claim 8, wherein identifying the incomplete phrase in the query includes:
    - identifying a candidate phrase in the query;
      
      comparing the candidate phrase to an incomplete phrase in a list of incomplete phrases that are present in documents of the document collection; and
      
      if the candidate phrase matches to the incomplete phrase, identifying the candidate phrase as an incomplete phrase and identifying a phrase extension associated with the incomplete phrase as the phrase extension of the incomplete phrase.
  - 11. The method of claim 8, wherein the suggested phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 12. The method of claim 8, further comprising receiving a selection from the user of the suggested phrase extension to use in the query.
  - 13. The method of claim 8, further comprising selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension.
  - 14. The method of claim 13, further comprising providing the selected documents to a user in response to the query.

15. A computer-implemented method of:
- receiving a query from a user;
  
  identifying a multiple word phrase in the query;
  
  identifying, by at least one processor of a computing system, a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and
  
  selecting, by at least one processor of the computing system, documents from the document collection containing the phrase extension.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
- - 16. The method of claim 15, wherein the identified phrase is an incomplete phrase, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase.
  - 17. The method of claim 15, further comprising suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query.
  - 18. The method of claim 17, wherein the phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 19. The method of claim 17, further comprising receiving a selection from the user of the suggested phrase extension to use in the query.
  - 20. The method of claim 15, wherein selecting the documents comprises:
    - combining a posting list of identified phrase and a posting list of the phrase extension of the identified phrase to form a combined posting list; and
      
      selecting documents appearing in the combined posting list.
  - 21. The method of claim 15, further comprising providing the selected documents to a user in response to the query.
  - 22. The method of claim 15, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and of a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.

23. A computer-implemented method of:
- receiving a query from a user;
  
  identifying a multiple word phrase in the query;
  
  identifying, by at least one processor of a computing system, a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and
  
  suggesting, by at least one processor of the computing system, the phrase extension to the user to use in the query.
- View Dependent Claims (24, 25, 26, 27, 28, 29)
- - 24. The method of claim 23, wherein the identified phrase is an incomplete phrase, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase.
  - 25. The method of claim 23, wherein the phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 26. The method of claim 23, further comprising receiving a selection from the user of the suggested phrase extension to use in the query.
  - 27. The method of claim 23, further comprising selecting, by at least one processor of the computing system, documents from the document collection containing the phrase extension.
  - 28. The method of claim 27, further comprising providing the selected documents to a user in response to the query.
  - 29. The method of claim 23, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and of a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.

30. A system for selecting documents from a document collection in response to a query, the system comprising:
- one or more memory devices configured store executable instructions; and
  
  one or more processors configured to execute the stored instructions to cause the system to;
  
  receive a query;
  
  identify an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;
  
  identify a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and
  
  select documents from the document collection, wherein the selected documents contain the phrase extension.
- View Dependent Claims (31, 32, 33, 34, 35, 36)
- - 31. The system of claim 30, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.
  - 32. The system of claim 30, wherein identifying the incomplete phrase in the query includes:
    - identifying a candidate phrase in the query;
      
      comparing the candidate phrase to an incomplete phrase in a list of incomplete phrases that are present in documents of the document collection; and
      
      if the candidate phrase matches to the incomplete phrase, identifying the candidate phrase as an incomplete phrase and identifying a phrase extension associated with the incomplete phrase as the phrase extension of the incomplete phrase.
  - 33. The system of claim 30, wherein the one or more processors are further configured to execute the stored instructions to cause the system to suggest the phrase extension to the user to use in the query.
  - 34. The system of claim 33, wherein the suggested phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 35. The system of claim 33, wherein the one or more processors are further configured to execute the stored instructions to cause the system to receive a selection from the user of the suggested phrase extension to use in the query.
  - 36. The system of claim 30, wherein the one or more processors are further configured to execute the stored instructions to cause the system to provide the selected documents to a user in response to the query.

37. A system for selecting documents from a document collection in response to a query, the system comprising:
- one or more memory devices configured store executable instructions; and
  
  one or more processors configured to execute the stored instructions to cause the system to;
  
  receive a query;
  
  identify an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;
  
  identify a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and
  
  suggest the phrase extension to the user to use in the query.
- View Dependent Claims (38, 39, 40, 41, 42, 43)
- - 38. The system of claim 37, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.
  - 39. The system of claim 37, wherein identifying the incomplete phrase in the query includes:
    - identifying a candidate phrase in the query;
      
      comparing the candidate phrase to an incomplete phrase in a list of incomplete phrases that are present in documents of the document collection; and
      
      if the candidate phrase matches to the incomplete phrase, identifying the candidate phrase as an incomplete phrase and identifying a phrase extension associated with the incomplete phrase as the phrase extension of the incomplete phrase.
  - 40. The system of claim 37, wherein the suggested phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 41. The system of claim 37, wherein the one or more processors are further configured to execute the stored instructions to cause the system to receive a selection from the user of the suggested phrase extension to use in the query.
  - 42. The system of claim 37, wherein the one or more processors are further configured to execute the stored instructions to cause the system to select documents from the document collection, wherein the selected documents contain the phrase extension.
  - 43. The system of claim 42, wherein the one or more processors are further configured to execute the stored instructions to cause the system to provide the selected documents to a user in response to the query.

44. A system comprising:
- one or more memory devices configured store executable instructions; and
  
  one or more processors configured to execute the stored instructions to cause the system to;
  
  receive a query from a user;
  
  identify a multiple word phrase in the query;
  
  identify a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and
  
  select documents from the document collection containing the phrase extension.
- View Dependent Claims (45, 46, 47, 48, 49, 50, 51)
- - 45. The system of claim 44, wherein the identified phrase is an incomplete phrase, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase.
  - 46. The system of claim 44, wherein the one or more processors are further configured to execute the stored instructions to cause the system to suggest the phrase extension to the user to use in the query.
  - 47. The system of claim 46, wherein the phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 48. The system of claim 46, wherein the one or more processors are further configured to execute the stored instructions to cause the system to receive a selection from the user of the suggested phrase extension to use in the query.
  - 49. The system of claim 44, wherein selecting the documents includes:
    - combining a posting list of identified phrase and a posting list of the phrase extension of the identified phrase to form a combined posting list; and
      
      selecting documents appearing in the combined posting list.
  - 50. The system of claim 44, wherein the one or more processors are further configured to execute the stored instructions to cause the system to provide the selected documents to a user in response to the query.
  - 51. The system of claim 44, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and of a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.

52. A system comprising:
- one or more memory devices configured store executable instructions; and
  
  one or more processors configured to execute the stored instructions to cause the system to;
  
  receive a query from a user;
  
  identify a multiple word phrase in the query;
  
  identify a phrase extension of the identified phrase, wherein the phrase extension of the identified phrase is a sequence of words that begins with the identified phrase but is longer than the identified phrase, and wherein the identified phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the identified phrase exceeding an expected co-occurrence rate of the phrase extension and the identified phrase in the document collection; and
  
  suggest the phrase extension to the user to use in the query.
- View Dependent Claims (53, 54, 55, 56, 57, 58)
- - 53. The system of claim 52, wherein the identified phrase is an incomplete phrase, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase.
  - 54. The system of claim 52, wherein the phrase extension is selected from a plurality of phrase extensions of the identified phrase, based on information gains of the plurality of phrase extensions given the identified phrase.
  - 55. The system of claim 52, wherein the one or more processors are further configured to execute the stored instructions to cause the system to receive a selection from the user of the suggested phrase extension to use in the query.
  - 56. The system of claim 52, wherein the one or more processors are further configured to execute the stored instructions to cause the system to select documents from the document collection containing the phrase extension.
  - 57. The system of claim 56, wherein the one or more processors are further configured to execute the stored instructions to cause the system to provide the selected documents to a user in response to the query.
  - 58. The system of claim 52, wherein the expected co-occurrence rate of the phrase extension is a function of a number of documents in the document collection that include the identified phrase and of a number of documents in the document collection that include the phrase extension, and the actual co-occurrence rate being a function of a number of times the phrase extension appears within a threshold number of words of the identified phrase in the document collection.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna L.
Primary Examiner(s)
Vy, Hung T

Application Number

US13/309,273
Publication Number

US 20120310902A1
Time in Patent Office

593 Days
Field of Search

707/705, 707/758, 707/759, 707/763, 707/765, 707/767, 704/2
US Class Current

707/767
CPC Class Codes

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/243   Natural language query form...

G06F 16/24578   using ranking

G06F 16/3322   using system suggestions G0...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06Q 10/10   Office automation; Time man...

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

213 Citations

58 Claims

Specification

Use Cases

Quick Links

Others

Phrase-based detection of duplicate documents in an information retrieval system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

213 Citations

58 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others