Phrase-based searching in an information retrieval system

US 9,569,505 B2
Filed: 05/15/2015
Issued: 02/14/2017
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of selecting documents in a document collection in response to a query, the method comprising:

receiving a query including a first phrase and a second phrase;

retrieving, by at least one processor of a computing system, a posting list of documents containing the first phrase;

for each document in the posting list;

accessing, by at least one processor of the computing system, a list of related phrases of the first phrase, wherein the list indicates whether a related phrase is present in the document, the first phrase predicting the occurrence of each of the related phrases in the document collection, wherein the first phrase predicts an occurrence of a related phrase based on a measure of an actual co-occurrence rate of the related phrase and the first phrase in the document collection exceeding an expected co-occurrence rate of the related phrase and the first phrase in the document collection;

comparing, by at least one processor of the computing system, the second phrase to the list of related phrases that are present in the document; and

when the comparison indicates that the second phrase is a related phrase of the first phrase that is present in the document, then selecting the document to include in a result to the query, without retrieving a posting list of documents containing the second phrase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

236 Citations

19 Claims

1. A computer-implemented method of selecting documents in a document collection in response to a query, the method comprising:
- receiving a query including a first phrase and a second phrase;
  
  retrieving, by at least one processor of a computing system, a posting list of documents containing the first phrase;
  
  for each document in the posting list;
  
  accessing, by at least one processor of the computing system, a list of related phrases of the first phrase, wherein the list indicates whether a related phrase is present in the document, the first phrase predicting the occurrence of each of the related phrases in the document collection, wherein the first phrase predicts an occurrence of a related phrase based on a measure of an actual co-occurrence rate of the related phrase and the first phrase in the document collection exceeding an expected co-occurrence rate of the related phrase and the first phrase in the document collection;
  
  comparing, by at least one processor of the computing system, the second phrase to the list of related phrases that are present in the document; and
  
  when the comparison indicates that the second phrase is a related phrase of the first phrase that is present in the document, then selecting the document to include in a result to the query, without retrieving a posting list of documents containing the second phrase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - when the comparison indicates that the second phrase is a related phrase of the first phrase that is not present in a document, then excluding the document from the result to the query, without retrieving a posting list of documents containing the second phrase.
  - 3. The method of claim 1, further comprising, when the comparison indicates that the second phrase is not a related phrase of the first phrase:
    - intersecting the posting list of documents containing the first phrase with a posting list of documents for the second phrase to select documents containing both the first phrase and the second phrase.
  - 4. The method of claim 1, further comprising:
    - storing the list of related phrases for a first phrase with respect to a document in a bit vector, wherein a bit of the bit vector is set for each related phrase of the first phrase that is present in the document, and a bit of the vector is unset for each related phrase of the first phrase that is not present in the document, wherein the bit vector has a numerical value; and
      
      scoring a selected document by determining an adjusted value of the bit vector according to the bits set for related phrases of the first phrase that are present in the document.
  - 5. The method of claim 1, further comprising:
    - determining the first phrase has a phrase extension;
      
      accessing a posting list for the phrase extension; and
      
      join the posting list for the phrase extension with the posting list for the first phrase to generated a union posting list,wherein the accessing, comparing, and selecting is performed for each document in the union posting list.
  - 6. The method of claim 5, wherein the phrase extension is selected from a plurality of phrase extensions of the first phrase, based on information gains of the plurality of phrase extensions given the first phrase.
  - 7. The method of claim 5, wherein the first phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the first phrase exceeding an expected co-occurrence rate of the phrase extension and the first phrase in the document collection.
  - 8. The method of claim 1, further comprising providing the selected documents to a user in response to the query.
  - 9. The method of claim 1, where the expected co-occurrence rate of the related phrase and the first phrase is a function of a number of documents in the document collection that include the first phrase and a number of documents in the document collection that include the related phrase, and the actual co-occurrence rate being a function of a number of times the first phrase appears within a threshold number of words of the related phrase in the document collection.
  - 10. The method of claim 9, wherein the threshold is about 100.

11. A system for selecting documents from a document collection in response to a query, the system comprising:
- one or more memory devices configured store executable instructions; and
  
  one or more processors configured to execute the stored instructions to cause the system to;
  
  receive a query including a first phrase and a second phrase;
  
  retrieve a posting list of documents containing the firstphrase;
  
  for each document in the posting list;
  
  access a list of related phrases of the first phrase, wherein the lists indicates whether a related phrase is present in the document, the first phrase predicting the occurrence of each of the related phrases in the document collection based on a measure of an actual co-occurrence rate of the related phrase and the first phrase in the document collection exceeding an expected co-occurrence rate of the related phrase and the first phrase in the document collection;
  
  compare the second phrase to the list of related phrases that are present document; and
  
  when the comparison indicates that the second phrase is a related phrase of the first phrase that is present in the document, then select the document to include in a result to the query, without retrieving a posting list of documents containing the second phrase.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system of claim 11, wherein the one or more processors are further configured to execute the stored instructions to cause the system,when the comparison indicates that the second phrase is a related phrase of the first phrase but is not present in a document, to exclude the document from the result to the query, without retrieving a posting list of documents containing the second phrase.
  - 13. The system of claim 11, wherein the one or more processors are further configured to execute the stored instructions to cause the system,when the comparison indicates that the second phrase is not a related phrase of the first phrase, then to intersect the posting list of documents containing the first phrase and with a posting list of documents for the second phrase to select documents containing both the first phrase and the second phrase.
  - 14. The system of claim 11, wherein the one or more processors are further configured to execute the stored instructions to cause the system to:
    - store the list of related phrases for a first phrase with respect to a document in a bit vector, wherein a bit of the bit vector is set for each related phrase of the first phrase that is present in the document, and a bit of the vector is unset for each related phrase of the first phrase that is not present in the document, wherein the bit vector has a numerical value; and
      
      score a selected document by determining an adjusted value of the bit vector according to the bits set for related phrases of the first phrase that are present in the document.
  - 15. The system of claim 11, wherein the one or more processors are further configured to execute the stored instructions to cause the system todetermine the first phrase has a phrase extension;
    - accessing a posting list for the phrase extension; and
      
      join the posting list for the phrase extension with the posting list for the first phrase to generated a union posting list,wherein the accessing, comparing, and selecting is performed for each document in the union posting list.
  - 16. The system of claim 15, wherein the phrase extension is selected from a plurality of phrase extensions of the first phrase, based on information gains of the plurality of phrase extensions given the first phrase.
  - 17. The system of claim 15, wherein the first phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the first phrase exceeding an expected co-occurrence rate of the phrase extension and the first phrase in the document collection.
  - 18. The system of claim 11, wherein the one or more processors are further configured to execute the stored instructions to cause the system to provide the selected documents to a user in response to the query.
  - 19. The system of claim 11, where the expected co-occurrence rate of the related phrase and the first phrase is a function of a number of documents in the document collection that include the first phrase and a number of documents in the document collection that include the related phrase, and the actual co-occurrence rate being a function of a number of times the first phrase appears within a threshold number of words of the related phrase in the document collection.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna L.
Primary Examiner(s)
Vy, Hung T

Application Number

US14/713,374
Publication Number

US 20150248415A1
Time in Patent Office

641 Days
Field of Search

707/754, 707/706, 707/705, 707/709, 707/711, 707/741, 707/758, 707/722, 707/723
US Class Current

1/1
CPC Class Codes

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/243   Natural language query form...

G06F 16/24578   using ranking

G06F 16/3322   using system suggestions G0...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06Q 10/10   Office automation; Time man...

Phrase-based searching in an information retrieval system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

236 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Phrase-based searching in an information retrieval system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

236 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links