Multi-stage query processing system and method for use with tokenspace repository

US 9,146,967 B2
Filed: 03/26/2013
Issued: 09/29/2015
Est. Priority Date: 08/13/2004
Status: Active Grant

First Claim

Patent Images

1. A method of processing a query in a multi-stage query processing system having one or more processors and memory storing one or more programs for execution by the one or more processors to perform the method comprising:

performing a first stage processing of a query, including;

retrieving a first set of document identifiers from an index in response to one or more query terms;

generating a first set of relevancy scores for a first set of compressed documents corresponding to at least a subset of the first set of document identifiers based on one or more of;

presence of query terms, term frequency, and document popularity; and

storing the first set of relevancy scores in the memory;

performing a second stage processing of the query, including;

generating a second set of relevancy scores for the documents in the first set of compressed documents based on one or more of;

a list of token positions for one or more query terms in the query, distances between query terms in the documents, attributes of tokens in the documents, and text that appears around a query term used in a document of the first set of documents; and

storing the second set of relevancy scores in the memory;

reading the first and second set of relevancy scores from the memory, and generating an ordered list of documents for further processing based on the first and second set of relevancy scores;

automatically generating additional query terms from the documents in the ordered list of documents;

formulating a new query using the additional query terms;

processing the new query to retrieve a second set of document identifiers from the index and to generate a third set of relevancy scores based at least in part on the additional query terms; and

using the third set of relevancy scores to select a set of top documents for presentation to the user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multi-stage query processing system and method enables multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. At one or more stages of a multi-stage query processing system a set of relevancy scores are used to select a subset of documents for presentation as an ordered list to a user. The set of relevancy scores can be derived in part from one or more sets of relevancy scores determined in prior stages of the multi-stage query processing system. In some embodiments, the multi-stage query processing system is capable of executing one or more passes on a user query, and using information from each pass to expand the user query for use in a subsequent pass to improve the relevancy of documents in the ordered list.

33 Citations

21 Claims

1. A method of processing a query in a multi-stage query processing system having one or more processors and memory storing one or more programs for execution by the one or more processors to perform the method comprising:
- performing a first stage processing of a query, including;
  
  retrieving a first set of document identifiers from an index in response to one or more query terms;
  
  generating a first set of relevancy scores for a first set of compressed documents corresponding to at least a subset of the first set of document identifiers based on one or more of;
  
  presence of query terms, term frequency, and document popularity; and
  
  storing the first set of relevancy scores in the memory;
  
  performing a second stage processing of the query, including;
  
  generating a second set of relevancy scores for the documents in the first set of compressed documents based on one or more of;
  
  a list of token positions for one or more query terms in the query, distances between query terms in the documents, attributes of tokens in the documents, and text that appears around a query term used in a document of the first set of documents; and
  
  storing the second set of relevancy scores in the memory;
  
  reading the first and second set of relevancy scores from the memory, and generating an ordered list of documents for further processing based on the first and second set of relevancy scores;
  
  automatically generating additional query terms from the documents in the ordered list of documents;
  
  formulating a new query using the additional query terms;
  
  processing the new query to retrieve a second set of document identifiers from the index and to generate a third set of relevancy scores based at least in part on the additional query terms; and
  
  using the third set of relevancy scores to select a set of top documents for presentation to the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein a respective token, of the tokens in the documents, is a phrase.
  - 3. The method of claim 1, wherein the second set of relevancy scores are at least based on attributes of tokens in the documents, wherein the attributes comprise font attributes of tokens in the documents.
  - 4. The method of claim 1, further comprising:
    - decompressing at least a portion of the first set of compressed documents to recover a first set of tokens, wherein the first set of recovered tokens are associated with positions in the first set of compressed documents corresponding to the first set of document identifiers.
  - 5. The method of claim 4, further comprising:
    - reconstructing one or more portions of the first set of compressed documents using the first set of recovered tokens.
  - 6. The method of claim 5, further comprising:
    - presenting the reconstructed portions to a user in an ordered list of the set of top documents.
  - 7. The method of claim 1, wherein the third set of relevancy scores are based on one or more positions of the query terms in the set of compressed documents corresponding to the second set of document identifiers.
  - 8. The method of claim 1, wherein the first set of document identifiers corresponds to locations of tokens corresponding to the query terms in the tokenspace repository storing a set of compressed documents.
  - 9. The method of claim 1, wherein retrieving the first set of document identifiers comprises using the index to produce a list of token positions for the one or more query terms and accessing a map to produce a set of documents identifiers corresponding to the token positions.

10. A multi-stage query processing system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs stored in the memory, the on or more programs comprising instructions for;
  
  performing a first stage processing of a query, including;
  
  retrieving a first set of document identifiers from an index in response to one or more query terms;
  
  generating a first set of relevancy scores for a first set of compressed documents corresponding to at least a subset of the first set of document identifiers based on one or more of;
  
  presence of query terms, term frequency, and document popularity; and
  
  storing the first set of relevancy scores in the memory;
  
  performing a second stage processing of the query, including;
  
  generating a second set of relevancy scores for the documents in the first set of compressed documents based on one or more of;
  
  a list of token positions for one or more query terms in the query, distances between query terms in the documents, attributes of tokens in the documents, and text that appears around a query term used in a document of the first set of documents; and
  
  storing the second set of relevancy scores in the memory;
  
  reading the first and second set of relevancy scores from the memory, and generating an ordered list of documents for further processing based on the first and second set of relevancy scores;
  
  automatically generating additional query terms from the documents in the ordered list of documents;
  
  formulating a new query using the additional query terms;
  
  processing the new query to retrieve a second set of document identifiers from the index and to generate a third set of relevancy scores based at least in part on the additional query terms; and
  
  using the third set of relevancy scores to select a set of top documents for presentation to the user.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, wherein a respective token of the tokens in the documents is a phrase.
  - 12. The system of claim 10, wherein the second set of relevancy scores are at least based on attributes of tokens in the documents, wherein the attributes comprise font attributes of tokens in the documents.
  - 13. The system of claim 10, further comprising instructions for:
    - decompressing at least a portion of the first set of compressed documents to recover a first set of tokens, wherein the first set of recovered tokens are associated with positions in the first set of compressed documents corresponding to the first set of document identifiers.
  - 14. The system of claim 13, further comprising instructions for:
    - reconstructing one or more portions of the first set of compressed documents using the first set of recovered tokens.
  - 15. The system of claim 14, further comprising instructions for:
    - presenting the reconstructed portions to a user in an ordered list of the set of top documents.

16. A non-transitory computer-readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for:
- performing a first stage processing of a query, including;
  
  retrieving a first set of document identifiers from an index in response to one or more query terms;
  
  generating a first set of relevancy scores for a first set of compressed documents corresponding to at least a subset of the first set of document identifiers based on one or more of;
  
  presence of query terms, term frequency, and document popularity; and
  
  storing the first set of relevancy scores in the memory;
  
  performing a second stage processing of the query, including;
  
  generating a second set of relevancy scores for the documents in the first set of compressed documents based on one or more of;
  
  a list of token positions for one or more query terms in the query, distances between query terms in the documents, attributes of tokens in the documents, and text that appears around a query term used in a document of the first set of documents; and
  
  storing the second set of relevancy scores in the memory;
  
  reading the first and second set of relevancy scores from the memory, and generating an ordered list of documents for further processing based on the first and second set of relevancy scores;
  
  automatically generating additional query terms from the documents in the ordered list of documents;
  
  formulating a new query using the additional query terms;
  
  processing the new query to retrieve a second set of document identifiers from the index and to generate a third set of relevancy scores based at least in part on the additional query terms; and
  
  using the third set of relevancy scores to select a set of top documents for presentation to the user.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The non-transitory computer-readable storage medium of claim 16, wherein a respective token of the tokens in the documents is a phrase.
  - 18. The non-transitory computer-readable storage medium of claim 16, wherein the second set of relevancy scores are at least based on attributes of tokens in the documents, wherein the attributes comprise font attributes of tokens in the documents.
  - 19. The non-transitory computer-readable storage medium of claim 16, further comprising instructions for:
    - decompressing at least a portion of the first set of compressed documents to recover a first set of tokens, wherein the first set of recovered tokens are associated with positions in the first set of compressed documents corresponding to the first set of document identifiers.
  - 20. The non-transitory computer-readable storage medium of claim 19, further comprising instructions for:
    - reconstructing one or more portions of the first set of compressed documents using the first set of recovered tokens.
  - 21. The non-transitory computer-readable storage medium of claim 20, further comprising instructions for:
    - presenting the reconstructed portions to a user in an ordered list of the set of top documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Dean, Jeffrey A., Haahr, Paul G., Sercinoglu, Olcan, Singhal, Amitabh K.
Primary Examiner(s)
Reyes, Mariela
Assistant Examiner(s)
Almani, Mohsen

Application Number

US13/851,036
Publication Number

US 20130212092A1
Time in Patent Office

917 Days
Field of Search

707/764
US Class Current

1/1
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/30   of unstructured textual dat...

G06F 16/951   Indexing; Web crawling tech...

Multi-stage query processing system and method for use with tokenspace repository

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

33 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-stage query processing system and method for use with tokenspace repository

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links