Efficient retrieval algorithm by query term discrimination

US 20080288483A1
Filed: 05/18/2007
Published: 11/20/2008
Est. Priority Date: 05/18/2007
Status: Active Grant

First Claim

Patent Images

1. In a computing environment, a method comprising:

ranking search terms of a query based on one or more importance criterion;

choosing a subset of the search terms based on the ranking; and

locating documents in an index that is indexed by terms including terms that correspond to the search terms, including using the subset to determine a reduced number of rows in the index to search for relevant documents.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is an efficient retrieval mechanism that quickly locates documents (e.g., corresponding to online advertisements) based on query term discrimination. A topmost subset (e.g., two) of search terms is selected according to their ranked importance, e.g., as ranked by inverted document frequency. The topmost terms are then used to narrow the number of rows of an inverted query index that are searched to find document identifiers and associated scores, such as computed offline by a BM25 algorithm. For example, for each document identifier of each important term, a fast search within each of the narrowed subset of rows (that also contain that document identifier) may be performed by comparing document identifiers to jump a pointer within each other row, followed by a binary search to locate a particular document. The scores of the set of particular documents may then be used to rank their relative importance for returning as results.

Citations

20 Claims

1. In a computing environment, a method comprising:
- ranking search terms of a query based on one or more importance criterion;
  
  choosing a subset of the search terms based on the ranking; and
  
  locating documents in an index that is indexed by terms including terms that correspond to the search terms, including using the subset to determine a reduced number of rows in the index to search for relevant documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein ranking the search terms comprises sorting the search terms based on each one'"'"'s inverted document frequency.
  - 3. The method of claim 1 wherein choosing the subset of the search terms based on the ranking comprises selecting two top-ranked search terms.
  - 4. The method of claim 1 wherein using the subset to determine the reduced number of rows in the index comprises selecting other rows in the index that include a document identifier that matches a document identifier within a row corresponding to the subset.
  - 5. The method of claim 4 wherein locating the documents comprises moving a pointer within each other row based on a comparison of a document identifier within a subset row and a document identifier within each other row to locate a particular document within each other row.
  - 6. The method of claim 5 further comprising ranking results including using a score associated with each of the particular documents.
  - 7. The method of claim 1 further comprising, building the index, including computing a score for each document with respect to a term.
  - 8. The method of claim 7 wherein computing the score comprises using a BM25 algorithm.

9. In a computing environment, a system comprising:
- a ranking mechanism that ranks terms of an incoming query based on one or more importance criteria into a set of ranked terms; and
  
  a merge mechanism that searches an index of terms, each term having a set of one or more document identifiers identifying a data structure that contains that term, each document identifier associated with a score, the merge mechanism configured to search the index by choosing a topmost subset of the ranked terms based on their ranked importance, and to search for documents in a subset of rows of the index in which the subset of the rows is selected by having a relationship with at least one document identifier corresponding to a term of the topmost subset.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9 wherein the topmost subset corresponds to the top two ranked terms of the query as determined according to the one or more importance criteria.
  - 11. The system of claim 9 wherein the one or more importance criteria includes the inverted document frequency of each term.
  - 12. The system of claim 9 wherein the merge mechanism searches for documents by locating a particular document within each row of the subset of the rows, including by jumping within that row to jump points corresponding to document identifiers in that row, based on a comparison of document identifiers within the row and a document identifier within a row corresponding to the topmost subset.
  - 13. The system of claim 12 wherein the merge mechanism locates the particular document by further performing a binary search between one jump point corresponding to a document identifier in the subset of rows that is less than the document identifier within the row corresponding to the topmost subset, and another jump point corresponding to document identifier in the subset of rows that is greater than the document identifier within the row corresponding to the topmost subset.
  - 14. The system of claim 9 further comprising means for building the index from a set of documents, in which the index is built in an offline state independent of providing results based on queries.
  - 15. The system of claim 9 wherein the document identifiers correspond to advertisements.

16. A computer-readable medium having computer-executable instructions, which when executed perform steps, comprising:
- (a) parsing a query into terms;
  
  (b) ranking the terms based on one or more importance criteria into a set of ranked terms;
  
  (c) selecting a subset of the ranked terms as a most important term set;
  
  (d) selecting as a selected row of an inverted query index a row that corresponds to the most important term set;
  
  (e) for each document identifier in each selected row, finding other rows in the index that have a matching document identifier;
  
  (f) moving a pointer in each of the other rows based upon a comparison of the document identifier in the selected row with the document identifier in each of the other rows, followed by a binary search, to locate a particular document in each other row; and
  
  (g) using a score associated with each particular document identifier to rank the corresponding documents for returning as a result set.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer-readable medium of claim 16 having further computer-executable instructions comprising, selecting as a new selected row a different row that corresponds to the most important term set, and returning to step (d).
  - 18. The computer-readable medium of claim 16 wherein ranking the terms based on one or more importance criteria includes ranking the terms based on an inverted document frequency for each term, and wherein selecting the subset of the ranked terms as a most important term set comprises selecting the two terms having the lowest inverted document frequency.
  - 19. The computer-readable medium of claim 16 wherein moving the pointer in each of the other rows further includes performing a binary search between one jump point corresponding to a document identifier in one of the other rows that is less than the document identifier within the selected row, and another jump point corresponding to a document identifier in that other row that is greater than the document identifier within the selected row.
  - 20. The computer-readable medium of claim 16 having further computer-executable instructions comprising, building the inverted query index from a set of documents corresponding to advertisements, in which the index is built in an offline state.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation), Mitsubishi Electric Corporation
Original Assignee
Microsoft Corporation
Inventors
Chen, Zheng, Zhang, Benyu, Wang, Jian, Ji, Lei, Lin, Chenxi, Zeng, Huajun

Granted Patent

US 7,822,752 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/5
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links