Efficient retrieval algorithm by query term discrimination

US 7,822,752 B2
Filed: 05/18/2007
Issued: 10/26/2010
Est. Priority Date: 05/18/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A system comprising:

a ranking mechanism that ranks terms of a query based on one or more importance criteria into a set of ranked terms; and

a merge mechanism that searches an index of terms, at least one term having a set of one or more document identifiers associated with a score and identifying a data structure that contains the at least one term, the merge mechanism configured to;

search the index by choosing a topmost subset of the ranked terms based on a ranked importance of individual ranked terms; and

search for documents in a subset of rows of the index, the subset of rows selected by having a relationship with at least one document identifier corresponding to a term of the topmost subset of the ranked terms, wherein the merge mechanism is configured to search for the documents by locating one or more particular documents within at least one row of the subset of rows by jumping within the at least one row to one or more jump points corresponding to associated document identifiers in the at least one row, wherein the jumping is based on a comparison of one or more document identifiers within the at least one row and a reference document identifier within another row corresponding to the topmost subset of the ranked terms; and

at least one computing device configured to implement one or both of the ranking mechanism or the merge mechanism.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is an efficient retrieval mechanism that quickly locates documents (e.g., corresponding to online advertisements) based on query term discrimination. A topmost subset (e.g., two) of search terms is selected according to their ranked importance, e.g., as ranked by inverted document frequency. The topmost terms are then used to narrow the number of rows of an inverted query index that are searched to find document identifiers and associated scores, such as computed offline by a BM25 algorithm. For example, for each document identifier of each important term, a fast search within each of the narrowed subset of rows (that also contain that document identifier) may be performed by comparing document identifiers to jump a pointer within each other row, followed by a binary search to locate a particular document. The scores of the set of particular documents may then be used to rank their relative importance for returning as results.

Citations

16 Claims

1. A system comprising:
- a ranking mechanism that ranks terms of a query based on one or more importance criteria into a set of ranked terms; and
  
  a merge mechanism that searches an index of terms, at least one term having a set of one or more document identifiers associated with a score and identifying a data structure that contains the at least one term, the merge mechanism configured to;
  
  search the index by choosing a topmost subset of the ranked terms based on a ranked importance of individual ranked terms; and
  
  search for documents in a subset of rows of the index, the subset of rows selected by having a relationship with at least one document identifier corresponding to a term of the topmost subset of the ranked terms, wherein the merge mechanism is configured to search for the documents by locating one or more particular documents within at least one row of the subset of rows by jumping within the at least one row to one or more jump points corresponding to associated document identifiers in the at least one row, wherein the jumping is based on a comparison of one or more document identifiers within the at least one row and a reference document identifier within another row corresponding to the topmost subset of the ranked terms; and
  
  at least one computing device configured to implement one or both of the ranking mechanism or the merge mechanism.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1 wherein the topmost subset of the ranked terms comprises two ranked terms of the set of ranked terms as determined according to the one or more importance criteria.
  - 3. The system of claim 1 wherein the one or more importance criteria includes an inverted document frequency of individual terms of the query.
  - 4. The system of claim 1 wherein the merge mechanism is configured to locate the one or more particular documents by further performing a binary search between one of the jump points that is less than the reference document identifier, and another of the jump points that is greater than the reference document identifier.
  - 5. The system of claim 1 further comprising means for building the index from a document set that includes the documents, wherein the index is built in an offline state independent of providing results based on one or both of the query or another query.
  - 6. The system of claim 1 wherein the one or more document identifiers correspond to advertisements.

7. A computer-readable storage medium having computer executable-instructions, which when executed perform steps comprising:
- (a) parsing a query into terms;
  
  (b) ranking the terms based on one or more importance criteria into a set of ranked terms;
  
  (c) selecting a subset of the set of ranked terms as a most important term set;
  
  (d) selecting as a selected row of an inverted query index a row that corresponds to the most important term set;
  
  (e) for a document identifier in the selected row, finding other rows in the inverted query index that have a matching document identifier;
  
  (f) moving a pointer in individual other rows based upon a comparison of the document identifier in the selected row with the individual other rows'"'"' matching document identifiers, followed by one or more binary searches, to locate particular documents in the other rows; and
  
  (g) using a score associated with the other rows'"'"' matching document identifiers to rank the particular documents for returning as a result set.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The computer-readable storage medium of claim 7, wherein the computer-executable instructions, when executed, further perform a step comprising selecting as a new selected row a different row that corresponds to the most important term set, and returning to step (d).
  - 9. The computer-readable storage medium of claim 7, wherein ranking the terms based on one or more importance criteria includes ranking the terms based on an inverted document frequency for each term, and wherein selecting the subset of the ranked terms as the most important term set comprises selecting two terms having a lower inverted document frequency.
  - 10. The computer-readable storage medium of claim 7, wherein moving the pointer in individual other rows further includes performing at least one of the one or more binary searches between one jump point corresponding to an associated document identifier in one other row of the other rows that is less than the document identifier within the selected row, and another jump point corresponding to another associated document identifier in the one other row that is greater than the document identifier within the selected row.
  - 11. The computer-readable storage medium of claim 7, wherein the computer-executable instructions, when executed, further perform a step comprising building the inverted query index in an offline state from a set of documents corresponding to advertisements.

12. A method comprising:
- parsing a query into terms;
  
  ranking the terms based on one or more importance criteria into a set of ranked terms;
  
  selecting a subset of the set of ranked terms as a term set;
  
  selecting, as a selected row, a row of an inverted query index that corresponds to the term set;
  
  for a document identifier in the selected row, finding other rows in the inverted query index that have a matching document identifier;
  
  moving a pointer in individual other rows based upon a comparison of the document identifier in the selected row with the individual other rows'"'"' matching document identifiers, followed by one or more binary searches, to locate particular documents in the other rows; and
  
  using a score associated with the other rows'"'"' matching document identifiers to rank the particular documents for returning as a result set,wherein at least one of parsing the query, ranking the terms, selecting the subset of the ranked terms, selecting the row, finding the other rows, moving the pointer, or using the score is implemented by at least one computing device.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12, further comprising selecting, as a new selected row, a different row that corresponds to the term set.
  - 14. The method of claim 12, wherein ranking the terms based on one or more importance criteria includes ranking the terms based on an inverted document frequency for each term, and wherein selecting the subset of the ranked terms comprises selecting two terms having a lower inverted document frequency.
  - 15. The method of claim 12, wherein moving the pointer in individual other rows includes performing at least one of the one or more binary searches between one jump point corresponding to an associated document identifier in one other row of the other rows that is less than the document identifier within the selected row, and another jump point corresponding to another associated document identifier in the one other row that is greater than the document identifier within the selected row.
  - 16. The method of claim 12, further comprising building the inverted query index in an offline state from a set of documents corresponding to advertisements.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation), Mitsubishi Electric Corporation
Original Assignee
Microsoft Corporation
Inventors
Chen, Zheng, Zhang, Benyu, Wang, Jian, Ji, Lei, Lin, Chenxi, Zeng, Huajun
Primary Examiner(s)
Vy; Hung T
Assistant Examiner(s)
Lie; Angela M

Application Number

US11/804,627
Publication Number

US 20080288483A1
Time in Patent Office

1,257 Days
Field of Search

707 1- 5, 707/713, 707/715, 707/736, 707/741, 707/748, 707/758, 707/769
US Class Current

707/748
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links