Efficient retrieval algorithm by query term discrimination

US 7,925,644 B2
Filed: 02/27/2008
Issued: 04/12/2011
Est. Priority Date: 03/01/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method for use in information retrieval, the method comprising:

for each of a plurality of terms, selecting a predetermined number of top scoring documents for the term to form a corresponding document set for the term;

receiving a query comprising a plurality of query terms;

ranking the plurality of query terms received in the query based at least in part on the corresponding document sets for each of the plurality of query terms, wherein the ranking comprises using an inverse document frequency algorithm;

selecting a number of ranked query terms from the plurality of query terms, wherein each selected ranked query term comprises its corresponding document set and each document in a respective document set comprises a document identification number;

forming a union set based on the document sets associated with the selected number of ranked query terms; and

for a document identification number in the union set, scanning a document set corresponding to an unselected query term for a matching document identification number, wherein the unselected query term is included in the query comprising the plurality of query terms.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for use in information retrieval includes, for each of a plurality of terms, selecting a predetermined number of top scoring documents for the term to form a corresponding document set for the term. When a plurality of terms are received, optionally as a query, the system ranks, using an inverse document frequency algorithm, the plurality of terms for importance based on the document sets for the plurality of terms. Then a number of ranked terms are selected based on importance and a union set is formed based on the document sets associated with the selected number of ranked terms.

Citations

19 Claims

1. A method for use in information retrieval, the method comprising:
- for each of a plurality of terms, selecting a predetermined number of top scoring documents for the term to form a corresponding document set for the term;
  
  receiving a query comprising a plurality of query terms;
  
  ranking the plurality of query terms received in the query based at least in part on the corresponding document sets for each of the plurality of query terms, wherein the ranking comprises using an inverse document frequency algorithm;
  
  selecting a number of ranked query terms from the plurality of query terms, wherein each selected ranked query term comprises its corresponding document set and each document in a respective document set comprises a document identification number;
  
  forming a union set based on the document sets associated with the selected number of ranked query terms; and
  
  for a document identification number in the union set, scanning a document set corresponding to an unselected query term for a matching document identification number, wherein the unselected query term is included in the query comprising the plurality of query terms.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the scanning comprises use of a pointer that indicates a scanning position in the document set corresponding to the unselected query term.
  - 3. The method of claim 1, wherein the scanning comprises use of jumping.
  - 4. The method of claim 1, wherein the scanning comprises use of jumping and binary searching.
  - 5. The method of claim 1, wherein the number of ranked query terms comprise two.
  - 6. The method of claim 1 further comprising repeating the scanning for more than one document identification number in the union set.
  - 7. The method of claim 1, wherein the predetermined number of documents comprises a number less than 25.
  - 8. The method of claim 1 further comprising, after a matching document identification number is determined from the scanning, comparing a document score associated with the matching document identification number in the document set corresponding to the unselected query term to a document score associated with the document identification number in the union set.
  - 9. The method of claim 8 further comprising, based at least in part on the comparing, jumping a pointer for the document set corresponding to the unselected query term or binary searching in the document set corresponding to the unselected query term.
  - 10. The method of claim 9, wherein jumping occurs if the document score associated with a document identification number in the document set corresponding to the unselected query term is less than the document score associated with the document identification number in the union set.
  - 11. The method of claim 9, wherein binary searching occurs if the document score associated with a document identification number in the document set corresponding to the unselected query term is not less than the document score associated with the document identification number in the union set.
  - 12. The method of claim 1, wherein the scoring the documents comprises BM25 scoring.
  - 13. The method of claim 1 further comprising outputting a list of documents.
  - 14. One or more computer memory devices comprising computer-executable instructions to perform the method of claim 1.

15. An offline method for use in online information retrieval, the method comprising:
- for each of a plurality of terms, selecting a predetermined number of top scoring documents for the term to form a corresponding document set for the term; and
  
  storing the document sets for subsequent access responsive to an online query;
  
  receiving a query comprising a plurality of query terms;
  
  ranking the plurality of query terms using an inverse document frequency algorithm;
  
  selecting at least two ranked query terms from the plurality of query terms, wherein each selected, ranked query term comprises a corresponding document set of top scoring documents, wherein the selecting the at least two ranked query terms leaves at least one unselected query term from the plurality of query terms;
  
  forming a union set based on the document sets associated with the at least two ranked query terms;
  
  merging the union set with a document set corresponding to the at least one unselected query term; and
  
  outputting results based on the merging.
- View Dependent Claims (16, 17)
- - 16. The method of claim 15, wherein the scoring comprises BM25 scoring and wherein the predetermined number of top scoring documents comprises a number less than 25.
  - 17. One or more computer memory devices comprising computer-executable instructions to perform the method of claim 15.

18. An online information retrieval method comprising:
- receiving a query that comprises a plurality of terms;
  
  accessing documents or information about documents;
  
  based on the accessing, ranking the plurality of terms using an inverse document frequency algorithm;
  
  selecting a number of ranked terms, wherein each selected ranked term comprises a corresponding document set and each document in a respective document set comprises a document identification number;
  
  forming a union set based on the document sets associated with the selected number of ranked terms; and
  
  for a document identification number in the union set, scanning a document set corresponding to an unselected term for a matching document identification number, wherein the unselected term is included in the query comprising the plurality of terms.
- View Dependent Claims (19)
- - 19. One or more computer memory devices comprising processor executable instructions to perform the method of claim 18.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zeng, HuaJun, Lin, Chenxi, Chen, Zheng, Zhang, Benyu, Wang, Jian, Ji, Lei
Primary Examiner(s)
Stevens; Robert

Application Number

US12/038,652
Publication Number

US 20080215574A1
Time in Patent Office

1,140 Days
Field of Search

707/713, 707/723, 707/736, 707/999.5
US Class Current

707/713
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

G06Q 10/10 Office automation; Time man...

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient retrieval algorithm by query term discrimination

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links