Machine-learned approach to determining document relevance for search over large electronic collections of documents

US 7,287,012 B2
Filed: 01/09/2004
Issued: 10/23/2007
Est. Priority Date: 01/09/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented system that facilitates a machine-learned approach to determine document relevance, comprising:

a storage component that receives a set of human or machine selected items to be employed as positive test cases; and

a training component that trains at least one classifier with the human or machine selected items as positive test cases and one or more other items as negative test cases in order to provide a query-independent model, the trained classifier is employed to filter documents obtained from statistical-based or probabilistic-based searches.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a system and methodology that applies automated learning procedures for determining document relevance and assisting information retrieval activities. A system is provided that facilitates a machine-learned approach to determine document relevance. The system includes a storage component that receives a set of human selected items to be employed as positive test cases of highly relevant documents. A training component trains at least one classifier with the human selected items as positive test cases and one or more other items as negative test cases in order to provide a query-independent model, wherein the other items can be selected by a statistical search, for example. Also, the trained classifier can be employed to aid an individual in identifying and selecting new positive cases or utilized to filter or re-rank results from a statistical-based search.

Citations

30 Claims

1. A computer-implemented system that facilitates a machine-learned approach to determine document relevance, comprising:
- a storage component that receives a set of human or machine selected items to be employed as positive test cases; and
  
  a training component that trains at least one classifier with the human or machine selected items as positive test cases and one or more other items as negative test cases in order to provide a query-independent model, the trained classifier is employed to filter documents obtained from statistical-based or probabilistic-based searches.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The system of claim 1, the negative test cases selected by a statistical search.
  - 3. The system of claim 1, the trained classifier is employed to aid an individual in selecting new positive cases.
  - 4. The system of claim 1, outputs of the filter are ranked such that positive cases are ranked before negative cases.
  - 5. The system of claim 1, the outputs are ranked according to a probability they are a positive case.
  - 6. The system of claim 1, the storage component includes logs of relevant sites of interest for users, documents, or data items.
  - 7. The system of claim 6, the storage component includes information for a centralized store or from divergent sources such as web sites, document collections, encyclopedias, local data sources and remote data sources.
  - 8. The system of claim 1, the classifier is employed to automatically analyze data in the storage component in order to assist one or more tools that can interact with a user interface.
  - 9. The system of claim 8, the tools include at least one of an administrative tool, an editing tool, and a ranking tool.
  - 10. The system of claim 8, the tools are employed in at least one of an online and an offline manner.
  - 11. The system of claim 1, the classifiers are trained according to positive and negative test data in order to determine an item'"'"'s relevance such as from documents or links that suggest other sites of useful information.
  - 12. The system of claim 11, further comprising a set of manually selected documents or items to train a machine-learned classifier.
  - 13. The system of claim 11, the classifier is applied to new terms to identify best bet or relevant documents.
  - 14. The system of claim 11, further comprising bootstrapping new models over various training iterations to facilitate a growing model of learned expressions that are employed for more accurate information retrieval activities.
  - 15. The system of claim 14, further comprising best bets that are hand-selected by an editor.
  - 16. The system of claim 15, further comprising a component to maximize a likelihood of displaying types of documents or items that users are likely to think are interesting enough to view or retrieve.
  - 17. The system of claim 1, the classifier includes at least one of the following learning techniques:
    - Support Vector Machines (SVM), a Naive Bayes, a Bayes Net, a decision tree, similarity-based, a vector-based, a Hidden Markov Model, or other learning technique.
  - 18. The system of claim 1, further comprising a component to perform post-processing of information to determine a document or site'"'"'s relevance to a user or administrator.
  - 19. The system of claim 18, the post-processing includes ranking in accordance with predetermined probability thresholds, items having a higher probability of being relevant are presented before items of lower probability.
  - 20. The system of claim 18, further comprising explicit annotations that are added to displayed items to indicate a document or site'"'"'s relevance or importance.
  - 21. A computer readable medium having computer readable instructions stored thereon for implementing the training component and the storage component of claim 1.

22. A computer-based information retrieval system, comprising:
- means for determining a training set for data terms;
  
  means for automatically classifying the training set;
  
  means for determining new items from the classified training set; and
  
  means for presenting the new items in accordance with an information retrieval request.
- View Dependent Claims (23)
- - 23. The system of claim 22, further comprising means for testing the classified training set.

24. A computer-implemented method to facilitate automated information retrieval, comprising:
- processing n queries from a data log, n being an integer;
  
  identifying relevant candidates from the n queries; and
  
  training classifiers to identify other relevant candidates for subsequent search activities.
- View Dependent Claims (25, 26, 27, 28, 29)
- - 25. The method of claim 24, farther comprising forwarding an analysis to an editor that determines whether or not a piece of information is desirable to be presented for a given query or topic.
  - 26. The method of claim 24, farther comprising extracting relevant candidates from a list of potential documents or sites and automatically placing the best bets before other statistical rankings.
  - 27. The method of claim 24, further comprising re-ranking results by a probability that a document is relevant, respective documents are downloaded, and terms are extracted and looked-up for terms appearing in the document.
  - 28. The method of claim 24, farther comprising determining at least one category to be classified.
  - 29. The method of claim 28, further comprising employing a subset of a training data set to test the classified categories.

30. A computer readable medium having a data structure stored thereon, comprising:
- a first data field related to a training data set for a relevance category;
  
  a second data field that relates to a new set of data items pertaining to the relevance category; and
  
  a third data field that relates to a probability ranking for the new set of data items.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chandrasekar, Raman, Chen, Harr, Corston, Simon H.
Primary Examiner(s)
Holmes; Michael B.

Application Number

US10/754,159
Publication Number

US 20050154686A1
Time in Patent Office

1,383 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06F 16/3346 using probabilistic model

G06F 16/951 Indexing; Web crawling tech...

Machine-learned approach to determining document relevance for search over large electronic collections of documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Machine-learned approach to determining document relevance for search over large electronic collections of documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links