System and method for optimized source selection in an information retrieval system
First Claim
1. In a distributed information system including databases as sources of documents for query searching, a method of optimizing the selection of sources for satisfying a query, comprising the steps of:
- a) forming a training set of documents by randomly selecting significant portions of the documents from each of the sources;
b) forming a test set of documents by using the set of documents excluded in the training set;
c) defining each document in the training and test set in terms of features/attributes and a name as samples representing individual sources;
d) processing the samples using an algorithm to recognize patterns in the documents which distinguish one source from another source;
e) generating a set of rules from the patterns as a model using the algorithm; and
f) applying to the model a query in terms of desired features/attributes to predict the optimum sources satisfying the query.
1 Assignment
0 Petitions
Accused Products
Abstract
In an information retrieval system, an automated system optimizes selection of sources in a distributed information system for query searching. A training set of documents is created for each source by randomly selecting significant portions of the documents thereof. A test set documents is created for each source from the documents not included in the training set. Each document in the training and test set is defined in terms of features/attributes and a name as samples representing individual sources. Pattern recognizing means process the samples to recognize patterns in the documents to distinguish one source from another source. Rule generating means provide a set of DNF rules from the patterns as a model representing each source. The test set of documents is expressed in terms of DNF rules. Evaluating means create a final classification model after minimizing any error between the DNF rules for the training and test sets. Query means enable a user to express a query in terms of features/attributes and DNF rules which when applied to the final model automatically select the optimal sources for query searching. The sources may also be expressed in taxonomic groupings which reduces the number of data sources and speeds query searching on a distributive information network by a user.
-
Citations
14 Claims
-
1. In a distributed information system including databases as sources of documents for query searching, a method of optimizing the selection of sources for satisfying a query, comprising the steps of:
-
a) forming a training set of documents by randomly selecting significant portions of the documents from each of the sources; b) forming a test set of documents by using the set of documents excluded in the training set; c) defining each document in the training and test set in terms of features/attributes and a name as samples representing individual sources; d) processing the samples using an algorithm to recognize patterns in the documents which distinguish one source from another source; e) generating a set of rules from the patterns as a model using the algorithm; and f) applying to the model a query in terms of desired features/attributes to predict the optimum sources satisfying the query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 13, 14)
-
-
8. An automated system for optimized selection of sources in a distributed information system for query searching, comprising:
-
a) means for forming a training set of documents at each source by randomly selecting a significant portions of the documents; b) means for forming a test of the documents at each source not included in the training set; c) means for defining each document in the training and test set in terms of features/attributes and a name as samples representing individual sources; d) means for processing the samples to recognize patterns in the documents to distinguish one source from another source using an algorithm; e) means for generating a set of rules from the patterns as a model representing each source using the algorithm; and f) means for applying to the model a query in terms of desired features/attributes to predict the optimum sources satisfying the query. - View Dependent Claims (9, 10, 11, 12)
-
Specification