Content and quality assessment method and apparatus for quality searching
First Claim
1. A computer-based process utilizing a specifically programmed computer for filtering information according to content quality that is organized in documents comprising the steps of:
- A) obtaining a selected set of documents;
B) labeling each document of the selected set based on content quality with the documents labeled as positive if they match content and quality criteria and labeled as negative if they do not match content and quality criteria;
C) extracting and representing features from each document in the selected set;
D) modifying the extracted represented features;
E) constructing models using pattern recognition algorithms to assign a label to a document;
F) constructing models for labeling documents based on content quality using pattern recognition algorithms stored in the storage device consisting of the following steps;
1) dividing the first set of documents into N subsets such that the union of all the subsets is the first set of documents;
2) choosing at least one pattern recognition algorithm;
a) instantiating a set of parameters for the pattern recognition algorithm;
i) processing each of the N subsets comprising the following steps;
a′
) defining a first subset and defining a second subset mutually exclusive of the first subset;
b′
) training the pattern recognition algorithm to build a model using the first subset and the parameter set;
c′
) applying the model to the second subset of documents to obtain labels and scores for each document;
d′
) evaluating the labels and scores;
e′
) storing the evaluation measure, set of parameters, and current pattern recognition algorithm; and
b) repeating step 2a) until all appropriate sets of parameters for the pattern recognition algorithm have been applied; and
3) repeating step
2) until all pattern recognition algorithms have been applied;
4) aggregating the evaluation measures for the N subsets, the pattern recognition algorithms with the set of parameters;
5) selecting the parameter set and pattern recognition algorithm with an aggregate evaluation measure that meets a selection criteria;
6) applying the parameter set and the pattern recognition algorithm identified from step
5) to the first set of documents to build a final model; and
G) constructing a final model with the appropriate pattern recognition model and associated parameter to assign a label and/or a score to a set of previously unlabeled documents; and
H) displaying the label and/or score of at least one of the previously unlabeled documents.
0 Assignments
0 Petitions
Accused Products
Abstract
A computer-based process retrieves information organized in documents containing text and/or coded representations of text. The process involves obtaining and labeling a selected set of documents based on content quality, and extracting and representing features from each document in the selected set. The extracted and selected features are modified, and models are constructed using parametric learning algorithms. The constructed models are capable of assigning a label to each document. The model parameters are instantiated using a first subset of the selected set of documents. Parameters are chosen by validating the corresponding model against at least a second subset of the full document set. The constructed models also are capable of assigning labels to similar documents outside a selected subset not previously given to the process of model construction.
27 Citations
42 Claims
-
1. A computer-based process utilizing a specifically programmed computer for filtering information according to content quality that is organized in documents comprising the steps of:
-
A) obtaining a selected set of documents; B) labeling each document of the selected set based on content quality with the documents labeled as positive if they match content and quality criteria and labeled as negative if they do not match content and quality criteria; C) extracting and representing features from each document in the selected set; D) modifying the extracted represented features; E) constructing models using pattern recognition algorithms to assign a label to a document; F) constructing models for labeling documents based on content quality using pattern recognition algorithms stored in the storage device consisting of the following steps; 1) dividing the first set of documents into N subsets such that the union of all the subsets is the first set of documents; 2) choosing at least one pattern recognition algorithm; a) instantiating a set of parameters for the pattern recognition algorithm; i) processing each of the N subsets comprising the following steps;
a′
) defining a first subset and defining a second subset mutually exclusive of the first subset;
b′
) training the pattern recognition algorithm to build a model using the first subset and the parameter set;
c′
) applying the model to the second subset of documents to obtain labels and scores for each document;
d′
) evaluating the labels and scores;
e′
) storing the evaluation measure, set of parameters, and current pattern recognition algorithm; andb) repeating step 2a) until all appropriate sets of parameters for the pattern recognition algorithm have been applied; and 3) repeating step
2) until all pattern recognition algorithms have been applied;4) aggregating the evaluation measures for the N subsets, the pattern recognition algorithms with the set of parameters; 5) selecting the parameter set and pattern recognition algorithm with an aggregate evaluation measure that meets a selection criteria; 6) applying the parameter set and the pattern recognition algorithm identified from step
5) to the first set of documents to build a final model; andG) constructing a final model with the appropriate pattern recognition model and associated parameter to assign a label and/or a score to a set of previously unlabeled documents; and H) displaying the label and/or score of at least one of the previously unlabeled documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system using a model to filter documents according to quality comprising:
-
A) a storage device; B) at least one processor to implement the following steps; C) obtaining from the storage device a first set of documents labeled according to content quality; D) extracting and representing features from the first set of documents; E) modifying the extracted represented features; F) constructing models for labeling documents based on content quality using pattern recognitional algorithms stored in the storage device consisting of the following steps; 1) dividing the first set of documents into N subsets such that the union of all the subsets is the first set of documents; 2) choosing at least one pattern recognition algorithm; a) instantiating a set of parameters for the pattern recognition algorithm; i) processing each of the N subsets comprising the following steps;
a′
) defining a first subset and defining a second subset mutually exclusive of the first subset;
b′
) training the pattern recognition algorithm to build a model using the first subset and the parameter set;
c′
) applying the model to the second subset of documents to obtain labels and scores for each document;
d′
) evaluating the labels and scores;
e′
) storing the evaluation measure, set of parameters, and current pattern recognition algorithm; andb) repeating step 2a) until all appropriate sets of parameters for the pattern recognition algorithm have been applied; and 3) repeating step
2) until all pattern recognition algorithms have been applied;4) aggregating the evaluation measures for the N subsets, the pattern recognition algorithms with the set of parameters; 5) selecting the parameter set and pattern recognition algorithm with an aggregate evaluation measure that meets a selection criteria; 6) applying the parameter set and the pattern recognition algorithm identified from step
5) to the first set of documents to build a final model; andG) obtaining a second set of non-labeled documents related to the first set of documents; H) using the final model to label and score the second set of documents according to content quality; and I) displaying the label and/or score of at least one of the previously unlabeled documents. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A system to estimate performance of a model to filter documents according to quality comprising:
-
A) a storage device; B) at least one processor to implement the following steps; C) obtaining from the storage device a set of documents labeled according to content quality; D) extracting and representing features from the set of documents; E) modifying the extracted represented features; F) constructing models to estimate performance for labeling documents based on content quality using pattern recognition algorithms stored in the storage device consisting of the following steps; 1) dividing the set of documents into N subsets such that the union of all the subsets is the set of documents; 2) processing each of the N subsets comprising the following steps; a) defining a first set of documents, a second set of documents, and a third set of documents in the subset; b) choosing at least one pattern recognition algorithm; i) instantiating a set of parameters for the pattern recognition algorithm;
a′
) training the pattern recognition algorithm to build a model using the first set of documents and the parameter set;
b′
) applying the model to the second set of documents to obtain labels and scores for each document;
c′
) evaluating the labels and scores;
d′
) storing the evaluation measure, set of parameters, and current pattern recognition algorithm; and
e′
) repeating step i) until all appropriate sets of parameters for the pattern recognition algorithm have been applied; andc) repeating step 2b) until all pattern recognition algorithm have been applied; d) aggregating the evaluation measures, the pattern recognition algorithms with the set of parameters; e) selecting the parameter set and pattern recognition algorithm with an aggregate evaluation measure that meets a selection criteria; f) applying the parameter set and the pattern recognition algorithm identified from step e) to the union of the first and second subset to form an intermediate final model; g) applying the intermediate final model to the third set of documents to obtain labels and scores for each document; h) evaluating the labels and scores; i) storing the intermediate evaluation measure; j) repeating step 2 until all subsets have been applied; and 3) aggregating the intermediate evaluation measures to form an aggregate evaluation measure; and 4) displaying the aggregate evaluation measure as the estimated performance. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
-
Specification