Classifying documents using multiple classifiers
First Claim
1. A computer-implemented method comprising:
- computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold aj, wherein documents having a score less than aj are unlikely to have the property P, and wherein each classifier has a respective upper threshold bj, wherein documents having a score greater than bj are likely to have the property P;
determining, for each respective classifier, a plurality of intervals between aj and bj for the classifier;
determining, for each document in the collection of documents, a combination of intervals I11 to IDK according to which interval of the plurality of intervals each respective score for the document belongs;
determining, for each combination of intervals Ij1 to IjK, whether any documents in the collection of documents have a corresponding combination of intervals I11 to IDK;
selecting no more than M documents for each combination of intervals Ij1 to IjK for which at least one document in the collection has the corresponding combination of intervals; and
training a multiple classifier model for the D distinct classifiers using each selected document.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying resources using scores from multiple classifiers. In general, one aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving identifying a collection of documents to classify; receiving a plurality of classifiers for scoring a document with respect to a specified property; for each document in the collection, applying each of the plurality of classifiers, each classifier generating a score associated with a likelihood that the document has the specified property, combining the scores from each classifier including applying a multiple classifier model that uses monotonic regression to combine the plurality of classifiers, and classifying the document as having the specified property based on the combined score.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold aj, wherein documents having a score less than aj are unlikely to have the property P, and wherein each classifier has a respective upper threshold bj, wherein documents having a score greater than bj are likely to have the property P; determining, for each respective classifier, a plurality of intervals between aj and bj for the classifier; determining, for each document in the collection of documents, a combination of intervals I11 to IDK according to which interval of the plurality of intervals each respective score for the document belongs; determining, for each combination of intervals Ij1 to IjK, whether any documents in the collection of documents have a corresponding combination of intervals I11 to IDK; selecting no more than M documents for each combination of intervals Ij1 to IjK for which at least one document in the collection has the corresponding combination of intervals; and training a multiple classifier model for the D distinct classifiers using each selected document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold aj, wherein documents having a score less than aj are unlikely to have the property P, and wherein each classifier has a respective upper threshold bj, wherein documents having a score greater than bj are likely to have the property P; determining, for each respective classifier, a plurality of intervals between aj and bj for the classifier; determining, for each document in the collection of documents, a combination of intervals I11 to IDK according to which interval of the plurality of intervals each respective score for the document belongs; determining, for each combination of intervals Ij1 to IjK, whether any documents in the collection of documents have a corresponding combination of intervals I11 to IDK; selecting no more than M documents for each combination of intervals Ij1 to IjK for which at least one document in the collection has the corresponding combination of intervals; and training a multiple classifier model for the D distinct classifiers using each selected document. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold aj, wherein documents having a score less than aj are unlikely to have the property P, and wherein each classifier has a respective upper threshold bj, wherein documents having a score greater than bj are likely to have the property P; determining, for each respective classifier, a plurality of intervals between aj and bj for the classifier; determining, for each document in the collection of documents, a combination of intervals I11 to IDK according to which interval of the plurality of intervals each respective score for the document belongs; determining, for each combination of intervals to Ij1 to IjK, whether any documents in the collection of documents have a corresponding combination of intervals I11 to IDK; selecting no more than M documents for each combination of intervals Ij1 to IjK for which at least one document in the collection has the corresponding combination of intervals; and training a multiple classifier model for the D distinct classifiers using each selected document. - View Dependent Claims (20)
-
Specification