Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
First Claim
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
- generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other;
receiving, at the duplicate detector, a set of documents of known classification;
calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes;
receiving, at the duplicate detector, an unknown document;
generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, the query signature being based at least in part on a correspondence between a unique attribute of the unknown document and the lexicon of attributes;
comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; and
determining, based on the query signature not matching a class signature, and using the inductive classifier, a classification from among a plurality of possible classifications for the unknown document, comprising;
determining a plurality of attributes that discriminate between the plurality of possible classifications;
comparing the determined plurality of attributes to the unknown document to generate a probabilistic classification for the unknown document;
determining that the probabilistic classification has probability exceeding a threshold; and
selecting the probabilistic classification.
5 Assignments
0 Petitions
Accused Products
Abstract
A classification system includes a signature-based duplicate detector and an inductive classifier that share attribute information. To perform the duplicate detection and the classification, the duplicate detector and inductive classifier are first initialized by generating a lexicon of attributes for the duplicate detector and a classification model for the classifier. To develop a classification model, a training set of documents of known class are used by the classifier to determine the attributes of the documents that are most useful in classifying an unknown document. The model is developed from these attributes. Attribute information containing the attributes determined by the classifier is then passed to the duplicate detector and the duplicate detector uses the attribute information to generate the lexicon of attributes.
37 Citations
23 Claims
-
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
-
generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receiving, at the duplicate detector, a set of documents of known classification; calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receiving, at the duplicate detector, an unknown document; generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, the query signature being based at least in part on a correspondence between a unique attribute of the unknown document and the lexicon of attributes; comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; and determining, based on the query signature not matching a class signature, and using the inductive classifier, a classification from among a plurality of possible classifications for the unknown document, comprising; determining a plurality of attributes that discriminate between the plurality of possible classifications; comparing the determined plurality of attributes to the unknown document to generate a probabilistic classification for the unknown document; determining that the probabilistic classification has probability exceeding a threshold; and selecting the probabilistic classification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-usable storage medium storing a computer program for classifying a document, the computer program comprising instructions for causing a computer to perform the following operations:
-
generate a lexicon of attributes based on attribute information, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receive a set of documents of known classification; calculate class signatures based on the set of documents of known classification and the lexicon of attributes; receive an unknown document; generate a query signature based on the unknown document and the lexicon of attributes, the query signature being based at least in part on a correspondence between a unique attribute of the unknown document and the lexicon of attributes; compare the query signature to the class signatures to determine whether the query signature matches a class signature; and determine, based on the query signature not matching a class signature, and using the inductive classifier, a classification from among a plurality of possible classifications for the unknown document, wherein the determination is performed using instructions for causing the computer to perform the following operations; determine a plurality of attributes that discriminate between the plurality of possible classifications; compare the determined plurality of attributes to the unknown document to generate a probabilistic classification for the unknown document; determine that the probabilistic classification has probability exceeding a threshold; and select the probabilistic classification.
-
-
21. The computer-usable storage medium 20, further comprising instructions for indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature.
-
22. An apparatus for classifying a document, the apparatus comprising:
-
one or more processing devices and a computer-readable medium coupled to the one or more processing devices, the medium storing instructions which, when executed by the one or more processing devices, cause the one or more computers to perform operations comprising; generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receiving, at the duplicate detector, a set of documents of known classification; calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receiving, at the duplicate detector, an unknown document; generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, the query signature being based at least in part on a correspondence between a unique attribute of the unknown document and the lexicon of attributes; comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; and determining, based on the query signature not matching a class signature, and using the inductive classifier, a classification from among a plurality of possible classifications for the unknown document, comprising; determining a plurality of attributes that discriminate between the plurality of possible classifications; comparing the determined plurality of attributes to the unknown document to generate a probabilistic classification for the unknown document; determining that the probabilistic classification has probability exceeding a threshold; and selecting the probabilistic classification. - View Dependent Claims (23)
-
Specification