Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System
First Claim
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
- generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other;
receiving, at the duplicate detector, a set of documents of known classification;
calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes;
receiving, at the duplicate detector, an unknown document;
generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes;
comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature;
when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and
when the query signature does not match a class signature, providing the unknown document to the inductive classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
A classification system includes a signature-based duplicate detector and an inductive classifier that share attribute information. To perform the duplicate detection and the classification, the duplicate detector and inductive classifier are first initialized by generating a lexicon of attributes for the duplicate detector and a classification model for the classifier. To develop a classification model, a training set of documents of known class are used by the classifier to determine the attributes of the documents that are most useful in classifying an unknown document. The model is developed from these attributes. Attribute information containing the attributes determined by the classifier is then passed to the duplicate detector and the duplicate detector uses the attribute information to generate the lexicon of attributes.
-
Citations
19 Claims
-
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
-
generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receiving, at the duplicate detector, a set of documents of known classification; calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receiving, at the duplicate detector, an unknown document; generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes; comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and when the query signature does not match a class signature, providing the unknown document to the inductive classifier.
-
-
2. The method of claim 1 further comprising applying, at the inductive classifier, a classification model to the unknown document to determine a class for the unknown document.
-
3. The method of claim 1 wherein generating, at the duplicate detector, a lexicon of attributes based on the attribute information comprises selecting a specified number of the attributes with the highest mutual information scores.
-
4. The method of claim 1 further comprising generating, at the inductive classifier, attribute information based on a set of training documents of known classification.
-
5. The method of claim 4 wherein the attribute information comprises the attributes and mutual information scores.
-
6. The method of claim 5 wherein the attribute information comprises the portion of the attributes and the mutual information scores corresponding to the portion of the attributes.
-
7. The method of claim 4 wherein generating, at the inductive classifier, attribute information based on the set of training documents of known classification comprises:
-
receiving the training set of documents of known classification; analyzing the set of training documents to determine attributes in the set of training documents; and calculating mutual information scores for the attributes in the set of training documents.
-
-
8. The method of claim 4 wherein generating, at the inductive classifier, attribute information based on the set of training documents of known classification further comprises selecting a portion of the attributes based on mutual information scores.
-
9. The method of claim 4 wherein generating, at the inductive classifier, attribute information based on the set of training documents of known classification comprises:
-
selecting a specified number of the attributes with the highest mutual information scores; and creating attribute clusters from the selected attributes.
-
-
10. The method of claim 9 wherein the attribute information comprises the attribute clusters.
-
11. The method of claim 1 wherein:
-
generating, at the duplicate detector, a lexicon of attributes based on the attribute information comprises; generating a primary lexicon and a secondary lexicon based on the attribute information; and generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes comprises; determining unique attributes in the unknown document; determining an intersection between the unique attributes in the unknown document and the primary lexicon; determining whether the intersection exceeds a threshold; when the intersection does not exceed the threshold, adding attributes from the secondary lexicon that intersect with the unique attributes in the unknown document to the intersection to create an augmented intersection that exceeds the threshold; and calculating a signature for the document based on the augmented intersection.
-
-
12. The method of claim 11 wherein:
-
the attribute information comprises attributes in a set of training documents and mutual information scores for the attributes in the set of training documents, and generating a primary lexicon and a secondary lexicon based on the attribute information comprises; designating a specified number of the attributes in the set of training documents with the highest mutual information scores as the primary lexicon; and designating at least a portion of the attributes other than the specified number of the attributes with the highest mutual information scores as the secondary lexicon.
-
-
13. The method of claim 1 wherein the set of documents of known classification comprises a set of spam e-mails such that receiving, at the duplicate detector, a set of documents of known classification comprises receiving, at the duplicate detector, a set of spam e-mails.
-
14. The method of claim 13 wherein calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes comprises calculating spam signatures based on the spam e-mails and the lexicon of attributes.
-
15. The method of claim 14 wherein, when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature comprises indicating the unknown document is spam when the query signature matches a spam signature.
-
16. The method of claim 15 wherein calculating spam signatures based on the spam e-mails and the lexicon of attributes comprises:
-
selecting a spam e-mail from the set of spam e-mails; determining unique attributes in the selected spam e-mail; determining an intersection between the unique attributes in the selected spam e-mail and the lexicon; and calculating a spam signature based on the intersection.
-
-
17. The method of claim 16 wherein the unknown document comprises an unknown e-mail such that receiving, at the duplicate detector, an unknown document comprises receiving, at the duplicate detector, an unknown e-mail.
-
18. A computer-usable storage medium storing a computer program used classifying a document using a duplicate detector and an inductive classifier, the computer program comprising instructions for causing a computer to perform the following operations:
-
generate, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receive, at the duplicate detector, a set of documents of known classification; calculate, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receive, at the duplicate detector, an unknown document; generate, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes; compare, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; when the query signature matches a class signature, indicate the unknown document has a class of the document corresponding to the class signature that matches the query signature; and when the query signature does not match a class signature, provide the unknown document to the inductive classifier.
-
-
19. An apparatus used in classifying a document, the apparatus comprising:
one or more processing devices and a computer-readable medium coupled to the one or more processing devices, the medium storing instructions which, when executed by the one or more processing devices, cause the one or more computers to perform operations comprising; generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other; receiving, at the duplicate detector, a set of documents of known classification; calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receiving, at the duplicate detector, an unknown document; generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes; comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and when the query signature does not match a class signature, providing the unknown document to the inductive classifier.
Specification