Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
First Claim
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
- receiving, at the inductive classifier, a training set of documents of known classification;
generating, at the inductive classifier, attribute information based on the set of training documents of known classification;
developing, at the inductive classifier, a classification model based on the attribute information;
providing the attribute information from the inductive classifier to the duplicate detector, the duplicate detector being configured to determine whether two or more data items are near duplicates;
generating, at the duplicate detector, a lexicon of attributes based on the attribute information received from the inductive classifier;
receiving, at the duplicate detector, a set of documents of known classification;
calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes;
receiving, at the duplicate detector, an unknown document;
generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, wherein the generating of the query signature comprises;
determining unique attributes in the unknown document;
determining an intersection between the unique attributes in the unknown document and the lexicon; and
calculating the query signature based on the intersection;
comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature;
when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and
when the query signature does not match a class signature;
providing the unknown document to the inductive classifier; and
applying, at the inductive classifier, the classification model to the unknown document to determine a class for the unknown document.
7 Assignments
0 Petitions
Accused Products
Abstract
A classification system includes a signature-based duplicate detector and an inductive classifier that share attribute information. To perform the duplicate detection and the classification, the duplicate detector and inductive classifier are first initialized by generating a lexicon of attributes for the duplicate detector and a classification model for the classifier. To develop a classification model, a training set of documents of known class are used by the classifier to determine the attributes of the documents that are most useful in classifying an unknown document. The model is developed from these attributes. Attribute information containing the attributes determined by the classifier is then passed to the duplicate detector and the duplicate detector uses the attribute information to generate the lexicon of attributes.
69 Citations
15 Claims
-
1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:
-
receiving, at the inductive classifier, a training set of documents of known classification; generating, at the inductive classifier, attribute information based on the set of training documents of known classification; developing, at the inductive classifier, a classification model based on the attribute information; providing the attribute information from the inductive classifier to the duplicate detector, the duplicate detector being configured to determine whether two or more data items are near duplicates; generating, at the duplicate detector, a lexicon of attributes based on the attribute information received from the inductive classifier; receiving, at the duplicate detector, a set of documents of known classification; calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes; receiving, at the duplicate detector, an unknown document; generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, wherein the generating of the query signature comprises; determining unique attributes in the unknown document; determining an intersection between the unique attributes in the unknown document and the lexicon; and calculating the query signature based on the intersection; comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and when the query signature does not match a class signature; providing the unknown document to the inductive classifier; and applying, at the inductive classifier, the classification model to the unknown document to determine a class for the unknown document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification