×

Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems

  • US 7,725,475 B1
  • Filed: 12/21/2004
  • Issued: 05/25/2010
  • Est. Priority Date: 02/11/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:

  • receiving, at the inductive classifier, a training set of documents of known classification;

    generating, at the inductive classifier, attribute information based on the set of training documents of known classification;

    developing, at the inductive classifier, a classification model based on the attribute information;

    providing the attribute information from the inductive classifier to the duplicate detector, the duplicate detector being configured to determine whether two or more data items are near duplicates;

    generating, at the duplicate detector, a lexicon of attributes based on the attribute information received from the inductive classifier;

    receiving, at the duplicate detector, a set of documents of known classification;

    calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes;

    receiving, at the duplicate detector, an unknown document;

    generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, wherein the generating of the query signature comprises;

    determining unique attributes in the unknown document;

    determining an intersection between the unique attributes in the unknown document and the lexicon; and

    calculating the query signature based on the intersection;

    comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature;

    when the query signature matches a class signature, indicating the unknown document has a class of the document corresponding to the class signature that matches the query signature; and

    when the query signature does not match a class signature;

    providing the unknown document to the inductive classifier; and

    applying, at the inductive classifier, the classification model to the unknown document to determine a class for the unknown document.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×