×

Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems

  • US 8,713,014 B1
  • Filed: 05/14/2010
  • Issued: 04/29/2014
  • Est. Priority Date: 02/11/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method of classifying a document using a duplicate detector and an inductive classifier, the method comprising:

  • generating, at the duplicate detector, a lexicon of attributes based on attribute information received from the inductive classifier, wherein the duplicate detector is configured to determine whether two or more data items are near duplicates of each other;

    receiving, at the duplicate detector, a set of documents of known classification;

    calculating, at the duplicate detector, class signatures based on the set of documents of known classification and the lexicon of attributes;

    receiving, at the duplicate detector, an unknown document;

    generating, at the duplicate detector, a query signature based on the unknown document and the lexicon of attributes, the query signature being based at least in part on a correspondence between a unique attribute of the unknown document and the lexicon of attributes;

    comparing, at the duplicate detector, the query signature to the class signatures to determine whether the query signature matches a class signature; and

    determining, based on the query signature not matching a class signature, and using the inductive classifier, a classification from among a plurality of possible classifications for the unknown document, comprising;

    determining a plurality of attributes that discriminate between the plurality of possible classifications;

    comparing the determined plurality of attributes to the unknown document to generate a probabilistic classification for the unknown document;

    determining that the probabilistic classification has probability exceeding a threshold; and

    selecting the probabilistic classification.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×