×

Information classification paradigm

  • US 7,529,748 B2
  • Filed: 03/15/2006
  • Issued: 05/05/2009
  • Est. Priority Date: 11/15/2005
  • Status: Active Grant
First Claim
Patent Images

1. A machine-implemented method for determining whether documents being searched are relevant to a desired piece of information, the method comprising:

  • accessing documents stored in electronic form in a memory;

    automatically classifying, by a processor, documents in an initial set of source documents of the electronic documents into one of at least three groups, where the classifying is performed by a first classifier comprising an untrained rules-based classifier applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, the first classifier comprising software executing on a machine, the at least three groups comprising a first group containing documents determined by the untrained rules-based classifier to be of interest to the desired piece of information, a second group containing documents determined by the untrained rules-based classifier to be not of interest to the desired piece of information, and a third group containing documents that the untrained rules-based classifier did not place in the first group or the second group, the first classifier classifying by the processor;

    determining the presence or absence of at least one classification identifier in a source document;

    if the at least one classification identifier is absent within the source document then classifying the source document into the second group;

    if the at least one classification identifier is present within the source document then extracting, with a snippet extractor a snippet from the document and determining the presence or absence of at least one keyword within the snippet wherein the snippet is selected based on structures of the source document; and

    if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the source document into the third group; and

    for each document classified into the third group, the processor;

    extracting from a document at least one feature vector; and

    classifying the document with a second classifier into either the first group or the second group based on the at least one feature vector, wherein the second classifier comprises a support vector machine (SVM) trained, prior to the classifying, with a plurality of labeled training documents, the labeled training documents having been labeled according to analysis of the documents prior to performance of the method, and where the second classifier comprises software executing on the machine.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×