×

Training procedure for N-gram-based statistical content classification

  • US 7,792,846 B1
  • Filed: 07/27/2007
  • Issued: 09/07/2010
  • Est. Priority Date: 07/27/2007
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method, comprising:

  • selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;

    generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;

    providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;

    searching for the plurality of N-grams in the set of training documents;

    computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories;

    searching for the plurality of N-grams in the set of validation documents; and

    determining a threshold for each of the plurality of categories.

View all claims
  • 22 Assignments
Timeline View
Assignment View
    ×
    ×