Training procedure for N-gram-based statistical content classification
First Claim
Patent Images
1. A computer-implemented method, comprising:
- selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;
generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;
providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;
searching for the plurality of N-grams in the set of training documents;
computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories;
searching for the plurality of N-grams in the set of validation documents; and
determining a threshold for each of the plurality of categories.
22 Assignments
0 Petitions
Accused Products
Abstract
A training procedure for N-gram based statistical document classification has been disclosed. In one embodiment, a set of N-grams is selected out of a second set of N-grams, each of the N-grams having a sequence of N bytes, where N is an integer. Then a statistical content classification model is generated based on occurrences of the N-grams, if any, in a set of training documents and a set of validation documents. The statistical content classification model is provided to content filters to classify content.
-
Citations
28 Claims
-
1. A computer-implemented method, comprising:
-
selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer; generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents; providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories; searching for the plurality of N-grams in the set of training documents; computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories; searching for the plurality of N-grams in the set of validation documents; and determining a threshold for each of the plurality of categories. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented method, comprising:
-
selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer; generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents; providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories; determining a utility for each of the second plurality of N-grams using a frequency of occurrence of a respective N-gram in a subset of training documents of the set of training documents that have been classified in a respective category and a frequency of occurrence of the respective N-gram in remaining training documents of the set of training documents; and selecting the sub-range of values of N based on utilities of the second plurality of N-grams. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A machine-accessible medium that provides instructions that, if executed by a processor, will cause the processor to perform operations comprising:
-
selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer; generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents; providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories; searching for the plurality of N-grams in the set of training documents; computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories; searching for the plurality of N-grams in the set of validation documents; and determining a threshold for each of the plurality of categories. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A machine-accessible medium that provides instructions that, if executed by a processor, will cause the processor to perform operations comprising:
-
selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer; generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents; providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories; determining a utility for each of the second plurality of N-grams using a frequency of occurrence of a respective N-gram in a subset of training documents of the set of training documents that have been classified in a respective category and a frequency of occurrence of the respective N-gram in remaining training documents of the set of training documents; and selecting the sub-range of values of N based on utilities of the second plurality of N-grams. - View Dependent Claims (19, 20, 21, 22)
-
-
23. An apparatus comprising:
-
a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer; and a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents, wherein the search engine is operable to compute a plurality of scores for each of the plurality of N-grams with respect to a plurality of categories; wherein the model generator is operable to determine a plurality of thresholds for the plurality of categories using the plurality of scores and the set of validation documents, each of the plurality of thresholds being associated with a distinct one of the plurality of categories;
wherein the model generator is operable to compute each of the plurality of thresholds using a frequency of occurrences of each of the plurality of N-grams in the set of validation documents, the plurality of scores, and a predetermined false positive limit.- View Dependent Claims (24, 25, 26)
-
-
27. A system comprising:
-
a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer; a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents; a repository coupled to the model generator to store the statistical content classification model; an N-gram-based content rating engine coupled to the repository, to access the statistical content classification model and to rate content of documents in the natural language using the statistical content classification model, wherein the documents are from a network external to the system; a content filtering module comprising the N-gram-based content rating engine; and a client machine coupled to the content filtering module, wherein the content filtering module receives a request to access a web page from the client machine and the N-gram-based content rating engine rates content of the requested web page, wherein the content filtering module blocks the requested web page from the client machine if the content of the requested web page is in a prohibited category and the content filtering module passes the requested web page to the client machine if the content of the requested web page is in an allowable category.
-
-
28. A system comprising:
-
a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer; a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents; a repository coupled to the model generator to store the statistical content classification model; an N-gram-based content rating engine coupled to the repository, to access the statistical content classification model and to rate content of documents in the natural language using the statistical content classification model, wherein the documents are from a network external to the system; a content filtering module comprising the N-gram-based content rating engine; and a client machine coupled to the content filtering module, wherein the content filtering module receives an incoming electronic mail message and the N-gram-based content rating engine rates content of the electronic mail message, wherein the content filtering module blocks the electronic mail message from the client machine if the content of the electronic mail message is in a prohibited category and the content filtering module passes the electronic mail message to the client machine if the content of the electronic mail message is in an allowable category.
-
Specification