Training procedure for N-gram-based statistical content classification

US 7,792,846 B1
Filed: 07/27/2007
Issued: 09/07/2010
Est. Priority Date: 07/27/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method, comprising:

selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;

generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;

providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;

searching for the plurality of N-grams in the set of training documents;

computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories;

searching for the plurality of N-grams in the set of validation documents; and

determining a threshold for each of the plurality of categories.

View all claims

22 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A training procedure for N-gram based statistical document classification has been disclosed. In one embodiment, a set of N-grams is selected out of a second set of N-grams, each of the N-grams having a sequence of N bytes, where N is an integer. Then a statistical content classification model is generated based on occurrences of the N-grams, if any, in a set of training documents and a set of validation documents. The statistical content classification model is provided to content filters to classify content.

Citations

28 Claims

1. A computer-implemented method, comprising:
- selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;
  
  providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;
  
  searching for the plurality of N-grams in the set of training documents;
  
  computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories;
  
  searching for the plurality of N-grams in the set of validation documents; and
  
  determining a threshold for each of the plurality of categories.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein determining the threshold for each of the plurality of categories comprises:
    - computing the threshold using a frequency of occurrence of each of the plurality of N-grams in a subset of validation documents of the set of validation documents that have been classified into a respective category, a frequency of non-occurrence of each of the plurality of N-grams in the subset of validation documents, the plurality of scores, and a predetermined false positive limit.
  - 3. The method of claim 1, wherein each of said plurality of N-grams representing at least a portion of a keyword in a non-delimited natural language.
  - 4. The method of claim 3, wherein the non-delimited natural language is Chinese.
  - 5. The method of claim 1, wherein the set of training documents includes one or more web pages.
  - 6. The method of claim 1, wherein the set of training documents includes one or more electronic mail messages.

7. A computer-implemented method, comprising:
- selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;
  
  providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;
  
  determining a utility for each of the second plurality of N-grams using a frequency of occurrence of a respective N-gram in a subset of training documents of the set of training documents that have been classified in a respective category and a frequency of occurrence of the respective N-gram in remaining training documents of the set of training documents; and
  
  selecting the sub-range of values of N based on utilities of the second plurality of N-grams.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The method of claim 7, wherein each of said plurality of N-grams representing at least a portion of a keyword in a non-delimited natural language.
  - 9. The method of claim 8, wherein the non-delimited natural language is Chinese.
  - 10. The method of claim 7, wherein the set of training documents includes one or more web pages.
  - 11. The method of claim 7, wherein the set of training documents includes one or more electronic mail messages.

12. A machine-accessible medium that provides instructions that, if executed by a processor, will cause the processor to perform operations comprising:
- selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;
  
  providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;
  
  searching for the plurality of N-grams in the set of training documents;
  
  computing a plurality of scores for each of the plurality of N-grams with respect to the plurality of categories;
  
  searching for the plurality of N-grams in the set of validation documents; and
  
  determining a threshold for each of the plurality of categories.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The machine-accessible medium of claim 12, wherein determining the threshold for each of the plurality of categories comprises:
    - computing the threshold using a frequency of occurrence of each of the plurality of N-grams in a subset of validation documents of the set of validation documents that have been classified into a respective category, a frequency of non-occurrence of each of the plurality of N-grams in the subset of validation documents, the plurality of scores, and a predetermined false positive limit.
  - 14. The machine-accessible medium of claim 12, wherein each of said plurality of N-grams representing at least a portion of a keyword in a non-delimited natural language.
  - 15. The machine-accessible medium of claim 14, wherein the non-delimited natural language is Chinese.
  - 16. The machine-accessible medium of claim 12, wherein the set of training documents includes one or more web pages.
  - 17. The machine-accessible medium of claim 12, wherein the set of training documents includes one or more electronic mail messages.

18. A machine-accessible medium that provides instructions that, if executed by a processor, will cause the processor to perform operations comprising:
- selecting a plurality of N-grams from a second plurality of N-grams, wherein the second plurality of N-grams are associated with a range of values of N and the plurality of N-grams are associated with a sub-range of the range of values of N, wherein each of the second plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  generating a statistical content classification model based on occurrences of the plurality of N-grams, if any, in a set of training documents and a set of validation documents;
  
  providing the statistical content classification model to content filters to classify content into one or more of a plurality of categories;
  
  determining a utility for each of the second plurality of N-grams using a frequency of occurrence of a respective N-gram in a subset of training documents of the set of training documents that have been classified in a respective category and a frequency of occurrence of the respective N-gram in remaining training documents of the set of training documents; and
  
  selecting the sub-range of values of N based on utilities of the second plurality of N-grams.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The machine-accessible medium of claim 18, wherein each of said plurality of N-grams representing at least a portion of a keyword in a non-delimited natural language.
  - 20. The machine-accessible medium of claim 19, wherein the non-delimited natural language is Chinese.
  - 21. The machine-accessible medium of claim 18, wherein the set of training documents includes one or more web pages.
  - 22. The machine-accessible medium of claim 18, wherein the set of training documents includes one or more electronic mail messages.

23. An apparatus comprising:
- a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer; and
  
  a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents,wherein the search engine is operable to compute a plurality of scores for each of the plurality of N-grams with respect to a plurality of categories;
  
  wherein the model generator is operable to determine a plurality of thresholds for the plurality of categories using the plurality of scores and the set of validation documents, each of the plurality of thresholds being associated with a distinct one of the plurality of categories;
  
  wherein the model generator is operable to compute each of the plurality of thresholds using a frequency of occurrences of each of the plurality of N-grams in the set of validation documents, the plurality of scores, and a predetermined false positive limit.
- View Dependent Claims (24, 25, 26)
- - 24. The apparatus of claim 23, further comprising:
    - a processing module to select the plurality of N-grams from a second plurality of N-grams based on utilities of the plurality of N-grams in content classification with respect to the set of training documents.
  - 25. The apparatus of claim 23, wherein the model generator is operable to determine a plurality of thresholds for the plurality of categories using the plurality of scores and the set of validation documents, each of the plurality of thresholds being associated with a distinct one of the plurality of categories.
  - 26. The apparatus of claim 25, wherein the model generator is operable to compute each of the plurality of thresholds using a frequency of occurrences of each of the plurality of N-grams in the set of validation documents, the plurality of scores, and a predetermined false positive limit.

27. A system comprising:
- a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents;
  
  a repository coupled to the model generator to store the statistical content classification model;
  
  an N-gram-based content rating engine coupled to the repository, to access the statistical content classification model and to rate content of documents in the natural language using the statistical content classification model, wherein the documents are from a network external to the system;
  
  a content filtering module comprising the N-gram-based content rating engine; and
  
  a client machine coupled to the content filtering module, wherein the content filtering module receives a request to access a web page from the client machine and the N-gram-based content rating engine rates content of the requested web page, wherein the content filtering module blocks the requested web page from the client machine if the content of the requested web page is in a prohibited category and the content filtering module passes the requested web page to the client machine if the content of the requested web page is in an allowable category.

28. A system comprising:
- a pattern matching engine to search for a plurality of N-grams in a set of training documents and a set of validation documents, each of said plurality of N-grams representing at least a portion of a keyword in a natural language, and the set of training documents and the set of validation documents being written in the natural language, wherein each of said plurality of N-grams comprises a sequence of N bytes, where N is an integer;
  
  a model generator coupled to the search engine to generate a statistical content classification model based on occurrences of each of the plurality of N-grams in the set of training documents and the set of validation documents;
  
  a repository coupled to the model generator to store the statistical content classification model;
  
  an N-gram-based content rating engine coupled to the repository, to access the statistical content classification model and to rate content of documents in the natural language using the statistical content classification model, wherein the documents are from a network external to the system;
  
  a content filtering module comprising the N-gram-based content rating engine; and
  
  a client machine coupled to the content filtering module, wherein the content filtering module receives an incoming electronic mail message and the N-gram-based content rating engine rates content of the electronic mail message, wherein the content filtering module blocks the electronic mail message from the client machine if the content of the electronic mail message is in a prohibited category and the content filtering module passes the electronic mail message to the client machine if the content of the electronic mail message is in an allowable category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Quest Software, Inc.
Original Assignee
SonicWALL, Inc. (SonicWall Holdings Ltd.)
Inventors
Raffill, Thomas E., Gmuender, John, Yanovsky, Boris, Zhu, Shunhui, Yanovsky, Roman
Primary Examiner(s)
Fleurantin; Jean B

Application Number

US11/881,770
Time in Patent Office

1,138 Days
Field of Search

707 1- 10, 707100-1041, 707200-206, 707600-831, 704/1, 704/9, 704/10, 704/251, 704/257, 704/240, 379/88.09
US Class Current

707/754
CPC Class Codes

G06F 16/35 Clustering; Classification

Training procedure for N-gram-based statistical content classification

First Claim

22 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Training procedure for N-gram-based statistical content classification

First Claim

22 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links