Automated generation of spam-detection rules using optical character recognition and identifications of common features

US 20090077617A1
Filed: 09/13/2007
Published: 03/19/2009
Est. Priority Date: 09/13/2007
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method of enabling spam detection comprising:

identifying a set of images as being spam;

applying optical character recognition (OCR) techniques to said images to provide text strings representative of content of individual said images;

applying automated techniques to said text strings to identify common text-related features and patterns of a plurality of said text strings, wherein said common text-related features and patterns are determined to be indicative of spam;

generating spam-detection rules based on identifications of said common text-related features and patterns; and

applying said spam-detection rules to electronic communications to detect occurrences of spam within said electronic communications.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a spam detection method and system, optical character recognition (OCR) techniques are applied to a set of images that have been identified as being spam. The images may be provided as the initial training of the spam detection system, but the preferred embodiment is one in which the images are provided for the purpose of updating the spam-detection rules of currently running systems at different locations. The OCR generates text strings representative of content of the individual images. Automated techniques are applied to the text strings to identify common features or patterns, such as misspellings which are either intentionally included in order to avoid detection or introduced through OCR errors due to the text being obscured. Spam-detection rules are automatically generated on the basis of identifications of the common features. Then, the spam-detection rules are applied to electronic communications, such as electronic mail, so as to detect occurrences of spam within the electronic communications.

Citations

20 Claims

1. A computer-implemented method of enabling spam detection comprising:
- identifying a set of images as being spam;
  
  applying optical character recognition (OCR) techniques to said images to provide text strings representative of content of individual said images;
  
  applying automated techniques to said text strings to identify common text-related features and patterns of a plurality of said text strings, wherein said common text-related features and patterns are determined to be indicative of spam;
  
  generating spam-detection rules based on identifications of said common text-related features and patterns; and
  
  applying said spam-detection rules to electronic communications to detect occurrences of spam within said electronic communications.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1 wherein applying said spam-detection rules includes transmitting said spam-detection rules to a plurality of spam firewalls of a plurality of independent networks.
  - 3. The computer-implemented method of claim 2 wherein identifying said set of images includes receiving said images from said independent networks as spam which was not identified as being spam by said spam firewalls.
  - 4. The computer-implemented method of claim 3 wherein said spam-detection rules are transmitted to said spam firewalls as an update to previously employed spam-detection rules.
  - 5. The computer-implemented method of claim 1 wherein applying said spam-detection rules includes determining whether email messages contain spam.
  - 6. The computer-implemented method of claim 1 wherein identifying said common text-related features includes determining occurrences of specific words not found in a dictionary which is accessible during application of said automated techniques.
  - 7. The computer-implemented method of claim 1 wherein identifying said common text-related features includes determining occurrences of words containing symbols not consistent with spelling words with respect to a particular language.
  - 8. The computer-implemented method of claim 1 wherein applying automated techniques includes applying a threshold to a frequency of occurrences of said text-related features and patterns.
  - 9. The computer-implemented method of claim 1 wherein applying automated techniques includes updating existing rules to optimize said existing rules on a basis of said common text-related features and patterns.
  - 10. The computer-implemented method of claim 1 wherein applying said OCR techniques includes forming a plurality of said text strings for at least one said image, including defining segments of said image and forming a separate said text string for each said segment.
  - 11. The computer-implemented method of claim 1 wherein generating and applying said spam-detection rules includes utilizing Bayesian analysis to determine probabilities as to whether said spam-detection rules are effective in detecting spam, said Bayesian analysis including establishing a threshold of probability which must be met by each said spam-detection rule.

12. A system for determining spam-detection rules comprising:
- a supply of known image spam, each said known image spam including an image designated as being spam;
  
  an optical character recognition (OCR) component having an input to receive said known image spam, said OCR component being configured to form at least one text string for each said known image spam that includes text;
  
  a pattern recognition component connected to said OCR component to receive said text strings, said pattern recognition component being configured to identify common text-related features and patterns among text strings formed at said OCR component; and
  
  a rules generation component connected to said pattern recognition component, said rules generation component being configured to generate spam-detection rules on a basis of said common text-related features and patterns.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The system of claim 12 further comprising an update facility to distribute said spam-detection rules to a plurality of spam firewalls of independent networks.
  - 14. The system of claim 12 wherein said pattern recognition is computer programming configured to detect misspellings of words.
  - 15. The system of claim 12 wherein said supply of known image spam is a storage of email.
  - 16. The system of claim 12 wherein said rules generation component is configured to apply Bayesian analysis.

17. A computer-implemented method comprising:
- utilizing spam-detection rules to identify spam;
  
  collecting spam images which remain unidentified as spam by said spam-detection rules;
  
  applying OCR processing to said spam images to generate text strings representative of text contained in said spam images;
  
  using automated techniques to identify commonalities among said text strings, where said commonalities are inconsistent with language construction;
  
  generating additional spam-detection rules based on said commonalities; and
  
  providing an update for subsequent detections of spam.
- View Dependent Claims (18, 19, 20)
- - 18. The computer-implemented method of claim 17 wherein identifying said commonalities includes detecting misspellings within a plurality of said spam images.
  - 19. The computer-implemented method of claim 17 wherein generating said additional spam-detection rules includes applying a frequency of occurrence algorithm.
  - 20. The computer-implemented method of claim 17 wherein said spam-detection rules are applied to email messages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Barracuda Networks Incorporated
Original Assignee
Barracuda Networks Incorporated
Inventors
Drako, Dean M., Anderson, Shawn Paul, Levow, Zachary S.

Application Number

US11/900,741
Publication Number

US 20090077617A1
Time in Patent Office

Days
Field of Search
US Class Current

726/1
CPC Class Codes

H04L 51/212   using filtering or selectiv...

H04L 63/0227   Filtering policies mail mes...

H04L 63/1416   Event detection, e.g. attac...

Automated generation of spam-detection rules using optical character recognition and identifications of common features

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automated generation of spam-detection rules using optical character recognition and identifications of common features

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links