Automated generation of spam-detection rules using optical character recognition and identifications of common features
First Claim
1. A computer-implemented method of enabling spam detection comprising:
- identifying a set of images as being spam;
applying optical character recognition (OCR) techniques to said images to provide text strings representative of content of individual said images;
applying automated techniques to said text strings to identify common text-related features and patterns of a plurality of said text strings, wherein said common text-related features and patterns are determined to be indicative of spam;
generating spam-detection rules based on identifications of said common text-related features and patterns; and
applying said spam-detection rules to electronic communications to detect occurrences of spam within said electronic communications.
1 Assignment
0 Petitions
Accused Products
Abstract
In a spam detection method and system, optical character recognition (OCR) techniques are applied to a set of images that have been identified as being spam. The images may be provided as the initial training of the spam detection system, but the preferred embodiment is one in which the images are provided for the purpose of updating the spam-detection rules of currently running systems at different locations. The OCR generates text strings representative of content of the individual images. Automated techniques are applied to the text strings to identify common features or patterns, such as misspellings which are either intentionally included in order to avoid detection or introduced through OCR errors due to the text being obscured. Spam-detection rules are automatically generated on the basis of identifications of the common features. Then, the spam-detection rules are applied to electronic communications, such as electronic mail, so as to detect occurrences of spam within the electronic communications.
-
Citations
20 Claims
-
1. A computer-implemented method of enabling spam detection comprising:
-
identifying a set of images as being spam; applying optical character recognition (OCR) techniques to said images to provide text strings representative of content of individual said images; applying automated techniques to said text strings to identify common text-related features and patterns of a plurality of said text strings, wherein said common text-related features and patterns are determined to be indicative of spam; generating spam-detection rules based on identifications of said common text-related features and patterns; and applying said spam-detection rules to electronic communications to detect occurrences of spam within said electronic communications. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for determining spam-detection rules comprising:
-
a supply of known image spam, each said known image spam including an image designated as being spam; an optical character recognition (OCR) component having an input to receive said known image spam, said OCR component being configured to form at least one text string for each said known image spam that includes text; a pattern recognition component connected to said OCR component to receive said text strings, said pattern recognition component being configured to identify common text-related features and patterns among text strings formed at said OCR component; and a rules generation component connected to said pattern recognition component, said rules generation component being configured to generate spam-detection rules on a basis of said common text-related features and patterns. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer-implemented method comprising:
-
utilizing spam-detection rules to identify spam; collecting spam images which remain unidentified as spam by said spam-detection rules; applying OCR processing to said spam images to generate text strings representative of text contained in said spam images; using automated techniques to identify commonalities among said text strings, where said commonalities are inconsistent with language construction; generating additional spam-detection rules based on said commonalities; and providing an update for subsequent detections of spam. - View Dependent Claims (18, 19, 20)
-
Specification