System and method for identifying text-based SPAM in rasterized images
First Claim
1. A computer implemented method for identifying spam in an image, the method comprising:
- (a) identifying a plurality of contours in the image, the contours corresponding to probable symbols;
(b) ignoring contours that are too small or too large;
(c) identifying text lines in the image, based on the remaining contours;
(d) parsing the text lines into words;
(e) ignoring words that are too short or too long from the identified text lines;
(f) ignoring text lines that are too short;
(g) verifying that the image contains text by comparing a number of pixels of a symbol color within remaining contours to a total number of pixels of the symbol color in the image; and
(h) if the image contains text, rendering a spam/no spam verdict based on a contour representation of the text that remains after step (f).
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method and computer program product for identifying spam in an image, including (a) identifying a plurality of contours in the image, the contours corresponding to probable symbols; (b) ignoring contours that are too small or too large; (c) identifying text lines in the image, based on the remaining contours; (d) parsing the text lines into words; (e) ignoring words that are too short or too long from the identified text lines; (f) ignoring text lines that are too short; (g) verifying that the image contains text by comparing a number of pixels of a symbol color within remaining contours to a total number of pixels of the symbol color in the image, and that there is at least one text line after filtration; and (h) if the image contains text, rendering a spam/no spam verdict based on a contour representation of the text that which appears after step (f).
-
Citations
19 Claims
-
1. A computer implemented method for identifying spam in an image, the method comprising:
-
(a) identifying a plurality of contours in the image, the contours corresponding to probable symbols; (b) ignoring contours that are too small or too large; (c) identifying text lines in the image, based on the remaining contours; (d) parsing the text lines into words; (e) ignoring words that are too short or too long from the identified text lines; (f) ignoring text lines that are too short; (g) verifying that the image contains text by comparing a number of pixels of a symbol color within remaining contours to a total number of pixels of the symbol color in the image; and (h) if the image contains text, rendering a spam/no spam verdict based on a contour representation of the text that remains after step (f). - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for identifying spam in an image, the system executing the steps of:
-
(a) identifying a plurality of contours in the image, the contours corresponding to probable symbols; (b) ignoring contours that are too small or too large; (c) identifying text lines in the image, based on the remaining contours; (d) parsing the text lines into words; (e) ignoring words that are too short or too long, from the identified text lines; (f) ignoring text lines that are too short; (g) verifying that the image contains text by comparing a number of pixels of a symbol color within remaining contours to a total number of pixels of the symbol color in the image; and (h) rendering a spam/no spam verdict based on a contour representation of the remaining text lines and the remaining words. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer implemented method for identifying text in an image, the method comprising:
-
(a) identifying a plurality of contours in the image, the contours corresponding to probable symbols, each contour forming a closed boundary around each probable symbol; (b) for each contour, identifying adjacent contours that are within 2Xim of the contour, to the left and right, as belonging to the same text line, wherein Xim is a most frequent distance between all adjacent contour pairs in all text lines; (c) identifying text lines in the image, based on the adjacent contours, wherein adjacent contours belong to the same text line; (d) ignoring text lines that are too short and parsing the remaining text lines into words; and (e) identifying presence of text in the image based on the words. - View Dependent Claims (16, 17, 18, 19)
-
Specification