Systems and methods for identifying spam messages using subject information
First Claim
1. A system for identifying a spam email message, the system comprising:
- a computing platform including computing hardware of at least one processor, a memory operably coupled to the at least one processor and configured to store instructions invoked by the at least one processor, an operating system implemented on the computing hardware, and input/output facilities;
a rules database configured to store a plurality of ratio determination rules including a set of conditions for a text string for which the rules are applied to determine an n-value of words in a gram and a k-value of words to skip in an input text;
a vectors database configured to store a plurality of known vectors, wherein the plurality of known vectors are classified by thematic category;
instructions that, when executed on the computing platform, cause the computing platform to implement;
a message processing tool configured to receive an email message via the input/output facilities, the email message containing a subject field,a gram building tool configured to build a k-skip-n-gram set of word combinations according to the ratio of the k-value and the n-value for the subject field as the input text as determined by the ratio determination rules in the rules database,a vector building tool configured to receive, from the gram building tool, the k-skip-n-gram set of word combinations, and build a vector for each k-skip-n-gram word combination, anda spam identification tool configured to determine a spam presence threshold based on the cosine similarity for each k-skip-n-gram word combination and the plurality of known vectors for the particular email message subject field thematic category, and determine that the email message contains spam when the spam presence threshold is exceeded.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for identifying a spam email message. A system can include a rules database configured to store a plurality of ratio determination rules, a vectors database configured to store a plurality of known vectors, a message processing tool configured to receive an email message, a gram building tool configured to build a k-skip-n-gram set of word combinations according he ratio determination rules, a vector building tool configured to receive the k-skip-n-gram set of word combinations, and build a vector for each k-skip-n-gram word combination, and a spam identification tool configured to determine a spam presence threshold based on the cosine similarity for each k-skip-n-gram word combination and the plurality of known vectors for the particular email message subject field subject category, and determine that the email message contains spam when the spam presence threshold is exceeded.
51 Citations
20 Claims
-
1. A system for identifying a spam email message, the system comprising:
-
a computing platform including computing hardware of at least one processor, a memory operably coupled to the at least one processor and configured to store instructions invoked by the at least one processor, an operating system implemented on the computing hardware, and input/output facilities; a rules database configured to store a plurality of ratio determination rules including a set of conditions for a text string for which the rules are applied to determine an n-value of words in a gram and a k-value of words to skip in an input text; a vectors database configured to store a plurality of known vectors, wherein the plurality of known vectors are classified by thematic category; instructions that, when executed on the computing platform, cause the computing platform to implement; a message processing tool configured to receive an email message via the input/output facilities, the email message containing a subject field, a gram building tool configured to build a k-skip-n-gram set of word combinations according to the ratio of the k-value and the n-value for the subject field as the input text as determined by the ratio determination rules in the rules database, a vector building tool configured to receive, from the gram building tool, the k-skip-n-gram set of word combinations, and build a vector for each k-skip-n-gram word combination, and a spam identification tool configured to determine a spam presence threshold based on the cosine similarity for each k-skip-n-gram word combination and the plurality of known vectors for the particular email message subject field thematic category, and determine that the email message contains spam when the spam presence threshold is exceeded. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for identifying a spam email message with a computing platform including computing hardware of at least one processor, a memory operably coupled to the at least one processor and configured to store instructions invoked by the at least one processor, the method comprising:
-
receiving an email message with a messaging processing tool invoked by the at least one processor, the email message containing a subject field text; determining, with the messaging processing tool, at least one parameter of the subject field text; determining, with a ratio determination tool invoked by the at least one processor, an n-value of words in a gram and a k-value of words to skip in the subject field text based on the at least one parameter of the subject field text a plurality of ratio determination rules; building, with a gram building tool invoked by the at least one processor, a k-skip-n-gram set of word combinations of the subject field text according to the ratio of the k-value and the n-value; building, with a vector building tool invoked by the at least one processor, a vector for each k-skip-n-gram word in the set of word combinations; calculating, with the vector building tool, a cosine similarity for each k-skip-n-gram word combination and a plurality of known vectors, wherein the plurality of known vectors are classified by thematic category; determining, with the vector building tool, a message thematic category based on the calculated rates of cosine similarity for each k-skip-n-gram word combination to the known vectors and the classified thematic categories of the known vectors; determining, with a spam identification tool invoked by the at least one processor, a spam presence threshold based on the cosine similarity for each k-skip-n-gram word combination and the plurality of known vectors for the determined email message thematic category; calculating, with the spam identification tool, a current spam presence value based at least on the rates of cosine similarity between all built vectors; and determining, with the spam identification tool, that the email message contains spam when the spam presence threshold is exceeded by the current spam presence value. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A system for subject-line email analysis on a computing platform including computing hardware of at least one processor, a memory operably coupled to the at least one processor and configured to store instructions invoked by the at least one processor, the system comprising:
instructions that, when executed on the computing platform, cause the computing platform to implement; a message processing engine communicatively coupled to an email server and configured to receive an email message, the email message containing a subject line; a gram building engine configured to build a set of word combinations for the subject-line according to the formula; - View Dependent Claims (19, 20)
Specification