Identifying malicious text in advertisement content
First Claim
1. A method comprising:
- retrieving, by a processor of an online system, text included in advertisement content of an advertisement (“
ad”
) request for presentation to a user of the online system;
identifying, by the processor of the online system, one or more words included in the advertisement content;
identifying, by the processor of the online system, one or more Unicode characters comprising each of the one or more words, each of the one or more Unicode characters being associated with a range of Unicode characters that comprise to a Unicode block of a plurality of Unicode blocks;
determining, for each Unicode character of the one or more Unicode characters included in each of the one or more words, a Unicode block associated with the Unicode character;
determining, by the processor of the online system, a score for each word of the one or more words by;
determining, for each of the identified one or more words, a most common Unicode block associated with the one or more Unicode characters in the word;
determining a conditional probability of the one or more Unicode characters being included in the word belonging to a specific Unicode block based at least in part on a number of Unicode characters in the word and a number of Unicode characters in the word associated with the most common Unicode block associated with the Unicode characters in the word; and
determining the score for the word based at least in part on the determined conditional probability, a word of the one or more words comprising Unicode characters associated with a same Unicode block having a higher determined score relative to a word comprising Unicode characters associated with two or more different Unicode blocks;
generating, by the processor of the online system, a combined score for the advertisement based on the determined scores of each word of the one or more words;
determining, by the processor of the online system, that the advertisement content is offensive based at least in part on the combined score for the advertisement being less than a threshold value; and
responsive to the combined score for the advertisement being less than the threshold value, determining, by the processor of the online system, that the advertisement content is ineligible for presentation to the user of the online system based at least in part on the determination that the advertisement content is offensive.
2 Assignments
0 Petitions
Accused Products
Abstract
An online system receives advertisement requests from one or more advertisers and determines whether an advertisement request includes malicious content before presenting content from the advertisement request to a user. To determine whether the advertisement request includes malicious content, the online system identifies text in the advertisement request, identifies words in the text, and identifies characters in each word. The online system identifies a most common type of character in each word and generates a score for each word based on its constituent characters. For example, a word'"'"'s score is based on the combination of characters in the word, such as a conditional probability of a word including a type of character given that the word includes a given number of the most common type of character. The scores are analyzed to determine if text in the advertisement request includes malicious content.
-
Citations
14 Claims
-
1. A method comprising:
-
retrieving, by a processor of an online system, text included in advertisement content of an advertisement (“
ad”
) request for presentation to a user of the online system;identifying, by the processor of the online system, one or more words included in the advertisement content; identifying, by the processor of the online system, one or more Unicode characters comprising each of the one or more words, each of the one or more Unicode characters being associated with a range of Unicode characters that comprise to a Unicode block of a plurality of Unicode blocks; determining, for each Unicode character of the one or more Unicode characters included in each of the one or more words, a Unicode block associated with the Unicode character; determining, by the processor of the online system, a score for each word of the one or more words by; determining, for each of the identified one or more words, a most common Unicode block associated with the one or more Unicode characters in the word; determining a conditional probability of the one or more Unicode characters being included in the word belonging to a specific Unicode block based at least in part on a number of Unicode characters in the word and a number of Unicode characters in the word associated with the most common Unicode block associated with the Unicode characters in the word; and determining the score for the word based at least in part on the determined conditional probability, a word of the one or more words comprising Unicode characters associated with a same Unicode block having a higher determined score relative to a word comprising Unicode characters associated with two or more different Unicode blocks; generating, by the processor of the online system, a combined score for the advertisement based on the determined scores of each word of the one or more words; determining, by the processor of the online system, that the advertisement content is offensive based at least in part on the combined score for the advertisement being less than a threshold value; and responsive to the combined score for the advertisement being less than the threshold value, determining, by the processor of the online system, that the advertisement content is ineligible for presentation to the user of the online system based at least in part on the determination that the advertisement content is offensive. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
retrieving, by a processor of an online system, text included in advertisement content of an advertisement (“
ad”
) request for presentation to a user of the online system;identifying, by the processor of the online system, one or more words included in the advertisement content; identifying a Unicode block associated with each of one or more characters in each of the identified one or more words, each of the one or more characters being associated with a range of characters that comprise to a Unicode block of a plurality of Unicode blocks; scoring, by the processor of the online system, each word from the identified one or more words by; determining, for each of the identified one or more words, a most common Unicode block associated with the one or more characters in the word; determining a conditional probability of the one or more characters being included in the word belonging to a specific Unicode block based at least in part on a number of characters in the word and a number of characters in the word associated with the most common Unicode block associated with the characters in the word; and determining a score for the word based at least in part on the determined conditional probability, wherein a word of the one or more words comprising characters associated with a same Unicode block having a higher determined score relative to a word comprising characters associated with two or more different Unicode blocks; generating, by the processor of the online system, a combined score for the advertisement based on the determined scores of each word of the one or more words; determining, by the processor of the online system, that the advertisement content includes offensive content based at least in part on the combined score for the advertisement being less than a threshold value; and responsive to the combined score for the advertisement being less than the threshold value, determining, by the processor of the online system, that the advertisement content is ineligible for presentation to the user of the online system based at least in part on the determination that the advertisement content includes offensive content. - View Dependent Claims (9, 10, 11)
-
-
12. A computer program product comprising a non-transitory computer-readable storage medium having instructions encoded thereon that, when executed by a processor, cause the processor to:
-
retrieve text included in advertisement content of an advertisement (“
ad”
) request for presentation to a user of an online system;identify one or more words included in the advertisement content; identify a Unicode block associated with each of one or more characters in each of the identified one or more words, each of the one or more characters being associated with a range of characters that comprise to a Unicode block of a plurality of Unicode blocks; score each word from the identified one or more words by; determining, for each of the identified one or more words, a most common Unicode block associated with the one or more characters in the word; determining a conditional probability of the one or more characters being included in the word belonging to a specific Unicode block based at least in part on a number of characters in the word and a number of characters in the word associated with the most common Unicode block associated with the characters in the word; and determining the score associated with the word based at least in part on the determined conditional probability, wherein a word of the one or more words comprising characters associated with a same Unicode block having a higher determined score relative to a word comprising characters associated with two or more different Unicode blocks; generate a combined score for the advertisement based on the determined scores of each word of the one or more words; determine that the advertisement content includes offensive content based at least in part on the combined score for the advertisement being less than a threshold value; and responsive to the combined score for the advertisement being less than the threshold value, determine that the advertisement content is ineligible for presentation to the user of the online system based at least in part on the determination that the advertisement content includes offensive content. - View Dependent Claims (13, 14)
-
Specification