MALICIOUS UNIFORM RESOURCE LOCATOR DETECTION
First Claim
1. A method comprising:
- receiving a uniform resource locator (URL) that includes one or more substrings, wherein each substring comprises a plurality of alphanumeric characters;
extracting, via one or more processors, a plurality of features associated with the URL;
determining, as at least one of the plurality of features, a similarity measure between a whole or part of the URL and a brand name associated with an authentic resource or a legitimate entity; and
applying one or more classification models to the one or more features to determine whether a resource located by the URL is an unauthentic resource.
3 Assignments
0 Petitions
Accused Products
Abstract
The techniques described herein use training data to train classification models to detect malicious Uniform Resource Locators (URLs) that target authentic resources (e.g., Web page, Web site, or other network locations accessed via a URL). The techniques train the classification models using one or more machine learning algorithms. The training data may include known benign URLs and known malicious URLs (e.g., training URLs) that are associated with a target authentic resource. The techniques then use the trained classification models to determine whether an unknown URL is a malicious URL. The malicious URL determination may be based on one or more lexical features (e.g., brand name edit distances for a domain and path of the URL) and/or site/page features (e.g., a domain age and a domain confidence level) extracted.
117 Citations
20 Claims
-
1. A method comprising:
-
receiving a uniform resource locator (URL) that includes one or more substrings, wherein each substring comprises a plurality of alphanumeric characters; extracting, via one or more processors, a plurality of features associated with the URL; determining, as at least one of the plurality of features, a similarity measure between a whole or part of the URL and a brand name associated with an authentic resource or a legitimate entity; and applying one or more classification models to the one or more features to determine whether a resource located by the URL is an unauthentic resource. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer-readable storage media comprising instructions that, when executed by a processor, perform operations comprising:
-
receiving a uniform resource locator (URL) that includes a plurality of tokens, wherein each token comprises one or more characters; parsing the URL to identify a substring in the URL that includes one or more tokens; determining an edit distance between the substring and a brand name; applying classification criteria to the edit distance to determine whether a resource located by the URL is a counterfeit resource; and classifying the URL as a malicious URL when the resource located by the URL is the counterfeit resource. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
one or more processors; one or more memories; a uniform resource locator (URL) input module, stored on the one or more memories and operable by the one or more processors, to collect a plurality of training URLs; a feature extraction module, stored on the one or more memories and operable by the one or more processors, to extract features associated with each of the plurality of training URLs; one or more machine learning algorithms, stored on the one or more memories and operable by the one or more processors, to train one or more classification models based on the features associated with each of the plurality of training URLs; and a malicious URL detection module, stored on the one or more memories and operable by the one or more processors, to apply the one or more classification models to an unknown URL and predict that the unknown URL is a malicious URL based at least in part on a similarity measure between a deceptive brand name text string included in the unknown URL and a real brand name text string. - View Dependent Claims (17, 18, 19, 20)
-
Specification