MALICIOUS UNIFORM RESOURCE LOCATOR DETECTION

US 20140298460A1
Filed: 03/26/2013
Published: 10/02/2014
Est. Priority Date: 03/26/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a uniform resource locator (URL) that includes one or more substrings, wherein each substring comprises a plurality of alphanumeric characters;

extracting, via one or more processors, a plurality of features associated with the URL;

determining, as at least one of the plurality of features, a similarity measure between a whole or part of the URL and a brand name associated with an authentic resource or a legitimate entity; and

applying one or more classification models to the one or more features to determine whether a resource located by the URL is an unauthentic resource.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The techniques described herein use training data to train classification models to detect malicious Uniform Resource Locators (URLs) that target authentic resources (e.g., Web page, Web site, or other network locations accessed via a URL). The techniques train the classification models using one or more machine learning algorithms. The training data may include known benign URLs and known malicious URLs (e.g., training URLs) that are associated with a target authentic resource. The techniques then use the trained classification models to determine whether an unknown URL is a malicious URL. The malicious URL determination may be based on one or more lexical features (e.g., brand name edit distances for a domain and path of the URL) and/or site/page features (e.g., a domain age and a domain confidence level) extracted.

117 Citations

20 Claims

1. A method comprising:
- receiving a uniform resource locator (URL) that includes one or more substrings, wherein each substring comprises a plurality of alphanumeric characters;
  
  extracting, via one or more processors, a plurality of features associated with the URL;
  
  determining, as at least one of the plurality of features, a similarity measure between a whole or part of the URL and a brand name associated with an authentic resource or a legitimate entity; and
  
  applying one or more classification models to the one or more features to determine whether a resource located by the URL is an unauthentic resource.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, wherein the similarity measure is the smallest edit distance of substrings of the whole or part of the URL and the brand name when the URL is not on a white list.
  - 3. The method as recited in claim 1, further comprising classifying the URL as a malicious URL that targets the brand name in response to the one or more classification models determining that the resource located by the URL is an unauthentic resource.
  - 4. The method as recited in claim 1, wherein the brand name is at least one of a company name, a product name, a team name, a trademark, a marketing slogan, a celebrity name, or a second level domain that is commonly used by the authentic resource or the legitimate entity.
  - 5. The method as recited in claim 1, wherein at least one of the plurality of features is a domain age that determines how long a domain of the URL has been in existence.
  - 6. The method as recited in claim 1, wherein at least one of the plurality of features is a domain confidence level that determines a reliability of a domain or a second level domain of the URL, wherein the reliability is based on a ratio of a number of known benign URLs hosted by the domain or the second level domain of the URL compared to a number of known malicious URLs hosted by the domain or the second level domain.
  - 7. The method as recited in claim 1, further comprising:
    - receiving a plurality of training URLs known to be malicious URLs or benign URLs; and
      
      learning the one or more classification models using one or more machine learning algorithms based on features extracted from the plurality of training URLs.

8. One or more computer-readable storage media comprising instructions that, when executed by a processor, perform operations comprising:
- receiving a uniform resource locator (URL) that includes a plurality of tokens, wherein each token comprises one or more characters;
  
  parsing the URL to identify a substring in the URL that includes one or more tokens;
  
  determining an edit distance between the substring and a brand name;
  
  applying classification criteria to the edit distance to determine whether a resource located by the URL is a counterfeit resource; and
  
  classifying the URL as a malicious URL when the resource located by the URL is the counterfeit resource.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The one or more computer-readable storage media as recited in claim 8, wherein the malicious URL is included as part of a phishing cyber attack.
  - 10. The one or more computer-readable storage media as recited in claim 8, wherein the edit distance is a minimum number of character level corrections that need to be performed on the substring so that the substring matches the brand name to which the substring is directed.
  - 11. The one or more computer-readable storage media as recited in claim 8, wherein the brand name is at least one of a company name, a product name, a team name, a trademark, a marketing slogan, a celebrity name, or a second level domain that is commonly used by an authentic resource or a legitimate entity.
  - 12. The one or more computer-readable storage media as recited in claim 8, wherein the operations further comprise:
    - determining a domain age of the URL, the domain age indicating how long the URL has been in existence; and
      
      applying the classification criteria to the domain age to determine that the resource located by the URL is the counterfeit resource.
  - 13. The one or more computer-readable storage media as recited in claim 8, wherein the operations further comprise:
    - determining a domain confidence level for the URL, the domain confidence level indicating a reliability of a domain or a second level domain of the URL based on a ratio of a number of known benign URLs hosted by the domain or the second level domain of the URL compared to a number of known malicious URLs hosted by the domain or the second level domain; and
      
      applying the classification criteria to the domain confidence level to determine that the resource located by the URL is the counterfeit resource.
  - 14. The one or more computer-readable storage media as recited in claim 8, wherein the operations further comprise:
    - receiving a plurality of training URLs known to be malicious URLs or benign URLs; and
      
      learning the classification criteria using one or more machine learning algorithms based on features extracted from the plurality of training URLs.
  - 15. The one or more computer-readable storage media as recited in claim 8, wherein the operations further comprise classifying the URL as a malicious URL that targets the brand name.

16. A system comprising:
- one or more processors;
  
  one or more memories;
  
  a uniform resource locator (URL) input module, stored on the one or more memories and operable by the one or more processors, to collect a plurality of training URLs;
  
  a feature extraction module, stored on the one or more memories and operable by the one or more processors, to extract features associated with each of the plurality of training URLs;
  
  one or more machine learning algorithms, stored on the one or more memories and operable by the one or more processors, to train one or more classification models based on the features associated with each of the plurality of training URLs; and
  
  a malicious URL detection module, stored on the one or more memories and operable by the one or more processors, to apply the one or more classification models to an unknown URL and predict that the unknown URL is a malicious URL based at least in part on a similarity measure between a deceptive brand name text string included in the unknown URL and a real brand name text string.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system as recited in claim 16, wherein the similarity measure is a minimum of edit distances that need to be performed on the deceptive brand name text string included in the unknown URL so that the deceptive brand name text string matches the real brand name text string.
  - 18. The system as recited in claim 16, wherein the malicious URL detection module is part of an email filter component, a search engine component, or a Web browser component.
  - 19. The system as recited in claim 16, wherein the feature extraction modules extracts a domain age feature from the plurality of training URLs and the malicious URL detection module predicts that the unknown URL is the malicious URL based on a domain age of the unknown URL, wherein the domain age feature indicates how long a URL has been in existence.
  - 20. The system as recited in claim 16, wherein the feature extraction modules extracts a domain confidence level feature from the plurality of training URLs and the malicious URL detection module predicts that the unknown URL is the malicious URL based on a domain confidence level of the unknown URL, wherein the domain confidence level feature indicates a reliability of a domain or a second level domain of a URL based on a ratio of a number of known benign URLs hosted by the domain or the second level domain of the URL compared to a number of known malicious URLs hosted by the domain or the second level domain.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Xue, Feng, Zhu, Bin Benjamin, Chu, Weibo

Granted Patent

US 9,178,901 B2
Time in Patent Office

Days
Field of Search
US Class Current

726/23
CPC Class Codes

H04L 63/1425 Traffic logging, e.g. anoma...

H04L 63/1483 service impersonation, e.g....

MALICIOUS UNIFORM RESOURCE LOCATOR DETECTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

117 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

MALICIOUS UNIFORM RESOURCE LOCATOR DETECTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

117 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links