Filter-based identification of malicious websites

US 8,850,570 B1
Filed: 06/30/2008
Issued: 09/30/2014
Est. Priority Date: 06/30/2008
Status: Active Grant

First Claim

Patent Images

1. A method of identifying malicious websites, the method comprising:

identifying a candidate suspicious website;

identifying a plurality of lightweight features associated with the candidate suspicious website;

identifying a dataset comprising a plurality of lightweight features associated with a plurality of known malicious websites and a plurality of lightweight features associated with a plurality of known innocuous websites;

generating a filter classifier comprising a statistical model including weights for the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of lightweight features associated with the plurality of known innocuous websites that distinguish the plurality of known malicious websites from the plurality of known innocuous websites;

determining, with the weights of the generated filter classifier, a continuous filter score for the candidate suspicious website based on the plurality of lightweight features associated with the candidate suspicious website, the continuous filter score indicating similarity between the lightweight features associated with the candidate suspicious website and the lightweight features of the known malicious websites;

prioritizing a scan of the candidate suspicious website relative to other candidate suspicious websites in response to the continuous filter score for the candidate suspicious website and continuous filter scores for the other candidate suspicious websites;

determining whether the candidate suspicious website is a malicious website responsive at least in part to the scan;

updating, in response to determining that the suspicious website is a malicious website, the plurality of lightweight features associated with the plurality of known malicious websites in the dataset to include the plurality of lightweight features associated with the suspicious website; and

re-generating the filter classifier to update the statistical model to include at least one modified weight for the plurality of lightweight features based on the updated dataset.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A candidate suspicious website is identified. A plurality of lightweight features associated with the candidate suspicious website is identified. A filter score is determined based on the plurality of lightweight features, wherein the filter score indicates a likelihood that the candidate suspicious website is a malicious website. Whether the filter score exceeds a threshold is determined. Responsive at least in part to the filter score exceeding the threshold it is determined that the candidate suspicious website is a suspicious website. Whether the suspicious website is a malicious website is determined by identifying software downloaded to the computing system responsive to accessing the suspicious website and determining whether the software downloaded to the computing system is malware based on characteristics associated with the downloaded software.

Citations

16 Claims

1. A method of identifying malicious websites, the method comprising:
- identifying a candidate suspicious website;
  
  identifying a plurality of lightweight features associated with the candidate suspicious website;
  
  identifying a dataset comprising a plurality of lightweight features associated with a plurality of known malicious websites and a plurality of lightweight features associated with a plurality of known innocuous websites;
  
  generating a filter classifier comprising a statistical model including weights for the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of lightweight features associated with the plurality of known innocuous websites that distinguish the plurality of known malicious websites from the plurality of known innocuous websites;
  
  determining, with the weights of the generated filter classifier, a continuous filter score for the candidate suspicious website based on the plurality of lightweight features associated with the candidate suspicious website, the continuous filter score indicating similarity between the lightweight features associated with the candidate suspicious website and the lightweight features of the known malicious websites;
  
  prioritizing a scan of the candidate suspicious website relative to other candidate suspicious websites in response to the continuous filter score for the candidate suspicious website and continuous filter scores for the other candidate suspicious websites;
  
  determining whether the candidate suspicious website is a malicious website responsive at least in part to the scan;
  
  updating, in response to determining that the suspicious website is a malicious website, the plurality of lightweight features associated with the plurality of known malicious websites in the dataset to include the plurality of lightweight features associated with the suspicious website; and
  
  re-generating the filter classifier to update the statistical model to include at least one modified weight for the plurality of lightweight features based on the updated dataset.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein identifying a candidate suspicious website comprises:
    - identifying a set of heuristics including a set of seed links; and
      
      identifying a link to the candidate suspicious website based on the set of seed links.
  - 3. The method of claim 1, wherein identifying the plurality of lightweight features associated with the candidate suspicious website comprises:
    - identifying hypertext transfer protocol information associated with the candidate suspicious website; and
      
      identifying the plurality of lightweight features associated with the candidate suspicious website based on the hypertext transfer protocol information.
  - 4. The method of claim 1, wherein the plurality of lightweight features comprise one or more features from the set consisting of:
    - web server software of the candidate suspicious website;
      
      whether the candidate suspicious website contains a reference to an advertising network;
      
      whether the candidate suspicious website contains a reference to a public file hosting system; and
      
      a number of redirections or links to other websites in the candidate suspicious website.
  - 5. The method of claim 1, wherein determining whether the candidate suspicious website is a malicious website comprises:
    - identifying software downloaded to the computing system responsive to accessing the candidate suspicious website; and
      
      determining whether the software downloaded to the computing system is malware based on characteristics associated with the downloaded software.
  - 6. The method of claim 1, wherein generating the filter classifier based on the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of known innocuous websites in the dataset comprises:
    - storing information associated with the plurality of known malicious websites and the plurality of known innocuous websites in association with unique identifiers for the websites, the information including lightweight features extracted from each of the plurality of known innocuous and the plurality of known malicious websites.

7. A computer system for identifying malicious websites, the system comprising:
- a non-transitory computer-readable storage medium storing executable computer program instructions comprising;
  
  a web crawler module adapted to;
  
  identify a candidate suspicious website; and
  
  identify a plurality of lightweight features associated with the candidate suspicious website;
  
  a filter module adapted to;
  
  identify a dataset comprising a plurality of lightweight features associated with a plurality of known malicious websites and a plurality of lightweight features associated with a plurality of known innocuous websites;
  
  generate a filter classifier comprising a statistical model including weights for the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of lightweight features associated with the plurality of known innocuous websites that distinguish the plurality of known malicious websites from the plurality of known innocuous websites; and
  
  determine, with the weights of the generated filter classifier, a continuous filter score for the candidate suspicious website based on the plurality of lightweight features associated with the candidate suspicious website, the continuous filter score indicating similarity between the lightweight features associated with the candidate suspicious website and the lightweight features of the known malicious websites;
  
  a malicious website scanning module adapted to;
  
  prioritize a scan of the candidate suspicious website relative to other candidate suspicious websites in response to the continuous filter score for the candidate suspicious website and continuous filter scores for the other candidate suspicious websites; and
  
  determine whether the candidate suspicious website is a malicious website responsive at least in part to the scan;
  
  wherein the filter module is further adapted to;
  
  update, in response to determining that the suspicious website is a malicious website, the plurality of lightweight features associated with the plurality of known malicious websites in the dataset to include the plurality of lightweight features associated with the suspicious website, andwherein the filter classifier is re-generated to update the statistical model to include at least one modified weight for the plurality of lightweight features based on the updated dataset; and
  
  a processor for executing the computer program instructions.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the web crawler module is further adapted to:
    - identify a set of heuristics including a set of seed links; and
      
      identify a link to the candidate suspicious website based on the set of seed links.
  - 9. The system of claim 7, wherein the web crawler module is further adapted to:
    - identify hypertext transfer protocol information associated with the candidate suspicious website; and
      
      identify the plurality of lightweight features associated with the candidate suspicious website based on the hypertext transfer protocol information.
  - 10. The system of claim 7, wherein the plurality of lightweight features comprise one or more features from the set consisting of:
    - web server software of the candidate suspicious website;
      
      whether the candidate suspicious website contains a reference to an advertising network;
      
      whether the candidate suspicious website contains a reference to a public file hosting system; and
      
      a number of redirections or links to other websites in the candidate suspicious website.
  - 11. The system of claim 7, wherein the malicious website scanning module is further adapted to:
    - identify software downloaded to the computing system responsive to accessing the candidate suspicious website; and
      
      determine whether the software downloaded to the computing system is malware based on characteristics associated with the downloaded software.
  - 12. The system of claim 7, wherein generating the filter classifier based on the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of known innocuous websites in the dataset comprises:
    - storing information associated with the plurality of known malicious websites and the plurality of known innocuous websites in association with unique identifiers for the websites, the information including lightweight features extracted from each of the plurality of known innocuous and the plurality of known malicious websites.

13. A non-transitory computer-readable storage medium encoded with executable program code for identifying malicious websites, the program code comprising program code for:
- identifying a candidate suspicious website;
  
  identifying a plurality of lightweight features associated with the candidate suspicious website;
  
  identifying a dataset comprising a plurality of lightweight features associated with a plurality of known malicious websites and a plurality of lightweight features associated with a plurality of known innocuous websites;
  
  generating a filter classifier comprising a statistical model including weights for the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of lightweight features associated with the plurality of known innocuous websites that distinguish the plurality of known malicious websites from the plurality of known innocuous websites;
  
  determining, with the weights of the generated filter classifier, a continuous filter score for the candidate suspicious website based on the plurality of lightweight features associated with the candidate suspicious website, the continuous filter score indicating similarity between the lightweight features associated with the candidate suspicious website and the lightweight features of the known malicious websites;
  
  prioritizing a scan of the candidate suspicious website relative to other candidate suspicious websites in response to the continuous filter score for the candidate suspicious website and continuous filter scores for the other candidate suspicious websites;
  
  determining whether the candidate suspicious website is a malicious website responsive at least in part to the scan;
  
  updating, in response to determining that the suspicious website is a malicious website, the plurality of lightweight features associated with the plurality of known malicious websites in the dataset to include the plurality of lightweight features associated with the suspicious website; and
  
  re-generating the filter classifier to update the statistical model to include at least one modified weight for the plurality of lightweight features based on the updated dataset.
- View Dependent Claims (14, 15, 16)
- - 14. The storage medium of claim 13, wherein program code for identifying the plurality of lightweight features associated with the candidate suspicious website comprises program code for:
    - identifying hypertext transfer protocol information associated with the candidate suspicious website; and
      
      identifying the plurality of lightweight features associated with the candidate suspicious website based on the hypertext transfer protocol information.
  - 15. The storage medium of claim 13, wherein generating the filter classifier based on the plurality of lightweight features associated with the plurality of known malicious websites and the plurality of known innocuous websites in the dataset comprises:
    - storing information associated with the plurality of known malicious websites and the plurality of known innocuous websites in association with unique identifiers for the websites, the information including lightweight features extracted from each of the plurality of known innocuous and the plurality of known malicious websites.
  - 16. The storage medium of claim 13, wherein the plurality of lightweight features comprise one or more features from the set consisting of:
    - web server software of the candidate suspicious website;
      
      whether the candidate suspicious website contains a reference to an advertising network;
      
      whether the candidate suspicious website contains a reference to a public file hosting system; and
      
      a number of redirections or links to other websites in the candidate suspicious website.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Ramzan, Zulfikar
Primary Examiner(s)
Eskandarnia, Arvin

Application Number

US12/165,467
Time in Patent Office

2,283 Days
Field of Search

726/22, 707/10
US Class Current

726/22
CPC Class Codes

G06F 1/00   Details not covered by grou...

G06F 21/566   Dynamic detection, i.e. det...

H04L 63/145   the attack involving the pr...

H04L 63/1483   service impersonation, e.g....

H04L 67/02   based on web technology, e....

H04W 4/21   for social networking appli...

Filter-based identification of malicious websites

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Filter-based identification of malicious websites

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links