Finding phishing sites

US 20070192855A1
Filed: 01/18/2006
Published: 08/16/2007
Est. Priority Date: 01/18/2006
Status: Active Grant

First Claim

Patent Images

1. In a computing environment, a method comprising:

processing data from at least one data source related to phishing sites; and

using a predictive model to determine whether a site is likely to be a phishing site.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is a technology by which phishing-related data sources are processed into aggregated data and a given site evaluated the aggregated data using a predictive model to automatically determine whether the given site is likely to be a phishing site. The predictive model may be built using machine learning based on training data, e.g., including known phishing sites and/or known non-phishing sites. To determine whether an object corresponding to a site is likely a phishing-related object are described, various criteria are evaluated, including one or more features of the object when evaluated. The determination is output in some way, e.g., made available to a reputation service, used to block access to a site or warn a user before allowing access, and/or used to assist a hand grader in being more efficient in evaluating sites.

191 Citations

20 Claims

1. In a computing environment, a method comprising:
- processing data from at least one data source related to phishing sites; and
  
  using a predictive model to determine whether a site is likely to be a phishing site.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein processing the data comprises generating a report for each of a plurality of data sources.
  - 3. The method of claim 2 wherein using the predictive model comprises aggregating the reports and applying the predictive model to the aggregated reports.
  - 4. The method of claim 1 further comprising, building the predictive model using machine learning based on training data.
  - 5. The method of claim 1 further comprising, building a plurality of predictive models using machine learning based on a plurality of different sets of training data.
  - 6. The method of claim 5 further comprising, retraining the model after new phishing-related data becomes available.
  - 7. The method of claim 5 wherein building the predictive model includes using a set of known phishing sites and a set of confirmed non-phishing sites.
  - 8. The method of claim 1 wherein using the predictive model comprises classifying the site into at least one class of a plurality of classes, including a class for blocking access to the site and a class for submitting data related to the site to a hand grader.

9. In a computing environment, a system comprising:
- means for converting phishing-related source data into aggregated data; and
  
  means for determining whether an object corresponding to a site is likely a phishing-related object based on one or more features determined from the aggregated data.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The system of claim 9 wherein the means for converting comprises processing means that converts phishing-related source data from a plurality of sources into report data, and means for aggregating the report data into statistics.
  - 11. The system of claim 10 further comprising means for training the predictive model.
  - 12. The system of claim 9 further comprising, means for outputting information corresponding to whether the object corresponds to a likely phishing site.
  - 13. The system of claim 12 wherein outputting the information comprises making the information accessible to a reputation service.
  - 14. The system of claim 9 wherein the one or more features include at least one of:
    - a number of times the object appeared in the source, a ratio of the number of appearances in the last period to the current one, a ratio compared to a previous time, a ratio compared to long-term averages and standard deviations, a number of times the object appeared in the source along with a commonly-phished brand, a number of times the object appeared with a phishing-related word in the body or subject of a related message, whether common brands or phishing-related words appear in the host name or the URL, a number of times the object appeared in the source along with a link to a commonly-phished domain, a number of times the object was an exception when it appeared with commonly-phished domains, a number of times the object appeared in an email and a purported responsible domain of the message was a commonly-phished site, a number of times the object appeared in the source data and received a sender identification failure, a number of times the object appeared in the source and got a sender identification pass, a number of times the object appeared in a message that got a move/delete spam confidence level, a number of times the object appeared in the source with a phishing trick, a number of times the object appeared in an email message that had a fingerprint match with a known phishing message, a number of times the object appeared in an email message that had a fingerprint match with a message from a known phish target domain, a time duration since the object was first observed in the data source, a time duration since the object was last observed in the data source, and whether the host is a numeric IP address and if so whether the host matches zombie heuristics.
  - 15. The system of claim 9 further comprising recording one or more properties corresponding to the object when received in a message, wherein at least one property is from a set of properties containing:
    - a GUID of the message that contained the object, a time value indicative of when a feedback user reported the message as spam or good, a time indicative of when the feedback user received the message, a host of a URL, the URL, whether a body of the message contains one or more phishing-related words, whether a purported responsible domain of the message is a commonly phished domain, whether the message has any URL that triggers at least one phishing heuristic, whether the message has a fail or pass result code from a sender identification check, a number indicative of unique domains of web hosts in the message, whether the host from this report is not a commonly phished domain, but every other web host in the message is a commonly phished domain, whether the host from this message is not a commonly phished domain, but there is a web host in the message from a commonly phished domain, whether the host is from a commonly phished domain, whether the host from is on a top traffic list, whether the host is a numeric IP address, whether a feedback user indicated the message was spam or good.
  - 16. The system of claim 9 further comprising recording one or more properties corresponding to the object when received via a browser-based report submission, wherein at least one property is from a set of properties containing:
    - a GUID of the report, the host of a reported URL, a time that the URL was reported, whether the host from this report is from a commonly phished domain, whether the host from this report is on a top traffic list, whether a filter marked the URL as phish but the user reports that it is not phishing, whether the filter marked the URL as phish and the user reports that it is phishing, whether the filter marked the URL as not phish but the user reports that it is phishing, and whether the filter marked the URL as not phish and the user reports that it is not phishing.
  - 17. The system of claim 9 wherein the means for determining whether an object corresponding to a site is likely a phishing-related object tracks at least one of:
    - a number of times a URL appeared in a known spam source and not in a known good source, traffic, rank from search, geolocation, registrar information on the domain and last-hop router of a traceroute to the host.

18. At least one computer-readable medium having computer-executable instructions, which when executed perform steps, comprising:
- aggregating phishing-related data from a plurality of sources including at least one source corresponding to an email service and at least one source corresponding to an internet access service; and
  
  predicting whether a site is likely to be a phishing site based on features of the site when evaluated against the aggregated data.
- View Dependent Claims (19, 20)
- - 19. The computer-readable medium of claim 18 wherein predicting whether the site is likely to be a phishing site comprises determining a probability value, and further comprising, using the probability value to automatically warn users from visiting a site with a probability of being a phishing site, using the probability value to automatically block users from visiting a site with another probability of being a phishing site, and/or using the probability value to assist a hand grader in grading sites more efficiently.
  - 20. The computer-readable medium of claim 18 wherein predicting whether the site is likely to be a phishing site comprises building a predictive model and applying the predictive model to the aggregated data based on the features of the site.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rounthwaite, Robert, Seshadrinathan, Gopalakrishnan, Mishra, Manav, Snelling, David, Penta, Anthony, Goodman, Joshua, Hulten, Geoffrey, Deyo, Roderic, Rehfuss, Paul, Haber, Elliott

Granted Patent

US 8,839,418 B2
Time in Patent Office

Days
Field of Search
US Class Current

726/22
CPC Class Codes

G06F 16/9566   URL specific, e.g. using al...

H04L 63/08   for authentication of entit...

H04L 63/1441   Countermeasures against mal...

H04L 63/1483   service impersonation, e.g....

Finding phishing sites

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

191 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Finding phishing sites

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

191 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links