Advanced URL and IP features

US 7,409,708 B2
Filed: 05/28/2004
Issued: 08/05/2008
Est. Priority Date: 06/04/2003
Status: Active Grant

First Claim

Patent Images

1. A computer readable storage medium having stored thereon computer executable components that facilitate spam detection the components comprise:

a component that receives an item and extracts a set of features associated with an origination of a message or part thereof and/or information that enables an intended recipient to contact, respond to, or act on the message, the features comprising at least one of IP address-based features and URL-based features, wherein the IP address-based features comprise at least one of presence of reverse DNS entry or domain name, hostname from the reverse DNS entry and missing reverse DNS entry;

an analysis component that analyzes at least a subset of the features; and

at least one filter that is trained on at least a subset of the features to facilitate distinguishing spam messages from good messages, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are systems and methods that facilitate spam detection and prevention at least in part by building or training filters using advanced IP address and/or URL features in connection with machine learning techniques. A variety of advanced IP address related features can be generated from performing a reverse IP lookup. Similarly, many different advanced URL based features can be created from analyzing at least a portion of any one URL detected in a message.

247 Citations

36 Claims

1. A computer readable storage medium having stored thereon computer executable components that facilitate spam detection the components comprise:
- a component that receives an item and extracts a set of features associated with an origination of a message or part thereof and/or information that enables an intended recipient to contact, respond to, or act on the message, the features comprising at least one of IP address-based features and URL-based features, wherein the IP address-based features comprise at least one of presence of reverse DNS entry or domain name, hostname from the reverse DNS entry and missing reverse DNS entry;
  
  an analysis component that analyzes at least a subset of the features; and
  
  at least one filter that is trained on at least a subset of the features to facilitate distinguishing spam messages from good messages, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The storage medium of claim 1, the at least one filter is trained to analyze text of a present reverse DNS address lookup.
  - 3. The storage medium of claim 2, the filter is a machine learning filter.
  - 4. The storage medium of claim 2, the filter is a hash-based or match-based filter.
  - 5. The storage medium of claim 2, the analysis component examines a reverse DNS entry to determine whether evidence of at least one of a DSL line, cable modem line or dialup line is included in the reverse DNS entry or to find other evidence that the computer that sent the message is one that should not be sending such messages.
  - 6. The storage medium of claim 2, the analysis component examines a reverse DNS entry that corresponds to an IP address detected in the message to determine whether at least one of “
    - dsl”
      
      , “
      
      cable”
      
      , “
      
      dialup”
      
      , “
      
      client”
      
      , “
      
      pool”
      
      , “
      
      user”
      
      , “
      
      dyn”
      
      , “
      
      tele”
      
      , “
      
      cust”
      
      , “
      
      dial”
      
      , “
      
      dialin”
      
      , “
      
      modem”
      
      , “
      
      ppp”
      
      , “
      
      dhcp”
      
      , “
      
      mail”
      
      , “
      
      smtp”
      
      , and/or “
      
      .mx”
      
      appears in the DNS entry to facilitate identifying the message as spam or good.
  - 7. The storage medium of claim 1, the IP address-based features comprise at least one of the following:
    - length of the reverse DNS entry;
      
      depth of the DNS entry; and
      
      presence of at least a portion of IP address in the hostname of the reverse DNS entry in clear form or encoded in octal or hexadecimal.
  - 8. The storage medium of claim 1, the URL-based features comprise at least one of the following:
    - one or more absolute URL features;
      
      one or more count-based URL features;
      
      one or more combination-based URL features; and
      
      any combination of at least two of absolute URL features, count-based URL features, and combination-based URL features.
  - 9. The storage medium of claim 8, the count-based URL features comprise total-based features and sequence-based features.

10. A computer implemented spam detection and filtering system comprising the following components executed on a processor:
- a component that uses traceroute to gather additional IP address or URL feature information about at least one message; and
  
  a filtering component that employs the traceroute information to facilitate distinguishing between spam and good messages, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, the filtering component comprises a machine learning filter.
  - 12. The system of claim 10, the filtering component comprises any one of a hash-based filter or a match-based filter.
  - 13. The system of claim 10, the traceroute is a traceroute of the IP address that the message was received from.
  - 14. The system of claim 10, the traceroute is a traceroute of the IP address of a URL contained in the message.
  - 15. The system of claim 10, the traceroute is a traceroute of the IP address of the DNS server of a URL in the message.

16. A computer implemented spam detection and filtering system comprising the following components executed on a processor:
- a component that receives an incoming message; and
  
  a filter that employs any combination of at least two of absolute URL features, count-based URL features, and combination-based URL features detected in a message to facilitate determining whether the message is spam.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, the URL-based inputs depend on a total number of URLs detected in the message.
  - 18. The system of claim 16, the filter treats any one URL detected in the message based in part on the number of distinct URLs or portions of URLs that precede or follow it in the message.
  - 19. The system of claim 16, the filter combines any number of URLs detected in the message into one or more subsets for use as inputs to facilitate distinguishing between spam and good messages.
  - 20. The system of claim 16, the filter is any one of the following:
    - a machine learning filter;
      
      a hash-based filter; and
      
      a match-based filter.

21. A computer implemented spam detection and filtering system comprising the following components executed on a processor:
- a component that receives an incoming message;
  
  a component that detects URLs and redirected URLs; and
  
  a machine learning filter that employs at least a portion of one or more redirected URLs detected in a message as inputs to facilitate determining whether the message is spam.
- View Dependent Claims (22, 23, 24)
- - 22. The system of claim 21, the machine learning filter is discriminatively trained with respect to URLs and redirected URLs or portions thereof.
  - 23. The system of claim 21, the machine learning filter employs counts of numbers of redirected URLs as inputs.
  - 24. The system of claim 21, the component further detects multilevel redirection to be used as an input to the machine learning filter.

25. A computer implemented spam detection and filtering system comprising the following components executed on a processor:
- a component that detects URLs in a message;
  
  a contact process component comprising at least one of the following contact routes;
  
  URL detected in the message including at least one of an IP address of the URL, a DNS server of the URL, a traceroute of the IP address of the host of the URL, an IP address of the DNS server of the URL, version information of the DNS server, and the traceroute of the IP address of the DNS server; and
  
  a machine learning filter component that employs at least one of the contact routes to facilitate determining whether the message is spam, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.
- View Dependent Claims (26, 27)
- - 26. The system of claim 25, the filter component employs the contact process comprising the IP address of the URL.
  - 27. The system of claim 25, the filter component employs the contact process comprising the DNS server of the URL.

28. A spam filtering method comprising:
- extracting at least one of IP address-based data and URL-based data from a message, wherein the IP address-based data comprising at least a portion of an IP address and the URL-based data comprising at least a portion of at least one URL;
  
  generating at least one of IP address-based features and the URL-based features from the respective data to be used as inputs to at least one filter; and
  
  employing at least one filter trained on at least a subset of the inputs to facilitate distinguishing spam messages from good messages, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.
- View Dependent Claims (29, 30, 31, 32, 33)
- - 29. The method of claim 28, the filter is a machine learning filter.
  - 30. The method of claim 28, further comprising analyzing data returned from a reverse DNS lookup by taking non-null information comprising a name return and using the name return as an input to a machine learning algorithm to train a filter.
  - 31. The method of claim 28, the IP address-based features comprising at least one of the following:
    - presence of reverse DNS entry or domain name;
      
      length of reverse DNS entry;
      
      hostname from the reverse DNS entry;
      
      missing reverse DNS entry;
      
      presence of at least a portion of IP address in the hostname of the reverse DNS entry; and
      
      evidence of any one of “
      
      dsl”
      
      , “
      
      cable”
      
      , “
      
      dialup”
      
      , “
      
      client”
      
      , “
      
      pool”
      
      , “
      
      user”
      
      , “
      
      dyn”
      
      , “
      
      tele”
      
      , “
      
      cust”
      
      , “
      
      dial.”
      
      , “
      
      dialin”
      
      , “
      
      modem”
      
      , “
      
      ppp”
      
      , “
      
      dhcp”
      
      , “
      
      mail”
      
      , “
      
      smtp”
      
      , and/or “
      
      .mx”
      
      in the DNS entry.
  - 32. The method of claim 28, the URL-based features comprising at least one of the following:
    - one or more absolute URL features;
      
      one or more count-based URL features;
      
      one or more combination-based URL features;
      
      any combination of at least two of absolute URL features, count-based URL features, and combination-based URL features;
      
      redirected URLs; and
      
      presence of multilevel redirected URLs.
  - 33. The method of claim 28, further comprising performing a traceroute to obtain additional IP address or URL features for use as inputs to the filter.

34. A spam detection and filtering method comprising:
- receiving incoming messages;
  
  examining a contact process of obtaining data from a URL to determine commonalities among a plurality of hostnames to facilitate generating features, wherein examining the contact process comprises at least one of;
  
  performing a DNS lookup for the URL,identifying identity of DNS server,obtaining traceroute of a path from the URL to the DNS server,identifying version information of DNS server,converting a hostname to an IP address using the DNS server,identifying at least a portion of the IP address andperforming a traceroute on the IP address to determine whether the IP addresses are connected in a similar way; and
  
  employing at least one filter trained at least in part on at least a subset of the features to facilitate determining whether messages are spam.

35. A computer implemented spam filtering system comprising the following components executed on a processor:
- means for extracting at least one of IP address-based data and URL-based data from a message, wherein the IP address-based data comprising at least a portion of an IP address and the URL-based data comprising at least a portion of at least one URL;
  
  means for generating at least one of IP address-based features and the URL-based features from the respective data to be used as inputs to at least one filter; and
  
  means for employing at least one filter trained on at least a subset of the inputs to facilitate distinguishing spam messages from good messages, wherein the filter is trained by analyzing at least a portion of the IP address-based data at least in part by taking null reverse DNS information and using a null RDNS entry as input into a machine learning algorithm.

36. A computer-readable storage medium containing a data structure adapted to be transmitted between two or more computer processes facilitating improved detection of spam, the data structure comprising:
- information associated with generating at least one of IP address-based features and the URL-based features from respective data to be used as inputs to at least one filter; and
  
  employing at least one machine learning filter trained on at least a subset of the inputs to facilitate distinguishing spam messages from good messages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Hulten, Geoffrey J, Mishra, Manav, Rounthwaite, Robert L, Goodman, Joshua T, Deurbrouck, John A, Penta, Anthony P
Primary Examiner(s)
SMITHERS, MATTHEW

Application Number

US10/856,978
Publication Number

US 20050022031A1
Time in Patent Office

1,530 Days
Field of Search

726/13
US Class Current

726/13
CPC Class Codes

G06Q 10/107 Computer-aided management o...

H04L 51/212 using filtering or selectiv...

Advanced URL and IP features

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

247 Citations

36 Claims

Specification

Use Cases

Quick Links

Others

Advanced URL and IP features

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

247 Citations

36 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others