Feedback loop for spam prevention

US 7,219,148 B2
Filed: 03/03/2003
Issued: 05/15/2007
Est. Priority Date: 03/03/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A system, that facilitates classifying items in connection with spam prevention, comprising computer-executable components embodied on a computer-readable storage medium, the system comprising:

a component that receives a set of the items;

a component that identifies intended recipients of the items, and tags a subset of the items to be polled, the subset of items corresponding to a subset of recipients that are known spam fighting users, wherein the subset of the items to be polled is determined before the items are labeled as spam or not spam, as such all items are considered for polling including those items which are designated as spam by a currently employed spam filter;

a feedback component that receives information relating to the spam fighter'"'"'s classification of the polled items, and employs the information in connection with training a spam filter and populating a spam list, wherein the feedback component employs machine learning techniques to train the spam filter; and

a component that modifies an item tagged for polling to identify it as a polling item, wherein the modified item comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of items facilitate classification of the item by the user, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the item'"'"'s unigue ID assigned thereto is sent to a database for storage.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject invention provides for a feedback loop system and method that facilitate classifying items in connection with spam prevention in server and/or client-based architectures. The invention makes uses of a machine-learning approach as applied to spam filters, and in particular, randomly samples incoming email messages so that examples of both legitimate and junk/spam mail are obtained to generate sets of training data. Users which are identified as spam-fighters are asked to vote on whether a selection of their incoming email messages is individually either legitimate mail or junk mail. A database stores the properties for each mail and voting transaction such as user information, message properties and content summary, and polling results for each message to generate training data for machine learning systems. The machine learning systems facilitate creating improved spam filter(s) that are trained to recognize both legitimate mail and spam mail and to distinguish between them.

272 Citations

81 Claims

1. A system, that facilitates classifying items in connection with spam prevention, comprising computer-executable components embodied on a computer-readable storage medium, the system comprising:
- a component that receives a set of the items;
  
  a component that identifies intended recipients of the items, and tags a subset of the items to be polled, the subset of items corresponding to a subset of recipients that are known spam fighting users, wherein the subset of the items to be polled is determined before the items are labeled as spam or not spam, as such all items are considered for polling including those items which are designated as spam by a currently employed spam filter;
  
  a feedback component that receives information relating to the spam fighter'"'"'s classification of the polled items, and employs the information in connection with training a spam filter and populating a spam list, wherein the feedback component employs machine learning techniques to train the spam filter; and
  
  a component that modifies an item tagged for polling to identify it as a polling item, wherein the modified item comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of items facilitate classification of the item by the user, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the item'"'"'s unigue ID assigned thereto is sent to a database for storage.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The system of claim 1, wherein the items comprise at least one of:
    - electronic mail (email) and messages.
  - 3. The system of claim 1, wherein the component that receives a set of the items is any one of an email server, a message server, and client email software.
  - 4. The system of claim 1, wherein the subset of items to be polled comprises all of the items received.
  - 5. The system of claim 1, wherein the subset of recipients comprises all recipients.
  - 6. The system of claim 1, wherein the subset of recipients are randomly selected.
  - 7. The system of claim 1, wherein the subset of recipients comprises paying users of the system.
  - 8. The system of claim 1, wherein at least a subset of messages that would ordinarily be filtered are considered for polling.
  - 9. The system of claim 1, wherein the subset of items tagged for polling is limited to at least one of the following:
    - a number of the items selected per user;
      
      a number of the items selected per user per time period; and
      
      a probability of tagging an item corresponding to a known user.
  - 10. The system of claim 1, wherein the tagged items are each assigned a unique ID, the unique ID corresponding to any one of the tagged item and contents of the tagged item.
  - 11. The system of claim 1, wherein the modified item comprises at least one of the following:
    - a modified “
      
      from”
      
      address;
      
      a modified subject line;
      
      a polling icon; and
      
      a polling color for identification as a polling item.
  - 12. The system of claim 1, wherein the modified item comprises the tagged item as an attachment.
  - 13. The system of claim 1, wherein the modified item comprises a summary of the tagged item, the summary comprising at least one of a subject, a date, text of the message, and a first few lines of the text.
  - 14. The system of claim 1, wherein the at least two voting buttons comprise a first voting button, the first voting button indicating “
    - legitimate mail” and
      
      a second voting button, the second voting button indicating “
      
      spam”
      
      .
  - 15. The system of claim 1, wherein the voting buttons are implemented by modifying text of the item.
  - 16. The system of claim 1, wherein the voting buttons are implemented by modifying a user interface of client email software.
  - 17. The system of claim 1, further comprising a central database that stores information and data relating to user properties, item content and properties associated with tagged items, user classification and voting statistical data, frequency analysis data of polling per user and of polling per user per time period, spain lists, legitimate mail lists, and black hole lists.
  - 18. The system of claim 1, wherein items tagged for polling that are marked as spam by an existing filter are delivered to the user'"'"'s inbox and considered for polling.
  - 19. The system of claim 1, wherein items tagged for polling are scanned for viruses such that one of the following occurs:
    - detected viruses are stripped out and the items are polled; and
      
      infected items are discarded.
  - 20. The system of claim 1 distributed across more than one spain-fighting company such that feedback from each company is sent to a central database operatively interfaced with each company, wherein some portion of the feedback is removed for privacy reasons.
  - 21. The system of claim 20, wherein the company feedback comprises one of the following:
    - only spain items, thereby excluding legitimate items; and
      
      spam items and sender name, domain name and IP address of legitimate items.
  - 22. The system of claim 1, further comprising a user classification validation component that tests user reliability and trustworthiness.
  - 23. The system of claim 22, wherein the user classification validation component is at least one of a cross-validation technique and a known result test message technique.
  - 24. The system of claim 22, wherein the user classification validation component can be applied to one or more suspected users.
  - 25. The system of claim 1, wherein the feedback component receives information relating to user feedback, honeypot feedback, and optionally, user recipient feedback of received items.

26. A method that facilitates classifying messages in connection with spam prevention comprising:
- receiving a set of the messages;
  
  identifying intended recipients of the messages;
  
  tagging a subset of the messages to be polled, the subset of messages corresponding to a subset of the recipients that are known spam fighting users, wherein the subset of messages to be polled is determined before the messages are labeled as spam or not spam, as such all messages are considered for polling including those messages which are designated as spam by a currently employed spam filter;
  
  receiving information relating to the user'"'"'s classification of polling messages;
  
  employing the information in connection with training a spam filter and populating a spam list, wherein training the spam filter is employed via a machine learning technique; and
  
  modifying a message tagged for polling to identify it as a polling message, wherein the modified message comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of messages facilitate classification of the message by the user, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the message'"'"'s unique ID assigned thereto is sent to a database for storage.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79)
- - 27. The method of claim 26, wherein at least a subset of messages that would ordinarily be filtered are received by an email server and into a feedback loop system.
  - 28. The method of claim 26, wherein all incoming messages are handled by client email software such that messages selected for polling are specific to preferences of an individual user.
  - 29. The method of claim 26, wherein all messages received are considered for polling to mitigate bias of data.
  - 30. The method of claim 26, wherein the subset of messages to be polled comprises all messages.
  - 31. The method of claim 26, wherein the subset of recipients comprises all recipients.
  - 32. The method of claim 26, wherein the subset of recipients which are known spam fighting users is determined by each recipient performing at least one of the following:
    - opting in to provide feedback on messages to facilitate training a new spam filter;
      
      passively opting in to provide feedback on messages by not opting out;
      
      paying for email and message services provided by a participating message server; and
      
      opening an email account with a participating message server.
  - 33. The method of claim 26, wherein the subset of users selected to participate in the message polling is selected at random.
  - 34. The method of claim 26, wherein the subset of users selected to participate in the message polling are selected from all paying users, thereby making it more expensive for some spammers to subvert the spam filter training.
  - 35. The method of claim 26, wherein the subset of messages tagged for polling are selected at random.
  - 36. The method of claim 26, wherein the subset of messages tagged for polling is limited by one or more polling limits.
  - 37. The method of claim 26, wherein the one or more polling limits comprise a per user limit and a per user per time period limit to mitigate bias of data.
  - 38. The method of claim 26, further comprising modifying tagged messages to mark and identify them as polling messages.
  - 39. The method of claim 38, wherein modifying the tagged messages comprises performing at least one of the following:
    - moving the tagged message to a separate folder for polling messages;
      
      modifying the “
      
      from”
      
      address of the tagged message;
      
      modifying the subject line of the tagged message;
      
      using a polling icon on the tagged message to identify it as a polling message; and
      
      using a unique color to identify the tagged message as a polling message.
  - 40. The method of claim 26, wherein the polling message comprises an attachment of the message as originally received and a set of instructions instructing the user on how to vote.
  - 41. The method of claim 40, further comprising at least two voting buttons to facilitate classifying the message as spam and not spam.
  - 42. The method of claim 41, further comprising a third voting button to opt out of future polling.
  - 43. The method of claim 41, wherein the voting buttons are incorporated into the polling message by modifying text of the message before sending the polling message to the respective user.
  - 44. The method of claim 41, wherein the voting buttons are implemented by modifying a user interface of client email software.
  - 45. The method of claim 41, wherein the voting buttons are incorporated into the polling message.
  - 46. The method of claim 40, further comprising a summary of the message, the summary comprising at least one of a subject line, message sender, date the message was sent, date the message was received, and a first few lines of text from the message.
  - 47. The method of claim 26, further comprising scanning the tagged messages for viruses before they are downloaded for polling.
  - 48. The method of claim 47, further comprising removing the viruses from any infected messages.
  - 49. The method of claim 47, wherein tagged messages infected with a virus are discarded.
  - 50. The method of claim 26, further comprising making a copy of each tagged message as originally received such that the respective users receive a first copy of the message in its original form and a second copy of the message in a form for polling.
  - 51. The method of claim 26, wherein the tagged messages are individually assigned a unique ID corresponding to at least one the tagged message and contents of the tagged message.
  - 52. The method of claim 51, wherein the tagged message and its associated ID are stored in a database in connection with training a spam filter and populating a spam list.
  - 53. The method of claim 26, wherein a feedback component receives the information relating to the user'"'"'s classification of the polling message, the feedback component comprising a central database.
  - 54. The method of claim 53, wherein the database provides information in connection with training a spam filter and populating a spam list via a machine-learning technique.
  - 55. The method of claim 53, wherein identifying users and tagging messages for polling is distributed across one or more mail servers and one or more client email software such that data generated by the mail servers and client email software are returned to a central database for storage in connection with training a spam filter and populating a spam list.
  - 56. The method of claim 55, wherein key information is removed from any data that is sent to the central database by the mail servers and client email software for privacy reasons, such that only a portion of the data is sent to the central database to facilitate training the spam filter.
  - 57. The method of claim 56, wherein the portion of data sent to the central database comprises at least one of the following:
    - information relating to spam messages;
      
      domain names embedded in legitimate messages; and
      
      IP addresses embedded in legitimate messages.
  - 58. The method of claim 55, wherein the data generated by the mail servers and the data generated by the client email software are aggregated into statistical data, respectively, corresponding to polling results and polling messages, thus mitigating bandwidth required to transmit the data to the central database.
  - 59. The method of claim 58, wherein the messages are selected using active learning techniques, i.e. techniques that select messages based on their estimated value to learning new or updated filters.
  - 60. The method of claim 26, wherein the spam filter is trained using messages classified as spam and not spam to mitigate bias of polling data and misclassification of the polling messages.
  - 61. The method of claim 26, further comprising distributing the trained spam filter to one or more servers, the distribution occurring automatically and/or by request by at least one of an email message and a posting on a website for downloading.
  - 62. The method of claim 26, wherein training the spam filter and populating the spam list is performed by machine learning techniques using data based on user classification feedback and optionally, data generated by one or more additional sources, the one or more sources comprising honeypots, recipient non-user classification feedback, and active learning techniques.
  - 63. The method of claim 62, wherein data generated by the one or more sources is re-weighted proportionately with respect to the type of data generated by the source and relative to the user classification data to facilitate obtaining an unbiased sampling of data.
  - 64. The method of claim 62, wherein honeypots correspond to email addresses disclosed in a restrictive manner such that it is known who is sending them legitimate messages, thereby facilitating immediate identification of spammers, verification of suspect merchants who distribute user subscriber information to spammers, and immediate classification of spam messages without waiting for user classification.
  - 65. The method of claim 64, wherein information generated by the honeypots is down weighted selectively depending at least in part on a number of honeypots in use relative to a number of other sources including user classification feedback.
  - 66. The method of claim 64, wherein data generated by the honeypots is integrated in real time into a central database, where information relating to user classifications and polling messages are also stored for later use in connection with training a spam filter and populating a spam list.
  - 67. The method of claim 26 further comprising:
    - monitoring incoming messages for their respective one or more positive features;
      
      determining a frequency of positive features received;
      
      determining whether one or more positive features received exceeds a thresholdfrequency based at least in part upon historical data; and
      
      quarantining suspicious messages, which correspond to the one or more positive features that exceed the threshold frequency, until further classification data is available to determine whether suspicious messages are spam.
  - 68. The method of claim, 67 wherein the feature used is information about the sender comprising at least one of the sender'"'"'s IP address and domain.
  - 69. The method of claim 67, wherein quarantining suspicious messages is performed by at least one of the following acts:
    - provisionally labeling the suspicious messages as spam and moving them to a spam folder;
      
      delaying delivery of the suspicious messages to the user(s) until further classification data is available; and
      
      storing the suspicious messages in a folder not visible to the user(s).
  - 70. The method of claim 26, further comprising determining false positive and catch rates of the spam filter to facilitate optimization of the spam filter, wherein determining false positive and catch rates compnses:
    - training the spam filter using a training data set, the training data set comprising a first set of polling results;
      
      classifying a second set of polling messages using user feedback to yield a second set of polling results;
      
      running the second set of polling messages through the trained spam filter;
      
      comparing the second set of polling results to the trained spam filter results to determine false positive and catch rates of the filter to thereby evaluate and tune filter parameters according to optimal filter performance.
  - 71. The method of claim 70, wherein more than one spam filter is built, each having different parameters and each being trained on the same training data set, such that the false positive and catch rates of each spam filter is compared to at least one other spam filter to determine optimal parameters for spam filtering.
  - 72. The method of claim 26, further comprising building an improved spam filter using additional sets of incoming messages, subsets of which are subjected to polling to yield new information in connection with training the improved spam filter, wherein previously obtained information is re-weighted based at least in part upon how long ago it was obtained.
  - 73. The method of claim 26, further comprising employing the information to build a legitimate sender list.
  - 74. The method of claim 73, wherein the legitimate sender list comprises any one of IP addresses, domain names, and URLs that are substantially classified as sources of good mail according to a percentage of messages classified as good.
  - 75. The method of claim 26, wherein the spam lists are used to generate a black hole list of addresses from whom no mail would be accepted.
  - 76. The method of claim 26, further comprising employing the information to facilitate terminating accounts of spammers.
  - 77. The method of claim 76, further comprising identifying a spammer who is using an ISP and automatically notifying the ISP of the spamming.
  - 78. The method of claim 76, further comprising identifying a domain responsible for sending spam, and automatically notifying at least one of the domain'"'"'s email provider and the domain'"'"'s ISP of the spamming.
  - 79. The method of claim 26, further comprising distributing at least one of the spam filter and the spam list to any one of mail servers, email servers, and client email software, wherein distributing comprises at least one of the following:
    - posting a notification on a website notifying that the spam filter and spam list are available for downloading;
      
      automatically pushing the spam filter and spam list out to mail servers, email servers and client email software; and
      
      manually pushing the spam filter and spam list out to mail servers, email servers, and client email software.

80. A computer-readable storage medium having stored thereon computer components, when executed by one or more processor for facilitating classification of messages in connection with spam prevention, the components comprising:
- a component that receives a set of messages;
  
  a component that identifies intended recipients of the messages, and tags a subset of the messages to be polled, the subset of messages corresponding to a subset of recipients that are known spam fighting users, wherein the subset of messages to be polled is determined before the messages are labeled as spam or not spam, as such all messages are considered for polling including those messages which are designated as spam by a currently employed spam filter;
  
  a message modification component that modifies the tagged messages to identify them as polling messages to users;
  
  a feedback component that receives information relating to the user'"'"'s classification of the polled messages, and employs the information in connection with training a spam filter and populating a spam list, wherein the feedback component employs machine learning techniques to train the spam filter; and
  
  a component that modifies a message tagged for polling to identify it as a polling message, wherein the modified message comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of messages facilitate classification of the message by the user, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the message'"'"'s unigue ID assigned thereto is sent to a database for storage.

81. A system that facilitates classifying messages in connection with spam preVention comprising a computer-readable storage medium, comprising:
- computer-executable means for receiving a set of the messages;
  
  computer-executable means for identifying intended recipients of the messages;
  
  computer-executable means for tagging a subset of the messages to be polled, the subset of messages corresponding to a subset of the recipients that are known spam fighting users wherein the subset of messages to be polled is determined before the messages are labeled as spam or not spam, as such all messages are considered for polling including those messages which are designated as spam by a currently employed spam filter;
  
  computer-executable means for receiving information relating to the user'"'"'s classification of the polling messages;
  
  computer-executable means for employing the information in connection with training a spam filter and populating a spam list, wherein training the spam filter is employed via a machine learning technique; and
  
  computer-executable means for modifying a message tagged for polling to identify it as a polling message, wherein the modified message comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of messages facilitate classification of the message by the user, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the message'"'"'s unigue ID assigned thereto is sent to a database for storage.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rounthwaite, Robert L., Goodman, Joshua T., Mehr, John D., Heckerman, David E., Slawson, Dean A., Rupersburg, Micah C., Howell, Nathan D.
Primary Examiner(s)
Tran; Philip B.

Application Number

US10/378,463
Publication Number

US 20040177110A1
Time in Patent Office

1,534 Days
Field of Search

709/224, 709206-207, 709/217, 709/223, 713/154
US Class Current

709/224
CPC Class Codes

G06Q 10/107 Computer-aided management o...

H04L 51/212 using filtering or selectiv...

Feedback loop for spam prevention

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

272 Citations

81 Claims

Specification

Use Cases

Quick Links

Others

Feedback loop for spam prevention

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

272 Citations

81 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others