Feedback loop for spam prevention

US 20040177110A1
Filed: 03/03/2003
Published: 09/09/2004
Est. Priority Date: 03/03/2003
Status: Active Grant

First Claim

Patent Images

1. A system that facilitates classifying items in connection with spam prevention, comprising:

a component that receives a set of the items;

a component that identifies intended recipients of the items, and tags a subset of the items to be polled, the subset of items corresponding to a subset of recipients that are known spam fighting users; and

a feedback component that receives information relating to the spam fighter'"'"'s classification of the polled items, and employs the information in connection with training a spam filter and populating a spam list.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject invention provides for a feedback loop system and method that facilitate classifying items in connection with spam prevention in server and/or client-based architectures. The invention makes uses of a machine-learning approach as applied to spam filters, and in particular, randomly samples incoming email messages so that examples of both legitimate and junk/spam mail are obtained to generate sets of training data. Users which are identified as spam-fighters are asked to vote on whether a selection of their incoming email messages is individually either legitimate mail or junk mail. A database stores the properties for each mail and voting transaction such as user information, message properties and content summary, and polling results for each message to generate training data for machine learning systems. The machine learning systems facilitate creating improved spam filter(s) that are trained to recognize both legitimate mail and spam mail and to distinguish between them.

Citations

95 Claims

1. A system that facilitates classifying items in connection with spam prevention, comprising:
- a component that receives a set of the items;
  
  a component that identifies intended recipients of the items, and tags a subset of the items to be polled, the subset of items corresponding to a subset of recipients that are known spam fighting users; and
  
  a feedback component that receives information relating to the spam fighter'"'"'s classification of the polled items, and employs the information in connection with training a spam filter and populating a spam list.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 2. The system of claim 1, wherein the items comprise at least one of:
    - electronic mail (email) and messages.
  - 3. The system of claim 1, wherein the component that receives a set of the items is any one of an email server, a message server, and client email software.
  - 4. The system of claim 1, wherein the subset of items to be polled comprises all of the items received.
  - 5. The system of claim 1, wherein the subset of recipients comprises all recipients.
  - 6. The system of claim 1, wherein the subset of recipients are randomly selected.
  - 7. The system of claim 1, wherein the subset of recipients comprises paying users of the system.
  - 8. The system of claim 1, wherein at least a subset of messages that would ordinarily be filtered are considered for polling.
  - 9. The system of claim 1, wherein the subset of items tagged for polling is limited to at least one of the following:
    - a number of the items selected per user;
      
      a number of the items selected per user per time period; and
      
      a probability of tagging an item corresponding to a known user.
  - 10. The system of claim 1, wherein the tagged items are each assigned a unique ID, the unique ID corresponding to any one of the tagged item and contents of the tagged item.
  - 11. The system of claim 1, further comprising a component that modifies an item tagged for polling to identify it as a polling item.
  - 12. The system of claim 11, wherein the modified item comprises at least one of the following:
    - a modified “
      
      from”
      
      address;
      
      a modified subject line;
      
      a polling icon; and
      
      a polling color for identification as a polling item.
  - 13. The system of claim 11, wherein the modified item comprises the tagged item as an attachment.
  - 14. The system of claim 11, wherein the modified item comprises a summary of the tagged item, the summary comprising at least one of a subject, a date, text of the message, and a first few lines of the text.
  - 15. The system of claim 11, wherein the modified item comprises voting instructions and any one of at least two voting buttons and links which correspond to at least two respective classes of items facilitate classification of the item by the user.
  - 16. The system of claim 15, wherein the voting buttons correspond to respective links such that when any one of the voting buttons is selected by the user, information relating to the selected voting button, the respective user, and the item'"'"'s unique ID assigned thereto is sent to a database for storage.
  - 17. The system of claim 15, wherein the at least two voting buttons comprise a first voting button, the first voting button indicating “
    - legitimate mail” and
      
      a second voting button, the second voting button indicating “
      
      spam”
      
      .
  - 18. The system of claim 15, wherein the voting buttons are implemented by modifying text of the item.
  - 19. The system of claim 15, wherein the voting buttons are implemented by modifying a user interface of client email software.
  - 20. The system of claim 1, further comprising a central database that stores information and data relating to user properties, item content and properties associated with tagged items, user classification and voting statistical data, frequency analysis data of polling per user and of polling per user per time period, spam lists, legitimate mail lists, and black hole lists.
  - 21. The system of claim 1, wherein items tagged for polling that are marked as spam by an existing filter are delivered to the user'"'"'s inbox and considered for polling.
  - 22. The system of claim 1, wherein items tagged for polling are scanned for viruses such that one of the following occurs:
    - detected viruses are stripped out and the items are polled; and
      
      infected items are discarded.
  - 23. The system of claim 1 distributed across more than one spam-fighting company such that feedback from each company is sent to a central database operatively interfaced with each company, wherein some portion of the feedback is removed for privacy reasons.
  - 24. The system of claim 23, wherein the company feedback comprises one of the following:
    - only spam items, thereby excluding legitimate items; and
      
      spam items and sender name, domain name and IP address of legitimate items.
  - 25. The system of claim 1, further comprising a user classification validation component that tests user reliability and trustworthiness.
  - 26. The system of claim 25, wherein the user classification validation component is at least one of a cross-validation technique and a known result test message technique.
  - 27. The system of claim 25, wherein the user classification validation component can be applied to one or more suspected users.
  - 28. The system of claim 1, wherein the feedback component receives information relating to user feedback, honeypot feedback, and optionally, user recipient feedback of received items.
  - 29. A server employing the system of claim 1.
  - 30. An e-mail architecture employing the system of claim 1.
  - 31. A computer readable medium having stored thereon the components of claim 1.

32. A method that facilitates classifying messages in connection with spam prevention comprising:
- receiving a set of the messages;
  
  identifying intended recipients of the messages;
  
  tagging a subset of the messages to be polled, the subset of messages corresponding to a subset of the recipients that are known spam fighting users;
  
  receiving information relating to the user'"'"'s classification of polling messages; and
  
  employing the information in connection with training a spam filter and populating a spam list.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86)
- - 33. The method of claim 32, wherein at least a subset of messages that would ordinarily be filtered are received by an email server and into a feedback loop system.
  - 34. The method of claim 32, wherein all incoming messages are handled by client email software such that messages selected for polling are specific to preferences of an individual user.
  - 35. The method of claim 32, wherein all messages received are considered for polling to mitigate bias of data.
  - 36. The method of claim 32, wherein the subset of messages to be polled comprises all messages.
  - 37. The method of claim 32, wherein the subset of recipients comprises all recipients.
  - 38. The method of claim 32, wherein the subset of recipients which are known spam fighting users is determined by each recipient performing at least one of the following:
    - opting in to provide feedback on messages to facilitate training a new spam filter;
      
      passively opting in to provide feedback on messages by not opting out;
      
      paying for email and message services provided by a participating message server; and
      
      opening an email account with a participating message server.
  - 39. The method of claim 32, wherein the subset of users selected to participate in the message polling is selected at random.
  - 40. The method of claim 32, wherein the subset of users selected to participate in the message polling are selected from all paying users, thereby making it more expensive for some spammers to subvert the spam filter training.
  - 41. The method of claim 32, wherein the subset of messages tagged for polling are selected at random.
  - 42. The method of claim 32, wherein the subset of messages tagged for polling is limited by one or more polling limits.
  - 43. The method of claim 32, wherein the one or more polling limits comprises a per user limit and a per user per time period limit to mitigate bias of data.
  - 44. The method of claim 32, further comprising modifying tagged messages to mark and identify them as polling messages.
  - 45. The method of claim 44, wherein modifying the tagged messages comprises performing at least one of the following:
    - moving the tagged message to a separate folder for polling messages;
      
      modifying the “
      
      from”
      
      address of the tagged message;
      
      modifying the subject line of the tagged message;
      
      using a polling icon on the tagged message to identify it as a polling message; and
      
      using a unique color to identify the tagged message as a polling message.
  - 46. The method of claim 32, wherein the polling message comprises an attachment of the message as originally received and a set of instructions instructing the user on how to vote.
  - 47. The method of claim 46, further comprising at least two voting buttons to facilitate classifying the message as spam and not spam.
  - 48. The method of claim 47, wherein the voting buttons are links, which when selected by the user, create feedback that is employed in connection with training a spam filter and populating a spam list, the feedback comprising information relating to the selected classification, the user, the message, and a unique ID assigned to one of the message and contents of the message.
  - 49. The method of claim 47, further comprising a third voting button to opt out of future polling.
  - 50. The method of claim 47, wherein the voting buttons are incorporated into the polling message by modifying text of the message before sending the polling message to the respective user.
  - 51. The method of claim 47, wherein the voting buttons are implemented by modifying a user interface of client email software.
  - 52. The method of claim 47, wherein the voting buttons are incorporated into the polling message.
  - 53. The method of claim 32, further comprising scanning the tagged messages for viruses before they are downloaded for polling.
  - 54. The method of claim 53, further comprising removing the viruses from any infected messages.
  - 55. The method of claim 53, wherein tagged messages infected with a virus are discarded.
  - 56. The method of claim 46, further comprising a summary of the message, the summary comprising at least one of a subject line, message sender, date the message was sent, date the message was received, and a first few lines of text from the message.
  - 57. The method of claim 32, further comprising making a copy of each tagged message as originally received such that the respective users receive a first copy of the message in its original form and a second copy of the message in a form for polling.
  - 58. The method of claim 32, wherein the tagged messages are individually assigned a unique ID corresponding to at least one the tagged message and contents of the tagged message.
  - 59. The method of claim 58, wherein the tagged message and its associated ID are stored in a database in connection with training a spam filter and populating a spam list.
  - 60. The method of claim 32, wherein a feedback component receives the information relating to the user'"'"'s classification of the polling message, the feedback component comprising a central database.
  - 61. The method of claim 60, wherein the database provides information in connection with training a spam filter and populating a spam list via a machine-learning technique.
  - 62. The method of claim 32, wherein the spam filter is trained using messages classified as spam and not spam to mitigate bias of polling data and misclassification of the polling messages.
  - 63. The method of claim 32, further comprising distributing the trained spam filter to one or more servers, the distribution occurring automatically and/or by request by at least one of an email message and a posting on a website for downloading.
  - 64. The method of claim 60, wherein identifying users and tagging messages for polling is distributed across one or more mail servers and one or more client email software such that data generated by the mail servers and client email software are returned to a central database for storage in connection with training a spam filter and populating a spam list.
  - 65. The method of claim 64, wherein key information is removed from any data that is sent to the central database by the mail servers and client email software for privacy reasons, such that only a portion of the data is sent to the central database to facilitate training the spam filter.
  - 66. The method of claim 65, wherein the portion of data sent to the central database comprises at least one of the following:
    - information relating to spam messages;
      
      domain names embedded in legitimate messages; and
      
      IP addresses embedded in legitimate messages.
  - 67. The method of claim 64, wherein the data generated by the mail servers and the data generated by the client email software are aggregated into statistical data, respectively, corresponding to polling results and polling messages, thus mitigating bandwidth required to transmit the data to the central database.
  - 68. The method of claim 32, wherein training the spam filter and populating the spam list is performed by machine learning techniques using data based on user classification feedback and optionally, data generated by one or more additional sources, the one or more sources comprising honeypots, recipient non-user classification feedback, and active learning techniques.
  - 69. The method of claim 68, wherein data generated by the one or more sources is re-weighted proportionately with respect to the type of data generated by the source and relative to the user classification data to facilitate obtaining an unbiased sampling of data.
  - 70. The method of claim 68, wherein honeypots correspond to email addresses disclosed in a restrictive manner such that it is known who is sending them legitimate messages, thereby facilitating immediate identification of spammers, verification of suspect merchants who distribute user subscriber information to spammers, and immediate classification of spam messages without waiting for user classification.
  - 71. The method of claim 70, wherein information generated by the honeypots is down weighted selectively depending at least in part on a number of honeypots in use relative to a number of other sources including user classification feedback.
  - 72. The method of claim 70, wherein data generated by the honeypots is integrated in real time into a central database, where information relating to user classifications and polling messages are also stored for later use in connection with training a spam filter and populating a spam list.
  - 73. The method of claim 67, wherein the messages are selected using active learning techniques, i.e. techniques that select messages based on their estimated value to learning new or updated filters.
  - 74. The method of claim 32 further comprising:
    - monitoring incoming messages for their respective one or more positive features;
      
      determining a frequency of positive features received;
      
      determining whether one or more positive features received exceeds a threshold frequency based at least in part upon historical data; and
      
      quarantining suspicious messages, which correspond to the one or more positive features that exceed the threshold frequency, until further classification data is available to determine whether suspicious messages are spam.
  - 75. The method of claim,74 wherein the feature used is information about the sender comprising at least one of the sender'"'"'s IP address and domain.
  - 76. The method of claim 74, wherein quarantining suspicious messages is performed by at least one of the following acts:
    - provisionally labeling the suspicious messages as spam and moving them to a spam folder;
      
      delaying delivery of the suspicious messages to the user(s) until further classification data is available; and
      
      storing the suspicious messages in a folder not visible to the user(s).
  - 77. The method of claim 32, further comprising determining false positive and catch rates of the spam filter to facilitate optimization of the spam filter, wherein determining false positive and catch rates comprises:
    - training the spam filter using a training data set, the training data set comprising a first set of polling results;
      
      classifying a second set of polling messages using user feedback to yield a second set of polling results;
      
      running the second set of polling messages through the trained spams filter;
      
      comparing the second set of polling results to the trained spam filter results to determine false positive and catch rates of the filter to thereby evaluate and tune filter parameters according to optimal filter performance.
  - 78. The method of claim 77, wherein more than one spam filter is built, each having different parameters and each being trained on the same training data set, such that the false positive and catch rates of each spam filter is compared to at least one other spam filter to determine optimal parameters for spam filtering.
  - 79. The method of claim 32, further comprising building an improved spam filter using additional sets of incoming messages, subsets of which are subjected to polling to yield new information in connection with training the improved spam filter, wherein previously obtained information is re-weighted based at least in part upon how long ago it was obtained.
  - 80. The method of claim 32, further comprising employing the information to build a legitimate sender list.
  - 81. The method of claim 80, wherein the legitimate sender list comprises any one of IP addresses, domain names, and URLs that are substantially classified as sources of good mail according to a percentage of messages classified as good.
  - 82. The method of claim 32, wherein the spam lists are used to generate a black hole list of addresses from whom no mail would be accepted.
  - 83. The method of claim 32, further comprising employing the information to facilitate terminating accounts of spammers.
  - 84. The method of claim 83, further comprising identifying a spammer who is using an ISP and automatically notifying the ISP of the spamming.
  - 85. The method of claim 83, further comprising identifying a domain responsible for sending spam, and automatically notifying at least one of the domain'"'"'s email provider and the domain'"'"'s ISP of the spamming.
  - 86. The method of claim 32, further comprising distributing at least one of the spam filter and the spam list to any one of mail servers, email servers, and client email software, wherein distributing comprises at least one of the following:
    - posting a notification on a website notifying that the spam filter and spam list are available for downloading;
      
      automatically pushing the spam filter and spam list out to mail servers, email servers and client email software; and
      
      manually pushing the spam filter and spam list out to mail servers, email servers, and client email software.

87. A cross-validation method that facilitates verifying reliability and trustworthiness of user classifications comprising:
- excluding one or more suspected user'"'"'s classifications from data employed to train the spam filter;
  
  training the spam filter using all other available user classifications; and
  
  running the suspected user'"'"'s polling messages through the trained spam filter to determine how it would have classified the messages compared to the suspected user'"'"'s classifications.
- View Dependent Claims (88)
- - 88. The method of claim 87, further comprising performing at least one of the following:
    - discounting existing and future classifications provided by users who are determined to be untrustworthy until the users are determined to be trustworthy;
      
      discarding existing classifications provided by users determined to be untrustworthy; and
      
      removing the untrustworthy users from future polling.

89. A method that facilitates verifying reliability and trustworthiness in user classifications for training a spam filter via a feedback loop system comprising:
- identifying a subset of spam-fighting users as suspect users;
  
  providing one or more messages having a known result to the suspect users for polling; and
  
  determining whether the suspected users'"'"' classification of the one or more test messages matches the known classification to ascertain the reliability of the users'"'"' classifications.
- View Dependent Claims (90, 91, 92, 93)
- - 90. The method of claim 89, wherein the subset of spam-fighting users identified as suspect users comprises all users.
  - 91. The method of claim 89, wherein the message is a test message that is known to be at least one of spam and good mail and that is injected into a stream of incoming mail by the feedback loop system and delivered to the suspect users.
  - 92. The method of claim 89, wherein the message received by the suspected users for polling is hand-classified by a system administrator to train the spam filter with a correct classification to identify untrustworthy users.
  - 93. The method of claim 89, further comprising at least one of the following acts:
    - discounting existing and future classifications provided by users who are determined to be untrustworthy until the users are determined to be trustworthy;
      
      discarding existing classifications provided by users determined to be untrustworthy; and
      
      removing the untrustworthy users from future polling.

94. A computer-readable medium having stored thereon the following computer executable components:
- a component that receives a set of messages;
  
  a component that identifies intended recipients of the messages, and tags a subset of the messages to be polled, the subset of messages corresponding to a subset of recipients that are known spam fighting users;
  
  a message modification component that modifies the tagged messages to identify them as polling messages to users; and
  
  a feedback component that receives information relating to the user'"'"'s classification of the polled messages, and employs the information in connection with training a spam filter and populating a spam list.

95. A system that facilitates classifying messages in connection with spam prevention comprising:
- means for receiving a set of the messages;
  
  means for identifying intended recipients of the messages;
  
  means for tagging a subset of the messages to be polled, the subset of messages corresponding to a subset of the recipients that are known spam fighting users;
  
  means for receiving information relating to the user'"'"'s classification of the polling messages; and
  
  means for employing the information in connection with training a spam filter and populating a spam list.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rounthwaite, Robert L., Goodman, Joshua T., Mehr, John D., Heckerman, David E., Slawson, Dean A., Rupersburg, Micah C., Howell, Nathan D.

Granted Patent

US 7,219,148 B2
Time in Patent Office

Days
Field of Search
US Class Current

709/202
CPC Class Codes

G06Q 10/107 Computer-aided management o...

H04L 51/212 using filtering or selectiv...

Feedback loop for spam prevention

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

95 Claims

Specification

Solutions

Use Cases

Quick Links

Feedback loop for spam prevention

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

95 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links