Large scale machine learning systems and methods

US 8,364,618 B1
Filed: 06/04/2012
Issued: 01/29/2013
Est. Priority Date: 11/14/2003
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

generating, by one or more processors, a model based on a plurality of features associated with documents that include spam documents and non-spam documents, the generating of the model including;

identifying, by the one or more processors, a condition associated with two or more features of the plurality of features,receiving, by the one or more processors and from a plurality of devices associated with the documents, statistics associated with the identified condition, a particular statistic, of the received statistics, being received from a particular device, of the plurality of devices, and the particular statistic indicating a particular weight, associated with the identified condition, for the particular device,generating a candidate rule for the model based on the condition and the received statistics,determining whether to add the candidate rule to the model,upon determining that the candidate rule should not be added to the model, setting a weight, for the candidate rule, to a value that indicates that the candidate rule should not be added to the model, andgenerating, by the one or more processors and based on the received statistics, a composite weight associated with the condition, the composite weight indicating how relevant the condition is, with respect to other conditions, in determining whether a document is to be classified as spam, the other conditions being associated with respective subsets of the plurality of features that differ from the condition;

receiving, by the one or more processors, a particular document, the particular document being associated with one or more features of the plurality of features;

determining, by the one or more processors and based on applying the model to the one or more features, to classify the particular document as a spam document; and

storing, by the one or more processors, information regarding the particular document based on the particular document being classified as the spam document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for generating a model is provided. The system generates, or selects, candidate conditions and generates, or otherwise obtains, statistics regarding the candidate conditions. The system also forms rules based, at least in part, on the statistics and the candidate conditions and selectively adds the rules to the model.

48 Citations

View as Search Results

20 Claims

1. A method comprising:
- generating, by one or more processors, a model based on a plurality of features associated with documents that include spam documents and non-spam documents, the generating of the model including;
  
  identifying, by the one or more processors, a condition associated with two or more features of the plurality of features,receiving, by the one or more processors and from a plurality of devices associated with the documents, statistics associated with the identified condition, a particular statistic, of the received statistics, being received from a particular device, of the plurality of devices, and the particular statistic indicating a particular weight, associated with the identified condition, for the particular device,generating a candidate rule for the model based on the condition and the received statistics,determining whether to add the candidate rule to the model,upon determining that the candidate rule should not be added to the model, setting a weight, for the candidate rule, to a value that indicates that the candidate rule should not be added to the model, andgenerating, by the one or more processors and based on the received statistics, a composite weight associated with the condition, the composite weight indicating how relevant the condition is, with respect to other conditions, in determining whether a document is to be classified as spam, the other conditions being associated with respective subsets of the plurality of features that differ from the condition;
  
  receiving, by the one or more processors, a particular document, the particular document being associated with one or more features of the plurality of features;
  
  determining, by the one or more processors and based on applying the model to the one or more features, to classify the particular document as a spam document; and
  
  storing, by the one or more processors, information regarding the particular document based on the particular document being classified as the spam document.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, the particular document including an e-mail.
  - 3. The method of claim 1, the particular statistic further identifying a particular subset of the documents associated with the particular device.
  - 4. The method of claim 3, the particular statistic further identifying one or more documents, of the particular subset of the documents associated with the particular device, that are classified, by the particular device, as spam documents.
  - 5. The method of claim 1, the generating of the model further including:
    - replacing, in the model, a previous weight associated with the condition with the composite weight.
  - 6. The method of claim 5, the generating of the model further including:
    - determining a cost associated with replacing, in the model, the previous weight with the composite weight; and
      
      determining that the cost does not exceed a threshold cost, the previous weight being replaced with the composite weight based on the cost not exceeding the threshold cost.
  - 7. The method of claim 1, the generating of the model further including:
    - determining that the composite weight satisfies a threshold weight,adding the candidate rule to the model based on the composite weight satisfying the threshold weight, andnotifying the plurality of devices that the candidate rule was added to the model.

8. A system comprising:
- one or more processors to;
  
  generate a model based on a plurality of features associated with documents that include spam documents and non-spam documents;
  
  the one or more processors, when generating the model, being further to;
  
  identify a condition associated with two or more features of the plurality of features,receive, from a plurality of devices associated with the documents, statistics associated with the identified condition, a particular statistic, of the received statistics, being received from a particular device, of the plurality of devices, and the particular statistic indicating a particular weight, associated with the identified condition, for the particular device,generate a candidate rule for the model based on the condition and the received statistics,determine whether to add the candidate rule to the model, andupon determining that the candidate rule should not be added to the model, set a weight, for the candidate rule, to a value that indicates that the candidate rule should not be added to the model, andgenerate, based on the received statistics, a composite weight associated with the condition, the composite weight indicating how relevant the condition is, with respect to other conditions, in determining whether a document is to be classified as spam, the other conditions being associated with respective subsets of the plurality of features that differ from the condition;
  
  receive a particular document, the particular document being associated with one or more features of the plurality of features;
  
  classify, based on applying the model to the one or more features, the particular document as a spam document; and
  
  processing the particular document based on classifying the particular document as the spam document.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, the particular document including an e-mail.
  - 10. The system of claim 8, the particular statistic further identifying a particular subset of the documents associated with the particular device.
  - 11. The system of claim 10, the particular statistic further identifying one or more documents, of the particular subset of the documents associated with the particular device, that are classified, by the particular device, as spam documents.
  - 12. The system of claim 8, the one or more processors, when generating the model, being further to:
    - replace, in the model, a previous weight, associated with the condition, with the composite weight.
  - 13. The system of claim 12, the one or more processors, when generating the model, being further to:
    - determine a cost associated with the replacing, in the model, the previous weight with the composite weight; and
      
      determine that the cost does not exceed a threshold cost,the one or more processors replacing the previous weight with the composite weight based on the cost not exceeding the threshold cost.
  - 14. The system of claim 8, the one or more processors, when generating the model, being further to:
    - determine that the composite weight satisfies a threshold weight,add the candidate rule to the model based on the composite weight satisfying the threshold weight, andnotify the plurality of devices that the candidate rule was added to the model.

15. A non-transitory memory device, comprising:
- one or more instructions which, when executed by one or more processors, cause the one or more processors to;
  
  identify a condition associated with two or more features, of a plurality of features associated with documents that include spam documents and non-spam documents;
  
  receive, from a plurality of devices associated with the documents, statistics associated with the identified condition, a particular statistic, of the received statistics, being received from a particular device, of the plurality of devices, and the particular statistic indicating a particular weight, associated with the identified condition, for the particular device;
  
  generate a candidate rule for the model based on the condition and the received statistics;
  
  determine whether to add the candidate rule to the model;
  
  upon determining that the candidate rule should not be added to the model, set a weight, for the candidate rule, to a value that indicates that the candidate rule should not be added to the model;
  
  generate, based on the received statistics, a composite weight associated with the condition, the composite weight indicating how relevant the condition is, with respect to other conditions, in determining whether a document is to be classified as spam, the other conditions being associated with respective subsets of the plurality of features that differ from the condition;
  
  receive a particular document, the particular document being associated with one or more features of the plurality of features; and
  
  classify, based on applying the composite weight to the one or more features, the particular document as a spam document.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory memory device of claim 15, the particular statistic further identifying a particular subset of the documents associated with the particular device.
  - 17. The non-transitory memory device of claim 16, the particular statistic further identifying one or more documents, of the particular subset of the documents associated with the particular device, that are classified, by the particular device, as spam documents.
  - 18. The non-transitory memory device of claim 15, the one or more instructions further causing the one or more processors to:
    - replace a previous weight, associated with the condition, with the composite weight.
  - 19. The non-transitory memory device of claim 18, the one or more instructions, when causing the one or more processors to replace the previous weight with the composite weight, further causing the one or more processors to:
    - determine a cost associated with replacing the previous weight with the composite weight;
      
      determine that the cost does not exceed a threshold cost; and
      
      replace the previous weight with the composite weight based on the cost not exceeding the threshold cost.
  - 20. The non-transitory memory device of claim 15, the one or more instructions further causing the one or more processors to:
    - determine that the composite weight satisfies a threshold weight,store the candidate rule based on the composite weight satisfying the threshold weight, andnotify the plurality of devices that the candidate rule was stored.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Bem, Jeremy, Harik, Georges R., Levenberg, Joshua L., Shazeer, Noam, Tong, Simon
Primary Examiner(s)
Casanova, Jorge A

Application Number

US13/487,873
Time in Patent Office

239 Days
Field of Search

706/12, 706/20, 707/999.002, 707/999.102, 707/603, 707/737, 707/749
US Class Current

706/20
CPC Class Codes

G06F 16/24575   using context

G06F 16/24578   using ranking

G06F 16/3346   using probabilistic model

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99943   Generating database or data...

Large scale machine learning systems and methods

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

48 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Large scale machine learning systems and methods

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

48 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links