Scaling machine learning using approximate counting

US 8,019,704 B1
Filed: 05/12/2010
Issued: 09/13/2011
Est. Priority Date: 05/16/2007
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computer devices, comprising:

storing, in a repository, information regarding a plurality of features;

storing, in a plurality of memory locations in a memory, values relating to the plurality of features;

identifying a particular feature of the plurality of features in the repository;

subjecting a string, associated with the particular feature, to multiple, different hash functions to generate multiple, different hash values;

identifying, for each of the multiple, different hash values, a respective memory location, of the plurality of memory locations in the memory;

reading the values stored at the respective memory locations;

performing an operation on the read values from the respective memory locations to obtain updated values;

writing the updated values into the respective memory locations; and

using the values, including the updated values, to make a prediction regarding particular data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system may track statistics for a number of features using an approximate counting technique by: subjecting each feature to multiple, different hash functions to generate multiple, different hash values, where each of the hash values may identify a particular location in a memory, and storing statistics for each feature at the particular locations identified by the hash values. The system may generate rules for a model based on the tracked statistics.

Citations

22 Claims

1. A method performed by one or more computer devices, comprising:
- storing, in a repository, information regarding a plurality of features;
  
  storing, in a plurality of memory locations in a memory, values relating to the plurality of features;
  
  identifying a particular feature of the plurality of features in the repository;
  
  subjecting a string, associated with the particular feature, to multiple, different hash functions to generate multiple, different hash values;
  
  identifying, for each of the multiple, different hash values, a respective memory location, of the plurality of memory locations in the memory;
  
  reading the values stored at the respective memory locations;
  
  performing an operation on the read values from the respective memory locations to obtain updated values;
  
  writing the updated values into the respective memory locations; and
  
  using the values, including the updated values, to make a prediction regarding particular data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, where performing the operation on the read values includes incrementing each of the read values by a particular amount.
  - 3. The method of claim 1, where performing the operation on the read values includes:
    - identifying a minimum value from the read values,updating the minimum value, andreplacing each of the read values with the updated minimum value.
  - 4. The method of claim 3, where writing the updated values into the respective memory locations includes writing the updated minimum value into each of the respective memory locations.
  - 5. The method of claim 1, where performing the operation on the read values includes:
    - determining a mean or median value from the read values,updating the mean or median value, andreplacing each of the read values with the updated mean or median value.
  - 6. The method of claim 5, where writing the updated values into the respective memory locations includes writing the updated mean or median value into each of the respective memory locations.
  - 7. The method of claim 1, further comprising:
    - determining a single value from the read values; and
      
      replacing each of the read values with the single value.
  - 8. The method of claim 7, where determining the single value includes:
    - using one of the read values as the single value.
  - 9. The method of claim 7, where determining the single value includes:
    - determining a mean or median value from the read values, andusing the mean or median value as the single value.
  - 10. The method of claim 1, where using the values, including the updated values, to make the prediction includes generating rules for a model based on the values including the updated values.

11. A method performed by one or more computer devices, comprising:
- processing, by one or more processors of the one or more computer devices, each particular feature of a plurality of features in a repository, where processing each particular feature includes;
  
  performing, by one or more processors of the one or more computer devices, a plurality of different hash functions on a string, associated with the particular feature, to generate a corresponding plurality of different hash values,identifying, by one or more processors of the one or more computer devices, buckets, of a plurality of buckets in a memory, based on the plurality of different hash values,reading, by one or more processors of the one or more computer devices, a statistical value from each of the identified buckets,updating, by one or more processors of the one or more computer devices, each of the statistical values by subjecting each of the statistical values to a particular function to generate updated statistical values, andwriting, by one or more processors of the one or more computer devices, each of the updated statistical values into a corresponding one of the identified buckets;
  
  identifying a group of features, of the plurality of features, based on the statistical values, including the updated statistical values, in the plurality of buckets; and
  
  using, by one or more processors of the one or more computer devices, the identified group of features and the statistical values associated with the identified group of features to make a prediction regarding particular data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, where the particular data includes particular e-mail data, andwhere using the identified group of features and the statistical values associated with the identified group of features to make a prediction regarding particular data includes using the identified group of features and the statistical values associated with the identified group of features to predict whether the particular e-mail data includes spam.
  - 13. The method of claim 11, where the particular data includes particular advertisement data, andwhere using the identified group of features and the statistical values associated with the identified group of features to make a prediction regarding particular data includes using the identified group of features and the statistical values associated with the identified group of features to predict whether the particular advertisement data will be selected by a user.
  - 14. The method of claim 11, where updating each of the statistical values includes:
    - identifying a minimum value from the statistical values,updating the minimum value, andreplacing each of the statistical values with the updated minimum value.
  - 15. The method of claim 14, where writing each of the updated statistical values into the corresponding one of the identified buckets includes writing the updated minimum value into each of the identified buckets.
  - 16. The method of claim 11, where updating each of the statistical values includes:
    - determining a mean or median value from the statistical values,updating the mean or median value, andreplacing each of the statistical values with the updated mean or median value.
  - 17. The method of claim 16, where writing each of the updated statistical values into the corresponding one of the identified buckets includes writing the updated mean or median value into each of the identified buckets.
  - 18. The method of claim 11, where updating each of the statistical values includes replacing each of the statistical values with one of the statistical values.
  - 19. The method of claim 11, further comprising:
    - determining a mean or median value from the statistical values, andreplacing each of the statistical values with the mean or median value.

20. A system, comprising:
- one or more first memory devices to store information regarding a plurality of features;
  
  one or more second memory devices to store, in a plurality of memory locations, statistical values relating to the plurality of features; and
  
  one or more computer devices to;
  
  identify a particular feature of the plurality of features in the one or more first memory devices,subject a string, associated with the particular feature, to multiple, different hash functions to generate multiple, different hash values,identify, for each of the multiple, different hash values, a respective memory location, of the plurality of memory locations in the one or more second memory devices,read the statistical values stored in the respective memory locations,perform an operation on the read statistical values to obtain updated statistical values,write the updated statistical values into the respective memory locations, anduse the statistical values, including the updated statistical values, to predict whether a particular e-mail includes spam or to predict whether a particular advertisement will be selected by a user.
- View Dependent Claims (21, 22)
- - 21. The system of claim 20, where, when performing the operation on the read statistical values, the one or more computer devices are to:
    - identify a minimum value from the read statistical values,update the minimum value, andreplace each of the read statistical values with the updated minimum value; and
      
      where, when writing the updated statistical values into the respective memory locations, the one or more computer devices are to write the updated minimum value into each of the respective memory locations.
  - 22. The system of claim 20, where, when performing the operation on the read statistical values, the one or more computer devices are to:
    - determine a mean or median value from the read statistical values, andupdate the mean or median value; and
      
      where, when writing the updated statistical values into the respective memory locations, the one or more computer devices are to write the updated mean or median value into each of the respective memory locations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Tong, Simon, Shazeer, Noam
Primary Examiner(s)
Gaffin; Jeffrey A
Assistant Examiner(s)
CHANG, LI WU

Application Number

US12/778,877
Time in Patent Office

489 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

Scaling machine learning using approximate counting

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Scaling machine learning using approximate counting

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links