Scaling machine learning using approximate counting that uses feature hashing

US 7,743,003 B1
Filed: 05/16/2007
Issued: 06/22/2010
Est. Priority Date: 05/16/2007
Status: Active Grant

First Claim

Patent Images

1. A method, performed by one or more computer devices, for approximate counting, comprising:

identifying, by one or more processors of the one or more computer devices, a feature of a plurality of features in a repository;

performing, by one or more processors of the one or more computer devices, a plurality of different hash functions on a feature name associated with the feature to generate a corresponding plurality of different hash values;

identifying, by one or more processors of the one or more computer devices, buckets, of a plurality of buckets in a memory, based on the plurality of different hash values;

reading, by one or more processors of the one or more computer devices, a statistical value from each of the identified buckets;

updating, by one or more processors of the one or more computer devices, each of the statistical values by subjecting each of the statistical values to a particular function to generate updated statistical values;

writing, by one or more processors of the one or more computer devices, each of the updated statistical values into a corresponding one of the identified buckets; and

generating, by one or more processors of the one or more computer devices, rules for a model based on the statistical values, including the updated statistical values, in the plurality of buckets.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system may track statistics for a number of features using an approximate counting technique by: subjecting each feature to multiple, different hash functions to generate multiple, different hash values, where each of the hash values may identify a particular location in a memory, and storing statistics for each feature at the particular locations identified by the hash values. The system may generate rules for a model based on the tracked statistics.

Citations

29 Claims

1. A method, performed by one or more computer devices, for approximate counting, comprising:
- identifying, by one or more processors of the one or more computer devices, a feature of a plurality of features in a repository;
  
  performing, by one or more processors of the one or more computer devices, a plurality of different hash functions on a feature name associated with the feature to generate a corresponding plurality of different hash values;
  
  identifying, by one or more processors of the one or more computer devices, buckets, of a plurality of buckets in a memory, based on the plurality of different hash values;
  
  reading, by one or more processors of the one or more computer devices, a statistical value from each of the identified buckets;
  
  updating, by one or more processors of the one or more computer devices, each of the statistical values by subjecting each of the statistical values to a particular function to generate updated statistical values;
  
  writing, by one or more processors of the one or more computer devices, each of the updated statistical values into a corresponding one of the identified buckets; and
  
  generating, by one or more processors of the one or more computer devices, rules for a model based on the statistical values, including the updated statistical values, in the plurality of buckets.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the feature is one of a combination of the features, andwhere performing the plurality of different hash functions includes performing the plurality of different hash functions on a feature name associated with the combination of features.
  - 3. The method of claim 1, wherein the plurality of features includes more than one-hundred thousand features.
  - 4. The method of claim 1, wherein updating each of the statistical values includes incrementing each of the statistical values by a particular amount.
  - 5. The method of claim 1, wherein updating each of the statistical values includes:
    - identifying a minimum value from the statistical values,updating the minimum value, andreplacing each of the statistical values with the updated minimum value.
  - 6. The method of claim 5, wherein writing each of the updated statistical values into the corresponding one of the identified buckets includes writing the updated minimum value into each of the identified buckets.
  - 7. The method of claim 1, wherein updating each of the statistical values includes:
    - determining a mean or median value from the statistical values,updating the mean or median value, andreplacing each of the statistical values with the updated mean or median value.
  - 8. The method of claim 7, wherein writing each of the updated statistical values into the corresponding one of the identified buckets includes writing the updated mean or median value into each of the identified buckets.
  - 9. The method of claim 1, further comprising:
    - determining a single value from the statistical values; and
      
      outputting the single value.
  - 10. The method of claim 9, wherein determining the single value includes:
    - using one of the statistical values as the single value.
  - 11. The method of claim 9, wherein determining the single value includes:
    - determining a mean or median value from the statistical values, andusing the mean or median value as the single value.

12. A device for performing approximate counting, comprising:
- a memory to store statistics regarding a plurality of features in buckets; and
  
  a processor to;
  
  identify a feature of the plurality of features,subject a feature name, associated with the feature, to a plurality of different hash functions to generate a plurality of different hash values, where the plurality of hash functions includes at least three different hash functions,identify a plurality of the buckets in the memory based on the plurality of different hash values,read the statistics from each of the identified buckets,update each of the statistics by subjecting each of the statistics to a particular function to generate updated statistics,write each of the updated statistics into a corresponding one of the identified buckets, andgenerate rules for a model based on the statistics, including the updated statistics, in the buckets.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The device of claim 12, wherein the feature is one of a combination of the features, and the feature name is associated with the combination of features.
  - 14. The device of claim 12, wherein the plurality of features includes more than one-hundred thousand features.
  - 15. The device of claim 12, wherein, when updating each of the statistics, the processor is configured to increment each of the statistics by a particular amount.
  - 16. The device of claim 12, wherein, when updating each of the statistics, the processor is configured to:
    - identify a minimum value from the statistics,update the minimum value, andreplace each of the statistics with the updated minimum value.
  - 17. The device of claim 16, wherein, when writing each of the updated statistics into the corresponding one of the identified buckets, the processor is configured to write the updated minimum value into each of the identified buckets.
  - 18. The device of claim 12, wherein, when updating each of the statistics, the processor is configured to:
    - determine a mean or median value from the statistics, andupdate the mean or median value, andreplace each of the statistics with the updated mean or median value.
  - 19. The device of claim 18, wherein, when writing each of the updated statistics into the corresponding one of the identified buckets, the processor is configured to write the updated mean or median value into each of the identified buckets.
  - 20. The device of claim 12, wherein the processor is further configured to:
    - determine a single value from the statistics, andoutput the single value.
  - 21. The device of claim 20, wherein when determining the single value, the processor is configured to use the statistics from one of the identified buckets as the single value.
  - 22. The device of claim 20, wherein when determining the single value, the processor is configured to:
    - determine a mean or median value from the statistics, anduse the mean or median value as the single value.

23. A method, performed by one or more computer devices, for approximate counting, comprising:
- identifying, by one or more processors of the one or more computer devices, a feature of a plurality of features in a repository;
  
  performing, by one or more processors of the one or more computer devices, a plurality of different hash functions on a feature name associated with the feature to generate a corresponding plurality of different hash values;
  
  identifying, by one or more processors of the one or more computer devices, a plurality of buckets in a memory based on the plurality of different hash values;
  
  reading, by one or more processors of the one or more computer devices, a statistical value from each of the identified buckets;
  
  determining, by one or more processors of the one or more computer devices, a single statistical value from the statistical values by subjecting the statistical values to a particular function; and
  
  generating, by one or more processors of the one or more computer devices, rules for a model based on the single statistical values for a group of the features in the repository.
- View Dependent Claims (24, 25)
- - 24. The method of claim 23, wherein determining the single statistical value includes:
    - using one of the statistical values as the single statistical value.
  - 25. The method of claim 23, wherein determining the single statistical value includes:
    - determining a mean or median value from the statistical values, andusing the mean or median value as the single statistical value.

26. A device for approximate counting, comprising:
- a memory to store statistics regarding a plurality of features in buckets; and
  
  a processor to;
  
  identify a feature of the plurality of features,subject a feature name, associated with the feature, to a plurality of different hash functions to generate a corresponding plurality of different hash values,identify a plurality of the buckets in the memory based on the plurality of different hash values,read the statistics from each of the identified buckets,determine a single statistical value from the statistics by subjecting the statistics to a particular function, andgenerate rules for a model based on the single statistical values for a group of the features.
- View Dependent Claims (27, 28)
- - 27. The device of claim 26, wherein, when determining the single statistical value, the processor is configured to use the statistics from one of the identified buckets as the single statistical value.
  - 28. The device of claim 26, wherein, when determining the single statistical value, the processor is configured to:
    - determine a mean or median value from the statistics, anduse the mean or median value as the single statistical value.

29. A system for approximate counting, comprising:
- one or more memory devices to store a plurality of distinct features; and
  
  one or more computer devices comprising;
  
  means for tracking statistics for a set of features, of the plurality of features in the one or more memory devices, using an approximate counting technique including;
  
  means for subjecting a feature name, associated with a respective feature of the set of features, to multiple, different hash functions to generate multiple, different hash values, each of the hash values identifying a particular location in a memory,means for generating statistics for each feature of the set of features, andmeans for storing the statistics, for each feature of the set of features, at the particular locations identified by the hash values; and
  
  means for generating rules for a model based on the tracked statistics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Shazeer, Noam, Tong, Simon
Primary Examiner(s)
Sparks; Donald
Assistant Examiner(s)
CHANG, LI WU

Application Number

US11/749,588
Time in Patent Office

1,133 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

Scaling machine learning using approximate counting that uses feature hashing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Scaling machine learning using approximate counting that uses feature hashing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links