Distributed grouping of large-scale data sets

US 10,394,913 B1
Filed: 07/14/2016
Issued: 08/27/2019
Est. Priority Date: 07/14/2016
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a first computing device, wherein the first computing device comprises a processor programmed by executable instructions to at least;

determine a first cosine distance between a first data vector, represented by a first temporary probabilistic data structure, and a center of a first cluster of data vectors;

determine a second cosine distance between the first data vector, represented by the first temporary probabilistic data structure, and a center of a second cluster of data vectors;

determine that the first cosine distance is smaller than the second cosine distance;

modify a first probabilistic data structure using the first data vector, wherein the first probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined; and

transmit the first probabilistic data structure to a second computing device; and

the second computing device, wherein the second computing device comprises a processor programmed by executable instructions to at least;

determine a third cosine distance between a second data vector, represented by a second temporary probabilistic data structure, and the center of the first cluster of data vectors;

determine a fourth cosine distance between the second data vector, represented by the second temporary probabilistic data structure, and the center of the second cluster of data vectors;

determine that the third cosine distance is smaller than the fourth cosine distance;

modify a second probabilistic data structure using the second data vector, wherein the second probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined;

receive the first probabilistic data structure from the first computing device; and

generate a third probabilistic data structure using the first probabilistic data structure and the second probabilistic data structure, wherein the third probabilistic data structure comprises data, regarding the first cluster of data vectors, from which an updated center of the first cluster of data vectors is determined.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are provided for the analysis of collections of data and automatic grouping of data having certain similarities. A collection of data regarding user interactions with item-specific content can be analyzed. The analysis can be used to identify groups of items that are of interest to groups of similar users and/or to identify groups of users with demonstrated interests in groups of similar items. Data may be analyzed in a “bottom-up” manner in which correlations within the data are discovered in an iterative manner, or in a “top-down” manner in which desired top-level groups are specified at the beginning of the process. A bottom-up process may also be distributed among multiple devices or processors to more efficiently discover groups when using large collections of data.

Citations

20 Claims

1. A system comprising:
- a first computing device, wherein the first computing device comprises a processor programmed by executable instructions to at least;
  
  determine a first cosine distance between a first data vector, represented by a first temporary probabilistic data structure, and a center of a first cluster of data vectors;
  
  determine a second cosine distance between the first data vector, represented by the first temporary probabilistic data structure, and a center of a second cluster of data vectors;
  
  determine that the first cosine distance is smaller than the second cosine distance;
  
  modify a first probabilistic data structure using the first data vector, wherein the first probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined; and
  
  transmit the first probabilistic data structure to a second computing device; and
  
  the second computing device, wherein the second computing device comprises a processor programmed by executable instructions to at least;
  
  determine a third cosine distance between a second data vector, represented by a second temporary probabilistic data structure, and the center of the first cluster of data vectors;
  
  determine a fourth cosine distance between the second data vector, represented by the second temporary probabilistic data structure, and the center of the second cluster of data vectors;
  
  determine that the third cosine distance is smaller than the fourth cosine distance;
  
  modify a second probabilistic data structure using the second data vector, wherein the second probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined;
  
  receive the first probabilistic data structure from the first computing device; and
  
  generate a third probabilistic data structure using the first probabilistic data structure and the second probabilistic data structure, wherein the third probabilistic data structure comprises data, regarding the first cluster of data vectors, from which an updated center of the first cluster of data vectors is determined.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the first probabilistic data structure comprises a count sketch, and wherein the count sketch is received by the first computing device from the second computing device prior to the first computing device determining the first cosine distance.
  - 3. The system of claim 1, wherein the first data vector comprises a plurality of dimensions, wherein a first dimension of the plurality of dimensions comprises data regarding interactions of a first user with content regarding an item, and wherein a second dimension of the plurality of dimensions comprises data regarding interactions of a second user with content regarding the item.
  - 4. The system of claim 1, wherein the executable instructions that program the first computing device to determine the first cosine distance comprise instructions to at least:
    - compute a first product using (1) a first dimension value of the first data vector and (2) a first corresponding value of data regarding the center of the first cluster of data vectors;
      
      compute a second product using (3) a second dimension value of the first data vector and (4) a second corresponding value of the data regarding the center of the first cluster of data vectors; and
      
      subtract a sum of the first product and the second product from a constant value.

5. A computer-implemented method comprising:
- as performed by a first computing system configured to execute specific instructions,determining a cosine distance between a data vector and a representation of a center of a first data vector group;
  
  determining, based at least partly on the cosine distance, to add the data vector to the first data vector group instead of a second data vector group;
  
  modifying a plurality of values of a first probabilistic data structure using the data vector, wherein the first probabilistic data structure comprises data, regarding the first data vector group, from which the representation of the center of the first data vector group is determined;
  
  transmitting the first probabilistic data structure to a second computing system; and
  
  receiving, from the second computing system, a second probabilistic data structure, wherein the second probabilistic data structure comprises data regarding the first data vector group, and wherein the second probabilistic data structure is based at least partly on data from the first probabilistic data structure.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The computer-implemented method of claim 5, wherein the first probabilistic data structure comprises a first count sketch, wherein the second probabilistic data structure comprises a second count sketch generated by the second computing system using a plurality of count sketches, and wherein the plurality of count sketches includes the first count sketch.
  - 7. The computer-implemented method of claim 5, wherein the second probabilistic data structure comprises a weighted average of a plurality of probabilistic data structures, the plurality of probability data structures including the first probabilistic data structure.
  - 8. The computer-implemented method of claim 5, wherein modifying the plurality of values comprises:
    - determining a value of a hash function using a first dimension value of the data vector; and
      
      adding the value of the hash function to a value of the plurality of values.
  - 9. The computer-implemented method of claim 5, wherein determining the cosine distance comprises:
    - computing a first product using (1) a first dimension value of the data vector and (2) a first corresponding value of the representation of the center of the first data vector group;
      
      computing a second product using (3) a second dimension value of the data vector and (4) a second corresponding value of the representation of the center of the first data vector group; and
      
      summing the first product and the second product.
  - 10. The computer-implemented method of claim 5, further comprising:
    - determining a second representation of the center of the first data vector group using the second probabilistic data structure;
      
      determining a second cosine distance between the data vector and the second representation of the center of the first data vector group;
      
      determining, based at least partly on the second cosine distance, to add the data vector to the second data vector group instead of the first data vector group;
      
      modifying a plurality of values of a third probabilistic data structure using the data vector, wherein the third probabilistic data structure comprises data regarding the second data vector group.
  - 11. The computer-implemented method of claim 5, wherein the data vector comprises a plurality of dimensions, wherein a first dimension of the plurality of dimensions comprises data regarding interactions of a first user with content regarding an item, and wherein a second dimension of the plurality of dimensions comprises data regarding interactions of a second user with content regarding the item.
  - 12. The computer-implemented method of claim 5, wherein a first dimension of the data vector comprises data regarding an attribute of an item.

13. A non-transitory computer storage medium storing an executable module, wherein the executable module configures a first computing system to perform a process comprising:
- determining, based at least partly on a distance between a data vector and a representation of a center of a first data vector group, to add the data vector to the first data vector group instead of a second data vector group;
  
  modifying a value of a first probabilistic data structure using the data vector, wherein the first probabilistic data structure comprises data, regarding the first data vector group, from which the representation of the center of the first data vector group is determined;
  
  transmitting the first probabilistic data structure to a second computing system; and
  
  receiving, from the second computing system, a second probabilistic data structure, wherein the second probabilistic data structure comprises data regarding the first data vector group, and wherein the second probabilistic data structure is based at least partly on data from the first probabilistic data structure.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The non-transitory computer storage medium of claim 13, wherein the first probabilistic data structure comprises a first count sketch, wherein the second probabilistic data structure comprises a second count sketch generated by the second computing system using a plurality of count sketches, and wherein the plurality of count sketches includes the first count sketch.
  - 15. The non-transitory computer storage medium of claim 13, wherein the second probabilistic data structure comprises a weighted average of a plurality of probabilistic data structures, the plurality of probability data structures including the first probabilistic data structure.
  - 16. The non-transitory computer storage medium of claim 13, wherein modifying the value comprises:
    - determining a value of a hash function using a first dimension value of the data vector; and
      
      adding the value of the hash function to a value of the plurality of values.
  - 17. The non-transitory computer storage medium of claim 13, the process further comprising:
    - computing a first product using (1) a first dimension value of the data vector and (2) a first corresponding value of the representation of the center of the first data vector group; and
      
      computing a second product using (3) a second dimension value of the data vector and (4) a second corresponding value of the representation of the center of the first data vector group;
      
      computing a sum of the first product and the second product, wherein the distance between the first vector and the representation of the center of the first data vector group is based at least partly on the sum.
  - 18. The non-transitory computer storage medium of claim 13, the process further comprising:
    - determining a second representation of the center of the first data vector group using the second probabilistic data structure;
      
      determining, based at least partly on a second distance between the data vector and the second representation of the center of the first data vector group, to add the data vector to the second data vector group instead of the first data vector group;
      
      modifying a value of a third probabilistic data structure using the data vector, wherein the third probabilistic data structure comprises data regarding the second data vector group.
  - 19. The non-transitory computer storage medium of claim 13, wherein the data vector comprises a plurality of dimensions, wherein a first dimension of the plurality of dimensions comprises data regarding interactions of a first user with content regarding an item, and wherein a second dimension of the plurality of dimensions comprises data regarding interactions of a second user with content regarding the item.
  - 20. The non-transitory computer storage medium of claim 13, the process further comprising receiving a plurality of data vectors from the second computing system, wherein the plurality of data vectors comprises the data vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Chaoji, Vineet Shashikant, Kaveri, Sivaramakrishnan, Khare, Vineet, Roy, Gourav, Sohoney, Saurabh, Willingham, Andrew Dennis
Primary Examiner(s)
Wong, Leslie

Application Number

US15/210,841
Time in Patent Office

1,139 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/24575   using context

G06F 16/9024   Graphs; Linked lists G06F16...

G06F 16/906   Clustering; Classification

G06F 16/9535   Search customisation based ...

G06F 17/16   Matrix or vector computatio...

G06F 17/18   for evaluating statistical ...

G06F 18/23213   with fixed number of cluste...

G06F 7/24   Sorting, i.e. extracting da...

Distributed grouping of large-scale data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed grouping of large-scale data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links