Distributed grouping of large-scale data sets
First Claim
1. A system comprising:
- a first computing device, wherein the first computing device comprises a processor programmed by executable instructions to at least;
determine a first cosine distance between a first data vector, represented by a first temporary probabilistic data structure, and a center of a first cluster of data vectors;
determine a second cosine distance between the first data vector, represented by the first temporary probabilistic data structure, and a center of a second cluster of data vectors;
determine that the first cosine distance is smaller than the second cosine distance;
modify a first probabilistic data structure using the first data vector, wherein the first probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined; and
transmit the first probabilistic data structure to a second computing device; and
the second computing device, wherein the second computing device comprises a processor programmed by executable instructions to at least;
determine a third cosine distance between a second data vector, represented by a second temporary probabilistic data structure, and the center of the first cluster of data vectors;
determine a fourth cosine distance between the second data vector, represented by the second temporary probabilistic data structure, and the center of the second cluster of data vectors;
determine that the third cosine distance is smaller than the fourth cosine distance;
modify a second probabilistic data structure using the second data vector, wherein the second probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined;
receive the first probabilistic data structure from the first computing device; and
generate a third probabilistic data structure using the first probabilistic data structure and the second probabilistic data structure, wherein the third probabilistic data structure comprises data, regarding the first cluster of data vectors, from which an updated center of the first cluster of data vectors is determined.
1 Assignment
0 Petitions
Accused Products
Abstract
Features are provided for the analysis of collections of data and automatic grouping of data having certain similarities. A collection of data regarding user interactions with item-specific content can be analyzed. The analysis can be used to identify groups of items that are of interest to groups of similar users and/or to identify groups of users with demonstrated interests in groups of similar items. Data may be analyzed in a “bottom-up” manner in which correlations within the data are discovered in an iterative manner, or in a “top-down” manner in which desired top-level groups are specified at the beginning of the process. A bottom-up process may also be distributed among multiple devices or processors to more efficiently discover groups when using large collections of data.
-
Citations
20 Claims
-
1. A system comprising:
-
a first computing device, wherein the first computing device comprises a processor programmed by executable instructions to at least; determine a first cosine distance between a first data vector, represented by a first temporary probabilistic data structure, and a center of a first cluster of data vectors; determine a second cosine distance between the first data vector, represented by the first temporary probabilistic data structure, and a center of a second cluster of data vectors; determine that the first cosine distance is smaller than the second cosine distance; modify a first probabilistic data structure using the first data vector, wherein the first probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined; and transmit the first probabilistic data structure to a second computing device; and the second computing device, wherein the second computing device comprises a processor programmed by executable instructions to at least; determine a third cosine distance between a second data vector, represented by a second temporary probabilistic data structure, and the center of the first cluster of data vectors; determine a fourth cosine distance between the second data vector, represented by the second temporary probabilistic data structure, and the center of the second cluster of data vectors; determine that the third cosine distance is smaller than the fourth cosine distance; modify a second probabilistic data structure using the second data vector, wherein the second probabilistic data structure comprises data, regarding the first cluster of data vectors, from which the center of the first cluster of data vectors is determined; receive the first probabilistic data structure from the first computing device; and generate a third probabilistic data structure using the first probabilistic data structure and the second probabilistic data structure, wherein the third probabilistic data structure comprises data, regarding the first cluster of data vectors, from which an updated center of the first cluster of data vectors is determined. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method comprising:
as performed by a first computing system configured to execute specific instructions, determining a cosine distance between a data vector and a representation of a center of a first data vector group; determining, based at least partly on the cosine distance, to add the data vector to the first data vector group instead of a second data vector group; modifying a plurality of values of a first probabilistic data structure using the data vector, wherein the first probabilistic data structure comprises data, regarding the first data vector group, from which the representation of the center of the first data vector group is determined; transmitting the first probabilistic data structure to a second computing system; and receiving, from the second computing system, a second probabilistic data structure, wherein the second probabilistic data structure comprises data regarding the first data vector group, and wherein the second probabilistic data structure is based at least partly on data from the first probabilistic data structure. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
-
13. A non-transitory computer storage medium storing an executable module, wherein the executable module configures a first computing system to perform a process comprising:
-
determining, based at least partly on a distance between a data vector and a representation of a center of a first data vector group, to add the data vector to the first data vector group instead of a second data vector group; modifying a value of a first probabilistic data structure using the data vector, wherein the first probabilistic data structure comprises data, regarding the first data vector group, from which the representation of the center of the first data vector group is determined; transmitting the first probabilistic data structure to a second computing system; and receiving, from the second computing system, a second probabilistic data structure, wherein the second probabilistic data structure comprises data regarding the first data vector group, and wherein the second probabilistic data structure is based at least partly on data from the first probabilistic data structure. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification