REAL-TIME ANALYTICS FOR LARGE DATA SETS
First Claim
1. A cloud computing system comprising:
- a plurality of processor nodes, each processor node having access to a local memory device and persistent storage devices;
a blob storage service implemented on the persistent storage devices;
the blob storage service operable to store a plurality of value blobs, each value blob of the plurality of value blobs containing one or more serialized data items;
a key management service operable to store a plurality of keys, each key of the plurality of keys associated with a collection descriptor comprising one or more unique value blob identifiers.
1 Assignment
0 Petitions
Accused Products
Abstract
A cloud computing system is described herein that enables fast processing of queries over massive amounts of stored data. The system is characterized by the ability to scan tens of billions of data items and to perform aggregate calculations like counts, sums, and averages in real-time (less than three seconds). Ad hoc queries are supported including grouping, sorting, and filtering without the need to predefine queries by providing highly efficient loading and processing of data items across an arbitrarily large number of processors. The system does not require any fixed schema, thus the system supports any type of data. Calculations made to satisfy a query may be distributed across a large number of processors to parallelize the work. In addition, an optimal blob size for storing multiple serialized data items is determined, and existing blobs that are too large or too small are proactively redistributed or coalesced to increase performance.
45 Citations
22 Claims
-
1. A cloud computing system comprising:
-
a plurality of processor nodes, each processor node having access to a local memory device and persistent storage devices; a blob storage service implemented on the persistent storage devices; the blob storage service operable to store a plurality of value blobs, each value blob of the plurality of value blobs containing one or more serialized data items; a key management service operable to store a plurality of keys, each key of the plurality of keys associated with a collection descriptor comprising one or more unique value blob identifiers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A cloud computing system comprising:
-
a plurality of processor nodes, each processor node having access to a local memory device and persistent storage devices; a blob storage service storing data items having a plurality of dimensions; a plurality of keys, each key of the plurality of keys corresponding to a particular dimension of the plurality of dimensions and identifying a collection of data items having the particular dimension; and a query manager operable to; a) determine a primary dimension referenced by a query expression; b) identify a set of keys corresponding to the primary dimension; c) select a set of processor nodes to evaluate a subquery of the incoming query; d) assign each processor node of the set of processor nodes a set of keys corresponding to the primary dimension, wherein said each processor node evaluates the query expression over the collection of data items identified by each key in the set keys assigned to said each processor node. - View Dependent Claims (11, 12, 13)
-
- 14. A cloud computing system comprising a blob storage manager that stores data items in a blob of a minimum size and a maximum size, wherein the minimum size and the maximum size are determined based at least on a size of local memory or performance data for inflating blobs of a plurality of sizes.
-
22. A cloud computing system comprising:
-
a plurality of processor nodes, each processor node having access to a memory device and persistent storage devices; the persistent storage devices storing a plurality of value blobs, each value blob of the plurality of value blobs containing one or more serialized data items; a data de-serialization library generator operable to; a) receive a structure definition for a specific type of data item; b) based on the structure definition, generate machine-executable instructions comprising a de-serialization library corresponding to the specific type of data item, which when executed, de-serializes a data item of the specific type; and c) store the de-serialization library in a value blob; and an object inflation module that, in response to a request to copy a particular data item from persistent storage into a memory device, is operable to retrieve from a value blob a de-serialization library for a particular data item type and execute the library to create and load a data item object into the memory device.
-
Specification