System and method for querying a distributed dwarf cube
First Claim
1. A method for querying a distributed dwarf cube comprising a plurality of dwarf cuboids, wherein the distributed dwarf cube is built using a mapreduce technique, the method comprising:
- receiving, by a processor, a query for retrieving data from a distributed dwarf cube, wherein the distributed dwarf cube is built of the data, wherein the data comprises cube values, wherein the distributed dwarf cube is built by;
processing the data, at a first mapreduce job of a series of mapreduce jobs, to generate indexes for the data, wherein the indexes are generated for each dimension in the data, wherein the cube values are replaced with a corresponding index for each dimension of the data;
sorting the cube values in one or more dimensions based on a cardinality of the cube values and index associated with each cube value, wherein the cube values are sorted in an order of highest cardinality to lowest cardinality at a second mapreduce job of the series of mapreduce jobs, wherein the cardinality indicates distinctiveness of the cube values in the one or more dimensions;
partitioning the sorted data into data blocks based on a predefined size, wherein each data block is associated with a range, wherein the range corresponds to a start cube value and an end cube value of a highest cardinality dimension in the data block;
building a distributed dwarf cube, comprising dwarf cuboids, at a third mapreduce job of the series of mapreduce jobs, wherein each dwarf cuboid is generated, from a data block, based on the range associated with the data block by;
processing the data block using a dwarf algorithm;
eliminating the dimensions with the highest cardinality from the data;
processing the data recursively based on the series of mapreduce jobs till all the dimensions in the data block are eliminated; and
storing the generated cuboid on a Distributed File System;
querying, by the processor, the distributed dwarf cube, wherein a cluster of query engines is utilized for querying by;
checking, by the processor, the one or more ranges of the cube values based upon the query, wherein the one or more ranges comprise complete cube values and non-complete cube values, wherein the non-complete cube values indicate the cube values present at a start or an end of a range of the one or more ranges;
creating, by the processor, a list of the cube values comprising the complete cube values and/or the non-complete cube values; and
transmitting, by the processor, the list of the cube values from the distributed dwarf cube corresponding to the query.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for querying a distributed dwarf cube are disclosed. A query for retrieving data from a distributed dwarf cube is received. The distributed dwarf cube is built of the data. The data comprises cube values. The distributed dwarf cube is built by processing the data to generate indexes for the data. The cube values in one or more dimensions are sorted based on a cardinality of the cube values. The data is partitioned into data blocks to build distributed dwarf cube from each data block based upon the cardinality of the cube values. The distributed dwarf cube comprises one or more ranges defined for the cube values. The one or more ranges of the cube values are checked based upon the query. Using the cube values, a list is created. The list of the cube values is transmitted from the distributed dwarf cube corresponding to the query.
9 Citations
17 Claims
-
1. A method for querying a distributed dwarf cube comprising a plurality of dwarf cuboids, wherein the distributed dwarf cube is built using a mapreduce technique, the method comprising:
receiving, by a processor, a query for retrieving data from a distributed dwarf cube, wherein the distributed dwarf cube is built of the data, wherein the data comprises cube values, wherein the distributed dwarf cube is built by; processing the data, at a first mapreduce job of a series of mapreduce jobs, to generate indexes for the data, wherein the indexes are generated for each dimension in the data, wherein the cube values are replaced with a corresponding index for each dimension of the data; sorting the cube values in one or more dimensions based on a cardinality of the cube values and index associated with each cube value, wherein the cube values are sorted in an order of highest cardinality to lowest cardinality at a second mapreduce job of the series of mapreduce jobs, wherein the cardinality indicates distinctiveness of the cube values in the one or more dimensions; partitioning the sorted data into data blocks based on a predefined size, wherein each data block is associated with a range, wherein the range corresponds to a start cube value and an end cube value of a highest cardinality dimension in the data block; building a distributed dwarf cube, comprising dwarf cuboids, at a third mapreduce job of the series of mapreduce jobs, wherein each dwarf cuboid is generated, from a data block, based on the range associated with the data block by; processing the data block using a dwarf algorithm; eliminating the dimensions with the highest cardinality from the data; processing the data recursively based on the series of mapreduce jobs till all the dimensions in the data block are eliminated; and storing the generated cuboid on a Distributed File System; querying, by the processor, the distributed dwarf cube, wherein a cluster of query engines is utilized for querying by; checking, by the processor, the one or more ranges of the cube values based upon the query, wherein the one or more ranges comprise complete cube values and non-complete cube values, wherein the non-complete cube values indicate the cube values present at a start or an end of a range of the one or more ranges; creating, by the processor, a list of the cube values comprising the complete cube values and/or the non-complete cube values; and transmitting, by the processor, the list of the cube values from the distributed dwarf cube corresponding to the query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A system for querying a distributed dwarf cube comprising a plurality of dwarf cuboids, wherein the distributed dwarf cube is built using a mapreduce technique, the system comprising:
-
a processor; a memory coupled to the processor, wherein the processor executes program instructions stored in the memory, to; receive a query for retrieving data from a distributed dwarf cube, wherein the distributed dwarf cube is built of the data, wherein the data comprises cube values, wherein the distributed dwarf cube is built by; processing the data, at a first mapreduce job of a series of mapreduce jobs, to generate indexes for the data, wherein the indexes are generated for each dimension in the data, wherein the cube values are replaced with a corresponding index for each dimension of the data; sorting the cube values in one or more dimensions based on a cardinality of the cube values and index associated with each cube value, wherein the cube values are sorted in an order of highest cardinality to lowest cardinality at a second mapreduce job of the series of mapreduce jobs, wherein the cardinality indicates distinctiveness of the cube values in the one or more dimensions; partitioning the sorted data into data blocks based a predefined size, wherein each data block is associated with a range, wherein the range corresponds to a start cube value and an end cube value of a highest cardinality dimension in the data block; building a distributed dwarf cube, comprising dwarf cuboids, at a third mapreduce job of the series of mapreduce jobs, wherein each dwarf cuboid is generated, from a data block, based on the range associated with the data block by; processing the data block using a dwarf algorithm; eliminating the dimensions with the highest cardinality from the data; processing the data recursively based on the series of mapreduce jobs till all the dimensions in the data block are eliminated; and storing the generated cuboid on a Distributed File System; query the distributed dwarf cube, wherein a cluster of query engines is utilized to query the distributed dwarf cube to; check the one or more ranges of the cube values based upon the query, wherein the one or more ranges comprise complete cube values and non-complete cube values, wherein the non-complete cube values indicate the cube values present at a start or an end of a range of the one or more ranges; create a list of the cube values comprising the complete cube values and/or the non-complete cube values; and transmit the list of the cube values from the distributed dwarf cube corresponding to the query. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable medium embodying a program executable in a computing device for querying a distributed dwarf cube comprising a plurality of dwarf cuboids, wherein the distributed dwarf cube is built using a mapreduce technique, the program comprising:
-
a program code for receiving a query for retrieving data from a distributed dwarf cube, wherein the distributed dwarf cube is built of the data, wherein the data comprises cube values, wherein the distributed dwarf cube is built by; processing the data at a first mapreduce job of a series of mapreduce jobs, to generate indexes for the data, wherein the indexes are generated for each dimension in the data, wherein the cube values are replaced with a corresponding index for each dimension of the data; sorting the cube values in one or more dimensions based on a cardinality of the cube values and index associated with each cube value, wherein the cube values are sorted in an order of highest cardinality to lowest cardinality at a second mapreduce job of the series of mapreduce jobs, wherein the cardinality indicates distinctiveness of the cube values in the one or more dimensions; partitioning the sorted data into data blocks based on a predefined size, wherein each data block is associated with a range, wherein the range corresponds to a start cube value and an end cube value of a highest cardinality dimension in the data block; building a distributed dwarf cube, comprising dwarf cuboids, at a third mapreduce job of the series of mapreduce jobs, wherein each dwarf cuboid is generated, from a data block, based on the range associated with the data block by; processing the data block using a dwarf algorithm; eliminating the dimensions with the highest cardinality from the data; processing the data recursively based on the series of mapreduce jobs till all the dimensions in the data block are eliminated; and storing the generated cuboid on a Distributed File System; a program code for querying the distributed dwarf cube, wherein a cluster of query engines is utilized for querying by; checking the one or more ranges of the cube values based upon the query, wherein the one or more ranges comprise complete cube values and non-complete cube values, wherein the non-complete cube values indicate the cube values present at a start or an end of a range of the one or more ranges; creating a list of the cube values comprising the complete cube values and/or the non-complete cube values; and transmitting the list of the cube values from the distributed dwarf cube corresponding to the query.
-
Specification