×

Custom data warehouse on top of mapreduce

  • US 8,918,388 B1
  • Filed: 03/31/2010
  • Issued: 12/23/2014
  • Est. Priority Date: 02/26/2010
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for using a data warehouse comprising a cluster of nodes, comprising:

  • receiving, by a query processor on a first node of said data warehouse, a particular query, wherein said particular query comprises a DATES clause that specifies date ranges and a FROM clause that specifies at least one virtual view, wherein said at least one virtual view has a nested relational structure and is constructed on the fly by performing a per-node in-memory hash join between one or more partitions of a nested relational fact table and one or more dimension tables;

    translating, by said query processor, said particular query into a corresponding map phase and a corresponding reduce phase, wherein said translating further comprises the steps of;

    determining, by said query processor, one or more fact table partitions based in part on partitioning by the date ranges specified in said particular query; and

    determining and applying particular optimization techniques, based on said particular query;

    at each of said one or more other nodes of said data warehouse, each node having one or more of said fact table partitions and having said one or more dimensions tables, creating the at least one virtual view by joining said one or more fact tables with attributes of any of said one or more dimension tables, said joining based on said particular query, wherein when there are two or more virtual views in the FROM clause, performing a union on all the rows from each individual virtual view and when a specific column referred to in the query does not exist in the view, treat the specific column as a NULL value;

    performing, by a scan processor at said each node of said one or more nodes, a scan over said at least one virtual views based on said particular query and any of said particular optimization techniques;

    performing, by an aggregation processor at said each node of said one or more nodes, local aggregation of said scanned results;

    at a third node of said one or more nodes, performing global aggregation of said local aggregations to generate search results of said particular query for further post-processing; and

    performing a SELECT on particular columns from said one or more fact table partitions and not from any of said dimension tables,creating a materialized view by joining results of said SELECT on particular columns with said dimension tables; and

    using said materialized view at run-time when;

    said particular query refers to a virtual view that corresponds to said one or more fact table partitions from which said particular columns were selected;

    non-aggregate columns referenced in SELECT, WHERE, GROUP BY clauses in said particular query are a subset of group by columns of said materialized view; and

    aggregate columns are computable from aggregate columns of said materialized view.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×