Populating a data warehouse using a pipeline approach
First Claim
1. A data collection and warehousing system comprising one or more computers that implement a processing pipeline, wherein the pipeline receives individual log files from a plurality of different servers on a periodic basis and passes the received files through a sequence of operations, the operations comprising:
- parsing the log files to generate, for each data file, (a) a fact file containing one or more primary key IDs and metrics for eventual use in a data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
parsing the fact files to generate, for each fact file, a plurality of fact tables corresponding to different fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
parsing the dimension files to generate, for each dimension file, a plurality of dimension tables corresponding to different dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging tables corresponding to the same data warehouse table to generate fact and dimension tables that each correspond to a single one of the data warehouse tables; and
loading the merged tables into the data warehouse tables.
2 Assignments
0 Petitions
Accused Products
Abstract
Described herein is a data collection and warehousing system that collects log files from a plurality of servers, and that transforms the entries of the log files prior to using them to populate a database. Nightly, the system provides a pre-processing program to each of the servers. The servers executing the pre-processing component to filter the logged data before sending it on to the data collection and warehousing system. The data collection and warehousing system pre-parses each received log file to create dimension and fact tables. During this process, strings are replaced by keys. These keys are calculated on-the-fly by taking a hash of each string. The resulting tables are then further parsed to create further dimension and fact tables that correspond to each of the dimension and fact tables of a relational database schema. After further processing, these tables are loaded into a data warehousing system.
-
Citations
40 Claims
-
1. A data collection and warehousing system comprising one or more computers that implement a processing pipeline, wherein the pipeline receives individual log files from a plurality of different servers on a periodic basis and passes the received files through a sequence of operations, the operations comprising:
-
parsing the log files to generate, for each data file, (a) a fact file containing one or more primary key IDs and metrics for eventual use in a data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
parsing the fact files to generate, for each fact file, a plurality of fact tables corresponding to different fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
parsing the dimension files to generate, for each dimension file, a plurality of dimension tables corresponding to different dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging tables corresponding to the same data warehouse table to generate fact and dimension tables that each correspond to a single one of the data warehouse tables; and
loading the merged tables into the data warehouse tables. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
the data warehouse is a relational database;
parsing the log files is performed with one or more special-purpose programs other than a relational database program;
parsing the fact files is performed with one or more programs other than a relational database program; and
parsing the dimension files is performed with one or more programs other than a relational database program.
-
-
10. A system as recited in claim 1, wherein:
-
the data collection and warehousing system periodically provides a pre-processor component to the plurality of servers for execution by each of the servers to pre-process one or more log files on that server;
the data collection and warehousing system receives the pre-processed log files from the individual servers after they have executed the pre-processor component.
-
-
11. A method of populating a data warehouse with logged data from a plurality of servers, wherein the data warehouse comprises a relational database having a plurality of fact and dimension tables, the method comprising the following operations:
-
periodically receiving the logged data from the individual servers;
for the logged data from each server, parsing the logged data to generate (a) a fact file containing one or more primary key IDs and metrics for eventual use in the data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
from each of the fact files, creating non-relational fact tables corresponding to fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
from each of the dimension files, creating non-relational dimension tables corresponding to dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging the non-relational fact tables;
merging the non-relational dimension tables;
loading the non-relational fact tables into corresponding relational fact tables of the relational database;
loading the non-relational dimension tables into corresponding relational dimension tables of the relational database. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
parsing the logged data is performed with one or more special-purpose programs other than a relational database program;
parsing the non-relational fact tables is performed with one or more programs other than a relational database program; and
parsing the non-relational dimension tables is performed with one or more programs other than a relational database program.
-
-
21. A method as recited in claim 11, further comprising:
-
periodically providing a pre-processor component to the plurality of servers for execution by each of the servers to pre-process one or more log files on that server;
receiving the pre-processed log files from the individual servers after they have executed the pre-processor component.
-
-
22. One or more computer-readable media containing instructions for populating a data warehouse with logged data from a plurality of servers, wherein the data warehouse comprises a relational database having a plurality of fact and dimension tables, the instructions performing operations comprising:
-
periodically providing a pre-processor component to the plurality of servers for execution by each of the servers to pre-process one or more log files on that server;
receiving the pre-processed log files from the individual servers after they have executed the pre-processor component;
for each received log file, parsing the log file to generate (a) a fact file containing one or more primary key IDs and metrics for eventual use in the data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
from each of the fact files, creating non-relational fact tables corresponding to fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
from each of the dimension files, creating non-relational dimension tables corresponding to dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
loading the non-relational fact tables into corresponding relational fact tables of the relational database;
loading the non-relational dimension tables into corresponding relational dimension tables of the relational database. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31)
parsing the log files is performed with one or more special-purpose programs other than a relational database program;
parsing the non-relational fact tables is performed with one or more programs other than a relational database program; and
parsing the non-relational dimension tables is performed with one or more programs other than a relational database program.
-
-
32. A method of populating a data warehouse with logged data from a plurality of servers, comprising:
-
periodically providing a data processing program to the plurality of logging servers for execution by each of the servers to pre-process data on that server;
executing the data processing program on each of the plurality of servers to produce pre-processed data;
providing the pre-processed data to a central collection facility;
further processing the data at the central collection facility; and
loading the further processed data into the data warehouse. - View Dependent Claims (33)
for the data from each server, parsing the data to generate (a) a fact file containing one or more primary key IDs and metrics for eventual use in the data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
from each of the fact files, creating non-relational fact tables corresponding to fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
from each of the dimension files, creating non-relational dimension tables corresponding to dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging the non-relational fact tables;
merging the non-relational dimension tables;
loading the non-relational fact tables into corresponding relational fact tables of the data warehouse;
loading the non-relational dimension tables into corresponding relational dimension tables of the data warehouse.
-
-
34. One or more computer-readable media containing computer-executable instructions for performing operations comprising:
-
daily sending a data processing program to a plurality of logging servers for execution by each of the servers to pre-process data on that server;
receiving pre-processed data from the servers;
transforming the data for use in a data warehouse; and
loading the transformed data into the data warehouse. - View Dependent Claims (35)
for the pre-processed data from each server, parsing the data to generate (a) a fact file containing one or more primary key IDs and metrics for eventual use in the data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
from each of the fact files, creating non-relational fact tables corresponding to fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
from each of the dimension files, creating non-relational dimension tables corresponding to dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging the non-relational fact tables;
merging the non-relational dimension tables;
loading the non-relational fact tables into corresponding relational fact tables of the data warehouse;
loading the non-relational dimension tables into corresponding relational dimension tables of the data warehouse.
-
-
36. A data collection and warehousing facility comprising one or more computers programmed to perform operations comprising:
-
daily sending a data processing program to a plurality of servers for execution by each of the servers to pre-process log files on that server;
receiving pre-processed log files from the servers;
transforming the log files to produce transformed data; and
loading the transformed data into a data warehouse. - View Dependent Claims (37)
for the data from each server, parsing the data to generate (a) a fact file containing one or more primary key IDs and metrics for eventual use in the data warehouse, and (b) a dimension file containing one or more primary key IDs and strings for eventual use in the data warehouse;
from each of the fact files, creating non-relational fact tables corresponding to fact tables of the data warehouse, each fact table containing one or more primary key IDs and corresponding metrics;
from each of the dimension files, creating non-relational dimension tables corresponding to dimension tables of the data warehouse, each dimension table containing one or more primary key IDs and dimension strings;
merging the non-relational fact tables;
merging the non-relational dimension tables;
loading the non-relational fact tables into corresponding relational fact tables of the data warehouse;
loading the non-relational dimension tables into corresponding relational dimension tables of the data warehouse.
-
-
38. A method of transforming data records, wherein certain strings are repeated within the data records, comprising:
-
upon encountering one of the strings in a data record, calculating a hash value from said one of the strings;
replacing said one of the strings in the data record with the calculated hash value;
upon encountering the same string in a subsequent data record, recalculating the hash value from the string and replacing the string in the subsequent data record with the calculated hash value, without looking up the previously calculated hash value;
using the hash value as a key to identify the string during subsequent processing of the data records.
-
-
39. One or more computer-readable media containing computer-executable instructions for performing operations comprising:
-
sequentially transforming data records;
upon detecting a particular string in a data record, calculating a hash value from said particular string;
replacing the string in the data record with the calculated hash value;
upon encountering the same string in a subsequent data record, recalculating the hash value from the string and replacing the string in the subsequent data record with the calculated hash value, without looking up the previously calculated hash value;
using the hash value as a key to identify the string during subsequent processing of the data records.
-
-
40. A data collection and warehousing facility that receives log entries from a plurality of servers and transforms the log entries to produce transformed data for use in a data warehouse, the transformation comprising:
-
determining a plurality of strings to be represented by keys;
when encountering one of said strings in a log entry, calculating a hash value from the string;
replacing the string in the log entry with the hash value;
upon encountering the same string in a subsequent log entry, recalculating the hash value from the string and replacing the string in the subsequent log entry with the calculated hash value, without looking up the previously calculated hash value;
using the hash value as a key to identify the string during subsequent processing of the data records.
-
Specification