Data stream splitting for low-latency data access
First Claim
1. A method comprising:
- producing, at a plurality of front end servers, log data based on real-time user activities;
transmitting the log data to an aggregating server;
aggregating the log data at the aggregating server;
splitting the log data into a plurality of log data streams based on bucket numbers by, for each entry of log data;
calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and
determining a bucket number for the entry by calculating the hash value modulo a total number of buckets;
feeding the log data streams to at least one back end server in parallel;
staging the log data at the aggregating server and providing the at least one back end server with access to the log data in real time;
staging the log data at a specified data staging area in one of the front end servers in an event the aggregating server is not available, and providing the at least one back end server with access to the log data from the specified data staging area in real time;
sending the log data to a data warehouse; and
providing, by the data warehouse, the at least one back end server with access to the log data for offline data analysis;
wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for facilitating and accelerating log data processing by splitting data streams are disclosed herein. The front-end clusters generate large amount of log data in real time and transfer the log data to an aggregating cluster. The aggregating cluster is designed to aggregate incoming log data streams from different front-end servers and clusters. The aggregating cluster further splits the log data into a plurality of data streams so that the data streams are sent to a receiving application in parallel. In one embodiment, the log data are randomly split to ensure the log data are evenly distributed in the split data streams. In another embodiment, the application that receives the split data streams determines how to split the log data.
78 Citations
18 Claims
-
1. A method comprising:
-
producing, at a plurality of front end servers, log data based on real-time user activities; transmitting the log data to an aggregating server; aggregating the log data at the aggregating server; splitting the log data into a plurality of log data streams based on bucket numbers by, for each entry of log data; calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and determining a bucket number for the entry by calculating the hash value modulo a total number of buckets; feeding the log data streams to at least one back end server in parallel; staging the log data at the aggregating server and providing the at least one back end server with access to the log data in real time; staging the log data at a specified data staging area in one of the front end servers in an event the aggregating server is not available, and providing the at least one back end server with access to the log data from the specified data staging area in real time; sending the log data to a data warehouse; and providing, by the data warehouse, the at least one back end server with access to the log data for offline data analysis; wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. An aggregating server, comprising:
-
a processor; a network interface, coupled to the processor, through which the aggregating server can communicate with a plurality of front end servers; a data storage including a data staging area; and a memory storing instructions which, when executed by the processor, cause the aggregating server to perform a process including; receiving log data from the front end servers, wherein the front end servers produce the log data based on real-time user activities, aggregating the log data, staging the log data at the data staging area, splitting the log data into a plurality of log data streams so that one or more back end servers can retrieve the log data streams in parallel based on bucket numbers by, for each entry of log data; calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and determining a bucket number for the entry by calculating the hash value modulo a total number of buckets; staging the log data at the aggregating server and providing the at least one of the back end servers with access to the log data in real time; sending the log data to a data warehouse; and wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application. - View Dependent Claims (16)
-
-
17. A computer-implemented system, comprising:
-
a plurality of front end servers configured for producing log data based on real-time user activities; and multiple aggregating servers configured for aggregating the log data received from at least some of the front end servers, the aggregating servers being connected with at least some of the front end servers via a network; wherein at least one of the aggregating servers includes a data staging area configured for staging the log data and providing one or more back end servers with access to the log data in real time, and at least one of the aggregating servers is configured for splitting the log data into a plurality of log data streams so that the one or more back end servers can retrieve the log data streams in parallel; wherein at least one of the front end servers is configured to; include a specified data staging area for staging the log data in an event the at least one of the aggregating servers is not available, and provide the one or more back end servers with access to the log data from the specified data staging area in real time; wherein the log data is split by, for each entry of log data; calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and determining a bucket number for the entry by calculating the hash value modulo a total number of buckets; at least one second level aggregating server configured for further aggregating the log data received from the multiple aggregating servers, the second level aggregating server being connected with the multiple aggregating servers, wherein the second level aggregating server includes a second level data staging area configured for staging the log data so that the back end server can access the log data in real time; and a data warehouse for receiving the log data from the at least one aggregating server, the data warehouse providing the one or more back end servers with access to the log data for offline data analysis. - View Dependent Claims (18)
-
Specification