Data stream splitting for low-latency data access

US 10,223,431 B2
Filed: 01/31/2013
Issued: 03/05/2019
Est. Priority Date: 01/31/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

producing, at a plurality of front end servers, log data based on real-time user activities;

transmitting the log data to an aggregating server;

aggregating the log data at the aggregating server;

splitting the log data into a plurality of log data streams based on bucket numbers by, for each entry of log data;

calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and

determining a bucket number for the entry by calculating the hash value modulo a total number of buckets;

feeding the log data streams to at least one back end server in parallel;

staging the log data at the aggregating server and providing the at least one back end server with access to the log data in real time;

staging the log data at a specified data staging area in one of the front end servers in an event the aggregating server is not available, and providing the at least one back end server with access to the log data from the specified data staging area in real time;

sending the log data to a data warehouse; and

providing, by the data warehouse, the at least one back end server with access to the log data for offline data analysis;

wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for facilitating and accelerating log data processing by splitting data streams are disclosed herein. The front-end clusters generate large amount of log data in real time and transfer the log data to an aggregating cluster. The aggregating cluster is designed to aggregate incoming log data streams from different front-end servers and clusters. The aggregating cluster further splits the log data into a plurality of data streams so that the data streams are sent to a receiving application in parallel. In one embodiment, the log data are randomly split to ensure the log data are evenly distributed in the split data streams. In another embodiment, the application that receives the split data streams determines how to split the log data.

78 Citations

View as Search Results

18 Claims

1. A method comprising:
- producing, at a plurality of front end servers, log data based on real-time user activities;
  
  transmitting the log data to an aggregating server;
  
  aggregating the log data at the aggregating server;
  
  splitting the log data into a plurality of log data streams based on bucket numbers by, for each entry of log data;
  
  calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and
  
  determining a bucket number for the entry by calculating the hash value modulo a total number of buckets;
  
  feeding the log data streams to at least one back end server in parallel;
  
  staging the log data at the aggregating server and providing the at least one back end server with access to the log data in real time;
  
  staging the log data at a specified data staging area in one of the front end servers in an event the aggregating server is not available, and providing the at least one back end server with access to the log data from the specified data staging area in real time;
  
  sending the log data to a data warehouse; and
  
  providing, by the data warehouse, the at least one back end server with access to the log data for offline data analysis;
  
  wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the step of splitting comprises:
    - splitting the log data randomly so that the log data are evenly distributed to the plurality of log data streams.
  - 3. The method of claim 1, further comprising:
    - receiving an instruction from the at least one back end server regarding how to split the log data into the plurality of log data streams.
  - 4. The method of claim 1,wherein the total number of buckets is a total number of the plurality of log data streams;
    - andthe method further comprising assigning the respective entry of the log data to a log data stream identified by the bucket number.
  - 5. The method of claim 4, wherein the category field includes a high level description of an intended destination of the log data entry.
  - 6. The method of claim 1, wherein the total number of buckets is determined by a number of back end servers that are available to receive the log data streams and a number of connections that each back end server is capable of handling.
  - 7. The method of claim 1, wherein the total number of buckets is instructed by a data consuming application running on the at least one back end server.
  - 8. The method of claim 6, wherein the back end servers are equally loaded when the back end servers receive and process the log data streams.
  - 9. The method of claim 1, further comprising:
    - examining prefixes of entries of the log data to determine the log data stream that the entries are assigned to.
  - 10. The method of claim 1, further comprising:
    - sending the log data to a data warehouse; and
      
      processing the log data at the data warehouse so that the data warehouse can respond to data queries based on the processed log data.
  - 11. The method of claim 1, wherein the total number of buckets is instructed by a data consuming application running on one or more back end servers, and the total number of buckets is determined by a number of the back end servers that are available to receive the log data streams and a number of connections that each back end server of the back end servers is capable of handling.
  - 12. The method of claim 1, further comprising sending the log data from the aggregating server to the data warehouse.
  - 13. The method of claim 1, further comprising:
    - further aggregating the log data received from the aggregating server by at least one second level aggregating server, the second level aggregating server being connected with the aggregating server, wherein the second level aggregating server includes a second level data staging area configured for staging the log data so that the at least one back end server can access the log data in real time.
  - 14. The method of claim 13, wherein the back end server can select an individual aggregating server from either the aggregating server or the second level aggregating server, and request to retrieve the log data in real time from the selected individual aggregating server, depending on which of the aggregating server or the second level aggregating server is closer to the at least one back end server in a network topology.

15. An aggregating server, comprising:
- a processor;
  
  a network interface, coupled to the processor, through which the aggregating server can communicate with a plurality of front end servers;
  
  a data storage including a data staging area; and
  
  a memory storing instructions which, when executed by the processor, cause the aggregating server to perform a process including;
  
  receiving log data from the front end servers, wherein the front end servers produce the log data based on real-time user activities,aggregating the log data,staging the log data at the data staging area,splitting the log data into a plurality of log data streams so that one or more back end servers can retrieve the log data streams in parallel based on bucket numbers by, for each entry of log data;
  
  calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and
  
  determining a bucket number for the entry by calculating the hash value modulo a total number of buckets;
  
  staging the log data at the aggregating server and providing the at least one of the back end servers with access to the log data in real time;
  
  sending the log data to a data warehouse; and
  
  wherein the log data staged at the aggregating server includes a plurality of log data entries, and each individual log data entry includes the category field and the application identification that identifies the data consuming application.
- View Dependent Claims (16)
- - 16. The aggregating server of claim 15, wherein the total number of buckets is a total number of the plurality of log data streams, andthe aggregating server is further configured to assign the respective entry of the log data to a log data stream identified by the bucket number.

17. A computer-implemented system, comprising:
- a plurality of front end servers configured for producing log data based on real-time user activities; and
  
  multiple aggregating servers configured for aggregating the log data received from at least some of the front end servers, the aggregating servers being connected with at least some of the front end servers via a network;
  
  wherein at least one of the aggregating servers includes a data staging area configured for staging the log data and providing one or more back end servers with access to the log data in real time, and at least one of the aggregating servers is configured for splitting the log data into a plurality of log data streams so that the one or more back end servers can retrieve the log data streams in parallel;
  
  wherein at least one of the front end servers is configured to;
  
  include a specified data staging area for staging the log data in an event the at least one of the aggregating servers is not available, andprovide the one or more back end servers with access to the log data from the specified data staging area in real time;
  
  wherein the log data is split by, for each entry of log data;
  
  calculating a hash value of a category field and an application identification that identifies a data consuming application for processing the entry; and
  
  determining a bucket number for the entry by calculating the hash value modulo a total number of buckets;
  
  at least one second level aggregating server configured for further aggregating the log data received from the multiple aggregating servers, the second level aggregating server being connected with the multiple aggregating servers, wherein the second level aggregating server includes a second level data staging area configured for staging the log data so that the back end server can access the log data in real time; and
  
  a data warehouse for receiving the log data from the at least one aggregating server, the data warehouse providing the one or more back end servers with access to the log data for offline data analysis.
- View Dependent Claims (18)
- - 18. The computer-implemented system of claim 17, wherein the total number of buckets is a total number of the plurality of log data streams, andthe computer-implemented system further assigns the respective entry of the log data to a log data stream identified by the bucket number.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Inventors
Rash, Samuel, Borthakur, Dhruba, Shao, Zheng, Hwang, Eric
Primary Examiner(s)
Tiv, Backhean
Assistant Examiner(s)
Nguyen, Linh T.

Application Number

US13/756,340
Publication Number

US 20140214752A1
Time in Patent Office

2,224 Days
Field of Search

705 37, 705 38, 705 729, 705 733, 705 11, 705 731, 705 261, 705 35, 707927, 70799901, 707999103, 707E17107, 707705, 707E17001, 707E17008, 707600, 707737, 707E17117, 707E17, 707610, 709217, 709218, 709219, 709203, 709201, 709224, 709238, 709215, 709227, 714E11204, 711118, 711216
US Class Current
CPC Class Codes

G06F 16/254 Extract, transform and load...

Data stream splitting for low-latency data access

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

78 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Data stream splitting for low-latency data access

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links