System and method for operating a big-data platform

US 9,582,528 B2
Filed: 05/05/2016
Issued: 02/28/2017
Est. Priority Date: 11/10/2011
Status: Active Grant

First Claim

Patent Images

1. A method for operating a big-data platform comprising:

at a data analysis platform, receiving discrete client data;

storing the client data in a network accessible distributed storage system that includes;

storing the client data in a real-time storage system in a row format;

merging the client data into a columnar-based distributed archive storage system;

identifying a merge status for the client data merged into the archive storage system, wherein the merge status indicates a redundancy of client data between the real-time storage system and the archive storage system;

receiving a data query through a query interface; and

processing the data query by selectively interfacing with the client data from the real-time storage system and archive storage system, according to a data mapping and reduction process, wherein the real-time storage system and the archive storage system are different, wherein processing the data query comprises;

(i) converting the single data query from a relational database-type query format to a first converted query format compatible with the real-time storage system,(ii) converting the single data query from the relational database-type query format to a second converted query format compatible with the archive storage system,(iii) cooperatively querying the real-time storage system and the archive storage system by distributing, in parallel, the first converted query over the real-time storage system and the second converted query over the archive storage system,(iv) using the merge status and timestamps of the client data in the real-time storage system and the archive storage system to skip client data from either the real-time storage system or the archive storage system if the skipped data is accounted for in the other of the real-time storage system or the archive storage system, and(v) retrieving a single cohesive query result that incorporates real-time data and archive data returned from the first converted query and the second converted query, respectively,wherein merging the client data into a columnar-based distributed archive storage system comprises storing the client data in the archive storage system in a columnar format, andwherein interfacing with the client data from the archive storage system comprises;

converting, by using a query processing cluster, at least a portion of the data query to the mapping process and the reduction process; and

executing the mapping process and the reduction process by using the query processing cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for operating a big-data platform that includes at a data analysis platform, receiving discrete client data; storing the client data in a network accessible distributed storage system that includes: storing the client data in a real-time storage system; and merging the client data into a columnar-based distributed archive storage system; receiving a data query request through a query interface; and selectively interfacing with the client data from the real-time storage system and archive storage system according to the query.

21 Citations

View as Search Results

24 Claims

1. A method for operating a big-data platform comprising:
- at a data analysis platform, receiving discrete client data;
  
  storing the client data in a network accessible distributed storage system that includes;
  
  storing the client data in a real-time storage system in a row format;
  
  merging the client data into a columnar-based distributed archive storage system;
  
  identifying a merge status for the client data merged into the archive storage system, wherein the merge status indicates a redundancy of client data between the real-time storage system and the archive storage system;
  
  receiving a data query through a query interface; and
  
  processing the data query by selectively interfacing with the client data from the real-time storage system and archive storage system, according to a data mapping and reduction process, wherein the real-time storage system and the archive storage system are different, wherein processing the data query comprises;
  
  (i) converting the single data query from a relational database-type query format to a first converted query format compatible with the real-time storage system,(ii) converting the single data query from the relational database-type query format to a second converted query format compatible with the archive storage system,(iii) cooperatively querying the real-time storage system and the archive storage system by distributing, in parallel, the first converted query over the real-time storage system and the second converted query over the archive storage system,(iv) using the merge status and timestamps of the client data in the real-time storage system and the archive storage system to skip client data from either the real-time storage system or the archive storage system if the skipped data is accounted for in the other of the real-time storage system or the archive storage system, and(v) retrieving a single cohesive query result that incorporates real-time data and archive data returned from the first converted query and the second converted query, respectively,wherein merging the client data into a columnar-based distributed archive storage system comprises storing the client data in the archive storage system in a columnar format, andwherein interfacing with the client data from the archive storage system comprises;
  
  converting, by using a query processing cluster, at least a portion of the data query to the mapping process and the reduction process; and
  
  executing the mapping process and the reduction process by using the query processing cluster.
- View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, wherein discrete client data is received and stored with dynamic schema.
  - 3. The method of claim 2, wherein the data query includes a schema definition and wherein selectively interfacing with the client data includes applying the schema definition to the dynamic schema.
  - 4. The method of claim 1, further comprising at a client data agent collecting client data and transmitting the client data to the data analysis platform.
  - 5. The method of claim 4, wherein the client data agent is integrated into an event channel from which client data is collected.
  - 7. The method of claim 4, further comprising at the client data agent serializing data into a binary serialization data-interchange that is transmitted to the data analysis platform.
  - 8. The method of claim 4, wherein collecting client data is collected through a client agent data-input plugin.
  - 9. The method of claim 1, wherein the columnar-based distributed archive storage system stores client data in time series order, and wherein selectively interfacing with client data includes querying data from distributed storage system.
  - 10. The method of claim 1, wherein receiving a data query includes converting relational a database styled query to a data-intensive cluster query process.
  - 11. The method of claim 1, wherein the data query is received through an infographics interface and further comprising returning an infographic from the selectively interfaced client data.
  - 12. The method of claim 1, wherein receiving a data query includes receiving the data query through a business intelligence tool driver and further comprising returning data analytics results to the business intelligence tool driver.
  - 13. The method of claim 1, wherein client data is associated with a user account through unique identifier.
  - 14. The method of claim 13, wherein client data merged into the archive data storage system is isolated according to the user account associated with the client data and the query processing cluster interfaces with the distributed storage system, and the query processing cluster is shared between by a plurality of user accounts.
  - 15. The method of claim 1, further comprising at a client data agent collecting client data and transmitting the client data to the data analysis platform;
    - wherein the columnar-based distributed archive storage system stores client data in time series order with a dynamic schema, and wherein selectively interfacing with client data includes cooperatively querying data from the real-time storage system and the archive storage system for a cohesive query result.
  - 16. The method of claim 15, wherein distributed storage system includes over one petabyte of data.
  - 17. The method of claim 1, wherein the mapping process and the reduction process are MapReduce processes.
  - 18. The method of claim 1, wherein the query processing cluster is constructed to execute MapReduce processes, and the mapping process and the reduction process are MapReduce processes.
  - 19. The method of claim 1, wherein the query processing cluster includes a cluster that is constructed to execute MapReduce processes, and the mapping process and the reduction process are MapReduce processes.
  - 20. The method of claim 1, wherein the data analysis platform is a multi-tenant data analysis platform.
  - 21. The method of claim 1, wherein the query result includes structured data.
  - 22. The method of claim 2, wherein the data analysis platform is a multi-tenant data analysis platform.
  - 23. The method of claim 3, wherein the data analysis platform is a multi-tenant data analysis platform.

6. A method for operating a big-data platform comprising:
- at a client data agent, collecting discrete client data and transmitting the discrete client data to the data analysis platform, wherein the client data agent is integrated into an event channel from which client data is collected, wherein the event channel is selected from a list comprising syslog, a relational database, cloud data, and sensor data; and
  
  at a data analysis platform, receiving the discrete client data;
  
  storing the client data in a network accessible distributed storage system that includes;
  
  storing the client data in a real-time storage system in a row format;
  
  merging the client data into a columnar-based distributed archive storage system, wherein merging the client data into a columnar-based distributed archive storage system comprises storing the client data in the archive storage system in a columnar format;
  
  identifying a merge status for the client data merged into the archive storage system, wherein the merge status indicates a redundancy of client data between the real-time storage system and the archive storage system;
  
  receiving a data query through a query interface; and
  
  processing the data query by selectively interfacing with the client data from the real-time storage system and archive storage system, according to a data mapping and reduction process, wherein processing the data query comprises cooperatively querying the real-time storage system and the archive storage system and distributing the data query over the real-time storage system and the archive storage system to retrieve a single cohesive query result,using the merge status and timestamps of the client data in the real-time storage system and the archive storage system to skip client data from either the real-time storage system or the archive storage system if the skipped data is accounted for in the other of the real-time storage system or the archive storage system, andwherein interfacing with the client data from the archive storage system comprises;
  
  converting, by using a query processing cluster, at least a portion of the data query to the mapping process and the reduction process; and
  
  executing the mapping process and the reduction process by using the query processing cluster.

24. A method comprising:
- at a multi-tenant data analysis platform;
  
  receiving discrete client data, the client data being associated with a user account of the multi-tenant data analysis platform through a unique identifier;
  
  storing the client data in a network accessible distributed storage system that includes a real-time storage system and a columnar-based distributed archive storage system, the storing of the client data comprising;
  
  storing the client data in the real-time storage system in a row format;
  
  merging the client data into the archive storage system in a columnar format, the client data merged into the archive data storage system being isolated according to the user account associated with the client data;
  
  identifying a merge status for the client data merged into the archive storage system, wherein the merge status indicates a redundancy of client data between the real-time storage system and the archive storage system;
  
  receiving a data query through a query interface; and
  
  processing the data query by selectively interfacing with the client data from the real-time storage system and archive storage system, wherein processing the data query comprises;
  
  converting the data query from a relational database-type query format to a first converted query format compatible with the real-time storage system and a second converted query format compatible with the archive system, wherein the converting the data query comprises converting, by using a query processing cluster, the data query to a MapReduce mapping process and a MapReduce reduction process,cooperatively querying the real-time storage system and the archive storage system by distributing, in parallel, the first converted data query over the real-time storage system and the second converted query over archive storage system,using the merge status and timestamps of the client data in the real-time storage system and the archive storage system to skip client data from either the real-time storage system or the archive storage system if the skipped data is accounted for in the other of the real-time storage system or the archive storage system, andretrieving a single cohesive query result based on results from both the real-time storage system and the archive storage system,wherein interfacing with the client data from the archive storage system comprises;
  
  executing the MapReduce mapping process and the MapReduce reduction process by using the query processing cluster, andwherein the query processing cluster includes a cluster that is constructed to execute MapReduce processes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Treasure Data, Inc. (SoftBank Group Corp.)
Original Assignee
Treasure Data, Inc. (SoftBank Group Corp.)
Inventors
Furuhashi, Sadayuki, Yoshikawa, Hironobu, Ota, Kazuki
Primary Examiner(s)
Saeed, Usmaan
Assistant Examiner(s)
Perez-Arroyo, Raquel

Application Number

US15/147,790
Publication Number

US 20160246824A1
Time in Patent Office

299 Days
Field of Search

707/661, 707/204, 707/609
US Class Current

1/1
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G06F 16/221   Column-oriented storage; Ma...

G06F 16/2365   Ensuring data consistency a...

G06F 16/2471   Distributed queries

G06F 16/258   Data format conversion from...

System and method for operating a big-data platform

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

21 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

System and method for operating a big-data platform

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others