Systems and methods for rapid data analysis

US 10,713,240 B2
Filed: 07/10/2017
Issued: 07/14/2020
Est. Priority Date: 03/10/2014
Status: Active Grant

First Claim

Patent Images

1. A method for rapid data analysis comprising:

receiving and interpreting a first query, wherein interpreting the first query comprises identifying a first set of data shards of a first dataset containing data relevant to the first query;

wherein the first dataset is partitioned by a first field;

for a first query pass of the first query, collecting a first data sample from the first set of data shards, wherein collecting the first data sample comprises collecting data from each of the first set of data shards;

for the first query pass, calculating a first result to the first query based on analysis of the first data sample; and

for a second query pass of the first query that uses the first result as input, partitioning a second dataset based on a second field, wherein the second data set contains data identical to the first dataset, wherein the second field is identified by a set of shard partitioning rules, based on the first field;

wherein the second field is non-identical to the first field.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for rapid data analysis includes receiving and interpreting a first query operating on a first dataset partitioned into shards by a first field; collecting a first data sample from a first set of data shards; calculating a first result to the first query based on analysis of the first data sample; and partitioning a second dataset into shards by a second field based on the first result.

92 Citations

20 Claims

1. A method for rapid data analysis comprising:
- receiving and interpreting a first query, wherein interpreting the first query comprises identifying a first set of data shards of a first dataset containing data relevant to the first query;
  
  wherein the first dataset is partitioned by a first field;
  
  for a first query pass of the first query, collecting a first data sample from the first set of data shards, wherein collecting the first data sample comprises collecting data from each of the first set of data shards;
  
  for the first query pass, calculating a first result to the first query based on analysis of the first data sample; and
  
  for a second query pass of the first query that uses the first result as input, partitioning a second dataset based on a second field, wherein the second data set contains data identical to the first dataset, wherein the second field is identified by a set of shard partitioning rules, based on the first field;
  
  wherein the second field is non-identical to the first field.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the second dataset is the first dataset;
    - wherein partitioning the second dataset comprises re-partitioning the first dataset.
  - 3. The method of claim 1, wherein the second dataset is distinct from the first dataset.
  - 4. The method of claim 1, wherein partitioning the second dataset comprises automatically partitioning the second dataset to improve query performance for queries similar to the first query.
  - 5. The method of claim 4, wherein automatically partitioning the second dataset to improve query performance comprises automatically partitioning the second dataset only after queries similar to the first query have been identified as common queries.
  - 6. The method of claim 4, further comprising generating a data aggregate of the first dataset to improve query performance for the queries similar to the first query.
  - 7. The method of claim 1, wherein partitioning the second dataset comprises identifying the second field as containing data relevant to the first query.
  - 8. The method of claim 1, further comprising detecting that the first dataset is used less than a use threshold and, in response, removing the first dataset.
  - 9. The method of claim 1, further comprising:
    - analyzing the first result to identify a set of query-relevant data sources;
      
      identifying a second set of data shards from the set of query-relevant data sources;
      
      collecting a second data sample from the second set of data shards, wherein collecting the second data sample comprises collecting data from each of the second set of data shards; and
      
      calculating a final result to the first query based on analysis of the second data sample.
  - 10. The method of claim 9, wherein the second set of data shards contains data not contained in the first set of data shards.

11. A system for rapid data analysis comprising:
- an event database, comprising first and second datasets;
  
  wherein the first and second datasets contain identical data;
  
  wherein the first dataset is partitioned by a first field;
  
  a string lookup database that stores information linking strings to integers that uniquely identify the strings;
  
  a string translator that converts strings in incoming data to integer identifiers using the string lookup database;
  
  a query engine that processes queries on the event database and returns at least a first query result for a first query pass of a first query, and a second query result for a second query pass of the first query, wherein the second query pass uses the first query result as input; and
  
  a data manager that, based on a second field, partitions the second dataset;
  
  wherein the second field is identified by a set of shard partitioning rules, based on the first field;
  
  wherein the a second field is non-identical to the first field.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, wherein the data manager also repartitions the first data set.
  - 13. The system of claim 12, wherein the data manager repartitions the first data set by the first field.
  - 14. The system of claim 11, wherein the data manager automatically partitions the second dataset to improve query performance for future queries similar to past queries.
  - 15. The system of claim 11, wherein the data manager automatically partitions the second dataset to improve query performance for queries identified as common queries.
  - 16. The system of claim 11, wherein the data manager further generates a data aggregate of the first dataset to improve query performance for future queries similar to past queries.
  - 17. The system of claim 11, wherein the data manager identifies the second field as containing data relevant to the first query prior to partitioning the second dataset by the second field.
  - 18. The system of claim 11, wherein the data manager identifies and removes datasets used less than a use threshold.
  - 19. The system of claim 11, wherein the query engine processes an incoming query by:
    - identifying a first set of data shards of the first dataset containing data relevant to the incoming query;
      
      collecting a first data sample from the first set of data shards;
      
      calculating a first result to the incoming query based on analysis of the first data sample;
      
      analyzing the first result to identify a set of query-relevant data sources;
      
      identifying a second set of data shards from the set of query-relevant data sources;
      
      collecting a second data sample from the second set of data shards, wherein collecting the second data sample comprises collecting data from each of the second set of data shards; and
      
      calculating a final result to the incoming query based on analysis of the second data sample.
  - 20. The system of claim 19, wherein the second set of data shards contains data not contained in the first set of data shards.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Scuba Analytics, Inc.
Original Assignee
Interana, Inc.
Inventors
Johnson, Robert, Abraham, Lior, Johnson, Ann, Dimitrov, Boris, Fossgreen, Don
Primary Examiner(s)
Spieler, William

Application Number

US15/645,698
Publication Number

US 20170308570A1
Time in Patent Office

1,100 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2425   Iterative querying; Query f...

G06F 16/24545   Selectivity estimation or d...

G06F 16/24554   Unary operations; Data part...

G06F 16/2462   Approximate or statistical ...

G06F 16/2471   Distributed queries

G06F 16/278   Data partitioning, e.g. hor...

Systems and methods for rapid data analysis

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

92 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for rapid data analysis

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

92 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links