Systems and methods for facilitating analytics on remotely stored data sets
First Claim
1. A computer-implemented method for facilitating analytics on remotely stored data sets, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- detecting a replication job that makes, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system;
in response to detecting the replication job, identifying, within the secondary storage system, the secondary copy of the data set duplicated from the primary copy of the data set stored in the primary storage system;
generating a set of virtual objects that represent at least a portion of the secondary copy of the data set identified within the secondary storage system;
providing an exposure module that resides in an Input/Output (I/O) path of a remote analytics engine, wherein the remote analytics engine comprises a computer cluster that implements a distributed file system that stores data across various nodes;
interfacing, via the exposure module, the distributed file system with the secondary storage system to expose the set of virtual objects to the remote analytics engine via a network by way of a file system plug-in that;
interfaces with the remote analytics engine such that the portion of the secondary copy of the data set appears to the remote analytics engine to be stored locally on the remote analytics engine;
extends a native functionality of the distributed file system implemented on the remote analytics engine;
enabling the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set via the network by way of;
the set of virtual objects exposed to the remote analytics engine via the network;
the file system plug-in that interfaces with the file system of the remote analytics engine;
throttling, by the file system plug-in that interfaces with the remote analytics engine, one or more data transfers in connection with the analytics job by;
determining a computing load on the remote analytics engine;
controlling, based at least in part on the computing load on the remote analytics engine, a block size of data accessible by the remote analytics engine in connection with the analytics job;
controlling, based at least in part on the computing load on the remote analytics engine, a number of map-reduce processes performed by the remote analytics engine in connection with the analytics job;
identifying another secondary copy of the data set within another secondary storage system;
receiving, by the file system plug-in, at least one request to perform an I/O operation on the portion of the secondary copy of the data set in connection with the analytics job;
optimizing at least a portion of the analytics job by;
providing parallel access to the secondary copy of the data set and the other secondary copy of the data set in connection with the analytics job;
causing the secondary storage system to prefetch some of the portion of the secondary copy of the data set by;
generating, based at least in part on the request to perform the I/O operation, a notification of an anticipated future I/O operation likely to be performed on the portion of the secondary copy of the data set in connection with the analytics job;
forwarding the notification of the anticipated future I/O operation from the file system plug-in to the secondary storage system to cause the secondary storage system to;
prefetch the some of the portion of the secondary copy of the data set;
perform at least one read-ahead using at least one read-ahead buffer to accelerate the analytics job.
7 Assignments
0 Petitions
Accused Products
Abstract
The disclosed computer-implemented method for facilitating analytics on remotely stored data sets may include (1) identifying, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system, (2) generating a set of virtual objects that represent at least a portion of the secondary copy of the data set, (3) exposing the set of virtual objects to a remote analytics engine via a network such that the portion of the secondary copy of the data set appears to be stored locally on the remote analytics engine, and then (4) enabling the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set by way of the set of virtual objects via the network. Various other methods, systems, and computer-readable media are also disclosed.
-
Citations
15 Claims
-
1. A computer-implemented method for facilitating analytics on remotely stored data sets, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
-
detecting a replication job that makes, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system; in response to detecting the replication job, identifying, within the secondary storage system, the secondary copy of the data set duplicated from the primary copy of the data set stored in the primary storage system; generating a set of virtual objects that represent at least a portion of the secondary copy of the data set identified within the secondary storage system; providing an exposure module that resides in an Input/Output (I/O) path of a remote analytics engine, wherein the remote analytics engine comprises a computer cluster that implements a distributed file system that stores data across various nodes; interfacing, via the exposure module, the distributed file system with the secondary storage system to expose the set of virtual objects to the remote analytics engine via a network by way of a file system plug-in that; interfaces with the remote analytics engine such that the portion of the secondary copy of the data set appears to the remote analytics engine to be stored locally on the remote analytics engine; extends a native functionality of the distributed file system implemented on the remote analytics engine; enabling the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set via the network by way of; the set of virtual objects exposed to the remote analytics engine via the network; the file system plug-in that interfaces with the file system of the remote analytics engine; throttling, by the file system plug-in that interfaces with the remote analytics engine, one or more data transfers in connection with the analytics job by; determining a computing load on the remote analytics engine; controlling, based at least in part on the computing load on the remote analytics engine, a block size of data accessible by the remote analytics engine in connection with the analytics job; controlling, based at least in part on the computing load on the remote analytics engine, a number of map-reduce processes performed by the remote analytics engine in connection with the analytics job;
identifying another secondary copy of the data set within another secondary storage system;receiving, by the file system plug-in, at least one request to perform an I/O operation on the portion of the secondary copy of the data set in connection with the analytics job; optimizing at least a portion of the analytics job by; providing parallel access to the secondary copy of the data set and the other secondary copy of the data set in connection with the analytics job; causing the secondary storage system to prefetch some of the portion of the secondary copy of the data set by; generating, based at least in part on the request to perform the I/O operation, a notification of an anticipated future I/O operation likely to be performed on the portion of the secondary copy of the data set in connection with the analytics job; forwarding the notification of the anticipated future I/O operation from the file system plug-in to the secondary storage system to cause the secondary storage system to; prefetch the some of the portion of the secondary copy of the data set; perform at least one read-ahead using at least one read-ahead buffer to accelerate the analytics job. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for facilitating analytics on remotely stored data sets, the system comprising:
-
an identification module, stored in memory, that; detects a replication job that makes, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system; in response to detecting the replication job, identifies, within the secondary storage system, the secondary copy of the data set duplicated from the primary copy of the data set stored in the primary storage system; a generation module, stored in memory, that generates a set of virtual objects that represent at least a portion of the secondary copy of the data set within the secondary storage system; an exposure module, stored in memory, that; resides in an Input/Output (I/O) path of a remote analytics engine, wherein the remote analytics engine comprises a computer cluster that implements a distributed file system that stores data across various nodes; and interfaces the distributed file system with the secondary storage system to expose the set of virtual objects to the remote analytics engine via a network by way of a file system plug-in that; interfaces with the remote analytics engine such that the portion of the secondary copy of the data set appears to the remote analytics engine to be stored locally on the remote analytics engine; extends a native functionality of the distributed file system implemented on the remote analytics engine; enables the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set via the network by way of; the set of virtual objects exposed to the remote analytics engine via the network; the file system plug-in that interfaces with the remote analytics engine; wherein; the file system plug-in throttles one or more data transfers in connection with the analytics job by; determining a computing load on the remote analytics engine; controlling, based at least in part on the computing load on the remote analytics engine, a block size of data accessible by the remote analytics engine in connection with the analytics job; controlling, based at least in part on the computing load on the remote analytics engine, a number of map-reduce processes performed by the remote analytics engine in connection with the analytics job; the identification module identifies another secondary copy of the data set within another secondary storage system; the exposure module optimizes at least a portion of the analytics job by; providing parallel access to the secondary copy of the data set and the other secondary copy of the data set in connection with the analytics job; causing the secondary storage system to prefetch some of the portion of the secondary copy of the data set by performing at least one read-ahead using at least one read-ahead buffer to accelerate the analytics job; a receiving module, stored in memory, that receives at least one request to perform an I/O operation on the portion of the secondary copy of the data set in connection with the analytics job; wherein the generation module generates, based at least in part on the request to perform the I/O operation, a notification of an anticipated future I/O operation likely to be performed on the portion of the secondary copy of the data set in connection with the analytics job; a providing module, stored in memory, that forwards the notification of the anticipated future I/O operation from the file system plug-in to the secondary storage system to cause the secondary storage system to prefetch the some of the portion of the secondary copy of the data set; at least one physical processor that executes the identification module, the generation module, and the exposure module. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
-
detect a replication job that makes, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system; in response to detecting the replication job, identify, within the secondary storage system, the secondary copy of the data set duplicated from the primary copy of the data set stored in the primary storage system; generate a set of virtual objects that represent at least a portion of the secondary copy of the data set within the secondary storage system; provide an exposure module that resides in an Input/Output (I/O) path of a remote analytics engine, wherein the remote analytics engine comprises a computer cluster that implements a distributed file system that stores data across various nodes; interface, via the exposure module, the distributed file system with the secondary storage system to expose the set of virtual objects to the remote analytics engine via a network by way of a file system plug-in; interfaces with the remote analytics engine such that the portion of the secondary copy of the data set appears to the remote analytics engine to be stored locally on the remote analytics engine; extends a native functionality of the distributed file system implemented on the remote analytics engine; enable the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set via the network by way of; the set of virtual objects exposed to the remote analytics engine via the network; the file system plug-in that interfaces with the file system of the remote analytics engine; throttle, by the file system plug-in that interfaces with the remote analytics engine, one or more data transfers in connection with the analytics job by; determining a computing load on the remote analytics engine; controlling, based at least in part on the computing load on the remote analytics engine, a block size of data accessible by the remote analytics engine in connection with the analytics job; controlling, based at least in part on the computing load on the remote analytics engine, a number of map-reduce processes performed by the remote analytics engine in connection with the analytics job; identify another secondary copy of the data set within another secondary storage system; receive, by the file system plug-in, at least one request to perform an I/O operation on the portion of the secondary copy of the data set in connection with the analytics job; optimize at least a portion of the analytics job by; providing parallel access to the secondary copy of the data set and the other secondary copy of the data set in connection with the analytics job; causing the secondary storage system to prefetch some of the portion of the secondary copy of the data set by; generating, based at least in part on the request to perform the I/O operation, a notification of an anticipated future I/O operation likely to be performed on the portion of the secondary copy of the data set in connection with the analytics job; forwarding the notification of the anticipated future I/O operation from the file system plug-in to the secondary storage system to cause the secondary storage system to; prefetch the some of the portion of the secondary copy of the data set; perform at least one read-ahead using at least one read-ahead buffer to accelerate the analytics job.
-
Specification