×

Systems and methods for facilitating analytics on remotely stored data sets

  • US 10,528,602 B1
  • Filed: 12/26/2014
  • Issued: 01/07/2020
  • Est. Priority Date: 12/26/2014
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for facilitating analytics on remotely stored data sets, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:

  • detecting a replication job that makes, within a secondary storage system, a secondary copy of a data set duplicated from a primary copy of the data set stored in a primary storage system;

    in response to detecting the replication job, identifying, within the secondary storage system, the secondary copy of the data set duplicated from the primary copy of the data set stored in the primary storage system;

    generating a set of virtual objects that represent at least a portion of the secondary copy of the data set identified within the secondary storage system;

    providing an exposure module that resides in an Input/Output (I/O) path of a remote analytics engine, wherein the remote analytics engine comprises a computer cluster that implements a distributed file system that stores data across various nodes;

    interfacing, via the exposure module, the distributed file system with the secondary storage system to expose the set of virtual objects to the remote analytics engine via a network by way of a file system plug-in that;

    interfaces with the remote analytics engine such that the portion of the secondary copy of the data set appears to the remote analytics engine to be stored locally on the remote analytics engine;

    extends a native functionality of the distributed file system implemented on the remote analytics engine;

    enabling the remote analytics engine to perform at least one analytics job on the portion of the secondary copy of the data set via the network by way of;

    the set of virtual objects exposed to the remote analytics engine via the network;

    the file system plug-in that interfaces with the file system of the remote analytics engine;

    throttling, by the file system plug-in that interfaces with the remote analytics engine, one or more data transfers in connection with the analytics job by;

    determining a computing load on the remote analytics engine;

    controlling, based at least in part on the computing load on the remote analytics engine, a block size of data accessible by the remote analytics engine in connection with the analytics job;

    controlling, based at least in part on the computing load on the remote analytics engine, a number of map-reduce processes performed by the remote analytics engine in connection with the analytics job;

    identifying another secondary copy of the data set within another secondary storage system;

    receiving, by the file system plug-in, at least one request to perform an I/O operation on the portion of the secondary copy of the data set in connection with the analytics job;

    optimizing at least a portion of the analytics job by;

    providing parallel access to the secondary copy of the data set and the other secondary copy of the data set in connection with the analytics job;

    causing the secondary storage system to prefetch some of the portion of the secondary copy of the data set by;

    generating, based at least in part on the request to perform the I/O operation, a notification of an anticipated future I/O operation likely to be performed on the portion of the secondary copy of the data set in connection with the analytics job;

    forwarding the notification of the anticipated future I/O operation from the file system plug-in to the secondary storage system to cause the secondary storage system to;

    prefetch the some of the portion of the secondary copy of the data set;

    perform at least one read-ahead using at least one read-ahead buffer to accelerate the analytics job.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×