Identification of virtual machines using a distributed job scheduler
First Claim
1. A method for operating a data management system, comprising:
- storing a first set of snapshots of a first virtual machine as a first set of files using a distributed file system, the distributed file system replicates the first set of files among a plurality of nodes within a cluster, the first set of snapshots includes a first base image for the first virtual machine;
storing a second set of snapshots of a second virtual machine different from the first virtual machine as a second set of files using the distributed file system, the distributed file system replicates the second set of files among the plurality of nodes within the cluster, the second set of snapshots includes a second base image for the second virtual machine;
determining a first job associated with the first virtual machine to be performed using a distributed job scheduler, the distributed job scheduler comprises a plurality of job scheduling processes running on the plurality of nodes, each node of the plurality of nodes runs one of the plurality of job scheduling processes;
determining that a first node of the plurality of nodes stores the first set of files; and
running the first job on the first node in response to determining that the first node stores the first set of files, the first job comprising;
generating a plurality of hash values corresponding with a plurality of data blocks within the first base image for the first virtual machine, the plurality of data blocks is arranged such that data blocks within a first portion of the first base image are spaced at a fixed distance from each other and other data blocks within a second portion of the first base image are spaced at monotonically increasing distances from each other, the first portion of the first base image does not overlap with the second portion of the first base image;
comparing the plurality of hash values with another plurality of hash values corresponding with a plurality of other data blocks within the second base image for the second virtual machine different from the first virtual machine;
identifying the second base image for the second virtual machine as a candidate base image from which a dependent base file for the first virtual machine is generated;
generating the dependent base file using the first base image for the first virtual machine and the second base image for the second virtual machine; and
storing the dependent base file for the first virtual machine using the distributed file system.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for managing, storing, and serving data within a virtualized environment are described. In some embodiments, a data management system may manage the extraction and storage of virtual machine snapshots, provide near instantaneous restoration of a virtual machine or one or more files located on the virtual machine, and enable secondary workloads to directly use the data management system as a primary storage target to read or modify past versions of data. The data management system may allow a virtual machine snapshot of a virtual machine stored within the system to be directly mounted to enable substantially instantaneous virtual machine recovery of the virtual machine.
56 Citations
20 Claims
-
1. A method for operating a data management system, comprising:
-
storing a first set of snapshots of a first virtual machine as a first set of files using a distributed file system, the distributed file system replicates the first set of files among a plurality of nodes within a cluster, the first set of snapshots includes a first base image for the first virtual machine; storing a second set of snapshots of a second virtual machine different from the first virtual machine as a second set of files using the distributed file system, the distributed file system replicates the second set of files among the plurality of nodes within the cluster, the second set of snapshots includes a second base image for the second virtual machine; determining a first job associated with the first virtual machine to be performed using a distributed job scheduler, the distributed job scheduler comprises a plurality of job scheduling processes running on the plurality of nodes, each node of the plurality of nodes runs one of the plurality of job scheduling processes; determining that a first node of the plurality of nodes stores the first set of files; and running the first job on the first node in response to determining that the first node stores the first set of files, the first job comprising; generating a plurality of hash values corresponding with a plurality of data blocks within the first base image for the first virtual machine, the plurality of data blocks is arranged such that data blocks within a first portion of the first base image are spaced at a fixed distance from each other and other data blocks within a second portion of the first base image are spaced at monotonically increasing distances from each other, the first portion of the first base image does not overlap with the second portion of the first base image; comparing the plurality of hash values with another plurality of hash values corresponding with a plurality of other data blocks within the second base image for the second virtual machine different from the first virtual machine; identifying the second base image for the second virtual machine as a candidate base image from which a dependent base file for the first virtual machine is generated; generating the dependent base file using the first base image for the first virtual machine and the second base image for the second virtual machine; and storing the dependent base file for the first virtual machine using the distributed file system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A data management system, comprising:
-
a distributed file system configured to store a first set of snapshots of a first virtual machine as a first set of files, the distributed file system configured to replicate the first set of files among a plurality of nodes within a cluster, the first set of snapshots includes a first base image for the first virtual machine, the distributed file system configured to store a second set of snapshots of a second virtual machine different from the first virtual machine as a second set of files, the distributed file system configured to replicate the second set of files among the plurality of nodes within the cluster, the second set of snapshots includes a second base image for the second virtual machine; and a distributed job scheduler configured to determine a first job associated with the first virtual machine to be performed, the distributed job scheduler comprises a plurality of job scheduling processes running on the plurality of nodes, each node of the plurality of nodes runs one of the plurality of job scheduling processes, the distributed job scheduler configured to determine that a first node of the plurality of nodes stores the first set of files and configured to run the first job on the first node in response to the determination that the first node stores the first set of files, the first job configured to generate a plurality of hash values corresponding with a plurality of data blocks within the first base image for the first virtual machine, the plurality of data blocks is arranged such that data blocks within a first portion of the first base image are spaced at a fixed distance from each other and other data blocks within a second portion of the first base image are spaced at monotonically increasing distances from each other, the first portion of the first base image does not overlap with the second portion of the first base image, the first job configured to compare the plurality of hash values with another plurality of hash values corresponding with a plurality of other data blocks within the second base image for the second virtual machine different from the virtual machine and configured to identify the second base image for the second virtual machine as a candidate base image from which a dependent base file for the first virtual machine is generated, the first job configured to generate the dependent base file using the first base image for the first virtual machine and the second base image for the second virtual machine, the dependent base file comprises data differences between the first base image for the first virtual machine and the second base image for the second virtual machine. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. One or more storage devices containing processor readable code for programming one or more processors to perform a method for operating a data management system, the processor readable code comprising:
-
processor readable code configured to store a first set of snapshots of a first virtual machine as a first set of files using a distributed file system, the distributed file system replicates the first set of files among a plurality of nodes within a cluster, the first set of snapshots includes a first base image for the first virtual machine; processor readable code configured to store a second set of snapshots of a second virtual machine different from the first virtual machine as a second set of files using the distributed file system, the distributed file system replicates the second set of files among the plurality of nodes within the cluster, the second set of snapshots includes a second base image for the second virtual machine; processor readable code configured to determine a first job associated with the first virtual machine to be performed using a distributed job scheduler, the distributed job scheduler comprises a plurality of job scheduling processes running on the plurality of nodes, each node of the plurality of nodes runs one of the plurality of job scheduling processes; processor readable code configured to determine that a first node of the plurality of nodes stores the first set of files; and processor readable code configured to run the first job on the first node in response to determining that the first node stores the first set of files, the first job generates a plurality of hash values corresponding with a plurality of data blocks within the first base image for the first virtual machine and compares the plurality of hash values with another plurality of hash values corresponding with a plurality of other data blocks within the second base image for the second virtual machine different from the first virtual machine, the plurality of data blocks is arranged such that data blocks within a first portion of the first base image are spaced at a fixed distance from each other and other data blocks within a second portion of the first base image are spaced at monotonically increasing distances from each other, the first portion of the first base image does not overlap with the second portion of the first base image, the first job identifies the second base image for the second virtual machine as a candidate base image from which a dependent base file for the first virtual machine is generated and generates the dependent base file using the first base image for the first virtual machine and the second base image for the second virtual machine, the dependent base file comprises data differences between the first base image for the first virtual machine and the second base image for the second virtual machine.
-
Specification