AUTOMATED DATA MANAGEMENT VIA MACHINE-READABLE DATA DEFINITION FILES
1. A method, comprising:
- maintaining a set of machine-readable data definition files corresponding to a set of data assets, wherein at least one machine-readable data definition file specifies code for implementing at least one goal state associated with at least one corresponding data asset; and
executing the at least one machine-readable data definition file to effectuate the at least one goal state;
wherein the maintaining and executing steps are implemented via at least one processing device comprising a processor and a memory.
Techniques are disclosed for automated data management. In one example, a method maintains a set of machine-readable data definition files corresponding to a set of data assets. At least one machine-readable data definition file specifies code for implementing at least one goal state associated with at least one corresponding data asset. The at least one machine-readable data definition file is executed to effectuate the at least one goal state.
- 1. A method, comprising:
maintaining a set of machine-readable data definition files corresponding to a set of data assets, wherein at least one machine-readable data definition file specifies code for implementing at least one goal state associated with at least one corresponding data asset; and executing the at least one machine-readable data definition file to effectuate the at least one goal state; wherein the maintaining and executing steps are implemented via at least one processing device comprising a processor and a memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- 14. An article of manufacture comprising a processor-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by one or more processing devices implement the steps of:
executing the at least one machine-readable data definition file to effectuate the at least one goal state.
- 15. A system comprising:
one or more processors operatively coupled to one or more memories configured to; maintain a set of machine-readable data definition files corresponding to a set of data assets, wherein at least one machine-readable data definition file specifies code for implementing at least one goal state associated with at least one corresponding data asset; and execute the at least one machine-readable data definition file to effectuate the at least one goal state.
- View Dependent Claims (16, 17, 18, 19, 20)
The field relates generally to automated data management and, more particularly, to automated data management via machine-readable data definition files.
Utility computing and second-generation web frameworks brought about wide-spread problems for humans managing ever-expanding information technology (IT) infrastructure. The development of infrastructure-as-code (IaC) tools and processes allowed fewer humans to manage a massive collection of infrastructure. IaC is the process of managing and provisioning IT infrastructure associated with, for example, data centers through machine-readable definitions (scripts or declarative definitions), rather than through physical hardware configuration tools. Machine-readable definitions are data (or metadata) in a format that can be easily processed by a computer. Some computer languages for creating machine-readable data have features to improve human readability of the machine-readable data.
The IT infrastructure managed and provisioned by IaC tools and processes may comprise physical processing devices such as bare-metal servers and/or logical processing virtual devices such as virtual machines, as well as associated configuration resources. Thus, the expanding scale of infrastructure has become manageable through IaC by modeling of infrastructure with code followed by execution of the code.
However, such larger automated data centers attract massive amounts of data sets that do not have the same level of automated management as the infrastructure that the data sets reside on. It has been estimated that in less than ten years (e.g., about 2025), the projected size of the so-called datosphere will be about 163 ZB (zettabytes), and the creation of data will shift from consumer-driven to enterprise-driven. Enterprise IT departments must therefore transition from the management of petabytes of storage to zettabytes of storage. The lack of automation for such massive amounts of data will result in significant challenges for enterprises and other entities.
Embodiments of the invention provide techniques for automated data management.
For example, in one embodiment, a method comprises the following steps. The method maintains a set of machine-readable data definition files corresponding to a set of data assets. At least one machine-readable data definition file specifies code for implementing at least one goal state associated with at least one corresponding data asset. The at least one machine-readable data definition file is executed to effectuate the at least one goal state.
Non-limiting examples of goal states may comprise a data provisioning goal state, a data protection goal state, a data availability goal state, a data regulation goal state, a data quality goal state, a data analytics goal state, a data valuation goal state, and a capacity planning goal state.
Advantageously, illustrative embodiments provide for automated data management of data assets associated with an enterprise. While applicable to data repositories of any size, techniques described herein are particularly well suited for management of large scale data repositories, e.g., zettabytes of storage.
These and other features and advantages of the invention will become more readily apparent from the accompanying drawings and the following detailed description.
Illustrative embodiments may be described herein with reference to exemplary cloud infrastructure, data repositories, data centers, data processing systems, computing systems, data storage systems and associated servers, computers, storage units and devices and other processing devices. It is to be appreciated, however, that embodiments of the invention are not restricted to use with the particular illustrative system and device configurations shown. Moreover, the phrases “cloud infrastructure,” “data repository,” “data center,” “data processing system,” “information processing system,” “computing environment,” “computing system,” “data storage system,” “data lake,” and the like as used herein are intended to be broadly construed, so as to encompass, for example, private and/or public cloud computing or storage systems, as well as other types of systems comprising distributed virtual infrastructure. However, a given embodiment may more generally comprise any arrangement of one or more processing devices.
As used herein, the following terms and phrases have the following illustrative meanings:
“metadata” as used herein is intended to be broadly construed, and may comprise, for example, data that describes or defines data;
“valuation” illustratively refers to a computation and/or estimation of something'"'"'s worth or value; in this case, data valuation is a computation and/or estimation of the value of a data set for a given context;
“context” illustratively refers to surroundings, circumstances, environment, background, settings, characteristics, qualities, attributes, descriptions, and/or the like, that determine, specify, and/or clarify something; in this case, for example, context is used to determine a value of data;
“data asset” as used herein is intended to be broadly construed, and may comprise, for example, one or more data items, units, elements, blocks, objects, sets, fields, and the like, combinations thereof, and otherwise any information that is obtained and/or generated by an enterprise;
“enterprise” illustratively refers to an organization, a business, a company, a venture, an entity, or the like; and
“entity” illustratively refers to one or more persons, one or more systems, or combinations thereof.
As mentioned above, it is realized that enterprise IT departments, and any entities that have data management responsibilities, will soon need to transition from the management of petabytes of storage to zettabytes of storage. This transition will be a challenge for many reasons, examples of which are as follows:
Manual and Siloed Storage Management Tasks
Storage administrators currently managing large amounts (e.g., petabytes) of storage spend their time manually running management tools and/or creating scripts to do the following tasks:
- (i) Data provisioning: allocating and expanding storage;
- (ii) Data protection: managing number of copies, availability, restore/repair;
- (iii) Data availability: managing permissions, encryption, access, searchability, etc.;
- (iv) Data regulation: ensuring data compliance with the ever-shifting compliance environment;
- (v) Data quality and analytics: ensuring artificial intelligence (AI) algorithms are effectively leveraging the most appropriate, highest quality data;
- (vi) Data Valuation: creating a portfolio of data assets with known value; and
- (viii) Capacity planning: predicting data growth and cost for the enterprise.
Given that the tasks described above are often performed manually by humans that are often members of different departments, the time required to manually manage zettabytes of enterprise data will outpace the ability of an enterprise to effectively manage that data.
Given the breadth of the data expansion problem described above, enterprises will lack the ability to control storage costs. For example, if an enterprise owns hundreds of thousands of data sets, and the value of those data sets is unknown, it will be impractical if not impossible to control the number of copies that are appropriate for each individual data set based on its value. This will result in an enterprise over-paying for storage capacity.
Execution Time of Storage Administration Tasks
The current size of today'"'"'s administrative teams will not be able to manually manage hundreds of thousands of data sets. Scaling these teams to effectively manage zettabytes of data is unrealistic from a budget perspective as well as error-prone from a scale perspective.
Zettabyte-size data sets introduce the inevitability of manual error, resulting in violation of corporate, federal, and our global data regulations. This exposure can result in heavy fines being paid by organizations that are not capable of managing massive data set capacities. In addition, attacks against zettabyte-size enterprise data sets can result in additional revenue loss.
Enterprises that are unable to manage zettabyte-size data sets frequently will miss windows of opportunities to monetize data. The inability to scale enterprise data valuation algorithms to know which data sets are “hot” (frequently used or accessed) and which data sets are “cold” (not frequently used or accessed) will mean that enterprises will be unable to maximize revenue opportunities enabled by data.
Data Tracking and Auditing
Data sets get altered and moved around all the time. It is difficult to track them while they exist. It is next to impossible to find any audit data after they were deleted.
Illustrative embodiments address the above and other challenges associated with data management of such large-scale data by adapting IaC concepts. As mentioned above, IaC is a method of writing and deploying machine-readable data definition files. The files generate service components that, when executed, support the delivery of business systems and IT-enabled processes. IaC enables IT operations teams to manage and provision IT infrastructure automatically through code without relying on manual processes. IaC concepts result in what is referred to as programmable infrastructure.
More particularly, illustrative embodiments provide for creation and maintenance of a catalog of machine-readable data definition files (DDFs) that describe a goal state of data assets in the enterprise, a mapping of those DDFs to actual data assets, and an execution engine for effectuating these goals via integration with one or more data management application programming interfaces (APIs).
DDF editor 102 allows a developer to create data definition files that comprise goal states for data. A “goal state,” as used herein, is a state of a given data set with regard to a certain goal or goals. That is, the developer can create/edit a DDF for a given data set to include one or more goal states such as, by way of example only, goals that address data provisioning (allocating and expanding storage, etc.), data protection (managing number of copies, availability, restore/repair, etc.), data availability (managing permissions, encryption, access, searchability, etc.), data regulation (ensuring data compliance with the ever-shifting compliance environment, etc.), data quality and/or data analytics (ensuring AI algorithms are effectively leveraging the most appropriate, highest quality data, etc.), data valuation (creating a portfolio of data assets with known value, etc.), and capacity planning. For example, a data protection goal state may specify how many protection copies of the given data set are allowed, while a data provisioning goal state may specify what are the maximum budgets for storing the data set in a public cloud platform. One of ordinary skill in the art will realize a wide variety of additional and alternative goal states that can be included in a DDF using DDF editor 102. A non-limiting example of a DDF is further described below in the context of
Further, as shown in
DDF mapping layer 106 enables and manages the mapping of DDFs (0-to-many) to actual enterprise data sets in the repository 120. Changes to DDFs and/or new mappings result in the notification of the scalable DDF execution engine 108. It is to be appreciated that DDF mapping layer 106 can be implemented in a variety of ways including, but not limited to, a linked list, a key-value store, etc.
Scalable DDF execution engine 108, in one illustrative embodiment, is a de-centralized engine that is distributed across multiple geographies. The engine 108 monitors all data sets in the repository 120 and executes the necessary code to bring the data sets to the goal state defined in the DDFs. For example, assuming the goal state in a DDF for a given data set is to store no more than 10 copies of the data set for data protection purposes, the DDF execution engine 108 executes code that ensures that such copy limit is enacted and enforced within whatever storage platform or platforms the data set copies are stored. Likewise, if the goal state in the DDF of the given data set is a maximum budget for storing the data set in a public cloud platform, then the DDF execution engine 108 executes code that ensures that such maximum budget is not exceeded, i.e., by monitoring costs of the storage platform currently storing the given data set and, if needed, migrating the given data set to one or more other public cloud platforms that meet the maximum budget (goal state). Monitoring and provisioning/managing of the data assets occurs via the DDF execution engine 108 calling the data management API 110. That is, the API 110 serves as the interface between the engine 108, the data sets in the repository 120, and whatever system is involved in the effectuation of the goal states in the DDFs.
Note that the data set repository 120 may be, in one or more illustrative embodiments, distributed geographically depending on where the given enterprise stores the various data sets that constitute the repository. In fact, one or more other components shown in system environment 100 may be geographically distributed in some embodiments.
Furthermore, as new data assets arrive into the enterprise data portfolio, the DDF mapping layer 106 may use any number of approaches to associate the data asset with a DDF. In one illustrative embodiment, the approach may be inheritance-based (e.g., the incoming data asset inherits a DDF from other assets that are also being generated by a specific application), semantic-based (e.g., the incoming data asset is associated with similar data assets and assumes their DDF), or default-based. Default-based data assets may trigger a review and/or the creation of a new DDF via the DDF editor 102.
Turning now to
Thus, as shown, the primary (production) data management system 210 resides on a cluster of N processing nodes 212-1, 212-2, . . . 212-N (respectively Node 1, Node 2, . . . Node N). All the nodes or subsets thereof can be used by the scalable DDF execution engine 214 (corresponding to DDF execution engine 108 in
Furthermore, as shown in
At least portions of the automated data management system environment shown in
As is apparent from the above, one or more of the processing modules or other components of the automated data management system environment shown in
The processing platform 400 in this embodiment comprises a plurality of processing devices, denoted 402-1, 402-2, 402-3, . . . 402-N, which communicate with one another over a network 404.
The network 404 may comprise any type of network, including by way of example a global computer network such as the Internet, a WAN, a LAN, a satellite network, a telephone or cable network, a cellular network, a wireless network such as a WiFi or WiMAX network, or various portions or combinations of these and other types of networks.
Some networks utilized in a given embodiment may comprise high-speed local networks in which associated processing devices communicate with one another utilizing Peripheral Component Interconnect Express (PCIe) cards of those devices, and networking protocols such as InfiniBand, Gigabit Ethernet or Fibre Channel.
The processing device 402-1 in the processing platform 400 comprises a processor 410 coupled to a memory 412.
The processor 410 may comprise a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other type of processing circuitry, as well as portions or combinations of such circuitry elements.
The memory 412 may comprise random access memory (RAM), read-only memory (ROM) or other types of memory, in any combination. The memory 412 and other memories disclosed herein should be viewed as illustrative examples of what are more generally referred to as “processor-readable storage media” storing executable program code of one or more software programs.
Articles of manufacture comprising such processor-readable storage media are considered embodiments of the present disclosure. A given such article of manufacture may comprise, for example, a storage array, a storage disk or an integrated circuit containing RAM, ROM or other electronic memory, or any of a wide variety of other types of computer program products. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals. Numerous other types of computer program products comprising processor-readable storage media can be used.
Also included in the processing device 402-1 of the example embodiment of
The other processing devices 402 of the processing platform 400 are assumed to be configured in a manner similar to that shown for processing device 402-1 in the figure.
Again, this particular processing platform is presented by way of example only, and other embodiments may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, servers, storage devices or other processing devices.
For example, other processing platforms used to implement embodiments of the disclosure can comprise different types of virtualization infrastructure, in place of or in addition to virtualization infrastructure comprising virtual machines. Such virtualization infrastructure illustratively includes container-based virtualization infrastructure configured to provide Docker containers or other types of Linux containers (LXCs).
The containers may be associated with respective tenants of a multi-tenant environment, although in other embodiments a given tenant can have multiple containers. The containers may be utilized to implement a variety of different types of functionality within the system. For example, containers can be used to implement respective cloud compute nodes or cloud storage nodes of a cloud computing and storage system. The compute nodes or storage nodes may be associated with respective cloud tenants of a multi-tenant environment. Containers may be used in combination with other virtualization infrastructure such as virtual machines implemented using a hypervisor.
As another example, portions of a given processing platform in some embodiments can comprise converged infrastructure such as VxRail™, VxRack™ or Vblock™ converged infrastructure commercially available from VCE, the Virtual Computing Environment Company, now the Converged Platform and Solutions Division of Dell EMC. For example, portions of an automated data management system environment of the type disclosed herein can be implemented utilizing converged infrastructure.
It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. In many embodiments, at least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.
Also, in other embodiments, numerous other arrangements of computers, servers, storage devices or other components are possible in the system and methods described herein. Such components can communicate with other elements of the system over any type of network or other communication media.
As indicated previously, in some embodiments, components of the automated data management system environment as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. For example, at least portions of the environment or other system components are illustratively implemented in one or more embodiments the form of software running on a processing platform comprising one or more processing devices.
It should again be emphasized that the above-described embodiments of the disclosure are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the disclosed techniques are applicable to a wide variety of other types of systems and assets. Also, the particular configurations of system and device elements, associated processing operations and other functionality illustrated in the drawings can be varied in other embodiments. Moreover, the various assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the embodiments. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.