Integration of distributed data processing platform with one or more distinct supporting platforms
First Claim
1. A method comprising:
- configuring a plurality of distributed processing nodes, each comprising a processor coupled to a memory, to communicate over a network;
obtaining metadata characterizing data locally accessible in respective data zones of respective ones of the distributed processing nodes;
populating catalog instances of a distributed catalog service for respective ones of the data zones utilizing the obtained metadata; and
performing distributed data analytics for a given analytics job in the distributed processing nodes utilizing the populated catalog instances of the distributed catalog service and the locally accessible data of the respective data zones;
wherein the given analytics job utilizes one or more first data items locally accessible within a first one of the data zones of a first one of the distributed processing nodes and one or more second data items locally accessible within at least a second one of the data zones of at least a second one of the distributed processing nodes;
wherein the obtained metadata comprises one or more metadata tags identifying at least one of the one or more first data items and the one or more second data items, a first entrance key permitting access to a first internal network of the first data zone, and a second entrance key permitting access to a second internal network of the second data zone;
wherein populating the catalog instances of the distributed catalog service further comprises provisioning a first populated catalog instance associated with the first data zone of the first distributed processing node with the first entrance key and provisioning a second populated catalog instance associated with the second data zone of the second distributed processing node with the second entrance key, andwherein performing distributed data analytics for the given analytics job comprises at least one of;
the first populated catalog instance associated with the first data zone of the first distributed processing node utilizing at least one of the one or more metadata tags to map to one or more first physical storage locations of one or more of the first data items in one or more first data storage devices of the first distributed processing node and utilizing the first entrance key to access one or more of the first data items at the one or more first physical storage locations in the one or more first data storage devices of the first distributed processing node; and
the second populated catalog instance associated with the second data zone of the second distributed processing node utilizing at least one of the one or more metadata tags to map to one or more second physical storage locations of one or more of the second data items in one or more second data storage devices of the second distributed processing node and utilizing the second entrance key to access one or more of the second data items at the one or more second physical storage locations in the one or more second data storage devices of the second distributed processing node.
7 Assignments
0 Petitions
Accused Products
Abstract
An apparatus in one embodiment comprises at least one processing device having a processor coupled to a memory. The one or more processing devices are operative to configure a plurality of distributed processing nodes to communicate over a network, to obtain metadata characterizing data locally accessible in respective data zones of respective ones of the distributed processing nodes, and to populate catalog instances of a distributed catalog service for respective ones of the data zones utilizing the obtained metadata. Distributed data analytics are performed in the distributed processing nodes utilizing the populated catalog instances of the distributed catalog service and the locally accessible data of the respective data zones. The metadata characterizing the locally accessible data is illustratively obtained in a metadata repository from at least one of a master data management platform and a governance, risk and compliance platform.
404 Citations
20 Claims
-
1. A method comprising:
-
configuring a plurality of distributed processing nodes, each comprising a processor coupled to a memory, to communicate over a network; obtaining metadata characterizing data locally accessible in respective data zones of respective ones of the distributed processing nodes; populating catalog instances of a distributed catalog service for respective ones of the data zones utilizing the obtained metadata; and performing distributed data analytics for a given analytics job in the distributed processing nodes utilizing the populated catalog instances of the distributed catalog service and the locally accessible data of the respective data zones; wherein the given analytics job utilizes one or more first data items locally accessible within a first one of the data zones of a first one of the distributed processing nodes and one or more second data items locally accessible within at least a second one of the data zones of at least a second one of the distributed processing nodes; wherein the obtained metadata comprises one or more metadata tags identifying at least one of the one or more first data items and the one or more second data items, a first entrance key permitting access to a first internal network of the first data zone, and a second entrance key permitting access to a second internal network of the second data zone; wherein populating the catalog instances of the distributed catalog service further comprises provisioning a first populated catalog instance associated with the first data zone of the first distributed processing node with the first entrance key and provisioning a second populated catalog instance associated with the second data zone of the second distributed processing node with the second entrance key, and wherein performing distributed data analytics for the given analytics job comprises at least one of; the first populated catalog instance associated with the first data zone of the first distributed processing node utilizing at least one of the one or more metadata tags to map to one or more first physical storage locations of one or more of the first data items in one or more first data storage devices of the first distributed processing node and utilizing the first entrance key to access one or more of the first data items at the one or more first physical storage locations in the one or more first data storage devices of the first distributed processing node; and the second populated catalog instance associated with the second data zone of the second distributed processing node utilizing at least one of the one or more metadata tags to map to one or more second physical storage locations of one or more of the second data items in one or more second data storage devices of the second distributed processing node and utilizing the second entrance key to access one or more of the second data items at the one or more second physical storage locations in the one or more second data storage devices of the second distributed processing node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device:
-
to configure a plurality of distributed processing nodes to communicate over a network; to obtain metadata characterizing data locally accessible in respective data zones of respective ones of the distributed processing nodes; to populate catalog instances of a distributed catalog service for respective ones of the data zones utilizing the obtained metadata; and to perform distributed data analytics for a given analytics job in the distributed processing nodes utilizing the populated catalog instances of the distributed catalog service and the locally accessible data of the respective data zones; wherein the given analytics job utilizes one or more first data items locally accessible within a first one of the data zones of a first one of the distributed processing nodes and one or more second data items locally accessible within at least a second one of the data zones of at least a second one of the distributed processing nodes; wherein the obtained metadata comprises one or more metadata tags identifying at least one of the one or more first data items and the one or more second data items, a first entrance key permitting access to a first internal network of the first data zone, and a second entrance key permitting access to a second internal network of the second data zone; wherein populating the catalog instances of the distributed catalog service further comprises provisioning a first populated catalog instance associated with the first data zone of the first distributed processing node with the first entrance key and provisioning a second populated catalog instance associated with the second data zone of the second distributed processing node with the second entrance key; and wherein performing distributed data analytics for the given analytics job comprises at least one of; the first populated catalog instance associated with the first data zone of the first distributed processing node utilizing at least one of the one or more metadata tags to map to one or more first physical storage locations of one or more of the first data items in one or more first data storage devices of the first distributed processing node and utilizing the first entrance key to access one or more of the first data items at the one or more first physical storage locations in the one or more first data storage devices of the first distributed processing node; and the second populated catalog instance associated with the second data zone of the second distributed processing node utilizing at least one of the one or more metadata tags to map to one or more second physical storage locations of one or more of the second data items in one or more second data storage devices of the second distributed processing node and utilizing the second entrance key to access one or more of the second data items at the one or more second physical storage locations in the one or more second data storage devices of the second distributed processing node. - View Dependent Claims (16, 17)
-
-
18. An apparatus comprising:
-
at least one processing device having a processor coupled to a memory; wherein said at least one processor is operative; to configure a plurality of distributed processing nodes to communicate over a network; to obtain metadata characterizing data locally accessible in respective data zones of respective ones of the distributed processing nodes; to populate catalog instances of a distributed catalog service for respective ones of the data zones utilizing the obtained metadata; and to perform distributed data analytics for a given analytics job in the distributed processing nodes utilizing the populated catalog instances of the distributed catalog service and the locally accessible data of the respective data zones; wherein the given analytics job utilizes one or more first data items locally accessible within a first one of the data zones of a first one of the distributed processing nodes and one or more second data items locally accessible within at least a second one of the data zones of at least a second one of the distributed processing nodes; wherein the obtained metadata comprises one or more metadata tags identifying at least one of the one or more first data items and the one or more second data items, a first entrance key permitting access to a first internal network of the first data zone, and a second entrance key permitting access to a second internal network of the second data zone; wherein populating the catalog instances of the distributed catalog service further comprises provisioning a first populated catalog instance associated with the first data zone of the first distributed processing node with the first entrance key and provisioning a second populated catalog instance associated with the second data zone of the second distributed processing node with the second entrance key, and wherein performing distributed data analytics for the given analytics job comprises at least one of; the first populated catalog instance associated with the first data zone of the first distributed processing node utilizing at least one of the one or more metadata tags to map to one or more first physical storage locations of one or more of the first data items in one or more first data storage devices of the first distributed processing node and utilizing the first entrance key to access one or more of the first data items at the one or more first physical storage locations in the one or more first data storage devices of the first distributed processing node; and the second populated catalog instance associated with the second data zone of the second distributed processing node utilizing at least one of the one or more metadata tags to map to one or more second physical storage locations of one or more of the second data items in one or more second data storage devices of the second distributed processing node and utilizing the second entrance key to access one or more of the second data items at the one or more second physical storage locations in the one or more second data storage devices of the second distributed processing node. - View Dependent Claims (19, 20)
-
Specification