Pluggable storage system for parallel query engines across non-native file systems
First Claim
1. A method for managing data, comprising:
- receiving, by one or more processors, a query from a client via one or more networks;
based on the received query, analyzing a catalog, which stores mappings of file names and file locations, for location information, wherein the catalog is associated with a universal namenode that provides a single namespace for accessing a plurality of files stored across a plurality of storage systems, and wherein the location information stored in connection with the catalog indicates a storage system on which a file is located among the plurality of storage systems;
based on the analysis, determining, by one or more processors, a first storage system of the plurality of storage systems, an associated first file system, an associated first protocol translator to use in connection with communication with the first storage system, a second storage system of the plurality of storage systems, an associated second file system, and an associated second protocol translator to use in connection with communication with the second storage system;
identifying, by one or more processors, a first data and a second data, wherein the first data is stored on the first storage system, and the second data is stored on the second storage system, and wherein a first portion of the query is performed on the first storage system and a second portion of the query is performed on the second storage system, wherein the first storage system is different from the second storage system, and wherein a first protocol used in connection with communication with the first storage system is different from a second protocol used in connection with communication with the second storage system;
running, by one or more processors, a first job on the first data using the associated first protocol translator, wherein the first job is not a native job of the first file system; and
running, by one or more processors, a second job on the second data using the associated second protocol translator, wherein the second job is not a native job of the second file system.
9 Assignments
0 Petitions
Accused Products
Abstract
A method, article of manufacture, and apparatus for managing data. In some embodiments, this includes receiving a query from a client, based on the received query, analyzing a catalog for location information, based on the analysis, determining a first storage system, an associated first file system, an associated first protocol translator, a second storage system, an associated second file system, and an associated second protocol translator, identifying a first data and a second data, wherein the first data is stored on the first storage system, and the second data is stored on the second storage system, running a first job on the first data using the associated first protocol translator, wherein the first job is not a native job of the first file system, and running a second job on the second data using the associated second protocol translator, wherein the second job is not a native job of the second file system.
147 Citations
29 Claims
-
1. A method for managing data, comprising:
-
receiving, by one or more processors, a query from a client via one or more networks; based on the received query, analyzing a catalog, which stores mappings of file names and file locations, for location information, wherein the catalog is associated with a universal namenode that provides a single namespace for accessing a plurality of files stored across a plurality of storage systems, and wherein the location information stored in connection with the catalog indicates a storage system on which a file is located among the plurality of storage systems; based on the analysis, determining, by one or more processors, a first storage system of the plurality of storage systems, an associated first file system, an associated first protocol translator to use in connection with communication with the first storage system, a second storage system of the plurality of storage systems, an associated second file system, and an associated second protocol translator to use in connection with communication with the second storage system; identifying, by one or more processors, a first data and a second data, wherein the first data is stored on the first storage system, and the second data is stored on the second storage system, and wherein a first portion of the query is performed on the first storage system and a second portion of the query is performed on the second storage system, wherein the first storage system is different from the second storage system, and wherein a first protocol used in connection with communication with the first storage system is different from a second protocol used in connection with communication with the second storage system; running, by one or more processors, a first job on the first data using the associated first protocol translator, wherein the first job is not a native job of the first file system; and running, by one or more processors, a second job on the second data using the associated second protocol translator, wherein the second job is not a native job of the second file system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for managing data, comprising a processor configured to:
-
receive a query from a client via one or more networks; based on the received query, analyze a catalog, which stores mappings of file names and file locations, for location information, wherein the catalog is associated with a universal namenode that provides a single namespace for accessing a plurality of files stored across a plurality of storage systems, and wherein the location information stored in connection with the catalog indicates a storage system on which a file is located among the plurality of storage systems; based on the analysis, determine a first storage system of the plurality of storage systems, an associated first file system, an associated first protocol translator to use in connection with communication with the first storage system, a second storage system of the plurality of storage systems, an associated second file system, and an associated second protocol translator to use in connection with communication with the second storage system; identify a first data and a second data, wherein the first data is stored on the first storage system, and the second data is stored on the second storage system, and wherein a first portion of the query is performed on the first storage system and a second portion of the query is performed on the second storage system, wherein the first storage system is different from the second storage system; run a first job on the first data using the associated first protocol translator, wherein the first job is not a native job of the first file system; and run a second job on the second data using the associated second protocol translator, wherein the second job is not a native job of the second file system. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
-
24. A computer program product for processing data, comprising a non-transitory computer readable medium having program instructions embodied therein for:
-
receiving, by one or more processors, a query from a client via one or more networks; based on the received query, analyzing a catalog, which stores mappings of file names and file locations, for location information, wherein the catalog is associated with a universal namenode that provides a single namespace for accessing a plurality of files stored across a plurality of storage systems, and wherein the location information stored in connection with the catalog indicates a storage system on which a file is located among the plurality of storage systems; based on the analysis, determining, by one or more processors, a first storage system of the plurality of storage systems, an associated first file system, an associated first protocol translator to use in connection with communication with the first storage system, a second storage system of the plurality of storage systems, an associated second file system, and an associated second protocol translator to use in connection with communication with the second storage system; identifying, by one or more processors, a first data and a second data, wherein the first data is stored on the first storage system, and the second data is stored on the second storage system, and wherein a first portion of the query is performed on the first storage system and a second portion of the query is performed on the second storage system, wherein the first storage system is different from the second storage system, and wherein a first protocol used in connection with communication with the first storage system is different from a second protocol used in connection with communication with the second storage system; running, by one or more processors, a first job on the first data using the associated first protocol translator, wherein the first job is not a native job of the first file system; and running, by one or more processors, a second job on the second data using the associated second protocol translator, wherein the second job is not a native job of the second file system. - View Dependent Claims (25, 26, 27, 28, 29)
-
Specification