Object metadata query with distributed processing systems
First Claim
Patent Images
1. A computer-implemented method comprising:
- providing one or more computer processors configured to perform;
providing access, from a distributed processing system, to a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a plurality of objects, the bucket defined by and belonging to the tenant, wherein the distributed object store is configured as part of a distributed key-value store, the distributed key-value store comprising;
a set of data and object metadata for the plurality of objects;
a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and
one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises;
information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store;
a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and
a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties;
wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and
wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests;
receiving a data request for object metadata from the distributed processing system, the data request associated with a first bucket within the distributed object store, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query,identifying one or more objects within the first bucket that satisfy the object metadata query;
wherein the object metadata query includes at least one query predicate involving an object metadata key, wherein identifying the one or more objects within the first bucket that satisfy the object metadata query comprises;
parsing the object metadata query into a query parse tree;
generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query;
selecting one of the candidate query plans; and
identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate;
for each object identified as satisfying the object metadata query;
determining a location of corresponding object metadata stored within the distributed object store;
retrieving the corresponding object metadata using the determined location; and
generating a metadata record from the corresponding object metadata;
combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and
returning the metadata collection to the distributed processing system in connection with the response to the data request.
9 Assignments
0 Petitions
Accused Products
Abstract
A distributed object store can expose object metadata, in addition to object data, to distributed processing systems, such as Hadoop and Apache Spark. The distributed object store may acts as a Hadoop Compatible File System (HCFS), exposing object metadata as a collection of records that can be efficiently processed by MapReduce (MR) and other distributed processing frameworks. A distributed processing job can specify a metadata query to narrow the set of objects returned. Related methods are also described.
139 Citations
20 Claims
-
1. A computer-implemented method comprising:
providing one or more computer processors configured to perform; providing access, from a distributed processing system, to a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a plurality of objects, the bucket defined by and belonging to the tenant, wherein the distributed object store is configured as part of a distributed key-value store, the distributed key-value store comprising; a set of data and object metadata for the plurality of objects; a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises; information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store; a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties; wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests; receiving a data request for object metadata from the distributed processing system, the data request associated with a first bucket within the distributed object store, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query, identifying one or more objects within the first bucket that satisfy the object metadata query;
wherein the object metadata query includes at least one query predicate involving an object metadata key, wherein identifying the one or more objects within the first bucket that satisfy the object metadata query comprises;parsing the object metadata query into a query parse tree; generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query; selecting one of the candidate query plans; and identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate; for each object identified as satisfying the object metadata query; determining a location of corresponding object metadata stored within the distributed object store; retrieving the corresponding object metadata using the determined location; and generating a metadata record from the corresponding object metadata; combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and returning the metadata collection to the distributed processing system in connection with the response to the data request. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
17. A system comprising:
-
one or more processors; a distributed key/value store in operable communication with the one or more processors, the distributed key value store comprising; a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a collection of objects, the bucket defined by and belonging to the tenant; a set of data and object metadata for the plurality of objects; a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises; information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store; a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties; wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests; a plurality of storage devices in operable communication with the one or more processors, the plurality of storage devices configured to store the object data and the object metadata; and a plurality of data service nodes in operable communication with the one or more processors, each of the data service nodes coupled to the distributed key/value store and corresponding ones of the storage devices, wherein a first one of the plurality of data service nodes comprises; an object storage engine to determine the location of object metadata stored within the plurality of storage devices using the distributed key/value store; a storage controller to retrieve the object metadata from the plurality of storage devices; a metadata formatting module to generate metadata records from object metadata and to combine metadata records into a metadata collection having a format compatible with a distributed processing system; and an interface configured to receive a data request for object metadata from the distributed processing system and, in cooperation with the controller, to return a metadata collection to the distributed processing system in response to the data request, wherein the data request is associated with a first bucket, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query that includes at least one predicate involving an object metadata key, wherein the controller is configured for identifying an object within the first bucket that satisfies the object metadata query by; parsing the object metadata query into a query parse tree; generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query; selecting one of the candidate query plans; and identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate; for each object identified as satisfying the object metadata query; determining a location of corresponding object metadata stored within the distributed object store; retrieving the corresponding object metadata using the determined location; and generating a metadata record from the corresponding object metadata; combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and returning the metadata collection to the distributed processing system in connection with the response to the data request. - View Dependent Claims (18, 19, 20)
-
Specification