Object metadata query with distributed processing systems

US 10,318,491 B1
Filed: 03/31/2015
Issued: 06/11/2019
Est. Priority Date: 03/31/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

providing one or more computer processors configured to perform;

providing access, from a distributed processing system, to a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a plurality of objects, the bucket defined by and belonging to the tenant, wherein the distributed object store is configured as part of a distributed key-value store, the distributed key-value store comprising;

a set of data and object metadata for the plurality of objects;

a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and

one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises;

information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store;

a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and

a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties;

wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and

wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests;

receiving a data request for object metadata from the distributed processing system, the data request associated with a first bucket within the distributed object store, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query,identifying one or more objects within the first bucket that satisfy the object metadata query;

wherein the object metadata query includes at least one query predicate involving an object metadata key, wherein identifying the one or more objects within the first bucket that satisfy the object metadata query comprises;

parsing the object metadata query into a query parse tree;

generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query;

selecting one of the candidate query plans; and

identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate;

for each object identified as satisfying the object metadata query;

determining a location of corresponding object metadata stored within the distributed object store;

retrieving the corresponding object metadata using the determined location; and

generating a metadata record from the corresponding object metadata;

combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and

returning the metadata collection to the distributed processing system in connection with the response to the data request.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A distributed object store can expose object metadata, in addition to object data, to distributed processing systems, such as Hadoop and Apache Spark. The distributed object store may acts as a Hadoop Compatible File System (HCFS), exposing object metadata as a collection of records that can be efficiently processed by MapReduce (MR) and other distributed processing frameworks. A distributed processing job can specify a metadata query to narrow the set of objects returned. Related methods are also described.

139 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- providing one or more computer processors configured to perform;
  
  providing access, from a distributed processing system, to a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a plurality of objects, the bucket defined by and belonging to the tenant, wherein the distributed object store is configured as part of a distributed key-value store, the distributed key-value store comprising;
  
  a set of data and object metadata for the plurality of objects;
  
  a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and
  
  one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises;
  
  information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store;
  
  a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and
  
  a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties;
  
  wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and
  
  wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests;
  
  receiving a data request for object metadata from the distributed processing system, the data request associated with a first bucket within the distributed object store, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query,identifying one or more objects within the first bucket that satisfy the object metadata query;
  
  wherein the object metadata query includes at least one query predicate involving an object metadata key, wherein identifying the one or more objects within the first bucket that satisfy the object metadata query comprises;
  
  parsing the object metadata query into a query parse tree;
  
  generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query;
  
  selecting one of the candidate query plans; and
  
  identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate;
  
  for each object identified as satisfying the object metadata query;
  
  determining a location of corresponding object metadata stored within the distributed object store;
  
  retrieving the corresponding object metadata using the determined location; and
  
  generating a metadata record from the corresponding object metadata;
  
  combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and
  
  returning the metadata collection to the distributed processing system in connection with the response to the data request.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein the data request further identifies a partition, wherein identifying one or more objects as objects associated with the first bucket comprises identifying one or more objects as objects associated with the first bucket and the partition.
  - 3. The method of claim 1 wherein receiving the data request for object metadata from a distributed processing system comprises receiving a data request from a Hadoop cluster.
  - 4. The method of claim 3 wherein receiving the data request for object metadata comprises receiving an Hadoop Distributed File System (HDFS) DataNode request.
  - 5. The method of claim 3 further comprising receiving a Hadoop Distributed File System (HDFS) Namenda request from the distributed processing system, the HDFS NameNode request identifying a bucket within the distributed object store.
  - 6. The method of claim 3 wherein generating the metadata record from the corresponding object metadata comprises generating a record in Apache Avro format, Apache Thrift format, Apache Parquet format, Simple Key/Value format, JSON format, Hadoop SequenceFile format, or Google Protocol Buffer format.
  - 7. The method of claim 1 wherein receiving the data request for object metadata from a distributed processing system comprises receiving a data request from an Apache Spark cluster.
  - 8. The method of claim 7 wherein combining the metadata records into the metadata collection comprises forming a Resilient Distributed Dataset (RDD).
  - 9. The method of claim 1 where determining the location of corresponding object metadata stored within the distributed object store comprises using the distributed key/value store.
  - 10. The method of claim 1 wherein identifying the one or more objects in the first bucket comprises issuing a PREFIX-GET command to the distributed key/value store, the PREFIX-GET command identifying a tenant and the bucket for the one or more objects.
  - 11. The method of claim 1 wherein selecting one of candidate query plans comprises:
    - evaluating the candidate query plans based upon a cost model, the cost model based on usage of at least one of time and processing resources for the given candidate query plan; and
      
      selecting one of candidate query plans based upon the cost model evaluation, wherein the selected candidate query plan has the lowest cost.
  - 12. The method of claim 11 wherein generating candidate query plans includes:
    - generating at least one logical query plan according to the received query; and
      
      generating a plurality of physical query plans according to the logical query plan, wherein the selected query plan corresponds to one of the plurality of physical query plans.
  - 13. The method of claim 12 wherein generating a plurality of physical query plans comprises generating a tree representation, wherein nodes of the tree representation correspond to operations, the method further comprising traversing the nodes of the tree representation and executing the corresponding operations.
  - 14. The method of claim 11 wherein evaluating the candidate query plans based upon a cost model comprises utilizing statistical information about the first respective secondary index computed from the distributed key-value store.
  - 15. The method of claim 1 wherein retrieving object ids from the first respective secondary index comprises retrieving rows from the distributed key-value store using the bucket id associated with the query and the object metadata keys involved in the query predicate.
  - 16. The method of claim 1, wherein configuring at least one or more secondary indexes to improve the efficiency of responding to data requests comprises at least one of:
    - configuring a secondary index to be responsive to one or more query predicates to reduce the number of objects retrieved in response to the data request; and
      
      configuring at least one secondary index to organize entries within at least one secondary index table to improve efficiency of identifying entries that are responsive to a data request.

17. A system comprising:
- one or more processors;
  
  a distributed key/value store in operable communication with the one or more processors, the distributed key value store comprising;
  
  a distributed object store configured for storing and retrieving object data and associated metadata, where the distributed object store uniquely identifies objects contained therein using an object key comprising one or more namespace identifiers and a unique object identifier (id) within the identified namespace, wherein the namespace identifiers comprise a tenant id uniquely identifying a tenant within the distributed object store, and a bucket id uniquely identifying a bucket that comprises a collection of objects, the bucket defined by and belonging to the tenant;
  
  a set of data and object metadata for the plurality of objects;
  
  a primary index configured for storing a mapping of object ids to storage locations, for the plurality of objects; and
  
  one or more secondary indexes each configured for storing information relating to properties of the plurality of objects different than the object ids and different than the set of data and metadata for the plurality of objects, wherein the one or more secondary indexes are each defined as part of a respective specified bucket, wherein each of the one or more secondary indexes comprises;
  
  information based on the properties of the object metadata itself and information based on object access patterns for one or more applications that query the distributed object store;
  
  a secondary index table maintaining a mapping between properties of the object metadata and cached object metadata properties for each stored object; and
  
  a plurality of secondary index definitions, the secondary index definitions comprising information about one or more secondary indexes that have been created in the distributed key-value store, the information about the one or more secondary indexes comprising, for each respective secondary index, index name, indexed metadata keys, and cached metadata keys, the cached metadata keys corresponding to duplicates of one or more of the cached object metadata properties;
  
  wherein providing duplicates of one or more of the cached object metadata properties is configured to improve data request performance by reducing time to access information about object metadata properties; and
  
  wherein at least one of the one or more secondary indexes is configured to improve the efficiency of responding to data requests;
  
  a plurality of storage devices in operable communication with the one or more processors, the plurality of storage devices configured to store the object data and the object metadata; and
  
  a plurality of data service nodes in operable communication with the one or more processors, each of the data service nodes coupled to the distributed key/value store and corresponding ones of the storage devices,wherein a first one of the plurality of data service nodes comprises;
  
  an object storage engine to determine the location of object metadata stored within the plurality of storage devices using the distributed key/value store;
  
  a storage controller to retrieve the object metadata from the plurality of storage devices;
  
  a metadata formatting module to generate metadata records from object metadata and to combine metadata records into a metadata collection having a format compatible with a distributed processing system; and
  
  an interface configured to receive a data request for object metadata from the distributed processing system and, in cooperation with the controller, to return a metadata collection to the distributed processing system in response to the data request, wherein the data request is associated with a first bucket, wherein the first bucket comprises at least a first respective secondary index, the data request comprising an object metadata query that includes at least one predicate involving an object metadata key, wherein the controller is configured for identifying an object within the first bucket that satisfies the object metadata query by;
  
  parsing the object metadata query into a query parse tree;
  
  generating a plurality of candidate query plans, each of the candidate query plans being semantically equivalent to the object metadata query;
  
  selecting one of the candidate query plans; and
  
  identifying one or more objects that satisfy the query predicate by retrieving object ids from the first respective secondary index using a first bucket id associated with the first bucket and the object metadata key involved in the query predicate;
  
  for each object identified as satisfying the object metadata query;
  
  determining a location of corresponding object metadata stored within the distributed object store;
  
  retrieving the corresponding object metadata using the determined location; and
  
  generating a metadata record from the corresponding object metadata;
  
  combining the metadata records from the identified objects into a metadata collection having a format compatible with the distributed processing system; and
  
  returning the metadata collection to the distributed processing system in connection with the response to the data request.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17 wherein the distributed processing system comprises a Hadoop cluster.
  - 19. The system of claim 18 wherein the metadata formatting module generates metadata records in Apache Avro format, Apache Thrift format, Apache Parquet format, Simple Key/Value format, JSON format, Hadoop SequenceFile format, or Google Protocol Buffer format.
  - 20. The system of claim 17 wherein the distributed processing system comprises an Apache Spark cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Graham, Stephen G., Wright, Eron D.
Primary Examiner(s)
Vital, Pierre M
Assistant Examiner(s)
Sultana, Nargis

Application Number

US14/674,324
Time in Patent Office

1,533 Days
Field of Search

707718
US Class Current
CPC Class Codes

G06F 16/182   Distributed file systems

G06F 16/2465   Query processing support fo...

G06F 16/2471   Distributed queries

G06F 16/951   Indexing; Web crawling tech...

Object metadata query with distributed processing systems

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

139 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Object metadata query with distributed processing systems

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

139 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others