Scalable distributed processing of RDF data

US 8,756,237 B2
Filed: 10/12/2012
Issued: 06/17/2014
Est. Priority Date: 10/12/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database;

accessing a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks;

loading the first subset of the data chunks to a main memory associated with the database system;

executing the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query;

based on the query result for the first query, accessing a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks;

loading the second subset of the data chunks to the main memory; and

executing the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In general, techniques are described for an RDF (Resource Description Framework) database system which can scale to huge size for realistic data sets of practical interest. In some examples, a database system includes a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples of the RDF database. The database system also includes a working memory, a query interface that receives a query for the RDF database, a SPARQL engine that identifies a subset of the data chunks relevant to the query, and an index interface that includes one or more bulk loaders that load the subset of the data chunks to the working memory. The SPARQL engine executes the query only against triples included within the loaded subset of the data chunks to obtain a query result.

38 Citations

View as Search Results

20 Claims

1. A method comprising:
- receiving, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database;
  
  accessing a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks;
  
  loading the first subset of the data chunks to a main memory associated with the database system;
  
  executing the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query;
  
  based on the query result for the first query, accessing a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks;
  
  loading the second subset of the data chunks to the main memory; and
  
  executing the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1,wherein a first one of the first plurality of data chunks includes a first plurality of triples of the RDF database,wherein a second one of the first plurality of data chunks includes a second plurality of triples of the RDF database,wherein the first plurality of triples comprises a first RDF graph,wherein the second plurality of triples comprises a second RDF graph,wherein loading the first subset of the data chunks to a main memory comprises merging the first RDF graph and the second RDF graph to generate a combined RDF graph in the main memory, andwherein executing the first query comprises executing the first query only against the combined RDF graph.
  - 3. The method of claim 1, wherein the second query depends on the query result for the first query.
  - 4. The method of claim 1,wherein the first query comprises a graph pattern, andwherein the first query result comprises a subgraph of the labeled directed graph that matches the graph pattern.
  - 5. The method of claim 1, wherein at least one of the triples included within the first subset of the data chunks loaded to the main memory does not match any part of the first query.
  - 6. The method of claim 1, wherein a first one of the plurality of the data chunks and a second one of the plurality of the data chunks include a common one of the plurality of triples.
  - 7. The method of claim 1, wherein the first index is distributed across multiple hosts.
  - 8. The method of claim 1, wherein the first index comprises one or more key-value pairs that each comprises a key and a value that references a corresponding one of the plurality of data chunks.
  - 9. The method of claim 1, further comprising:
    - using the query result of the first query to select at least one data chunk of the first subset of data chunks to be cleared from main memory prior to loading the second subset of the data chunks.
  - 10. The method of claim 1,wherein a third index comprises keys defined by a third characteristic of the first subset of the data chunks,wherein identifying the first subset of the data chunks relevant to the first query comprises accessing the third index.
  - 11. The method of claim 1, wherein a number of triples included in a first one of the plurality of data chunks is different than a number of triples included in a second one of the plurality of data chunks.
  - 12. The method of claim 1, further comprising:
    - selecting fewer than all of the triples included within the first subset of the data chunks loaded to the main memory to be cleared; and
      
      freeing the main memory of the selected triples.
  - 13. The method of claim 1, wherein the first query comprises a script comprising:
    - a gather step for identifying the first subset of the data chunks relevant to the first query and loading the first subset of the data chunks to the main memory;
      
      a sift step for executing the first query only against triples included within the first subset of the data chunks loaded to the main memory; and
      
      a clear step for freeing the main memory of one or more of the triples included within the first subset of the data chunks loaded to the main memory.
  - 14. The method of claim 13, wherein the sift step comprises executing a SPARQL Protocol and RDF Query Language (SPARQL) query included in the script against triples included within the first subset of the data chunks loaded to the main memory.
  - 15. The method of claim 14,wherein the SPARQL query comprises a first SPARQL query,wherein the gather step comprises a first gather step and a second gather step,the method further comprising:
    - executing a second gather step for identifying the second subset of the data chunks and loading the second subset of the data chunks to the main memory; and
      
      executing a second SPARQL query included in the script against triples included within the second subset of the data chunks.
  - 16. The method of claim 13, further comprising:
    - fragmenting the script into a plurality of script fragments; and
      
      sending one of the plurality of script fragments to a remote instance of the database system for execution, wherein the remote instance stores a data chunk relevant to executing the script fragment sent for execution.
  - 17. The method of claim 16, further comprising:
    - receiving, with the database system, a query result fragment from the remote instance of the database system,wherein executing the first query only against triples included within the first subset of the data chunks loaded to the main memory to obtain the query result comprises merging the query result fragment with the first query result.

18. A database system comprising:
- a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database;
  
  a main memory;
  
  a query interface that receives a first query and a second query for the RDF database;
  
  a query parser/evaluator that accesses a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks;
  
  an index interface that includes one or more bulk loaders that load the first subset of the data chunks to the main memory; and
  
  a SPARQL Protocol and RDF Query Language (SPARQL) engine that executes the first query only against triples included within the first subset of the data chunks loaded to the main memory to obtain a query result for the first query,wherein, based on the query result for the first query, the query parser/evaluator accesses a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks,wherein the one or more bulk loaders load the second subset of the data chunks to the main memory,wherein the SPARQL Protocol and RDF Query Language engine executes the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.
- View Dependent Claims (19)
- - 19. The database system of claim 18, wherein the database system is distributed among a plurality of instances.

20. A non-transitory computer-readable storage device storing instructions for causing one or more programmable processors to:
- receive, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database;
  
  access a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks;
  
  load the first subset of the data chunks to a main memory associated with the database system;
  
  execute the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query;
  
  based on the query result for the first query, access a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks;
  
  load the second subset of the data chunks to the main memory; and
  
  execute the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Architecture Technology Corporation
Original Assignee
Architecture Technology Corporation
Inventors
Stillerman, Matthew A., Joyce, Robert A.
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
BISKEBORN, KRISTOFER M

Application Number

US13/651,235
Publication Number

US 20140108414A1
Time in Patent Office

613 Days
Field of Search
US Class Current

707/741
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/22   Indexing; Data structures t...

G06F 16/24552   Database cache management

Scalable distributed processing of RDF data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

38 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable distributed processing of RDF data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links