Scalable distributed processing of RDF data
First Claim
1. A method comprising:
- receiving, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database;
accessing a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks;
loading the first subset of the data chunks to a main memory associated with the database system;
executing the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query;
based on the query result for the first query, accessing a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks;
loading the second subset of the data chunks to the main memory; and
executing the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.
1 Assignment
0 Petitions
Accused Products
Abstract
In general, techniques are described for an RDF (Resource Description Framework) database system which can scale to huge size for realistic data sets of practical interest. In some examples, a database system includes a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples of the RDF database. The database system also includes a working memory, a query interface that receives a query for the RDF database, a SPARQL engine that identifies a subset of the data chunks relevant to the query, and an index interface that includes one or more bulk loaders that load the subset of the data chunks to the working memory. The SPARQL engine executes the query only against triples included within the loaded subset of the data chunks to obtain a query result.
38 Citations
20 Claims
-
1. A method comprising:
-
receiving, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database; accessing a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks; loading the first subset of the data chunks to a main memory associated with the database system; executing the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query; based on the query result for the first query, accessing a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks; loading the second subset of the data chunks to the main memory; and executing the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A database system comprising:
-
a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database; a main memory; a query interface that receives a first query and a second query for the RDF database; a query parser/evaluator that accesses a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks; an index interface that includes one or more bulk loaders that load the first subset of the data chunks to the main memory; and a SPARQL Protocol and RDF Query Language (SPARQL) engine that executes the first query only against triples included within the first subset of the data chunks loaded to the main memory to obtain a query result for the first query, wherein, based on the query result for the first query, the query parser/evaluator accesses a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks, wherein the one or more bulk loaders load the second subset of the data chunks to the main memory, wherein the SPARQL Protocol and RDF Query Language engine executes the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query. - View Dependent Claims (19)
-
-
20. A non-transitory computer-readable storage device storing instructions for causing one or more programmable processors to:
-
receive, with a database system, a first query and a second query for a Resource Description Framework (RDF) database that stores a plurality of data chunks to one or more storage drives, wherein each of the plurality of data chunks includes a plurality of triples that represent an RDF subgraph of the RDF database; access a first index that indexes one or more of the data chunks to identify a first subset of the data chunks relevant to the first query, wherein the first index comprises keys defined by a first characteristic of the first subset of the data chunks; load the first subset of the data chunks to a main memory associated with the database system; execute the first query only against triples included within the subset of the data chunks loaded to the main memory to obtain a query result for the first query; based on the query result for the first query, access a second index to identify a second subset of the data chunks relevant to the second query, wherein the second index comprises keys defined by a second characteristic of the second subset of the data chunks; load the second subset of the data chunks to the main memory; and execute the second query only against triples included within the first subset of the data chunks and the second subset of the data chunks loaded to the main memory to obtain a query result for the second query.
-
Specification