EFFICIENT LARGE-SCALE JOINING FOR QUERYING OF COLUMN BASED DATA ENCODED STRUCTURES

US 20100088309A1
Filed: 12/15/2008
Published: 04/08/2010
Est. Priority Date: 10/05/2008
Status: Abandoned Application

First Claim

Patent Images

1. A method for processing data, comprising:

in response to a query implicating at least one join operation over data in at least one data store, receiving a subset of data as integer encoded and compressed sequences of values corresponding to different columns of the data in the at least one data store;

determining at least one result set for the at least one join operation including determining if a local cache includes any non-default values corresponding to columns implicated by the at least one join operation; and

where the local cache includes any non-default values corresponding to columns implicated by the at least one join operation, substituting the non-default values when determining the at least one result set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject disclosure relates to querying of column based data encoded structures enabling efficient query processing over large scale data storage, and more specifically, with respect to join operations. Initially, a compact structure is received that represents the data according to a column based organization, and various compression and data packing techniques, already enabling a highly efficient and fast query response in real-time. On top of already fast querying enabled by the compact column oriented structure, a scalable, fast algorithm is provided for query processing in memory, which constructs an auxiliary data structure, also column-oriented, for use in join operations, which further leverages characteristics of in-memory data processing and access, as well as the column-oriented characteristics of the compact data structure.

158 Citations

20 Claims

1. A method for processing data, comprising:
- in response to a query implicating at least one join operation over data in at least one data store, receiving a subset of data as integer encoded and compressed sequences of values corresponding to different columns of the data in the at least one data store;
  
  determining at least one result set for the at least one join operation including determining if a local cache includes any non-default values corresponding to columns implicated by the at least one join operation; and
  
  where the local cache includes any non-default values corresponding to columns implicated by the at least one join operation, substituting the non-default values when determining the at least one result set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - storing at least one result of the at least one result set in the local cache for substitution in connection with a second query.
  - 3. The method of claim 2, wherein the storing includes lockless storing of the at least one result in memory.
  - 4. The method of claim 1, wherein the determining includes parallelizing the operations defined by the query with multiple processors and a corresponding number of segments divided from the sequences, each segment handled by at least one different processor.
  - 5. The method of claim 1, further comprising:
    - setting the local cache to default values prior to initiating query processing.
  - 6. The method of claim 5, wherein the setting includes setting the local cache to values of negative one (“
    - −
      
      1”
      
      ) prior to initiating query processing.
  - 7. The method of claim 1, wherein the substituting includes substituting the non-default values when determining the at least one result set instead of scanning the corresponding column in the sequence of values.
  - 8. The method of claim 1, further comprising,where the local cache includes default values corresponding to columns implicated by the at least one join operation, processing the corresponding column in the sequence of values to retrieve at least one result for the at least one result set.
  - 9. The method of claim 1, wherein the receiving includes receiving the subset of data from a relational database and wherein the different columns of the data correspond to columns of the relational database.
  - 10. A computer readable medium comprising computer executable instructions for performing the method of claim 1.

11. A method for query processing, including:
- generating a lazy cache shared by segments of compacted data retrieved in response to a query as integer encoded and compressed sequences of values corresponding to different columns of the data in at least one data store representing a set of tables; and
  
  in response to a query implicating at least one join operation over data in at least one data store, processing the query with reference to the lazy cache implicating at least one join operation over the at least one data store;
  
  wherein the processing includes populating the lazy cache with at least one data value from at least one table of the set of tables according to a predetermined algorithm for potential re-use of the at least one data value over the lifetime of the query processing.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, wherein the generating includes organizing the lazy cache according to at least one vector with values corresponding to the sequences of values corresponding to the different columns of data.
  - 13. The method of claim 11, wherein the processing further includes scanning the sequences of values wherein the processing includes populating the lazy cache with at least one data value from at least one table of the set of tables according to a predetermined algorithm for potential re-use of the at least one data value over the lifetime of the query processing.
  - 14. The method of claim 11, wherein the processing includes using foreign key data identifications (IDs) from the sequences of values as an index to the lazy cache.
  - 15. The method of claim 14, wherein the processing includes determining if a value of the lazy cache corresponding to a foreign key data ID is a default value.
  - 16. The method of claim 15, wherein if the value of the lazy cache is the default value, performing the at least one join operation over the sequences of values.
  - 17. The method of claim 14, wherein if the value of the lazy cache is not the default value, skipping the at least one join operation over the sequences of values, and using the value of the lazy cache corresponding to the foreign key data ID instead.
  - 18. The method of claim 11, wherein the processing includes receiving a result set and further including writing at least one result of the result set to the lazy cache as an atomic operation of a core processor data type that does not require a lock for consistency.
  - 19. A computing device comprising means for performing the method of claim 11.

20. A device for processing data, comprising:
- high speed in memory storage for storing a subset of data received as integer encoded and compressed sequences of values corresponding to different columns of the data and for storing a vector of values corresponding to the different columns; and
  
  at least one query processor that processes the query over the subset of the data and that skips at least one join operation implicated by the query over the subset of data where a default value is found in the vector for a given column and substitutes a value of the vector for the at least one join operation instead.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Petculescu, Cristian, Netz, Amir

Application Number

US12/335,341
Publication Number

US 20100088309A1
Time in Patent Office

Days
Field of Search
US Class Current

707/714
CPC Class Codes

G06F 16/221   Column-oriented storage; Ma...

G06F 16/24552   Database cache management

G06F 16/2456   Join operations

EFFICIENT LARGE-SCALE JOINING FOR QUERYING OF COLUMN BASED DATA ENCODED STRUCTURES

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

158 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

EFFICIENT LARGE-SCALE JOINING FOR QUERYING OF COLUMN BASED DATA ENCODED STRUCTURES

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

158 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links