High-Performance Streaming Dictionary
First Claim
1. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
- providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module;
one or more data files being maintained by the file-maintenance module;
where a data file comprises a header, and one or more blocks;
where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes;
where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks;
where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry;
where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block;
where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node;
where a nonleaf node comprises(a) a fingerprint seed,(b) a fingerprint,(c) a counter c,(d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number;
where a leaf node comprises(a) a fingerprint seed,(b) a fingerprint,(c) a log sequence number,(d) a sorted array of zero or more leaf entries, and(e) a checksum, and(f) a descriptor version number;
the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node;
where a value comprises a length and a sequence of zero or more bytes;
where a pivot comprises a key;
where a key comprises a length and a sequence of zero or more bytes;
where a key-value pair comprises a key and a value;
where a message queue comprises a FIFO of zero or more messages;
where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f;
where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message;
where a message comprises(a) a message operation,(b) a key,(c) a value,(d) a transaction-identifier-sequence length, and(e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length;
where a leaf entry comprises(a) a transaction-record count, and(b) a key;
where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value;
where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count;
where a transaction record comprises a transaction record type, a transaction identifier, and a value;
where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation;
where a transaction identifier comprises a log sequence number;
where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation;
where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file;
inserting a message into nonleaf node comprising(a) identifying which children the key of the message is to be delivered to,(b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children,(c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children;
identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message;
flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block;
inserting a message into a leaf node, the insertion comprising(a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry,(b) if the message operation comprises an insert operation or a delete operation theni. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message,ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry,iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, andiv. storing the resulting leaf entry into the leaf node,(c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, theni. removing the innermost transaction record of the leaf entry,ii. while the innermost transaction record is a placeholder, removing said innermost transaction record,iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and(d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, theni. if the innermost transaction record of the leaf entry is a delete operation thenA. removing the innermost transaction record of the leaf entry,B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node,C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, andii. if the innermost transaction record of the leaf entry is an insert operation thenA. removing the innermost transaction record of the leaf entry,B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message,C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes;
where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes;
where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes;
where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes;
where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key;
where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node;
where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one;
where the logging module maintains a log, the log comprising a sequence of zero or more log entries;
a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again;
where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log;
where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file;
the front-end module establishing for each file, when the file is opened, a file identifier;
where a file identifier comprises a number;
where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum;
where a data file header further comprises a descriptor;
where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length;
where a fingerprint seed comprises an integer;
where a fingerprint comprises an integer;
where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry;
where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message;
where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs;
where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i;
where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state;
where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
4 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus and computer program product for storing data in a disk storage system is presented. A high-performance dictionary data structure is defined. The dictionary data structure is stored on a disk storage system. Key-value pairs can be inserted and deleted into the dictionary data structure. Updates run faster than one insertion per disk-head movement. The structure can also be stored on any system with two or more levels of memory. The dictionary is high performance and supports with full transactional semantics, concurrent access from multiple transactions, and logging and recovery. Keys can be looked up with only a logarithmic number of transfers, even for keys that have been recently inserted or deleted. Queries can be performed on ranges of key-value pairs, including recently inserted or deleted pairs, at a constant fraction of the bandwidth of the disk.
-
Citations
136 Claims
-
1. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks; where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry; where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a log sequence number, (d) a sorted array of zero or more leaf entries, and (e) a checksum, and (f) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, (c) a value, (d) a transaction-identifier-sequence length, and (e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length; where a leaf entry comprises (a) a transaction-record count, and (b) a key; where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value; where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count; where a transaction record comprises a transaction record type, a transaction identifier, and a value; where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation; where a transaction identifier comprises a log sequence number; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insert operation or a delete operation then i. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message, ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry, iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, and iv. storing the resulting leaf entry into the leaf node, (c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. removing the innermost transaction record of the leaf entry, ii. while the innermost transaction record is a placeholder, removing said innermost transaction record, iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and (d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. if the innermost transaction record of the leaf entry is a delete operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node, C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and ii. if the innermost transaction record of the leaf entry is an insert operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message, C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one; where the logging module maintains a log, the log comprising a sequence of zero or more log entries; a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again; where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log; where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file; the front-end module establishing for each file, when the file is opened, a file identifier; where a file identifier comprises a number; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs; where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i; where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
2. (canceled)
-
3. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks; where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry; where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a sorted array of zero or more leaf entries, and (d) a checksum, and (e) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a key comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a value comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
4. (canceled)
-
5. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a log sequence number, (d) a sorted array of zero or more leaf entries, and (e) a checksum, and (f) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, (c) a value, (d) a transaction-identifier-sequence length, and (e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length; where a leaf entry comprises (a) a transaction-record count, and (b) a key; where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value; where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count; where a transaction record comprises a transaction record type, a transaction identifier, and a value; where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation; where a transaction identifier comprises a log sequence number; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insert operation or a delete operation then i. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message, ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry, iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, and iv. storing the resulting leaf entry into the leaf node, (c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. removing the innermost transaction record of the leaf entry, ii. while the innermost transaction record is a placeholder, removing said innermost transaction record, iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and (d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. if the innermost transaction record of the leaf entry is a delete operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node, C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and ii. if the innermost transaction record of the leaf entry is an insert operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message, C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one; where the logging module maintains a log, the log comprising a sequence of zero or more log entries; a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again; where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log; where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file; the front-end module establishing for each file, when the file is opened, a file identifier; where a file identifier comprises a number; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
6. (canceled)
-
7. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a sorted array of zero or more leaf entries, and (b) a checksum, and (c) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a key comprises a length and a sequence of zero or more bytes; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
8. (canceled)
-
9. (canceled)
-
10. (canceled)
-
11. (canceled)
-
12. (canceled)
-
13. (canceled)
-
14. (canceled)
-
15. (canceled)
-
16. (canceled)
-
17. (canceled)
-
18. (canceled)
-
19. (canceled)
-
20. (canceled)
-
21. (canceled)
-
22. (canceled)
-
23. (canceled)
-
24. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a sorted array of zero or more leaf entries, and (b) a checksum, and (c) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
25. (canceled)
-
26. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a sorted array of zero or more leaf entries, and (d) a checksum, and (e) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs; where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
27. (canceled)
-
28. (canceled)
-
29. (canceled)
-
30. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a sorted array of zero or more leaf entries, and (b) a checksum, and (c) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a value comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs; where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
31. (canceled)
-
32. (canceled)
-
33. (canceled)
-
34. (canceled)
-
35. (canceled)
-
36. (canceled)
-
37. (canceled)
-
38. (canceled)
-
39. (canceled)
-
40. (canceled)
-
41. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a sorted array of zero or more leaf entries, and (b) a checksum, and (c) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a value comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs; where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
42. (canceled)
-
43. (canceled)
-
44. (canceled)
-
45. (canceled)
-
46. (canceled)
-
47. (canceled)
-
48. (canceled)
-
49. (canceled)
-
50. (canceled)
-
51. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks; where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry; where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a sorted array of zero or more leaf entries, and (b) a checksum, and (c) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a pivot comprises a key; where a value comprises a length and a sequence of zero or more bytes; where a key comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, and (c) a value; where a leaf entry comprises (a) a value, and (b) a key; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insertion operation then i. constructing a leaf entry with a key taken from the message, a provisional value taken from the value of the message, and a committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, otherwise setting the new leaf entry to have no committed value, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, (c) if the message operation comprises a delete operation then i. constructing a leaf entry that is a provisional delete, with the committed value taken from the previously existing leaf entry with the same key if such leaf entry exists, ii. deleting the previous leaf entry if it exists, and iii. storing the new leaf entry in the leaf node, where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the leaf entry; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
52. (canceled)
-
53. (canceled)
-
54. (canceled)
-
55. (canceled)
-
56. (canceled)
-
57. (canceled)
-
58. (canceled)
-
59. (canceled)
-
60. (canceled)
-
61. (canceled)
-
62. (canceled)
-
63. (canceled)
-
64. (canceled)
-
65. (canceled)
-
66. (canceled)
-
67. (canceled)
-
68. (canceled)
-
69. (canceled)
-
70. (canceled)
-
71. (canceled)
-
72. (canceled)
-
73. (canceled)
-
74. (canceled)
-
75. (canceled)
-
76. (canceled)
-
77. (canceled)
-
78. (canceled)
-
79. (canceled)
-
80. (canceled)
-
81. (canceled)
-
82. (canceled)
-
83. (canceled)
-
84. (canceled)
-
85. (canceled)
-
86. (canceled)
-
87. (canceled)
-
88. (canceled)
-
89. (canceled)
-
90. (canceled)
-
91. (canceled)
-
92. (canceled)
-
93. (canceled)
-
94. (canceled)
-
95. (canceled)
-
96. (canceled)
-
97. (canceled)
-
98. (canceled)
-
99. (canceled)
-
100. (canceled)
-
101. (canceled)
-
102. (canceled)
-
103. (canceled)
-
104. (canceled)
-
105. (canceled)
-
106. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a counter c, (b) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(c) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(d) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(e) a descriptor version number; where a leaf node comprises (a) a log sequence number, (b) a sorted array of zero or more leaf entries, and (c) a checksum, and (d) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a value comprises a length and a sequence of zero or more bytes; where a key comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, (c) a value, (d) a transaction-identifier-sequence length, and (e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length; where a leaf entry comprises (a) a transaction-record count, and (b) a key; where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value; where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count; where a transaction record comprises a transaction record type, a transaction identifier, and a value; where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation; where a transaction identifier comprises a log sequence number; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insert operation or a delete operation then i. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message, ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry, iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, and iv. storing the resulting leaf entry into the leaf node, (c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. removing the innermost transaction record of the leaf entry, ii. while the innermost transaction record is a placeholder, removing said innermost transaction record, iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and (d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. if the innermost transaction record of the leaf entry is a delete operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node, C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and ii. if the innermost transaction record of the leaf entry is an insert operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message, C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one; where the logging module maintains a log, the log comprising a sequence of zero or more log entries; a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again; where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log; where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file; the front-end module establishing for each file, when the file is opened, a file identifier; where a file identifier comprises a number; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
107. (canceled)
-
108. (canceled)
-
109. (canceled)
-
110. (canceled)
-
111. (canceled)
-
112. (canceled)
-
113. (canceled)
-
114. (canceled)
-
115. (canceled)
-
116. (canceled)
-
117. (canceled)
-
118. (canceled)
-
119. A computer system comprising a processor;
- a main memory;
a secondary memory;
where one or both of the memories contains an encoded application for storing data on the secondary memory, that when the application is performed on the processor, provides a process for processing information, the process causing the computer system to perform the operations of organizing data in a data storage system, said process comprising;providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks; where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry; where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a log sequence number, (d) a sorted array of zero or more leaf entries, and (e) a checksum, and (f) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a pivot comprises a key; where a key comprises a length and a sequence of zero or more bytes; where a value comprises a length and a sequence of zero or more bytes; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, (c) a value, (d) a transaction-identifier-sequence length, and (e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length; where a leaf entry comprises (a) a transaction-record count, and (b) a key; where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value; where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count; where a transaction record comprises a transaction record type, a transaction identifier, and a value; where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation; where a transaction identifier comprises a log sequence number; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insert operation or a delete operation then i. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message, ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry, iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, and iv. storing the resulting leaf entry into the leaf node, (c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. removing the innermost transaction record of the leaf entry, ii. while the innermost transaction record is a placeholder, removing said innermost transaction record, iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and (d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. if the innermost transaction record of the leaf entry is a delete operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node, C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and ii. if the innermost transaction record of the leaf entry is an insert operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message, C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one; where the logging module maintains a log, the log comprising a sequence of zero or more log entries; a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again; where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log; where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file; the front-end module establishing for each file, when the file is opened, a file identifier; where a file identifier comprises a number; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs; where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i; where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
- a main memory;
-
120. (canceled)
-
121. (canceled)
-
122. (canceled)
-
123. (canceled)
-
124. (canceled)
-
125. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:
-
providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, and a file-maintenance module; one or more data files being maintained by the file-maintenance module; where a data file comprises a header, and one or more blocks; where zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes; where a header comprises a root block number identifying a root block; where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node; where a nonleaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a counter c, (d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−
1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−
2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−
1, and(g) a descriptor version number; where a leaf node comprises (a) a fingerprint seed, (b) a fingerprint, (c) a log sequence number, (d) a sorted array of zero or more leaf entries, and (e) a checksum, and (f) a descriptor version number; the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node; where a key comprises a length and a sequence of zero or more bytes; where a value comprises a length and a sequence of zero or more bytes; where a pivot comprises a key; where a key-value pair comprises a key and a value; where a message queue comprises a FIFO of zero or more messages; where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f; where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message; where a message comprises (a) a message operation, (b) a key, (c) a value, (d) a transaction-identifier-sequence length, and (e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length; where a leaf entry comprises (a) a transaction-record count, and (b) a key; where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value; where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count; where a transaction record comprises a transaction record type, a transaction identifier, and a value; where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation; where a transaction identifier comprises a log sequence number; where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation; where, when the front-end module executes a insert of a key-value pair into a file in a transaction, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file; inserting a message into nonleaf node comprising (a) identifying which children the key of the message is to be delivered to, (b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children, (c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children; identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message; flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block; inserting a message into a leaf node, the insertion comprising (a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry, (b) if the message operation comprises an insert operation or a delete operation then i. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message, ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry, iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, and iv. storing the resulting leaf entry into the leaf node, (c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. removing the innermost transaction record of the leaf entry, ii. while the innermost transaction record is a placeholder, removing said innermost transaction record, iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and (d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, then i. if the innermost transaction record of the leaf entry is a delete operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node, C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and ii. if the innermost transaction record of the leaf entry is an insert operation then A. removing the innermost transaction record of the leaf entry, B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message, C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes; where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes; where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the file-maintenance module to traverse the tree, searching the root node for the key; where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node; where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one; where the logging module maintains a log, the log comprising a sequence of zero or more log entries; a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again; where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log; where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file; the front-end module establishing for each file, when the file is opened, a file identifier; where a file identifier comprises a number; where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum; where a data file header further comprises a descriptor; where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length; where a fingerprint seed comprises an integer; where a fingerprint comprises an integer; where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry; where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message; where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state; where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.
-
-
126. (canceled)
-
127. (canceled)
-
128. (canceled)
-
129. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;a non-leaf node further comprising pivot keys from its child nodes and optionally further comprising pointers to some grandchild nodes; non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising multiple root nodes; some nonleaf nodes being included as child nodes of more than one nonleaf node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a grandchild node by comparing the key to the pivot keys of the child nodes, and (b) if the pointer to the grandchild node is present, then searching the grandchild node for the key and returning the result of the search, (c) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordered after the last key, (d) searching the ith child node for the key.
-
-
130. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising multiple root nodes; some nonleaf nodes being included as child nodes of more than one nonleaf node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordered after the last key, (b) searching the ith message queue for messages matching the key, where if a message matching the key is a delete message, then returning no key, and if the message matching the key is an insert message then returning the pair from the message, and if no key is found then (c) searching the ith child node for the key.
-
-
131. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising multiple root nodes; some nonleaf nodes being included as child nodes of more than one nonleaf node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordereed after the last key, (b) searching the ith child node for the key.
-
-
132. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;a non-leaf node further comprising pivot keys from its child nodes and optionally further comprising pointers to some grandchild nodes; non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising a root node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a grandchild node by comparing the key to the pivot keys of the child nodes, and (b) if the pointer to the grandchild node is present, then searching the grandchild node for the key and returning the result of the search, (c) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordereed after the last key, (d) searching the ith child node for the key.
-
-
133. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising a root node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordered after the last key, (b) searching the ith message queue for messages matching the key, where if a message matching the key is a delete message, then returning no key, and if the message matching the key is an insert message then returning the pair from the message, and if no key is found then (c) searching the ith child node for the key.
-
-
134. A system comprising a structure comprising nodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf node comprises key-value pairs, and a non-leaf node comprises two or more pointers to child nodes, where each child node is the root of a child subtree, where the keys stored in each child subtree are ordered to be before the keys stored in any subsequent child subtree, where c equals the number of child nodes;
-
a non-leaf node further comprising pivot keys, where the number of pivot keys equals c−
1, where the ith pivot key is greater than or equal to the keys stored in the ith child, and less than or equal to the keys stored in the i+1st child;non-leaf nodes further comprising message queues, where the number of message queues equals c; each message queue comprising a collection of messages; each message comprising an insert message or a delete message; an insert message comprising a key-value pair; a delete message comprising a key; the keys in the ith queue being ordered between the ith pivot key and the i+1st pivot key, or if i is the first queue, the keys in the ith queue being ordered before the first pivot key, or if i is the last queue, the keys in the ith queue being ordered after the last pivot key, the system transforming the structure by moving messages from a nonleaf node into the root node of the corresponding child subtree; moving a message into a nonleaf node comprising determining which pair of pivot keys keys in the node the key of message is between, and inserting the message into the corresponding message queue; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message if the message comprises a delete message; moving a message into a leaf node comprising deleting from the leaf node any pair matching the key of the message, and inserting the key-value pair of the message into the leaf node if the message comprises an insert message; the structure further comprising a root node; the system searching the structure for a key by searching a root node; searching a leaf node for a key comprising returning a key-value pair that match the key, if said pair exists, otherwise returning no key; searching a nonleaf node for a key comprising (a) identifying a child number by comparing the key to the pivot keys to determine a number i such that the key is ordered between the ith and the i+1st pivot key, or that the first pivot key is ordered after the key, or that the last pivot key is ordered after the last key, (b) searching the ith child node for the key.
-
-
135. A system comprising a loader, the loader comprising distinct software modules, where the distinct software modules comprise comprise a data source module, a buffer module, one or more primary extractor modules, one or more sorter modules, one or more blocker modules, and are or more compressor/writer modules,
where the data source module is connected to the buffer module, the buffer module is connected to the primary extractor modules, each extractor module is connected to a sorter module, each sorter module is connected to a blocker module, and each blocker module is connected to one or more compressor/writer modules; -
where the loader module accepts data comprising rows from an external source and passes the data to the buffer module, where the buffer module buffers the data, and passes a copy of the data to each extractor module, where an extractor module transforms each row into a new row, and passes the new row to a sorter module, where a sorter module sorts all of the rows it receives, and passes the sorted rows to a blocker module, where a blocker module organizes the rows into blocks, the blocks comprising a tree-structured dictionary, and passes the blocks to a compressor/writer module, where a compressor/writer module compresses each block it receives and writes the compressed block to a file. - View Dependent Claims (136)
-
Specification