×

High-Performance Streaming Dictionary

  • US 20110246503A1
  • Filed: 04/06/2010
  • Published: 10/06/2011
  • Est. Priority Date: 04/06/2010
  • Status: Active Grant
First Claim
Patent Images

1. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a method for storing data, said method comprising:

  • providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise a front-end module, a buffer-pool module, a recovery module, a logging module, a locking module, and a file-maintenance module;

    one or more data files being maintained by the file-maintenance module;

    where a data file comprises a header, and one or more blocks;

    where one or more of the blocks comprise bytes encoding a block translation table, and zero or more of the blocks comprise bytes encoding leaf nodes, and zero or more of the blocks comprise bytes encoding nonleaf nodes;

    where a header comprises a root block number identifying a root block and a set of one or more block numbers identifying block translation tables blocks;

    where a block translation table comprises a number and an array, the number specifying the length of the array, the array indexed by a block number, each element of the array comprising a translation entry;

    where a translation entry comprises a file offset and a size, the file offset and size identifying a block of zero or more bytes in the file, the block number indexing the entry thus identifying the block;

    where the block identified by the root block number comprises an encoding of a leaf node or a nonleaf node;

    where a nonleaf node comprises(a) a fingerprint seed,(b) a fingerprint,(c) a counter c,(d) one or more child block numbers, where the number of child block numbers equals c, each child block number identifying a block comprising a leaf node or nonleaf node, the children being numbered consecutively from 0 to c−

    1,(e) zero or more pivots, where the number of pivots equals one less than the number of child block numbers, the pivots being numbered consecutively from 0 to c−

    2, and(f) one or more message queues, where the number of message queues equals c, the message queues being numbered consecutively from 0 to c−

    1, and(g) a descriptor version number;

    where a leaf node comprises(a) a fingerprint seed,(b) a fingerprint,(c) a log sequence number,(d) a sorted array of zero or more leaf entries, and(e) a checksum, and(f) a descriptor version number;

    the leaf and nonleaf nodes providing a tree structure, in which the leaves of the tree comprise the leaf nodes and the internal nodes of the tree comprise the nonleaf nodes, where the root block number identifies the root of the tree, and for each nonleaf node, the child block numbers of the node identify the respective children of the node;

    where a value comprises a length and a sequence of zero or more bytes;

    where a pivot comprises a key;

    where a key comprises a length and a sequence of zero or more bytes;

    where a key-value pair comprises a key and a value;

    where a message queue comprises a FIFO of zero or more messages;

    where the FIFO of messages comprises a counter f and a sequence of zero or messages where the number of messages in the sequence equals f;

    where the FIFO of messages supports operations comprising enqueueing a message and dequeueing the least recently enqueued message;

    where a message comprises(a) a message operation,(b) a key,(c) a value,(d) a transaction-identifier-sequence length, and(e) a sequence of zero or more transaction identifiers, where the number of transaction identifiers equals the transaction-identifier-sequence length;

    where a leaf entry comprises(a) a transaction-record count, and(b) a key;

    where for a leaf entry in which the transaction-record count equals one, the leaf entry further comprises a value;

    where for a leaf entry in which the transaction-record count is greater than one, the leaf entry further comprises a sequence of transaction records equal in length to the transaction-record count;

    where a transaction record comprises a transaction record type, a transaction identifier, and a value;

    where a transaction record type comprises an operation comprising an insert operation, a delete operation, and a placeholder operation;

    where a transaction identifier comprises a log sequence number;

    where a message operation comprises an insert operation, a delete operation, an abort operation, and a commit operation;

    where, when the front-end module executes a insert of a key-value pair into a file in a transaction, the front-end module causes the locking module to acquire a exclusive lock on the key, and the front-end module creates a message in which the message operation comprises an insert operation, the key length and key comprise the key of the key-value pair, the value length and value comprise the value of the key-value pair, the transaction-identifier-sequence length comprises the depth of the transaction, where a transaction with no parents has depth one, and a transaction with a parent has depth equal to one plus the depth of the parent, and the sequence of transaction identifiers comprises the transaction identifier of the transaction, and of each of the transaction'"'"'s ancestors, the front-end module then causing the file-maintenance module to insert the message into the root block identified in the header of the file;

    inserting a message into nonleaf node comprising(a) identifying which children the key of the message is to be delivered to,(b) inserting the message into the FIFOs of the message queues respectively corresponding to the identified children,(c) determining if the total size of the node is larger than a specified value, and if so then flushing message queues of one or more children;

    identifying which child a message to is be delivered to in a nonleaf node comprising comparing the pivot keys of the nonleaf node to the key of the message, thus identifying one or more children that can hold the message;

    flushing message queues of one or more children comprising selecting a nonempty message queue, fetching the corresponding child block into main memory using the buffer-pool module, and removing one or more messages from the message queue and inserting said messages into the child block;

    inserting a message into a leaf node, the insertion comprising(a) fetching from the leaf node the leaf entry with the same key as the message, if such leaf entry exists, otherwise constructing an empty leaf entry,(b) if the message operation comprises an insert operation or a delete operation theni. discarding any existing transaction record in the leaf entry that has the same innermost transaction identifier as the message,ii. adding placeholder transaction records for every ancestor transaction identifier in the message that is not listed in the leaf entry,iii. adding an insert or delete transaction record, depending on the operation of the message, for the innermost transaction of the message, the value length and value of the transaction record being copied from the message, andiv. storing the resulting leaf entry into the leaf node,(c) if the message operation comprises an abort operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, theni. removing the innermost transaction record of the leaf entry,ii. while the innermost transaction record is a placeholder, removing said innermost transaction record,iii. if the resulting leaf entry is empty, then deleting the original leaf entry, if any, from the leaf node, otherwise storing the new leaf entry into the leaf node, and(d) if the message operation comprises a commit operation and if the innermost transaction identifier of the leaf entry equals the innermost transaction identifier of the message, theni. if the innermost transaction record of the leaf entry is a delete operation thenA. removing the innermost transaction record of the leaf entry,B. if the resulting leaf entry is empty, then deleting the leaf entry from the leaf node,C. otherwise replacing the new innermost transaction record with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, andii. if the innermost transaction record of the leaf entry is an insert operation thenA. removing the innermost transaction record of the leaf entry,B. if the resulting leaf entry is empty, then replacing the old leaf entry with a new leaf entry with transaction-record count equal to one and with the key and value copied from the message,C. otherwise replacing the new innermost transaction record of the leaf entry with the contents of the removed innermost transaction record without changing the transaction identifier of the new innermost transaction record, and where when a leaf node is modified, if it becomes larger than a specified size, the leaf node is split by the file-maintenance module into two leaf nodes, and the parent of the leaf node, if it exists, is modified to include a new pivot key that separates the two leaf nodes, and to include pointers to the two leaf nodes, and if there is no parent of the leaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes;

    where when a nonleaf node is modified to include an additional pivot and pointer, if the fanout of the nonleaf node therefore becomes larger than a specified value, then the nonleaf node is split into two nonleaf nodes, and the parent of the nonleaf node, if it exists, is modified to include a new pivot key that separates the two nonleaf nodes, and to include pointers to the two nonleaf nodes, and if there is no parent of the nonleaf node, a new root node is created with the new pivot keys and the pointers to the leaf nodes;

    where when a leaf node is modified, it becomes smaller than a specified size, and if there is a left or right sibling to the leaf, then the leaf and the sibling are merged into one leaf node, and the parent of the two leaf nodes is modified to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes;

    where when a nonleaf node is modified to remove a pivot and pointer, if the fanout of the nonleaf node therefore becomes smaller than a specified value, and if there is a sibling to the nonleaf node, the nonleaf node and its siblings are merged, and their parent is updated to remove the pivot that separates the two nodes and to point at the one remaining node instead of the two original leaf nodes;

    where when the front-end module executes a search on a particular key in a file in a transaction, the front-end module causes the locking module to acquire a shared lock on the key and causes the file-maintenance module to traverse the tree, searching the root node for the key;

    where searching a nonleaf node for a key comprises identifying the smallest-numbered child that can hold the key as the next node on the path, flushing the message queue of that child, and then searching the child node;

    where searching a leaf node comprises fetching from the leaf node the leaf entry with the same key as the message, if such key exists, and returning the associated value from the innermost transaction record of the leaf entry, if the transaction record count is greater than one, or returning the value of the leaf entry if the transaction record count equals one;

    where the logging module maintains a log, the log comprising a sequence of zero or more log entries;

    a log entry comprising a sequence of bytes encoding, in order, a log sequence number, a length, log entry data, a checksum, and the length again;

    where a log sequence number comprises a number, where the log sequence number in one log entry of the log is one less than the log sequence number in the following log entry in the log;

    where when the front-end module executes an insert of a key-value pair into a file in a transaction, the front-end module causes the logging module to create a log entry in which the log data comprises bytes that encode an insert operation, the key, the value, the transaction identifier of the transaction, and a file identifier identifying the file;

    the front-end module establishing for each file, when the file is opened, a file identifier;

    where a file identifier comprises a number;

    where a checksum on a sequence of bytes is calculated by organizing the bytes into 8-byte values, interpreting the 8-byte values as integers, and summing the product of the ith 8-byte value with seventeen raised to the power of i to produce a checksum;

    where a data file header further comprises a descriptor;

    where a descriptor comprises a version number, a length, and a sequence of zero or more bytes, the number of bytes being equal to the length;

    where a fingerprint seed comprises an integer;

    where a fingerprint comprises an integer;

    where the fingerprint of a leaf node is calculated as the sum over all the leaf entries of the node of the product of the fingerprint seed of the node with the checksum of the leaf entry;

    where the fingerprint of a nonleaf node is calculated as the sum of the fingerprints of the children nodes of the nonleaf node further summed with the sum over all the messages in the message queues in the node of the product of the of the fingerprint seed of the node with the checksum of the message;

    where when the front-end module creates or opens a file, the front-end module provides a comparison function that, given two keys and the descriptor from the file header, compares the two keys to determine whether the two keys are to be considered equal or else which one is to be ordered ahead of the other, thus ordering key-value pairs;

    where the file-maintenance module maintains the tree of each file so that in a nonleaf node, for each given a pivot key numbered i the node, the pivot key is ordered to be greater than or equal to any of the key-value pairs stored in child subtrees or message queues numbered less than or equal to i, and the pivot key is ordered to be less than or equal to any of the key-value pairs stored in child subtrees or message queues numbered greater than i, and the pivot key numbered i is ordered to be before any pivot keys numbered greater than i;

    where when recovering from a crash, the recovery module reads the log and acting on each log entry to restore the data stored into a consistent state;

    where when the front-end module creates a file, the front-end module specifies a descriptor causing the file-maintenance module to maintain the descriptor version number in each node in the file.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×