System and method for consistent timestamping in distributed computer databases
First Claim
1. A transaction management method for use in a distributed database system having a plurality of interconnected nodes,the steps of the method comprising:
- (a) transmitting a prepare-to-commit message to each node in said system which is a cohort of a distributed transaction;
(b) each cohort of said distributed transaction receiving said prepare-to-commit message and then voting on a disposition of said distributed transaction, said disposition being selected from the set consisting of aborting said transaction and committing said transaction;
(c) each cohort voting to commit said transaction also voting a commit time range including an earliest time acceptable to said cohort for committing said transaction and a latest time acceptable to said cohort for committing said transaction;
(d) whenever said cohorts all vote to commit said transaction and said time ranges voted by said cohorts have a non-empty intersection, committing said transaction and selecting a transaction time for said transaction from the intersection of said time ranges voted by said cohorts;
(e) aborting said transaction whenever any of said cohorts vote to abort said transaction;
(f) aborting said transaction whenever said time ranges voted by said cohorts do not intersect; and
(g) repeating said steps (a) through (f) for a multiplicity of subsequent distributed transactions; and
in a first distributed transaction whose cohorts include a read-only cohort that updates no data values during said first distributed transaction, setting a read lock in said read-only cohort for each datum accessed by said read-only cohort while performing said first distributed transaction, storing in said read-only cohort data indicating that each said read lock was set by said first distributed transaction, and releasing each said read lock no later than the latest time voted by said read-only cohort for said first distributed transaction.
4 Assignments
0 Petitions
Accused Products
Abstract
A distributed database system has a plurality of databases located at distinct nodes, at least one of the databases comprising a timestamping database. Distributed transactions are committed using a two phase protocol. During the first phase, each cohort to the transaction votes to commit or abort the transaction, and also votes an earliest time and a latest time at which the transaction is to be committed. If all the cohorts vote to commit the transaction and the intersection of the voted time ranges is not empty, then the transaction is committed during the second phase of the protocol. A transaction time is selected from the intersection of the voted time ranges and is used to timestamp all updated data that is durably stored when the transaction is committed. Before the first phase of the two phase commit protocol, each transaction read or write locks data at each node for which it needs read or write access. Whenever a transaction enters the first phase of the commit protocol, read locks for that transaction can be converted into delay locks. Any transaction which obtains a write lock on delay locked data is a "delayed transaction". The delayed transaction votes a time range which guarantees that it will commit at a time which is later than the time at which the transactions with the delay locks commit. This combination of time range voting and delay locking ensures that the timestamp order of transactions is consistent throughout the distributed database and is consistent with a valid serialization order of the transactions.
176 Citations
11 Claims
-
1. A transaction management method for use in a distributed database system having a plurality of interconnected nodes,
the steps of the method comprising: -
(a) transmitting a prepare-to-commit message to each node in said system which is a cohort of a distributed transaction; (b) each cohort of said distributed transaction receiving said prepare-to-commit message and then voting on a disposition of said distributed transaction, said disposition being selected from the set consisting of aborting said transaction and committing said transaction; (c) each cohort voting to commit said transaction also voting a commit time range including an earliest time acceptable to said cohort for committing said transaction and a latest time acceptable to said cohort for committing said transaction; (d) whenever said cohorts all vote to commit said transaction and said time ranges voted by said cohorts have a non-empty intersection, committing said transaction and selecting a transaction time for said transaction from the intersection of said time ranges voted by said cohorts; (e) aborting said transaction whenever any of said cohorts vote to abort said transaction; (f) aborting said transaction whenever said time ranges voted by said cohorts do not intersect; and (g) repeating said steps (a) through (f) for a multiplicity of subsequent distributed transactions; and in a first distributed transaction whose cohorts include a read-only cohort that updates no data values during said first distributed transaction, setting a read lock in said read-only cohort for each datum accessed by said read-only cohort while performing said first distributed transaction, storing in said read-only cohort data indicating that each said read lock was set by said first distributed transaction, and releasing each said read lock no later than the latest time voted by said read-only cohort for said first distributed transaction. - View Dependent Claims (2, 9)
-
-
3. A distributed database system, comprising:
-
a plurality of interconnected nodes, each said node including a database, a transaction manager that coordinates commitment of transactions in which said node is a cohort, and a lock manager that governs access to the database at said node; said transaction manager on each node including means for; preparing to commit transactions in which said node is a cohort and voting to commit each such transaction within a time range specified by an earliest acceptable time and a latest acceptable time; selecting a transaction time for each transaction from the intersection of said time ranges voted by nodes that are cohorts to said each transaction; and committing each said transaction at the transaction time selected for that transaction; said lock manager including means for; for each transaction, setting a read lock on each datum to which said each transaction has obtained read access; allowing each transaction write access to data that has been read locked by another transaction, wherein said write access is allowed prior to termination of the other transaction but only after said transaction manager has prepared to commit said other transaction; marking as a delayed transaction each transaction allowed write access to data that is read locked by another transaction; and storing, with respect to each said delayed transaction, data representing a set of delaying transactions comprising transactions that have read locked data to which said delayed transaction has been allowed write access; said transaction manager further including means for selecting said time range such that, for each said delayed transaction, said earliest acceptable time is later than (A) said selected transaction time of each said delaying transaction that has committed, and (B) said latest acceptable time voted by said transaction manager for each delaying transaction that has prepared but not yet committed. - View Dependent Claims (4, 5, 6, 7, 8)
-
-
10. A distributed database system, comprising:
-
a plurality of interconnected nodes, each said node including a database, a transaction manager that coordinates commitment of distributed transactions in which said node is a cohort, and lock manager that governs access to the database at said node; said transaction manager on each node including means for; preparing to commit distributed transactions in which said node is a cohort and voting on a disposition of each such distributed transaction, said disposition being selected from the set consisting of aborting said each distributed transaction and committing said each distributed transaction;
each vote to commit including an time range comprising an earliest time acceptable to said cohort for committing said each distributed transaction and a latest time acceptable to said cohort for committing said each distributed transaction;selecting a transaction time for said each distributed transaction from the intersection of said time ranges voted by nodes that are cohorts to said each distributed transaction; committing said each distributed transaction at the transaction time selected for that distributed transaction when all cohorts of that distributed transaction vote to commit that distributed transaction; aborting each distributed transaction for which any cohort voted to abort said each distributed transaction; and aborting said each distributed transaction whenever said time ranges voted by said cohorts to said each distributed transaction do not intersect; and said lock manager on each node including means for; setting a read lock on each datum located at said node and to which said each distributed transaction has obtained read access; for each distributed transaction, setting a read lock on each datum located at said node and to which said each distributed transaction has obtained read access, and storing data in said node indicating which distributed transaction said read lock was set by; and when said node is a read-only cohort of a first distributed transaction, releasing each said read lock in said read-only cohort which was set by said first distributed transaction no later than the latest time voted by said read-only cohort for said first distributed transaction.
-
-
11. A transaction management method for use in a distributed database system having a plurality of interconnected nodes,
the steps of the method comprising: -
(a) transmitting a prepare-to-commit message to each node in said system which is a cohort of a distributed transaction; (b) each cohort of said distributed transaction receiving said prepare-to-commit message and then voting on a disposition of said distributed transaction, said disposition being selected from the set consisting of aborting said distributed transaction and committing said distributed transaction; (c) each cohort voting to commit said distributed transaction also voting a commit time range including an earliest time acceptable to said cohort for committing said distributed transaction and a latest time acceptable to said cohort for committing said distributed transaction; (d) whenever said cohorts all vote to commit said distributed transaction and said time ranges voted by said cohorts have a non-empty intersection, committing said distributed transaction and selecting a transaction time for said distributed transaction from the intersection of said time ranges voted by said cohorts; (e) aborting said distributed transaction whenever any of said cohorts vote to abort said distributed transaction; (f) aborting said distributed transaction whenever said time ranges voted by said cohorts do not intersect; and (g) repeating said steps (a) through (f) for a multiplicity of subsequent distributed transactions; and setting a read lock on each datum to which said distributed transaction has obtained read access; allowing each distributed transaction write access to data that has been read locked by another distributed transaction, wherein said write access is allowed prior to termination of the other distributed transaction but only after said node has voted to commit said other distributed transaction; marking as a delayed transaction each distributed transaction allowed write access to data that is read locked by another distributed transaction; and storing, with respect to each said delayed transaction, data representing a set of delaying transactions comprising transactions that have read locked data to which said delayed transaction has been allowed write access; said step (c) including selecting said commit time range such that, for each said delayed transaction, said earliest acceptable time is later than (A) said selected transaction time of each respective delaying transaction that has committed, and (B) said latest acceptable time voted by said cohort for each delaying transaction that has not yet committed.
-
Specification