Continuous flow checkpointing data processing
First Claim
1. A method for continuous flow checkpointing in a data processing system having at least one process stage comprising a data flow and at least two processes linked by the data flow, the method including:
- (a) propagating at least one command message through the process stage as part of the data flow;
(b) checkpointing each process within the process stage in response to receipt by each process of at least one command message.
4 Assignments
0 Petitions
Accused Products
Abstract
A data processing system and method that provides checkpointing and permits a continuous flow of data processing by allowing each process to return to operation after checkpointing, independently of the time required by other processes to checkpoint their state. Checkpointing in accordance with the invention makes use of a command message from a checkpoint processor that sequentially propagates through a process stage from data sources through processes to data sinks, triggering each process to checkpoint its state and then pass on a checkpointing message to connected “downstream” processes. This approach provides checkpointing and permits a continuous flow of data processing by allowing each process to return to normal operation after checkpointing, independently of the time required by other processes to checkpoint their state.
-
Citations
48 Claims
-
1. A method for continuous flow checkpointing in a data processing system having at least one process stage comprising a data flow and at least two processes linked by the data flow, the method including:
-
(a) propagating at least one command message through the process stage as part of the data flow;
(b) checkpointing each process within the process stage in response to receipt by each process of at least one command message. - View Dependent Claims (2, 3, 4)
(a) determining that the state of the data processing system needs to be restored;
(b) restoring each process to a corresponding saved state.
-
-
3. The method of claim 1, wherein at least one of such linked processes is a source or a sink.
-
4. The method of claim 1, wherein checkpointing includes suspending normal processing, saving a corresponding state, and returning to normal processing.
-
5. A method for continuous flow checkpointing in a data processing system having one or more sources for receiving and storing input data, one or more processes for receiving and processing data from one or more sources or prior processes, and one or more sinks for receiving processed data from one or more processes or sources and for publishing processed data, the method including:
-
(a) transmitting a checkpoint request message to every source;
(b) suspending normal data processing in each source in response to receipt of such checkpoint request message, saving a current checkpoint record sufficient to reconstruct the state of such source, propagating a checkpoint message from such source to any process that consumes data from such source, and resuming normal data processing in each source;
(c) suspending normal data processing in each process in response to receiving checkpoint messages from every source or prior process from which such process consumes data, saving a current checkpoint record sufficient to reconstruct the state of such process, propagating the checkpoint message from such process to any process or sink that consumes data from such process, and resuming normal data processing in such process;
(d) suspending normal data processing in each sink in response to receiving checkpoint messages from every process from which such sink consumes data, saving a current checkpoint record sufficient to reconstruct the state of such sink, saving any unpublished data, and resuming normal data processing in such sink. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
(a) determining that the state of the data processing system needs to be restored;
(b) restoring the state of each source, process, and sink from a corresponding current checkpoint record.
-
-
12. The method of claim 5, further including generating the checkpoint request message in response to detecting a checkpoint trigger event.
-
13. The method of claim 12, wherein the checkpoint trigger event occurs periodically.
-
14. The method of claim 12, wherein the checkpoint trigger event is based on an external stimulus.
-
15. The method of claim 12, wherein the checkpoint trigger event is based on occurrence of selected data values within or derived from incoming data records being processed.
-
16. The method of claim 12, further including:
-
(a) scanning incoming data records within each source for a selected data value;
(b) upon detecting the selected data value within each source, transmitting a control message to any process that consumes data from such source, the control message indicating that an end of data has occurred, and requesting checkpointing;
(c) determining that a checkpoint trigger event has occurred once a control message is transmitted by every source.
-
-
17. The method of claim 12, further including:
-
(a) examining incoming data records within each source and determining a selected data value based on such examination;
(b) providing the selected data value to each source;
(c) scanning incoming data records within each source for the selected data value;
(d) upon detecting the selected data value within each source, transmitting a control message to any process that consumes data from such source, the control message indicating that an end of data has occurred, and requesting checkpointing;
(e) determining that a checkpoint trigger event has occurred once a control message is transmitted by every source.
-
-
18. The method of claim 5, further including coordinating checkpointing with periodic production of output from the sinks.
-
19. The method of claim 5, further including terminating data processing by:
-
(a) propagating an end of job indication through each source, process, and sink;
(b) exiting data processing in each source, process, and sink in response to the end of job indication instead of resuming normal data processing.
-
-
20. The method of claim 5, further including publishing such data values essentially immediately before resuming normal data processing.
-
21. The method of claim 5, further including determining that unpublished data values are deterministic, and publishing such data values essentially immediately after saving such unpublished data.
-
22. The method of claim 5, further including determining that unpublished data values are deterministic and ordered, and publishing such data values at any time after receiving checkpoint messages from every process from which such sink consumes data and before resuming normal data processing.
-
23. The method of claim 5, further including determining that republishing data values is acceptable, and publishing such data values at any time after receiving checkpoint messages from every process from which such sink consumes data and before resuming normal data processing.
-
24. A method for continuous flow checkpointing in a data processing system having one or more sources for receiving and storing input data, one or more processes for receiving and processing data from one or more sources or prior processes, and one or more sinks for receiving processed data from one or more processes or sources and for publishing processed data, the method including:
-
(a) transmitting a checkpoint request message to every source;
(b) suspending normal data processing in each source in response to receipt of such checkpoint request message, saving a current checkpoint record sufficient to reconstruct the state of such source, propagating a checkpoint message from such source to any process that consumes data from such source, and resuming normal data processing in each source;
(c) suspending normal data processing in each process in response to receiving checkpoint messages from every source or prior process from which such process consumes data, saving a current checkpoint record sufficient to reconstruct the state of such process, propagating the checkpoint message from such process to any process or sink that consumes data from such process, and resuming normal data processing in such process;
(d) suspending normal data processing in each sink in response to receiving checkpoint messages from every process from which such sink consumes data, saving a current checkpoint record sufficient to reconstruct the state of such sink, saving any unpublished data, and propagating the checkpoint message from each sink to a checkpoint processor;
(e) receiving the checkpoint messages from all sinks, and in response to such receipt, updating a stored value indicating completion of checkpointing in all sources, processes, and sinks, and transmitting the stored value to each sink; and
(f) receiving the stored value in each sink and, in response to such receipt, publishing any unpublished data associated with such sink and resuming normal data processing in such sink.
-
-
25. A computer program, stored on a computer-readable medium, for continuous flow checkpointing in a data processing system having at least one process stage comprising a data flow and at least two processes linked by the data flow, the computer program comprising instructions for causing a computer to:
-
(a) propagate at least one command message through the process stage as part of the data flow;
(b) checkpoint each process within the process stage in response to receipt by each process of at least one command message. - View Dependent Claims (26, 27, 28)
(a) determine that the state of the data processing system needs to be restored;
(b) restore each process to a corresponding saved state.
-
-
27. The computer program of claim 25, wherein at least one of such linked processes is a source or a sink.
-
28. The computer program of claim 25, wherein the instructions for causing the computer to checkpoint include instructions for causing the computer to suspend normal processing, save a corresponding state, and return to normal processing.
-
29. A computer program, stored on a computer-readable medium, for continuous flow checkpointing in a data processing system having one or more sources for receiving and storing input data, one or more processes for receiving and processing data from one or more sources or prior processes, and one or more sinks for receiving processed data from one or more processes or sources and for publishing processed data, the computer program comprising instructions for causing a computer to:
-
(a) transmit a checkpoint request message to every source;
(b) suspend normal data processing in each source in response to receipt of such checkpoint request message, save a current checkpoint record sufficient to reconstruct the state of such source, propagate a checkpoint message from such source to any process that consumes data from such source, and resume normal data processing in each source;
(c) suspend normal data processing in each process in response to receiving checkpoint messages from every source or prior process from which such process consumes data, save a current checkpoint record sufficient to reconstruct the state of such process, propagate the checkpoint message from such process to any process or sink that consumes data from such process, and resume normal data processing in such process;
(d) suspend normal data processing in each sink in response to receiving checkpoint messages from every process from which such sink consumes data, save a current checkpoint record sufficient to reconstruct the state of such sink, save any unpublished data, and resume normal data processing in such sink. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
(a) determine that the state of the data processing system needs to be restored;
(b) restore the state of each source, process, and sink from a corresponding current checkpoint record.
-
-
36. The computer program of claim 29, further including instructions for causing the computer to generate the checkpoint request message in response to detecting a checkpoint trigger event.
-
37. The computer program of claim 36, wherein the checkpoint trigger event occurs periodically.
-
38. The computer program of claim 36, wherein the checkpoint trigger event is based on an external stimulus.
-
39. The computer program of claim 36, wherein the checkpoint trigger event is based on occurrence of selected data values within or derived from incoming data records being processed.
-
40. The computer program of claim 36, further including instructions for causing the computer to:
-
(a) scan incoming data records within each source for a selected data value;
(b) upon detecting the selected data value within each source, transmit a control message to any process that consumes data from such source, the control message indicating that an end of data has occurred, and requesting checkpointing;
(c) determine that a checkpoint trigger event has occurred once a control message is transmitted by every source.
-
-
41. The computer program of claim 36, further including instructions for causing the computer to:
-
(a) examine incoming data records within each source and determining a selected data value based on such examination;
(b) provide the selected data value to each source;
(c) scan incoming data records within each source for the selected data value;
(d) upon detecting the selected data value within each source, transmit a control message to any process that consumes data from such source, the control message indicating that an end of data has occurred, and requesting checkpointing;
(e) determine that a checkpoint trigger event has occurred once a control message is transmitted by every source.
-
-
42. The computer program of claim 29, further including instructions for causing the computer to coordinate checkpointing with periodic production of output from the sinks.
-
43. The computer program of claim 29, further including instructions for causing the computer to terminate data processing by:
-
(a) propagating an end of job indication through each source, process, and sink;
(b) exiting data processing in each source, process, and sink in response to the end of job indication instead of resuming normal data processing.
-
-
44. The computer program of claim 29, further including instructions for causing the computer to publish such data values essentially immediately before resuming normal data processing.
-
45. The computer program of claim 29, further including instructions for causing the computer to determine that unpublished data values are deterministic, and to publish such data values essentially immediately after saving such unpublished data.
-
46. The computer program of claim 29, further including instructions for causing the computer to determine that unpublished data values are deterministic and ordered, and to publish such data values at any time after receiving checkpoint messages from every process from which such sink consumes data and before resuming normal data processing.
-
47. The computer program of claim 29, further including instructions for causing the computer to determine that republishing data values is acceptable, and to publish such data values at any time after receiving checkpoint messages from every process from which such sink consumes data and before resuming normal data processing.
-
48. A computer program, stored on a computer-readable medium, for continuous flow checkpointing in a data processing system having one or more sources for receiving and storing input data, one or more processes for receiving and processing data from one or more sources or prior processes, and one or more sinks for receiving processed data from one or more processes or sources and for publishing processed data, the computer program comprising instructions for causing a computer to:
-
(a) transmit a checkpoint request message to every source;
(b) suspend normal data processing in each source in response to receipt of such checkpoint request message, save a current checkpoint record sufficient to reconstruct the state of such source, propagate a checkpoint message from such source to any process that consumes data from such source, and resume normal data processing in each source;
(c) suspend normal data processing in each process in response to receiving checkpoint messages from every source or prior process from which such process consumes data, save a current checkpoint record sufficient to reconstruct the state of such process, propagate the checkpoint message from such process to any process or sink that consumes data from such process, and resume normal data processing in such process;
(d) suspend normal data processing in each sink in response to receiving checkpoint messages from every process from which such sink consumes data, save a current checkpoint record sufficient to reconstruct the state of such sink, save any unpublished data, and propagate the checkpoint message from each sink to a checkpoint processor;
(e) receive the checkpoint messages from all sinks, and in response to such receipt, update a stored value indicating completion of checkpointing in all sources, processes, and sinks, and transmit the stored value to each sink; and
(f) receive the stored value in each sink and, in response to such receipt, publish any unpublished data associated with such sink and resume normal data processing in such sink.
-
Specification