RESILIENCY TO MEMORY FAILURES IN COMPUTER SYSTEMS
First Claim
1. A method in a computing system for correcting memory errors reported by a memory system, the method comprising:
- executing a load instruction of a processor of the computing system to load a data word from an address of the memory system;
when a memory error does not occur during executing of the load instruction, providing the data word loaded from the memory as the result of the load instruction;
when a memory error does occur during execution of the load instruction,retrieving error correction information associated with the address;
re-creating from the error correction information for the data word for the address of the memory; and
providing the re-created data word as a result of the load instructionwherein the executing of the load instruction, when a memory error does not occur, needs no additional overhead.
1 Assignment
0 Petitions
Accused Products
Abstract
A resiliency system detects and corrects memory errors reported by a memory system of a computing system using previously stored error correction information. When a program stores data into a memory location, the resiliency system executing on the computing system generates and stores error correction information. When the program then executes a load instruction to retrieve the data from the memory location, the load instruction completes normally if there is no memory error. If, however, there is a memory error, the computing system passes control to the resiliency system (e.g., via a trap) to handle the memory error. The resiliency system retrieves the error correction information for the memory location and re-creates the data of the memory location. The resiliency system stores the data as if the load instruction had completed normally and passes control to the next instruction of the program.
34 Citations
39 Claims
-
1. A method in a computing system for correcting memory errors reported by a memory system, the method comprising:
-
executing a load instruction of a processor of the computing system to load a data word from an address of the memory system; when a memory error does not occur during executing of the load instruction, providing the data word loaded from the memory as the result of the load instruction; when a memory error does occur during execution of the load instruction, retrieving error correction information associated with the address; re-creating from the error correction information for the data word for the address of the memory; and providing the re-created data word as a result of the load instruction wherein the executing of the load instruction, when a memory error does not occur, needs no additional overhead. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computing system that provides resilient access to memory, the computing system comprising:
-
a memory system having a memory storing data words at addresses and providing load and store access to the memory, wherein the memory system signals a memory error when a load access fails; a processor with an instruction set that includes a load instruction; and a storage medium that includes instructions of; a re-create data word component that re-creates the data word of the memory using error correction information that includes at least one other data word of the memory and at least one correction word; such that the processor executes a load instruction of an application program for loading a data word by issuing to the memory system a load request specifying the address of the data word to load; when a memory error is not signaled, providing the data word provided by the memory system as the loaded data word; and when a memory error is signaled, executing the re-create data word component and providing the re-created data word as the loaded data word. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer-readable storage medium containing computer-executable instructions of an application program interface for providing resiliency to memory accesses of an application program, the instructions comprising:
-
a segment register component that registers a segment of memory that is to be resilient for the application program, the registered segment having a segment descriptor indicating number of data words, number of check words, size of a check group, location of the data words, and location of the check words; a segment reference component that maps the registered segment into the address space of the application program and registers a re-create data word component to process memory errors that occur when the application program accesses the registered segment; and a segment write component that stores a data word in the registered segment by generating a check word for the data word, storing the generated check word, and storing the data word in the registered segment. - View Dependent Claims (20, 21, 22)
-
-
23. A method in a replacement node for reconstructing a resilient segment of a first node of a network of nodes with a distributed memory, the first node hosting a portion of the resilient segment, the method comprising:
-
receiving by the replacement node an indication that the first node is to be replaced; for each data word of the resilient segment, collecting by the replacement node from nodes other than the first node error correction information sufficient to re-create that data word, the error correction information for a data word of the first node being stored at nodes other than the first node; re-creating that data word based on the collected error correction information; and storing the re-created data word in a replacement resilient segment of the replacement node; for each check word stored in memory of the first node, re-creating and storing by the replacement node that check word; and notifying the nodes of the network that the replacement node has replaced the first node. - View Dependent Claims (24, 25)
-
-
26. A method in a computing system for storing data words and check words for providing resiliency in a memory, the method comprising:
-
storing data words in the memory, each data word being stored at an address within the memory; and for each check unit of data words, generating a check word for that check unit; generating a check word address for storing the generated check word for that check unit such that the same check word address can be regenerated from each of the addresses of the data words of that check unit and cannot be regenerated from any other addresses of data words; and storing the generated check word at the generated check word address within the memory. - View Dependent Claims (27, 28, 29, 30)
-
-
31. A method in a computing system for storing data words and check words to support resiliency of a memory, the memory providing a cache line of data words per access and a non-specific memory error signal covering multiple data words of the cache line, the method comprising:
-
storing the data words in the memory; and for each check unit of error correction information, generating a check word for that check unit such that each of the multiple data words covered by the same non-specific memory error signal is in a different check unit; and storing the generated check word. - View Dependent Claims (32, 33)
-
-
34. A method in a computing system for storing data words and check words to support resiliency of a memory, the memory being distributed across multiple nodes, the method comprising:
-
storing data words in the memory; and for each check unit of error correction information, generating a check word for that check unit such that at least two of the data words of that check unit are stored at different nodes; and storing the generated check word. - View Dependent Claims (35, 36)
-
-
37. A method in a computing system for adapting an application program to access a resilient segment of the application program, the method comprising:
-
modifying the application program code to invoke a register component of a resiliency system to register a data structure of the application program as a resilient segment; and modifying the application program to invoke a store data word component of the resiliency system to store data words in the resilient segment, rather than directly by the application program executing a store instruction, - View Dependent Claims (38, 39)
-
Specification