TOLERATING MEMORY STACK FAILURES IN MULTI-STACK SYSTEMS

US 20200133518A1
Filed: 10/31/2018
Published: 04/30/2020
Est. Priority Date: 10/31/2018
Status: Active Grant

First Claim

Patent Images

1. A memory system, comprising:

a random-access memory including a plurality of memory stacks, each including a plurality of stacked random-access memory integrated circuit dies;

a memory controller coupled to said random-access memory and operable to;

receive a block of data for writing to the memory stacks;

divide the block of data into a plurality of sub-blocks;

create a reliability sub-block based on the plurality of sub-blocks;

cause the plurality of sub-blocks and the reliability sub-block each to be written to a different one of the memory stacks;

cause the plurality of sub-blocks to be read from the plurality of memory stacks and detect an error therein indicating a failure within one of the memory stacks; and

in response to detecting the error, recover correct data based on the reliability sub-block.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Memory management circuitry and processes operate to improve reliability of a group of memory stacks, providing that if a memory stack or a portion thereof fails during the product'"'"'s lifetime, the system may still recover with no errors or data loss. A front-end controller receives a block of data requested to be written to memory, divides the block into sub-blocks, and creates a new redundant reliability sub-block. The sub-blocks are then written to different memory stacks. When reading data from the memory stacks, the front-end controller detects errors indicating a failure within one of the memory stacks, and recovers corrected data using the reliability sub-block. The front-end controller may monitor errors for signs of a stack failure and disable the failed stack.

Citations

20 Claims

1. A memory system, comprising:
- a random-access memory including a plurality of memory stacks, each including a plurality of stacked random-access memory integrated circuit dies;
  
  a memory controller coupled to said random-access memory and operable to;
  
  receive a block of data for writing to the memory stacks;
  
  divide the block of data into a plurality of sub-blocks;
  
  create a reliability sub-block based on the plurality of sub-blocks;
  
  cause the plurality of sub-blocks and the reliability sub-block each to be written to a different one of the memory stacks;
  
  cause the plurality of sub-blocks to be read from the plurality of memory stacks and detect an error therein indicating a failure within one of the memory stacks; and
  
  in response to detecting the error, recover correct data based on the reliability sub-block.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The memory system of claim 1, wherein:
    - the memory controller includes a front-end controller and a plurality of memory channel controllers coupled between the front-end controller and the memory stacks; and
      
      the front-end controller is operable to produce a plurality of different addresses respectively for the plurality of sub-blocks and the reliability sub-block based on a single address for the block of data.
  - 3. The memory system of claim 2, wherein the front-end controller is further operable to respond to a designated set of detected errors by disabling access for a designated one of the memory stacks causing the designated memory stack to not be accessed for read or write and making a record that the designated memory stack is disabled.
  - 4. The memory system of claim 2, wherein the front-end controller is further operable to, in response to detecting the error, cause the reliability sub-block to be read.
  - 5. The memory system of claim 2, wherein the memory controller detecting the error includes determining the presence of an uncorrectable error from error correction code data co-located with the plurality of sub-blocks.
  - 6. The memory system of claim 1, wherein the plurality of memory stacks are mounted on a carrier substrate, and the memory controller is part of a microprocessor integrated circuit mounted on the carrier substrate.
  - 7. The memory system of claim 1, wherein the plurality of memory stacks are part of a multi-chip module including the memory controller.

8. A method of managing memory access, comprising:
- receiving a block of data for writing to a random-access memory;
  
  dividing the block of data into a plurality of sub-blocks;
  
  creating a reliability sub-block based on the plurality of sub-blocks;
  
  causing the plurality of sub-blocks and the reliability sub-block each to be written to different ones of a plurality of memory stacks, each memory stack comprising a plurality of stacked random-access memory integrated circuits;
  
  causing the plurality of sub-blocks to be read from the plurality of memory stacks and detecting an error therein indicating a failure within one of the memory stacks; and
  
  in response to detecting the error, recovering correct data based on the reliability sub-block.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method of claim 8, further comprising, in response to detecting the error, causing the reliability sub-block to be read.
  - 10. The method of claim 8, wherein detecting the error includes determining the presence of an uncorrectable error from error correction code data co-located with the plurality of sub-blocks.
  - 11. The method of claim 8, wherein detecting the error is based on error detection code data co-located with the plurality of sub-blocks.
  - 12. The method of claim 8, wherein causing the plurality of sub-blocks and the reliability sub-block to be written further comprises producing a plurality of different addresses respectively for the plurality of sub-blocks and the reliability sub-block based on a single address received for the block of data.
  - 13. The method of claim 8, wherein causing the plurality of sub-blocks and the reliability sub-block to be written further comprises supplying each of the sub-blocks to a different memory channel controller configured for managing a respective memory channel of the memory stacks.
  - 14. The method of claim 8, further comprising responding to a designated set of detected errors by disabling access for a designated one of the memory stacks, causing the designated memory stack to not be accessed for read or write, and making a record that the designated memory stack is disabled.
  - 15. The method of claim 8, wherein the reliability sub-block is created by a front-end controller coupled to receive a memory request from a system cache and send the memory request to a plurality of memory channel controllers adapted to control respective memory channels of the memory stacks.

16. A memory controller circuit for interfacing with a plurality of random-access memory stacks, comprising:
- a plurality of memory channel controllers coupled to the random-access memory stacks; and
  
  a front-end controller coupled to the plurality of memory channel controllers and operable to;
  
  receive a block of data for writing to the random-access memory stacks;
  
  divide the block of data into a plurality of sub-blocks;
  
  create a reliability sub-block based on the plurality of sub-blocks;
  
  direct selected ones of the memory channel controllers to cause the plurality of sub-blocks and the reliability sub-block each to be written to a different one of the random-access memory stacks;
  
  direct selected ones of the memory channel controllers to cause the plurality of sub-blocks to be read from the random-access memory stacks;
  
  detect an error in the plurality of sub-blocks indicating a failure within one of the memory stacks; and
  
  in response to detecting the error, recover correct data based on the reliability sub-block.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The memory controller circuit of claim 16, wherein the front-end controller is further operable to, in response to detecting the error, cause the reliability sub-block to be read.
  - 18. The memory controller circuit of claim 16, wherein the front-end controller is further operable to produce a plurality of different addresses respectively for the plurality of sub-blocks and the reliability sub-block based on a single address for the block of data.
  - 19. The memory controller circuit of claim 16, wherein the front-end controller is further operable to respond to a designated set of detected errors by disabling a designated one of the random-access memory stacks causing the designated memory stack to not be accessed for read or write and making a record that the designated memory stack is disabled.
  - 20. The memory controller circuit of claim 16, wherein detecting the error includes determining the presence of an uncorrectable error from error correction code data co-located with the plurality of sub-blocks.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Advanced Micro Devices, Inc.
Original Assignee
Advanced Micro Devices, Inc.
Inventors
Mappouras, Georgios, Farahani, Amin Farmahini, Ignatowski, Michael

Granted Patent

US 11,494,087 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/0751   Error or fault detection no...

G06F 11/0793   Remedial or corrective acti...

G06F 11/108   Parity data distribution in...

G06F 3/0619   in relation to data integri...

G06F 3/0653   Monitoring storage devices ...

G06F 3/0673   Single storage device

TOLERATING MEMORY STACK FAILURES IN MULTI-STACK SYSTEMS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

TOLERATING MEMORY STACK FAILURES IN MULTI-STACK SYSTEMS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links