In-situ computing system failure avoidance

US 9,058,250 B2
Filed: 07/23/2013
Issued: 06/16/2015
Est. Priority Date: 07/23/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A computing system device failure avoidance method for a computing system including at least one device, the method comprising:

identifying at least one failure mechanism of each device affected by time variation of an operating parameter of the device;

assigning a respective time to replace for each failure mechanism;

assigning a respective remaining time to replace initially equal to the respective time to replace for each failure mechanism,tracking the operating parameter periodically at a tracking interval;

tracking a respective remaining time to replace for each failure mechanism periodically at the tracking interval, including,determining an effective operating time of the respective device during each tracking interval based on at least one value of the operating parameter tracked during a respective tracking interval, andsubtracting the effective operating time from the respective remaining time to replace; and

replacing a respective device responsive to one of the respective estimated time to replace reaching a respective threshold value.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A remaining time to replace can be updated taking into account time variation of a failure mechanism of a device. Starting with an initial remaining time to replace, an effective operating time can be determined periodically based on an operating parameter measured at a tracking interval, and remaining time to replace can be updated by subtracting the effective operating time. The technique can be applied to multiple failure mechanisms and to multiple devices and/or components each having multiple failure mechanisms.

23 Citations

View as Search Results

20 Claims

1. A computing system device failure avoidance method for a computing system including at least one device, the method comprising:
- identifying at least one failure mechanism of each device affected by time variation of an operating parameter of the device;
  
  assigning a respective time to replace for each failure mechanism;
  
  assigning a respective remaining time to replace initially equal to the respective time to replace for each failure mechanism,tracking the operating parameter periodically at a tracking interval;
  
  tracking a respective remaining time to replace for each failure mechanism periodically at the tracking interval, including,determining an effective operating time of the respective device during each tracking interval based on at least one value of the operating parameter tracked during a respective tracking interval, andsubtracting the effective operating time from the respective remaining time to replace; and
  
  replacing a respective device responsive to one of the respective estimated time to replace reaching a respective threshold value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the identifying of at least one failure mechanism includes identifying a performance metric impacted by time variation of an operating parameter of the device and identifying at least one failure mechanism of the device based on the performance metric.
  - 3. The method of claim 1, wherein the assigning of the respective time to replace and a respective remaining time to replace initially equal to the respective time to replace for each failure mechanism includes assigning a respective expected number of failures for each of the at least one failure mechanism and assigning a respective time to replace for each of the at least one failure mechanism based at least in part on a respective expected number of fails.
  - 4. The method of claim 1, further comprising:
    - monitoring for a fail;
      
      responsive to a fail, determining whether a failure is recoverable,responsive to the failure being recoverable, performing a recovery action and returning to monitoring for a failure, tracking the operating parameter, and tracking the respective remaining time to replace,responsive to the failure not being recoverable, notifying a user that action is necessary, andresponsive to a failure that is not recoverable, notifying a user of the failure.
  - 5. The method of claim 1, wherein the at least one failure mechanism includes electromigration.
  - 6. The method of claim 1, wherein the operating parameter is temperature and the tracking of the operating parameter includes using at least one device temperature sensor to measure a respective temperature during a measurement period at each tracking interval.
  - 7. The method of claim 6, further comprising determining a space domain estimated failure rate based on at least a number of temperature sensors used and the respective measured temperatures for each sensor and at each tracking interval.
  - 8. The method of claim 7, wherein each of the at least one temperature sensor tracks a respective number of elements, and the space domain estimated failure rate employs the relationship:
  - 9. The method of claim 6, further comprising determining a time domain estimated failure rate based on at least design life of the device, a respective measured temperature for each tracking interval, an operation time of the device, and a relationship between operation time, measured temperature, and tracking interval.
  - 10. The method of claim 9, wherein each of the at least one temperature sensor tracks a respective number of elements, the failure mechanism is electromigration, and the relationship includes:
  - 11. The method of claim 1, wherein a respective remaining time to replace for each of the at least one device is based on smallest remaining time to replace of the respective at least one failure mechanism of the respective at least one device.

12. An in-situ computing system device failure avoidance method for a computing system including at least one device, the method comprising:
- assigning a respective time to replace for each device of the at least one device;
  
  assigning a respective remaining time to replace initially equal to the respective time to replace;
  
  tracking at least one operating parameter of each device, including measuring an operating parameter of the device at least once for each device during a tracking interval;
  
  determining a respective effective operation time for each device based on a respective measured value of the operating parameter and the tracking interval;
  
  subtracting the respective effective operation time from the respective remaining time to replace;
  
  monitoring for a failure of the at least one device;
  
  responsive to a failure of the at least one device, maintaining the respective device and continuing operation in response to the failure being recoverable, and notifying a user in response to the failure not being recoverable; and
  
  responsive to a respective remaining time to replace having reached a threshold value, replacing the respective device.
- View Dependent Claims (13, 14, 15)
- - 13. The method of claim 12, wherein the determining of the respective time to replace includes identifying at least one respective failure mechanism of the device and determining a respective time to replace for each failure mechanism, and the determining of the respective remaining time to replace for each device includes determining a respective effective operation time based on at least one measured value of the operating parameter at a respective interval, and subtracting the respective effective operation time from the respective remaining time to replace.
  - 14. The method of claim 13, wherein the operating parameter is temperature, the failure mechanism is electromigration, each device includes at least one temperature sensor each tracking a number of elements, and the determining of the respective time to replace includes determining a time domain estimated failure rate for each device using the relationship:
  - 15. The method of claim 12, wherein the operating parameter is temperature, each device includes at least one temperature sensor each tracking a number of elements, and the determining of the respective time to replace includes determining a space domain estimated failure rate for each device using the relationship:

16. A computing system failure avoidance computer program product for a computing system including at least one device and at least one processing unit in communication with at least one non-transitory computer readable storage medium, the computer program product being stored on the at least one non-transitory computer readable storage medium and including instructions in the form of computer executable code that when loaded and executed by the processing unit cause the processing unit to perform a method comprising:
- identifying at least one failure mechanism of each of the at least one device affected by time variation of an operating parameter of the respective device;
  
  assigning a respective time to replace for each failure mechanism;
  
  assigning a respective remaining time to replace for each failure mechanism initially equal to the respective time to replace,tracking the operating parameter periodically at a tracking interval;
  
  tracking the respective remaining time to replace for each failure mechanism, including,determining an effective operating time of each device during each tracking interval based on at least one value of the operating parameter tracked during a respective tracking interval, andsubtracting the effective operating time from the respective remaining time to replace; and
  
  replacing a respective device in response to one of the respective remaining time to replace reaching a respective threshold value.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer program product of claim 16, wherein the method further comprises monitoring for a failure,responsive to a failure, determining whether the failure is recoverable,responsive to the failure being recoverable, performing a recovery action and returning to monitoring for a failure, andresponsive to the failure not being recoverable, notifying a user of the failure.
  - 18. The computer program product of claim 16, wherein the operating parameter is temperature, the tracking of the operating parameter includes using at least one temperature sensor of the computing system to measure a respective temperature of each device during each tracking interval, each of the at least one sensor tracks a respective number of elements, and the determining of the time to replace includes determining a space domain estimated failure rate based on at least a number of temperature sensors used and the respective measured temperatures for each sensor and at each tracking interval using the relationship:
  - 19. The computer program product of claim 16, wherein the operating parameter is temperature, the tracking of the operating parameter includes using at least one temperature sensor of the computing system to measure a respective temperature of each device during each tracking interval, each of the at least one sensor tracks a respective number of elements, the at least one failure mechanism includes electromigration, and the determining of the time to replace includes determining a time domain estimated failure rate based on at least design life of the device, the respective measured temperature for each tracking interval, and an operation time of the device using the relationship:
  - 20. The computer program product of claim 16, wherein the computing system includes at least one component, the method is applied recursively to each component, and a respective time to replace is tracked for each component and triggers replacing one of the device or the component when the respective time to replace has reached a respective threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
GlobalFoundries, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Bickford, Jeanne P. S., Habib, Nazmul, Li, Baozhen, Nsame, Pascal A.
Primary Examiner(s)
LE, DIEU MINH T

Application Number

US13/948,811
Publication Number

US 20150033081A1
Time in Patent Office

693 Days
Field of Search

714/39, 714/47.1, 714/47.2, 714/47.3, 714/48, 714/1, 714/2
US Class Current

1/1
CPC Class Codes

G06F 11/004   Error avoidance G06F11/07 a...

G06F 11/30   Monitoring

G06F 11/3058   Monitoring arrangements for...

G06F 11/3065   Monitoring arrangements det...

In-situ computing system failure avoidance

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

In-situ computing system failure avoidance

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links