버전 비교

  • 이 줄이 추가되었습니다.
  • 이 줄이 삭제되었습니다.
  • 서식이 변경되었습니다.

화면에 나오는 에러는 유심히 살펴볼 것

rasdaemon - 메모리, PCIe

REHL7(2014.06)에서 새로운 HERM (Hardware Event Report Mechanism) 도입
edac-tools, mcelog 대체

...

코드 블럭
themeEmacs
sudo systemctl status rasdaemon
sudo journalctl -f -u rasdaemon


(고장이 의심되는) 메모리 테스트

memtest86+ USB로 부팅해서 메모리 테스트


디스크 / SMART

코드 블럭
themeEmacs
title물리 디스크 목록 확인
$ lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0  1.8T  0 disk
├─sda1            8:1    0    1G  0 part /boot
└─sda2            8:2    0  1.8T  0 part
  ├─centos-root 253:0    0  128G  0 lvm  /
  ├─centos-swap 253:1    0 31.4G  0 lvm  [SWAP]
  └─centos-home 253:2    0  1.7T  0 lvm  /home

...

코드 블럭
themeEmacs
titleSMART
$ sudo smartctl -H /dev/sda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


$ sudo smartctl -a /dev/sda
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       1/129842826
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   078   078   000    Old_age   Always       -       19838h+41m+20.760s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       29
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   042   061   000    Old_age   Always       -       42 (Min/Max 9/61)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       1/129842826
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       1/129842826
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       1/129842826
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   097   097   010    Pre-fail  Always       -       25769803777
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       5993
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       5192
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       5192
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       173


디스크 / NVMe

NVMe의 SMART 로그 보기


코드 블럭
themeEmacs
sudo yum -y install nvme-cli

...

코드 블럭
themeEmacs
sudo nvme smart-log /dev/nvme0n1


온도 확인

코드 블럭
themeEmacs
sudo yum -y install lm_sensors hddtemp

...

코드 블럭
themeEmacs
$ sudo hddtemp
/dev/sda: ADATA SP900: 42°C


참고

20.6. CHECKING FOR HARDWARE ERRORS
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors

...

Diagnose Hardware Failures edit on github
https://support.system76.com/articles/hardware-failure/

How to Identify Which Hardware Component is Failing in Your Computer
https://www.howtogeek.com/174068

RHEL V5 - mcelog를 수집하는 방법
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02655435

...