서버 관리 / HW 고장 확인

화면에 나오는 에러는 유심히 살펴볼 것

rasdaemon - 메모리, PCIe

REHL7(2014.06)에서 새로운 HERM (Hardware Event Report Mechanism) 도입
edac-tools, mcelog 대체

sudo yum -y install rasdaemon
sudo systemctl enable rasdaemon
sudo systemctl start rasdaemon

$ ras-mc-ctl
Usage: ras-mc-ctl [OPTIONS...]
 --quiet            Quiet operation.
 --mainboard        Print mainboard vendor and model for this hardware.
 --status           Print status of EDAC drivers.
 --print-labels     Print Motherboard DIMM labels to stdout.
 --guess-labels     Print DMI labels, when bank locator is available.
 --register-labels  Load Motherboard DIMM labels into EDAC driver.
 --delay=N          Delay N seconds before writing DIMM labels.
 --labeldb=DB       Load label database from file DB.
 --layout           Display the memory layout.
 --summary          Presents a summary of the logged errors.
 --errors           Shows the errors stored at the error database.
 --help             This help message.

$ ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.

sudo systemctl status rasdaemon
sudo journalctl -f -u rasdaemon

(고장이 의심되는) 메모리 테스트

memtest86+ USB로 부팅해서 메모리 테스트

디스크 / SMART

물리 디스크 목록 확인

$ lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0  1.8T  0 disk
├─sda1            8:1    0    1G  0 part /boot
└─sda2            8:2    0  1.8T  0 part
  ├─centos-root 253:0    0  128G  0 lvm  /
  ├─centos-swap 253:1    0 31.4G  0 lvm  [SWAP]
  └─centos-home 253:2    0  1.7T  0 lvm  /home

SMART 불가능한 경우 - Unavailable - device lacks SMART capability.

$ sudo smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               DELL
Product:              PERC H710P
Revision:             3.13
User Capacity:        1,999,307,276,288 bytes [1.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6b083fe0e3f372001e34e2bc2229c3e6
Serial number:        00e6c32922bce2341e0072f3e3e03f08
Device type:          disk
Local Time is:        Mon Apr 15 15:53:55 2019 KST
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

SMART 불가능한 경우 - Unavailable - device lacks SMART capability.

$ sudo smartctl -H /dev/sda
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK


$ sudo smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
...
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
...
Error Counter logging not supported
...
Device does not support Self Test logging

SMART

$ sudo smartctl -H /dev/sda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


$ sudo smartctl -a /dev/sda
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       1/129842826
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   078   078   000    Old_age   Always       -       19838h+41m+20.760s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       29
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   042   061   000    Old_age   Always       -       42 (Min/Max 9/61)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       1/129842826
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       1/129842826
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       1/129842826
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   097   097   010    Pre-fail  Always       -       25769803777
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       5993
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       5192
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       5192
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       173

디스크 / NVMe

NVMe의 SMART 로그 보기

sudo yum -y install nvme-cli

sudo nvme smart-log /dev/nvme0n1

온도 확인

sudo yum -y install lm_sensors hddtemp

시스템 센서 검색 및 등록

sudo sensors-detect
...
Do you want to scan for Super I/O sensors? (YES/no): YES
Do you want to probe the I2C/SMBus adapters now? (YES/no): YES
...

센서 조회

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +39.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0:         +35.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:         +36.0°C  (high = +84.0°C, crit = +100.0°C)
Core 2:         +35.0°C  (high = +84.0°C, crit = +100.0°C)
Core 3:         +35.0°C  (high = +84.0°C, crit = +100.0°C)

nct6776-isa-0290
Adapter: ISA adapter
Vcore:          +0.39 V  (min =  +0.00 V, max =  +1.74 V)
AVCC:           +3.17 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:          +3.17 V  (min =  +2.98 V, max =  +3.63 V)
3VSB:           +3.26 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:           +3.12 V  (min =  +2.70 V, max =  +3.63 V)
fan1:             0 RPM  (min =    0 RPM)
fan2:          2339 RPM  (min =    0 RPM)
fan3:             0 RPM  (min =    0 RPM)
SYSTIN:         +37.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
AUXTIN:         -16.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PECI Agent 0:   +39.0°C  (high = +80.0°C, hyst = +75.0°C)
                         (crit = +100.0°C)
PCH_CHIP_TEMP:   +0.0°C
PCH_CPU_TEMP:    +0.0°C
PCH_MCH_TEMP:    +0.0°C
cpu0_vid:      +0.000 V
intrusion0:    OK
intrusion1:    OK

$ sudo hddtemp
/dev/sda: ADATA SP900: 42°C

참고

20.6. CHECKING FOR HARDWARE ERRORS
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors

REHL7 커널 릴리즈 노트 / 하드웨어 오류 보고 메커니즘
https://access.redhat.com/documentation/ko-kr/red_hat_enterprise_linux/7/html/7.0_release_notes/chap-kernel

How do I get notified of ECC errors in Linux?
https://serverfault.com/questions/643542

Diagnose Hardware Failures edit on github
https://support.system76.com/articles/hardware-failure/

RHEL V5 - mcelog를 수집하는 방법
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02655435

smartctl, hdparm 디스크 점검하기 - 스마일서브
https://idchowto.com/?p=41487

공간 바로가기

페이지 트리

rasdaemon - 메모리, PCIe

(고장이 의심되는) 메모리 테스트

디스크 / SMART

디스크 / NVMe

온도 확인

참고