화면에 나오는 에러는 유심히 살펴볼 것
rasdaemon - 메모리, PCIe
REHL7(2014.06)에서 새로운 HERM (Hardware Event Report Mechanism) 도입
edac-tools, mcelog 대체
sudo yum -y install rasdaemon
sudo systemctl enable rasdaemon
sudo systemctl start rasdaemon |
$ ras-mc-ctl
Usage: ras-mc-ctl [OPTIONS...]
--quiet Quiet operation.
--mainboard Print mainboard vendor and model for this hardware.
--status Print status of EDAC drivers.
--print-labels Print Motherboard DIMM labels to stdout.
--guess-labels Print DMI labels, when bank locator is available.
--register-labels Load Motherboard DIMM labels into EDAC driver.
--delay=N Delay N seconds before writing DIMM labels.
--labeldb=DB Load label database from file DB.
--layout Display the memory layout.
--summary Presents a summary of the logged errors.
--errors Shows the errors stored at the error database.
--help This help message. |
$ ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors. |
sudo systemctl status rasdaemon
sudo journalctl -f -u rasdaemon |
(고장이 의심되는) 메모리 테스트
memtest86+ USB로 부팅해서 메모리 테스트
디스크 / SMART
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 1.8T 0 part
├─centos-root 253:0 0 128G 0 lvm /
├─centos-swap 253:1 0 31.4G 0 lvm [SWAP]
└─centos-home 253:2 0 1.7T 0 lvm /home |
$ sudo smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: DELL
Product: PERC H710P
Revision: 3.13
User Capacity: 1,999,307,276,288 bytes [1.99 TB]
Logical block size: 512 bytes
Logical Unit id: 0x6b083fe0e3f372001e34e2bc2229c3e6
Serial number: 00e6c32922bce2341e0072f3e3e03f08
Device type: disk
Local Time is: Mon Apr 15 15:53:55 2019 KST
SMART support is: Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C
Error Counter logging not supported
Device does not support Self Test logging |
$ sudo smartctl -H /dev/sda
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
$ sudo smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
...
SMART support is: Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
...
Error Counter logging not supported
...
Device does not support Self Test logging |
$ sudo smartctl -H /dev/sda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
$ sudo smartctl -a /dev/sda
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 095 095 050 Old_age Always - 1/129842826
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 078 078 000 Old_age Always - 19838h+41m+20.760s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
171 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 29
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 0
181 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 042 061 000 Old_age Always - 42 (Min/Max 9/61)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 1/129842826
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 1/129842826
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 1/129842826
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0013 097 097 010 Pre-fail Always - 25769803777
233 SandForce_Internal 0x0032 000 000 000 Old_age Always - 5993
234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 5192
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 5192
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 173 |
디스크 / NVMe
NVMe의 SMART 로그 보기
sudo yum -y install nvme-cli |
sudo nvme smart-log /dev/nvme0n1 |
온도 확인
sudo yum -y install lm_sensors hddtemp |
sudo sensors-detect
...
Do you want to scan for Super I/O sensors? (YES/no): YES
Do you want to probe the I2C/SMBus adapters now? (YES/no): YES
... |
$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +39.0°C (high = +84.0°C, crit = +100.0°C)
Core 0: +35.0°C (high = +84.0°C, crit = +100.0°C)
Core 1: +36.0°C (high = +84.0°C, crit = +100.0°C)
Core 2: +35.0°C (high = +84.0°C, crit = +100.0°C)
Core 3: +35.0°C (high = +84.0°C, crit = +100.0°C)
nct6776-isa-0290
Adapter: ISA adapter
Vcore: +0.39 V (min = +0.00 V, max = +1.74 V)
AVCC: +3.17 V (min = +2.98 V, max = +3.63 V)
+3.3V: +3.17 V (min = +2.98 V, max = +3.63 V)
3VSB: +3.26 V (min = +2.98 V, max = +3.63 V)
Vbat: +3.12 V (min = +2.70 V, max = +3.63 V)
fan1: 0 RPM (min = 0 RPM)
fan2: 2339 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
SYSTIN: +37.0°C (high = +0.0°C, hyst = +0.0°C) ALARM sensor = thermistor
AUXTIN: -16.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
PECI Agent 0: +39.0°C (high = +80.0°C, hyst = +75.0°C)
(crit = +100.0°C)
PCH_CHIP_TEMP: +0.0°C
PCH_CPU_TEMP: +0.0°C
PCH_MCH_TEMP: +0.0°C
cpu0_vid: +0.000 V
intrusion0: OK
intrusion1: OK |
$ sudo hddtemp
/dev/sda: ADATA SP900: 42°C |
참고
20.6. CHECKING FOR HARDWARE ERRORS
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors
REHL7 커널 릴리즈 노트 / 하드웨어 오류 보고 메커니즘
https://access.redhat.com/documentation/ko-kr/red_hat_enterprise_linux/7/html/7.0_release_notes/chap-kernel
How do I get notified of ECC errors in Linux?
https://serverfault.com/questions/643542
Diagnose Hardware Failures
https://support.system76.com/articles/hardware-failure/
How to Identify Which Hardware Component is Failing in Your Computer
https://www.howtogeek.com/174068
RHEL V5 - mcelog를 수집하는 방법
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02655435
smartctl, hdparm 디스크 점검하기 - 스마일서브
https://idchowto.com/?p=41487