화면에 나오는 에러는 유심히 살펴볼 것
rasdaemon - 메모리, PCIe
REHL7(2014.06)에서 새로운 HERM (Hardware Event Report Mechanism) 도입
edac-tools, mcelog 대체
...
| 코드 블럭 | ||
|---|---|---|
| ||
sudo systemctl status rasdaemon sudo journalctl -f -u rasdaemon |
(고장이 의심되는) 메모리 테스트
memtest86+ USB로 부팅해서 메모리 테스트
디스크 / SMART
| 코드 블럭 | ||||
|---|---|---|---|---|
| ||||
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 1.8T 0 part ├─centos-root 253:0 0 128G 0 lvm / ├─centos-swap 253:1 0 31.4G 0 lvm [SWAP] └─centos-home 253:2 0 1.7T 0 lvm /home |
...
| 코드 블럭 | ||||
|---|---|---|---|---|
| ||||
$ sudo smartctl -H /dev/sda === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED $ sudo smartctl -a /dev/sda ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED ... Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0032 095 095 050 Old_age Always - 1/129842826 5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0 9 Power_On_Hours_and_Msec 0x0032 078 078 000 Old_age Always - 19838h+41m+20.760s 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33 171 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 29 177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 0 181 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0 182 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 042 061 000 Old_age Always - 42 (Min/Max 9/61) 195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 1/129842826 196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0 201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 1/129842826 204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 1/129842826 230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100 231 SSD_Life_Left 0x0013 097 097 010 Pre-fail Always - 25769803777 233 SandForce_Internal 0x0032 000 000 000 Old_age Always - 5993 234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 5192 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 5192 242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 173 |
디스크 / NVMe
NVMe의 SMART 로그 보기
| 코드 블럭 | ||
|---|---|---|
| ||
sudo yum -y install nvme-cli |
...
| 코드 블럭 | ||
|---|---|---|
| ||
sudo nvme smart-log /dev/nvme0n1 |
온도 확인
| 코드 블럭 | ||
|---|---|---|
| ||
sudo yum -y install lm_sensors hddtemp |
...
| 코드 블럭 | ||
|---|---|---|
| ||
$ sudo hddtemp /dev/sda: ADATA SP900: 42°C |
참고
20.6. CHECKING FOR HARDWARE ERRORS
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors
...
Diagnose Hardware Failures edit on github
https://support.system76.com/articles/hardware-failure/
How to Identify Which Hardware Component is Failing in Your Computer
https://www.howtogeek.com/174068
RHEL V5 - mcelog를 수집하는 방법
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02655435
...