목록 확인
# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
------------- ----------- --------------- -------------------------- ---------- ----------------------- ---------------- --------
/dev/nvme0n1 /dev/ng0n1 S463NF0M905327F Samsung SSD 970 PRO 512GB 0x1 6.94 GB / 512.11 GB 512 B + 0 B 1B2QEXP7
/dev/nvme10n1 /dev/ng10n1 S4..........26 SAMSUNG MZQLB7T6HMLA-00007 0x1 30.86 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme11n1 /dev/ng11n1 S4..........19 SAMSUNG MZQLB7T6HMLA-00007 0x1 927.35 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme12n1 /dev/ng12n1 S4..........80 SAMSUNG MZQLB7T6HMLA-00007 0x1 30.90 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme13n1 /dev/ng13n1 S4..........79 SAMSUNG MZQLB7T6HMLA-00007 0x1 927.71 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme14n1 /dev/ng14n1 S4..........87 SAMSUNG MZQLB7T6HMLA-00007 0x1 38.29 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme15n1 /dev/ng15n1 S4..........83 SAMSUNG MZQLB7T6HMLA-00007 0x1 30.91 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme1n1 /dev/ng1n1 S4..........76 SAMSUNG MZQLB7T6HMLA-00007 0x1 1.07 MB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme2n1 /dev/ng2n1 S4..........73 SAMSUNG MZQLB7T6HMLA-00007 0x1 947.71 MB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme3n1 /dev/ng3n1 S4..........43 SAMSUNG MZQLB7T6HMLA-00007 0x1 26.84 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme4n1 /dev/ng4n1 S4..........90 SAMSUNG MZQLB7T6HMLA-00007 0x1 7.68 TB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme5n1 /dev/ng5n1 S4..........91 SAMSUNG MZQLB7T6HMLA-00007 0x1 61.12 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme6n1 /dev/ng6n1 S4..........92 SAMSUNG MZQLB7T6HMLA-00007 0x1 61.09 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme7n1 /dev/ng7n1 S4..........75 SAMSUNG MZQLB7T6HMLA-00007 0x1 908.11 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme8n1 /dev/ng8n1 S4..........82 SAMSUNG MZQLB7T6HMLA-00007 0x1 908.14 GB / 7.68 TB 512 B + 0 B EDB5502Q
/dev/nvme9n1 /dev/ng9n1 S4..........85 SAMSUNG MZQLB7T6HMLA-00007 0x1 7.68 TB / 7.68 TB 512 B + 0 B EDB5502Q
커널 에러 로그
# cat /var/log/messages
Feb 24 11:57:41 stor1 kernel: md/raid:md125: device nvme14n1 operational as raid disk 0
Feb 24 11:57:41 stor1 kernel: md/raid:md125: device nvme4n1 operational as raid disk 2
...
Feb 24 11:57:49 stor1 kernel: nvme14n1: Read(0x2) @ LBA 3112952, 1024 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE DNR
Feb 24 11:57:49 stor1 kernel: critical target error, dev nvme14n1, sector 3112952 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
Feb 24 12:24:34 stor1 kernel: nvme4n1: Read(0x2) @ LBA 642804224, 1024 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE DNR
Feb 24 12:24:34 stor1 kernel: critical target error, dev nvme4n1, sector 642804224 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
# nvme smart-log /dev/nvme14n1
Smart Log for NVME device:nvme14n1 namespace-id:ffffffff
critical_warning : 0
temperature : 31 °C (304 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0
Data Units Read : 679451540 (347.88 TB)
Data Units Written : 15996237 (8.19 TB)
host_read_commands : 4162753349
host_write_commands : 451206689
controller_busy_time : 2872
power_cycles : 27
power_on_hours : 26610
unsafe_shutdowns : 16
media_errors : 76
num_err_log_entries : 267
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 31 °C (304 K)
Temperature Sensor 2 : 36 °C (309 K)
Temperature Sensor 3 : 41 °C (314 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
/dev/nvme{i}n1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
unsafe_shutdowns 28 10 10 10 16 16 16 17 17 15 17 17 14 18 16 10
num_err_log_entries 35 168 168 168 196 191 191 181 181 181 191 181 186 181 267 168
critical_warning 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
media_errors 0 0 0 0 5 0 0 0 0 0 0 0 0 0 76 0
https://santander.co.kr/122
1. available_spare < available_spare_threshold 이 되면 위험 , spare 영역 어쩌구 저쩌구
2. percentage_used 100% 넘어가면 위험함, 밴더사별로 내놓은 워런티? 수명? 뭐 그런거임
3. controller_busy_time 분단위인데.... 바쁘게(I/O 큐가 밀려있을때) 움직인 시간... 대기작업이 많이 있는경우 올라가는거라 정상인것같다.(정확하지 않다.) 0인 서버 못찾음
4. unsafe_shutdowns 말그대로임, 서버 강종하지말자.
5. media_errors 는 1 되면 배드섹터 감지된거니까 교체해야함
nvme는 모니터링해야되는게
1. available_spare < available_spare_threshold
2. percentage_used > 100
3.media_errors > 0