View previous topic :: View next topic |
Author |
Message |
Adel Ahmed Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
Joined: 21 Sep 2012 Posts: 1607
|
Posted: Fri Sep 02, 2016 4:44 pm Post subject: disk problems |
|
|
I'm using zfs on linux in a raidz configuration, I'm exporting this filesystem via NFS, I get freezesat times, while zpool status is not showing any errors I get the following in journalctl
Code: | Sep 02 18:30:27 pc.home kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
Sep 02 18:30:27 pc.home kernel: ata4: irq_stat 0x00400000, PHY RDY changed
Sep 02 18:30:27 pc.home kernel: ata4: hard resetting link |
and:
Code: | Sep 02 18:35:01 pc.home kernel: sd 3:0:0:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 02 18:35:01 pc.home kernel: sd 3:0:0:0: [sdd] tag#1 Sense Key : 0x5 [current] [descriptor]
Sep 02 18:35:01 pc.home kernel: sd 3:0:0:0: [sdd] tag#1 ASC=0x21 ASCQ=0x4
Sep 02 18:35:01 pc.home kernel: sd 3:0:0:0: [sdd] tag#1 CDB: opcode=0x28 28 00 00 03 38 90 00 00 58 00
Sep 02 18:35:01 pc.home kernel: blk_update_request: I/O error, dev sdd, sector 211088
Sep 02 18:35:01 pc.home kernel: ata4: EH complete
|
the problem stops showing up(although the pool is running in degraded mode) when one of the disks is diconnected, everything is pointing at this disk being the problem, how can I be 100% sure before I go back to the vendor asking for a replacement.
here's my smartctl:
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 143113960
3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1468
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 963051
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15475
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 415
183 Runtime_Bad_Block 0x0032 092 092 000 Old_age Always - 8
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 3
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 056 041 045 Old_age Always In_the_past 44 (Min/Max 44/44 #781)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 3784
193 Load_Cycle_Count 0x0032 065 065 000 Old_age Always - 71391
194 Temperature_Celsius 0x0022 044 059 000 Old_age Always - 44 (0 14 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 194 000 Old_age Always - 229
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10279h+26m+11.068s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21132281959
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 17697336676 |
I see no reallocated sectors, but I see a in the past in the airflow teampreature celsius, does that mean the disk had overheated at some point?
thanks
Code tags added for easy reading -- NeddySeagoon |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
NeddySeagoon Administrator
![Administrator Administrator](/images/ranks/rank-admin.gif)
![](images/avatars/3946266373f47d606a2db3.jpg)
Joined: 05 Jul 2003 Posts: 54838 Location: 56N 3W
|
Posted: Fri Sep 02, 2016 7:51 pm Post subject: |
|
|
Adel Ahmed,
Use smartclt to run the long self test. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Ant P. Watchman
![Watchman Watchman](/images/ranks/rank-G-2-watchman.gif)
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sat Sep 03, 2016 12:12 am Post subject: |
|
|
Your read error rates look pretty high for a disk that age. That could be a symptom of a failing drive head, or excessive vibrations throwing it off frequently enough to make the drive self-reset, or just a bad cable (not necessarily the external one though). It's important to know how fast the numbers are increasing too. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Buffoon Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/1123050365559155c1ec69d.jpg)
Joined: 17 Jun 2015 Posts: 1369 Location: EU or US
|
Posted: Sat Sep 03, 2016 12:52 am Post subject: |
|
|
I've seen read error rate high with Seagate, seems to mean nothing. Is it a Seagate? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Adel Ahmed Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
Joined: 21 Sep 2012 Posts: 1607
|
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
NeddySeagoon Administrator
![Administrator Administrator](/images/ranks/rank-admin.gif)
![](images/avatars/3946266373f47d606a2db3.jpg)
Joined: 05 Jul 2003 Posts: 54838 Location: 56N 3W
|
Posted: Sat Sep 03, 2016 7:07 pm Post subject: |
|
|
Adel Ahmed,
If thats correct, you stopped the test just before it firished. In the last 1% of the drive.
Code: | SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 15490 - |
It would be better if the completion message was not Code: | Interrupted (host reset) | , as that casts doubt over the rest of the information.
Try the long test again, this time allow it to complete or report an error.
It really is a long test. The drive does something similar to reading the entire drive to /dev/null but no data is passed over the SATA interface.
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 143113960 |
Is nothing to worry about. Raw values are often packed bit fields, so the Raw_Read_Error_Rate might be say, just the top 8 bits of a 32 bit value.
Unless you know how the 32 bit value is packed, its not useful.
The VALUE WORST THRESH values are all normalised. A value of VALUE or WORST, that is less than or equal to THRESH, is a cause for concern.
If the long test passes and you still get IO errors, its the data path from the HDD to the CPU.
Replace the SATA data cable. That's easy.
Try another SATA port, if you have one. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|