View previous topic :: View next topic |
Author |
Message |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Sat Jun 18, 2011 11:57 pm Post subject: SSD drive and occasional failed command: WRITE FPDMA QUEUED |
|
|
I've bought a SSD as a replacement for my SATA drive that served as a / at my home workstation. Everything seems to work fine except that I get some strange NCQ errors about failed commands. It's either "READ FPDMA QUEUED" or "WRITE FPDMA QUEUED" and it looks like this:
Code: | Jun 19 01:05:43 rimmer kernel: ata6: EH in SWNCQ mode,QC:qc_active 0x1F sactive 0x1F
Jun 19 01:05:43 rimmer kernel: ata6: SWNCQ:qc_active 0x1B defer_bits 0x4 last_issue_tag 0x1
Jun 19 01:05:43 rimmer kernel: dhfis 0x19 dmafis 0x19 sdbfis 0x0
Jun 19 01:05:43 rimmer kernel: ata6: ATA_REG 0x40 ERR_REG 0x0
Jun 19 01:05:43 rimmer kernel: ata6: tag : dhfis dmafis sdbfis sacitve
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x0: 1 1 0 1
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x1: 0 0 0 1
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x3: 1 1 0 1
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x4: 1 1 0 1
Jun 19 01:05:43 rimmer kernel: ata6.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x6 frozen
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/10:00:14:d5:c8/00:00:0a:00:00/40 tag 0 ncq 8192 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/38:08:34:d5:c8/00:00:0a:00:00/40 tag 1 ncq 28672 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:10:74:d5:c8/00:00:0a:00:00/40 tag 2 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:18:54:d4:c8/00:00:0a:00:00/40 tag 3 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:20:f4:d4:c8/00:00:0a:00:00/40 tag 4 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6: hard resetting link
Jun 19 01:05:43 rimmer kernel: ata6: nv: skipping hardreset on occupied port
Jun 19 01:05:43 rimmer kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 19 01:05:43 rimmer kernel: ata6.00: configured for UDMA/133
Jun 19 01:05:43 rimmer kernel: ata6: EH complete
|
The ata6 is the SSD drive. When it's a 'READ FPDMA QUEUED' then it looks like this: http://pastebin.com/r1EedyuP, especially it always references the CHS sector:
Code: |
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
|
I have no idea why this happens - the device halts for a few seconds and then everything works just fine. I did not have time to run fsck on the drive, but the smartctl looks OK (the full output is here: http://pastebin.com/0Zx64tRs):
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0
4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 5
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 3883
226 Load-in_Time 0x0032 100 100 000 Old_age Always - 21
227 Torq-amp_Count 0x0032 100 100 000 Old_age Always - 0
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 2362
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 3883
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 3452
|
The only recommendations I've found were to use 'libata.force=noncq' kernel parameter and to run 'hdparm -Q 1' on the drive. I've tried both, nothing changed. Except that the number of failed commands logged reflects the number set with hdparm - e.g. when I use '-Q 5' the log contains 5 'WRITE FPDMA QUEUED' commands.
How to fix this? Why is it happening? The basic system info:
- kernel: 2.6.36.1 (vanilla, but I've initially seen exactly the same problems with the current livecd)
- motherboard: Asus M2N-E (nvidia nforce-570 chipset)
- SSD: Intel 320 (120GB version)
- filesystem: reiserfs 3.6
I've checked that all the SATA cables are OK, and the original SATA drive was working just fine on the very same cable for several years. I've just replaced it with the SSD.
I've originally copied the data to the SSD using dd (the drives are of exactly the same size), and IIRC then there was no such error. Might be a coincidence, but it's kinda suspicious.
Any ideas what causes this and how to fix it? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Mon Jun 20, 2011 11:35 am Post subject: |
|
|
I'm just wondering - when moving the data from the old HDD to the SSD, I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? I know SSD drives need to be partitioned to get optimal results (due to the 512k blocks), but I doubt it could cause such problems. Or this might be the real cause? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Hu Administrator
![Administrator Administrator](/images/ranks/rank-admin.gif)
Joined: 06 Mar 2007 Posts: 23101
|
Posted: Tue Jun 21, 2011 2:23 am Post subject: |
|
|
tv007 wrote: | I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? | That was probably bad for your drive, whether or not it caused the problem you reported. SSDs work much better when they know which areas contain useful data and which do not. By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector. If your drive supports the TRIM command, you may be able to mitigate the damage that way. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Tue Jun 21, 2011 12:12 pm Post subject: |
|
|
Hu wrote: | tv007 wrote: | I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? | That was probably bad for your drive, whether or not it caused the problem you reported. SSDs work much better when they know which areas contain useful data and which do not. By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector. If your drive supports the TRIM command, you may be able to mitigate the damage that way. |
Yes, I'm used to copy partitions like this and I've realized too late this might be a problem for SSD. Anyway I think that shouldn't cause the I/O errors I've described. What should I do to fix that? I plan to add 'discard' to the mount options, and rewrite the free space with zeroes (cat /dev/zero > file.tmp && rm file.tmp). That should do the trick I guess?
I plan to repartition the drive to get proper alignment, and I'm thinking about a fresh install (I'm still on 32bits and I'm considering to switch to 64bits).
Anyway I haven't seen the I/O errors for about two days - not sure what changed. Yesterday I've flashed the BIOS on the MB, I've replaced the SATA cable, moved the drive to a separate power line (all the other drives are on the other one), changed the elevator to noop etc. So far everything seems fine (and I hope it'll stay like that). |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Hu Administrator
![Administrator Administrator](/images/ranks/rank-admin.gif)
Joined: 06 Mar 2007 Posts: 23101
|
Posted: Tue Jun 21, 2011 10:49 pm Post subject: |
|
|
Some filesystems will automatically issue a TRIM when they are created. If you use one of those, explicit clearing should not be necessary. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Tue Jun 21, 2011 11:57 pm Post subject: |
|
|
Hu wrote: | Some filesystems will automatically issue a TRIM when they are created. If you use one of those, explicit clearing should not be necessary. |
Yes, I know (now). I've found a nice article describing how to partition an SSD etc. I've copied the data to another drive, I've repartitioned the SSD to get a proper partition alignment and then I've created an ext4 partition so now I've got this.
Code: | $ fdisk -S 32 -H 32 /dev/sdb
Command (m for help): p
Disk /dev/sdb: 120.0 GB, 120034123776 bytes
32 heads, 32 sectors/track, 228946 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x7c257c25
Device Boot Start End Blocks Id System
/dev/sdb1 2048 133119 65536 83 Linux
/dev/sdb2 133120 8521727 4194304 82 Linux swap / Solaris
/dev/sdb3 8521728 234441647 112959960 83 Linux |
Not sure why the "boot" partition (sdb1) starts at 2048 (I guess 512 would be just as fine), but otherwise the partitions are nicely aligned to 512kB.
The ext4 was created like this
Code: | mke2fs -t ext4 -E stripe-size=128 /dev/sdb3 |
so it should be nicely aligned too (128 x 4kB blocks = 512kB). AFAIK ext4 clears all the blocks when it's created, and I've mounted it with 'discard' so this should be fixed too.
Hopefully this will make all those strange I/O errors go away ... |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Wed Jun 22, 2011 9:36 pm Post subject: |
|
|
So no luck - I just got bunch of "WRITE FPDMA QUEUED" errors
The full dmesg output (including the I/O errors) is available here : http://pastebin.com/7pkreUCA
I really wonder how this can happen, because I've set the io scheduler to noop for the SSD, yet the errors are somehow related to SWNCQ
Code: | EXT4-fs (sdb3): re-mounted. Opts: discard,commit=0
ata6: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF
ata6: SWNCQ:qc_active 0x1E031 defer_bits 0x7FFE1FCE last_issue_tag 0x10
dhfis 0xE031 dmafis 0x6010 sdbfis 0x0
ata6: ATA_REG 0x40 ERR_REG 0x0
ata6: tag : dhfis dmafis sdbfis sacitve
ata6: tag 0x0: 1 0 0 1
ata6: tag 0x4: 1 1 0 1
ata6: tag 0x5: 1 0 0 1
ata6: tag 0xd: 1 1 0 1
ata6: tag 0xe: 1 1 0 1
ata6: tag 0xf: 1 0 0 1
ata6: tag 0x10: 0 0 0 1
ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
ata6.00: failed command: WRITE FPDMA QUEUED
ata6.00: cmd 61/10:00:10:d7:f0/00:00:05:00:00/40 tag 0 ncq 8192 out
... |
Code: | rimmer ~ # cat /sys/block/sdb/queue/scheduler
[noop] deadline cfq |
I have no idea what's wrong. It could be a hw problem (e.g. a motherboard issue), but it was very reliable till today. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
gorkypl Guru
![Guru Guru](/images/ranks/rank_rect_3.gif)
Joined: 04 Oct 2010 Posts: 444 Location: Kraków, PL
|
Posted: Wed Jun 22, 2011 10:55 pm Post subject: |
|
|
Could you try with the latest kernel? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
tv007 n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 06 Aug 2006 Posts: 22
|
Posted: Thu Jun 23, 2011 1:39 pm Post subject: |
|
|
gorkypl wrote: | Could you try with the latest kernel? |
I've already upgraded to 2.6.38-gentoo-r6, i.e. the latest stable version, two days ago. The problem is still there - some additional info
dmesg : http://pastebin.com/uHvTVmss
.config : http://pastebin.com/PYeLKaBL
lspci : http://pastebin.com/nQPS0rxU
smartctl : http://pastebin.com/DwJfxdTK
I've started a new thread on the lkml mailing list, https://lkml.org/lkml/2011/6/22/476, no reply yet.
It seems this might be a sata chipset glitch (not sure why it did not fail before, with a traditional HDD - maybe the SSD is so fast it causes a race condition). I do have an unused Promise FastTrak TX4 controller, I'll try to use it instead of the onboard Nvidia MCP55 chipset. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|