Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
SSD drive and occasional failed command: WRITE FPDMA QUEUED
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Sat Jun 18, 2011 11:57 pm    Post subject: SSD drive and occasional failed command: WRITE FPDMA QUEUED Reply with quote

I've bought a SSD as a replacement for my SATA drive that served as a / at my home workstation. Everything seems to work fine except that I get some strange NCQ errors about failed commands. It's either "READ FPDMA QUEUED" or "WRITE FPDMA QUEUED" and it looks like this:

Code:
Jun 19 01:05:43 rimmer kernel: ata6: EH in SWNCQ mode,QC:qc_active 0x1F sactive 0x1F
Jun 19 01:05:43 rimmer kernel: ata6: SWNCQ:qc_active 0x1B defer_bits 0x4 last_issue_tag 0x1
Jun 19 01:05:43 rimmer kernel: dhfis 0x19 dmafis 0x19 sdbfis 0x0
Jun 19 01:05:43 rimmer kernel: ata6: ATA_REG 0x40 ERR_REG 0x0
Jun 19 01:05:43 rimmer kernel: ata6: tag : dhfis dmafis sdbfis sacitve
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x0: 1 1 0 1 
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x1: 0 0 0 1 
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x3: 1 1 0 1 
Jun 19 01:05:43 rimmer kernel: ata6: tag 0x4: 1 1 0 1 
Jun 19 01:05:43 rimmer kernel: ata6.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x6 frozen
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/10:00:14:d5:c8/00:00:0a:00:00/40 tag 0 ncq 8192 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/38:08:34:d5:c8/00:00:0a:00:00/40 tag 1 ncq 28672 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:10:74:d5:c8/00:00:0a:00:00/40 tag 2 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:18:54:d4:c8/00:00:0a:00:00/40 tag 3 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:20:f4:d4:c8/00:00:0a:00:00/40 tag 4 ncq 4096 out
Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)
Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }
Jun 19 01:05:43 rimmer kernel: ata6: hard resetting link
Jun 19 01:05:43 rimmer kernel: ata6: nv: skipping hardreset on occupied port
Jun 19 01:05:43 rimmer kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 19 01:05:43 rimmer kernel: ata6.00: configured for UDMA/133
Jun 19 01:05:43 rimmer kernel: ata6: EH complete


The ata6 is the SSD drive. When it's a 'READ FPDMA QUEUED' then it looks like this: http://pastebin.com/r1EedyuP, especially it always references the CHS sector:

Code:

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0
Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0


I have no idea why this happens - the device halts for a few seconds and then everything works just fine. I did not have time to run fsck on the drive, but the smartctl looks OK (the full output is here: http://pastebin.com/0Zx64tRs):

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       3883
226 Load-in_Time            0x0032   100   100   000    Old_age   Always       -       21
227 Torq-amp_Count          0x0032   100   100   000    Old_age   Always       -       0
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2362
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       3883
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       3452


The only recommendations I've found were to use 'libata.force=noncq' kernel parameter and to run 'hdparm -Q 1' on the drive. I've tried both, nothing changed. Except that the number of failed commands logged reflects the number set with hdparm - e.g. when I use '-Q 5' the log contains 5 'WRITE FPDMA QUEUED' commands.

How to fix this? Why is it happening? The basic system info:


  • kernel: 2.6.36.1 (vanilla, but I've initially seen exactly the same problems with the current livecd)
  • motherboard: Asus M2N-E (nvidia nforce-570 chipset)
  • SSD: Intel 320 (120GB version)
  • filesystem: reiserfs 3.6


I've checked that all the SATA cables are OK, and the original SATA drive was working just fine on the very same cable for several years. I've just replaced it with the SSD.

I've originally copied the data to the SSD using dd (the drives are of exactly the same size), and IIRC then there was no such error. Might be a coincidence, but it's kinda suspicious.

Any ideas what causes this and how to fix it?
Back to top
View user's profile Send private message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Mon Jun 20, 2011 11:35 am    Post subject: Reply with quote

I'm just wondering - when moving the data from the old HDD to the SSD, I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? I know SSD drives need to be partitioned to get optimal results (due to the 512k blocks), but I doubt it could cause such problems. Or this might be the real cause?
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23101

PostPosted: Tue Jun 21, 2011 2:23 am    Post subject: Reply with quote

tv007 wrote:
I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem?
That was probably bad for your drive, whether or not it caused the problem you reported. SSDs work much better when they know which areas contain useful data and which do not. By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector. If your drive supports the TRIM command, you may be able to mitigate the damage that way.
Back to top
View user's profile Send private message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Tue Jun 21, 2011 12:12 pm    Post subject: Reply with quote

Hu wrote:
tv007 wrote:
I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem?
That was probably bad for your drive, whether or not it caused the problem you reported. SSDs work much better when they know which areas contain useful data and which do not. By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector. If your drive supports the TRIM command, you may be able to mitigate the damage that way.


Yes, I'm used to copy partitions like this and I've realized too late this might be a problem for SSD. Anyway I think that shouldn't cause the I/O errors I've described. What should I do to fix that? I plan to add 'discard' to the mount options, and rewrite the free space with zeroes (cat /dev/zero > file.tmp && rm file.tmp). That should do the trick I guess?

I plan to repartition the drive to get proper alignment, and I'm thinking about a fresh install (I'm still on 32bits and I'm considering to switch to 64bits).

Anyway I haven't seen the I/O errors for about two days - not sure what changed. Yesterday I've flashed the BIOS on the MB, I've replaced the SATA cable, moved the drive to a separate power line (all the other drives are on the other one), changed the elevator to noop etc. So far everything seems fine (and I hope it'll stay like that).
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23101

PostPosted: Tue Jun 21, 2011 10:49 pm    Post subject: Reply with quote

Some filesystems will automatically issue a TRIM when they are created. If you use one of those, explicit clearing should not be necessary.
Back to top
View user's profile Send private message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Tue Jun 21, 2011 11:57 pm    Post subject: Reply with quote

Hu wrote:
Some filesystems will automatically issue a TRIM when they are created. If you use one of those, explicit clearing should not be necessary.


Yes, I know (now). I've found a nice article describing how to partition an SSD etc. I've copied the data to another drive, I've repartitioned the SSD to get a proper partition alignment and then I've created an ext4 partition so now I've got this.

Code:
$ fdisk -S 32 -H 32 /dev/sdb

Command (m for help): p

Disk /dev/sdb: 120.0 GB, 120034123776 bytes
32 heads, 32 sectors/track, 228946 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x7c257c25

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048      133119       65536   83  Linux
/dev/sdb2          133120     8521727     4194304   82  Linux swap / Solaris
/dev/sdb3         8521728   234441647   112959960   83  Linux


Not sure why the "boot" partition (sdb1) starts at 2048 (I guess 512 would be just as fine), but otherwise the partitions are nicely aligned to 512kB.

The ext4 was created like this

Code:
mke2fs -t ext4 -E stripe-size=128 /dev/sdb3


so it should be nicely aligned too (128 x 4kB blocks = 512kB). AFAIK ext4 clears all the blocks when it's created, and I've mounted it with 'discard' so this should be fixed too.

Hopefully this will make all those strange I/O errors go away ...
Back to top
View user's profile Send private message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Wed Jun 22, 2011 9:36 pm    Post subject: Reply with quote

So no luck - I just got bunch of "WRITE FPDMA QUEUED" errors :-(

The full dmesg output (including the I/O errors) is available here : http://pastebin.com/7pkreUCA

I really wonder how this can happen, because I've set the io scheduler to noop for the SSD, yet the errors are somehow related to SWNCQ

Code:
EXT4-fs (sdb3): re-mounted. Opts: discard,commit=0
ata6: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF
ata6: SWNCQ:qc_active 0x1E031 defer_bits 0x7FFE1FCE last_issue_tag 0x10
  dhfis 0xE031 dmafis 0x6010 sdbfis 0x0
ata6: ATA_REG 0x40 ERR_REG 0x0
ata6: tag : dhfis dmafis sdbfis sacitve
ata6: tag 0x0: 1 0 0 1 
ata6: tag 0x4: 1 1 0 1 
ata6: tag 0x5: 1 0 0 1 
ata6: tag 0xd: 1 1 0 1 
ata6: tag 0xe: 1 1 0 1 
ata6: tag 0xf: 1 0 0 1 
ata6: tag 0x10: 0 0 0 1 
ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
ata6.00: failed command: WRITE FPDMA QUEUED
ata6.00: cmd 61/10:00:10:d7:f0/00:00:05:00:00/40 tag 0 ncq 8192 out
...


Code:
rimmer ~ # cat /sys/block/sdb/queue/scheduler
[noop] deadline cfq


I have no idea what's wrong. It could be a hw problem (e.g. a motherboard issue), but it was very reliable till today.
Back to top
View user's profile Send private message
gorkypl
Guru
Guru


Joined: 04 Oct 2010
Posts: 444
Location: Kraków, PL

PostPosted: Wed Jun 22, 2011 10:55 pm    Post subject: Reply with quote

Could you try with the latest kernel?
Back to top
View user's profile Send private message
tv007
n00b
n00b


Joined: 06 Aug 2006
Posts: 22

PostPosted: Thu Jun 23, 2011 1:39 pm    Post subject: Reply with quote

gorkypl wrote:
Could you try with the latest kernel?

I've already upgraded to 2.6.38-gentoo-r6, i.e. the latest stable version, two days ago. The problem is still there - some additional info

dmesg : http://pastebin.com/uHvTVmss
.config : http://pastebin.com/PYeLKaBL
lspci : http://pastebin.com/nQPS0rxU
smartctl : http://pastebin.com/DwJfxdTK

I've started a new thread on the lkml mailing list, https://lkml.org/lkml/2011/6/22/476, no reply yet.

It seems this might be a sata chipset glitch (not sure why it did not fail before, with a traditional HDD - maybe the SSD is so fast it causes a race condition). I do have an unused Promise FastTrak TX4 controller, I'll try to use it instead of the onboard Nvidia MCP55 chipset.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum