Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
BTRFS raid0 filesystem corrupt (not caused by hardware)
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
JohnTheCoolingFan
n00b
n00b


Joined: 24 Jan 2024
Posts: 23

PostPosted: Mon Jul 15, 2024 5:46 pm    Post subject: BTRFS raid0 filesystem corrupt (not caused by hardware) Reply with quote

I have a 2x2TB raid0 HDD BTRFS array for not very important data, but a lot of it. Recently I started having I/O errors, btrfs scrub was pointing to a file, which I have deleted but even if I re-download it the errors come back.

My first instinct was that one of the drives was failing. Although they are relatively new, they've been mostly being read in background. One drive has 11911 hours and the second has 9655. But, SMART did not report any reallocated sectors and short self-tests and conveyance tests were completed without errors. This must mean there is a problem in the filesystem itself.

And so I took the btrfs-progs toolkit and tried to get any information, along with dmesg messages.

First I did a scrub again, with the file deleted, but it seems I'm now getting more errors. Right after scrub start message in dmesg I get the following:

Code:

[   51.316924] BTRFS info (device sdd1): scrub: started on devid 1
[   51.338541] BTRFS info (device sdd1): scrub: started on devid 2
[   51.988225] BTRFS error (device sdd1): parent transid verify failed on logical 9669465276416 mirror 1 wanted 155271 found 155112
[   52.000646] BTRFS error (device sdd1): parent transid verify failed on logical 9669465276416 mirror 2 wanted 155271 found 155112
[   52.029561] BTRFS info (device sdd1): scrub: not finished on devid 2 with status: -5
[   77.231973] BTRFS error (device sdd1): unable to fixup (regular) error at logical 6784041680896 on dev /dev/sdd1 physical 1151664128
[   77.387821] BTRFS warning (device sdd1): checksum error at logical 6784041680896 on dev /dev/sdd1, physical 1151664128, root 256, inode 38877, offset 1681453056, length 4096, links 1 (path: <redacted>)
[   77.387836] BTRFS error (device sdd1): unable to fixup (regular) error at logical 6784041680896 on dev /dev/sdd1 physical 1151664128
[   77.387887] BTRFS warning (device sdd1): checksum error at logical 6784041680896 on dev /dev/sdd1, physical 1151664128, root 256, inode 38877, offset 1681453056, length 4096, links 1 (path: <redacted>)
<more similar messages>
[   85.307015] BTRFS error (device sdd1): unable to fixup (regular) error at logical 6784935657472 on dev /dev/sdd1 physical 2045640704
[   85.307046] BTRFS warning (device sdd1): checksum error at logical 6784935657472 on dev /dev/sdd1, physical 2045640704, root 256, inode 38778, offset 281182208, length 4096, links 1 (path: CB2_Debian11_minimal_kernel4.19_20240619.img.xz)
[   92.400513] BTRFS error (device sdd1): parent transid verify failed on logical 9669465276416 mirror 2 wanted 155271 found 155112
[   92.400672] BTRFS error (device sdd1): parent transid verify failed on logical 9669465276416 mirror 1 wanted 155271 found 155112
[   92.400874] BTRFS info (device sdd1): scrub: not finished on devid 1 with status: -5
[  120.379669] zsh[2893]: segfault at 28 ip 000055f9bd1b8fc2 sp 00007ffe4814d9d0 error 4 in zsh[55f9bd177000+99000] likely on CPU 8 (core 0, socket 0)
[  120.379684] Code: fd 53 eb 16 0f 1f 40 00 45 85 e4 74 3b be 10 00 00 00 48 89 df e8 6e 17 01 00 48 89 ef e8 d6 9c 00 00 48 89 c3 48 85 c0 74 3e <8b> 43 08 85 c0 75 d7 45 85 e4 74 22 48 8b 3b e8 5a 17 01 00 eb cd
[  184.695659] BTRFS error (device sdd1): parent transid verify failed on logical 9669291474944 mirror 2 wanted 155271 found 155170
[  184.708977] BTRFS error (device sdd1): parent transid verify failed on logical 9669291474944 mirror 1 wanted 155271 found 155170
[  184.709031] BTRFS error (device sdd1): failed to run delayed ref for logical 9670149029888 num_bytes 16384 type 176 action 1 ref_mod 1: -5
[  184.709043] BTRFS error (device sdd1: state A): Transaction aborted (error -5)
[  184.709048] BTRFS: error (device sdd1: state A) in btrfs_run_delayed_refs:2168: errno=-5 IO failure
[  184.709053] BTRFS info (device sdd1: state EA): forced readonly


At this point I was in panic mode and was trying to mount the filesystem in rw mode to delete the other files that have been reported to be corrupted. But teh filesystem will only mount in readonly mode now: "[ 616.204062] BTRFS error (device sdd1: state EMA): Remounting read-write after error is not allowed". I've tried usebackuproot mount option, but it didn't seem to do anything useful. My goal is to get the filesystem back to being mountable, with all files on it intact, even if some files that have been corrupted will have to be deleted. And ensure that this won't spread and become any worse.

I've tried using btrfs check /dev/sdd1 to perform a simple check on the filesystem, but without --force it would report that the other device is busy, but the filesystem was unmounted, which was weird. The end of the command's output looks like this:

Code:

extent back ref already exists for 9669272862720 parent 0 root 7
extent back ref already exists for 9669272879104 parent 0 root 7
extent back ref already exists for 9669272895488 parent 0 root 7
parent transid verify failed on 9669460180992 wanted 155271 found 155148
Ignoring transid failure
Segmentation fault (core dumped)


Yes, a segfault! Somehow the tool segfauletd while checking my filesystem, and that was also reported in dmesg:

Code:

[ 1066.654234] btrfs[6560]: segfault at 10 ip 0000563880606e32 sp 00007ffdf28495c0 error 4 in btrfs[563880588000+ba000] likely on CPU 2 (core 2, socket 0)
[ 1066.654253] Code: ec 30 48 8b 76 30 49 89 cd 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 e8 9a ed fb ff 48 85 c0 0f 84 c6 00 00 00 48 89 c3 <49> 8b 44 24 10 48 3d ff 00 00 00 76 4c 49 8b 4c 24 18 49 39 4e 30
[ 1309.350830] btrfs[7029]: segfault at 10 ip 00005621adddfe32 sp 00007ffd9a2978a0 error 4 in btrfs[5621add61000+ba000] likely on CPU 14 (core 6, socket 0)
[ 1309.350844] Code: ec 30 48 8b 76 30 49 89 cd 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 e8 9a ed fb ff 48 85 c0 0f 84 c6 00 00 00 48 89 c3 <49> 8b 44 24 10 48 3d ff 00 00 00 76 4c 49 8b 4c 24 18 49 39 4e 30
[ 1361.175220] btrfs[7243]: segfault at 10 ip 000055aa61925e32 sp 00007ffd9a351340 error 4 in btrfs[55aa618a7000+ba000] likely on CPU 12 (core 4, socket 0)
[ 1361.175241] Code: ec 30 48 8b 76 30 49 89 cd 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 e8 9a ed fb ff 48 85 c0 0f 84 c6 00 00 00 48 89 c3 <49> 8b 44 24 10 48 3d ff 00 00 00 76 4c 49 8b 4c 24 18 49 39 4e 30
[ 1777.170155] btrfs[8656]: segfault at 10 ip 0000556d62cc2e32 sp 00007ffe7d1d9d20 error 4 in btrfs[556d62c44000+ba000] likely on CPU 6 (core 6, socket 0)
[ 1777.170174] Code: ec 30 48 8b 76 30 49 89 cd 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 e8 9a ed fb ff 48 85 c0 0f 84 c6 00 00 00 48 89 c3 <49> 8b 44 24 10 48 3d ff 00 00 00 76 4c 49 8b 4c 24 18 49 39 4e 30


Some otehr commands, like btrsf rescue clear-ino-cache /dev/sdd1 also failed:

Code:

parent transid verify failed on 9669451923456 wanted 155271 found 155112
parent transid verify failed on 9669449138176 wanted 155271 found 155112
parent transid verify failed on 9669449138176 wanted 155271 found 155112
kernel-shared/extent-tree.c:1300: btrfs_inc_extent_ref: BUG_ON `err` triggered, value -5
btrfs(+0x27ae7)[0x559fc0b74ae7]
btrfs(btrfs_inc_extent_ref+0x12f)[0x559fc0b7717f]
btrfs(+0x2835d)[0x559fc0b7535d]
btrfs(btrfs_cow_block+0x294)[0x559fc0b660d4]
btrfs(btrfs_search_slot+0x91c)[0x559fc0b6968c]
btrfs(truncate_free_ino_items+0xbd)[0x559fc0c1616d]
btrfs(clear_ino_cache_items+0x1fd)[0x559fc0c1661d]
btrfs(+0xa6b6e)[0x559fc0bf3b6e]
btrfs(main+0x93)[0x559fc0b5df63]
/usr/lib64/libc.so.6(+0x262e0)[0x7f135fd972e0]
/usr/lib64/libc.so.6(__libc_start_main+0x89)[0x7f135fd97399]
btrfs(_start+0x25)[0x559fc0b5f4a5]
Aborted (core dumped)


I don't know what I expected from the command, as my search for solutions didn't yield me anything really conclusive. And the segfaulting? What is so broken about my filesystem that the checking tools segfault all of a sudden?

Currently I'm running btrfs rescue chunk-recover /dev/sdd1, which is a lengthy process, in hopes of recovering the filesystem to a working state even if at a cost of some files being lost. It didn't segfault yet, so that's a positive.

Please help me restore/recover my filesystem. I don't know what to do and use, so I hope the error messages and other information I've provided will let someone experienced to tell what's wrong and how to fix it.

UPDATE: btrfs rescue chunk-recover finished. Here's the output:

Code:

Scanning: DONE in dev0, DONE in dev1                                 
corrupt leaf: root=1 block=9669179670528 slot=0, unexpected item end, have 16283 expect 0
leaf free space ret -4716, leaf data size 0, used 4716 nritems 17
leaf 9669179670528 items 17 free space -4716 generation 155275 owner ROOT_TREE
leaf 9669179670528 flags 0x1(WRITTEN) backref revision 1
fs uuid ec2c243e-5f1e-4d7b-a3c8-e3f6a73a140c
chunk uuid f3cb3cee-f352-4c55-94c2-b59b9a5629e8
ERROR: leaf 9669179670528 slot 0 pointer invalid, offset 15844 size 439 leaf data limit 0
ERROR: skip remaining slots
corrupt leaf: root=1 block=9669179670528 slot=0, unexpected item end, have 16283 expect 0
leaf free space ret -4716, leaf data size 0, used 4716 nritems 17
leaf 9669179670528 items 17 free space -4716 generation 155275 owner ROOT_TREE
leaf 9669179670528 flags 0x1(WRITTEN) backref revision 1
fs uuid ec2c243e-5f1e-4d7b-a3c8-e3f6a73a140c
chunk uuid f3cb3cee-f352-4c55-94c2-b59b9a5629e8
ERROR: leaf 9669179670528 slot 0 pointer invalid, offset 15844 size 439 leaf data limit 0
ERROR: skip remaining slots
Couldn't read tree root
open with broken chunk error


Update 2024-07-17: Tried btrfs check --repair --force /dev/sdd1 on fresh boot (force flag because it thought /dev/sdc1 was busy):
Code:

enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/sdd1
UUID: ec2c243e-5f1e-4d7b-a3c8-e3f6a73a140c
[1/7] checking root items
parent transid verify failed on 9669451923456 wanted 155271 found 155112
parent transid verify failed on 9669451923456 wanted 155271 found 155112
parent transid verify failed on 9669451923456 wanted 155271 found 155112
Ignoring transid failure
parent transid verify failed on 9669460180992 wanted 155271 found 155148
parent transid verify failed on 9669460180992 wanted 155271 found 155148
parent transid verify failed on 9669460180992 wanted 155271 found 155148
Ignoring transid failure
parent transid verify failed on 9669460443136 wanted 155271 found 155117
parent transid verify failed on 9669460443136 wanted 155271 found 155117
parent transid verify failed on 9669460443136 wanted 155271 found 155117
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=9669179932672 item=123 parent level=1 child bytenr=9669460443136 child level=1
ERROR: failed to repair root items: Input/output error


To be clear: filesystem is mountable, goes into readonly when encounters an error with one of the files and the btrfs commands seemingly can't fix the problem.


Last edited by JohnTheCoolingFan on Wed Jul 17, 2024 6:08 pm; edited 3 times in total
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9823
Location: almost Mile High in the USA

PostPosted: Mon Jul 15, 2024 10:47 pm    Post subject: Reply with quote

Your zsh segfaulted in the first dmesg dump, unrelated to btrfs as far as I can tell. I'd double check to make sure your computer is really running in tip top shape or trying the disks on other hardware before continuing further with the current cpu/mb/ram...

BTW a filesystem driver segfaulting for whatever reason, and I mean whatever reason, is a bad sign. It better be because of hardware issues -- if not, this is not a filesystem I would trust storing data on. Filesystems should be resilient to corrupt data structures on disk. I would hope btrfs qualifies for this.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
JohnTheCoolingFan
n00b
n00b


Joined: 24 Jan 2024
Posts: 23

PostPosted: Wed Jul 17, 2024 4:02 pm    Post subject: Reply with quote

Changed the SATA data cables as I was seeing ata errors in dmesg previously, now I don't have them. I still have the btrfs scrub and btrfs check problems with same errors.

The errors reported are quite consistent which is what makes me believe that the error is in the data on the drives and how it's handled, not hardware.

Edit: upon trying to now delete the errored files the file stayed, yet its metadata got garbled up (shows as all question marks in ls -l) and teh filesystem has gone readonly. dmesg:

Code:

[   61.955139] BTRFS error (device sdd1): parent transid verify failed on logical 9669291474944 mirror 1 wanted 155271 found 155170
[   61.974838] BTRFS error (device sdd1): parent transid verify failed on logical 9669291474944 mirror 2 wanted 155271 found 155170
[   61.974889] BTRFS error (device sdd1): failed to run delayed ref for logical 9670149029888 num_bytes 16384 type 176 action 1 ref_mod 1: -5
[   61.974898] BTRFS error (device sdd1: state A): Transaction aborted (error -5)
[   61.974902] BTRFS: error (device sdd1: state A) in btrfs_run_delayed_refs:2168: errno=-5 IO failure
[   61.974905] BTRFS info (device sdd1: state EA): forced readonly


I don't know how to tell btrfs to just discard the erroneous data and file this data belongs to and overwrite the values associated with it. As you can see, simply deleting the file already causes an error. Is this error basically unrecoverable?
Back to top
View user's profile Send private message
JohnTheCoolingFan
n00b
n00b


Joined: 24 Jan 2024
Posts: 23

PostPosted: Wed Jul 17, 2024 5:32 pm    Post subject: Reply with quote

eccerr0r wrote:
Your zsh segfaulted in the first dmesg dump, unrelated to btrfs as far as I can tell. I'd double check to make sure your computer is really running in tip top shape or trying the disks on other hardware before continuing further with the current cpu/mb/ram...

BTW a filesystem driver segfaulting for whatever reason, and I mean whatever reason, is a bad sign. It better be because of hardware issues -- if not, this is not a filesystem I would trust storing data on. Filesystems should be resilient to corrupt data structures on disk. I would hope btrfs qualifies for this.


So I tested the array in another PC with completely different hardware and it still has these exact results. I also ran a pass of memtest just in case, all passed.

Also, this is not btrfs filesystem driver you see segfaulting there - it's the tool from sys-fs/btrfs-progs, and it's up to date. zsh segfaulted shortly after the btrfs command did, I opened a new terminal tab and zsh was stuck.

So, this is not caused by other PC hardware, not really an HDD problem since self-tests report no problems and the official filesystem tools (not drivers) segfault. Great...
Back to top
View user's profile Send private message
JohnTheCoolingFan
n00b
n00b


Joined: 24 Jan 2024
Posts: 23

PostPosted: Wed Jul 17, 2024 7:05 pm    Post subject: Reply with quote

Status update: 105 out of 165 (roughly) items (dirs and files) are getting their metadata corrupted in one of the directories on the drive. The matadata is all question marks in ls -l. These identical lines are repeated all throughout the kernel log:
Code:

[ 5787.836014] BTRFS error (device sdd1: state EA): level verify failed on logical 9669178376192 mirror 2 wanted 0 found 1
[ 5787.836082] BTRFS error (device sdd1: state EA): level verify failed on logical 9669178376192 mirror 1 wanted 0 found 1
[ 5787.836159] BTRFS error (device sdd1: state EA): level verify failed on logical 9669178376192 mirror 2 wanted 0 found 1
[ 5787.836227] BTRFS error (device sdd1: state EA): level verify failed on logical 9669178376192 mirror 1 wanted 0 found 1

Oh, and SATA errors are not actually gone, some time after boot:
Code:

[    3.726695] BTRFS info (device nvme0n1p2): first mount of filesystem 39759457-c51d-4c2c-979b-402238802b1d
[    3.727364] BTRFS info (device nvme0n1p2): using crc32c (crc32c-intel) checksum algorithm
[    3.727951] BTRFS info (device nvme0n1p2): enabling ssd optimizations
[    3.728480] BTRFS info (device nvme0n1p2): turning on async discard
[    3.729012] BTRFS info (device nvme0n1p2): using free space tree
[    3.970530] ata4.00: exception Emask 0x10 SAct 0x2 SErr 0x680100 action 0x6 frozen
[    3.971124] ata4.00: irq_stat 0x08000000, interface fatal error
[    3.971702] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
[    3.972193] ata4.00: failed command: READ FPDMA QUEUED
[    3.972666] ata4.00: cmd 60/f8:08:08:02:00/01:00:00:00:00/40 tag 1 ncq dma 258048 in
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[    3.973620] ata4.00: status: { DRDY }
[    3.974100] ata4: hard resetting link
[    4.500906] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    4.519700] ata4.00: configured for UDMA/133
[    4.520307] sd 3:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[    4.520881] sd 3:0:0:0: [sdd] tag#1 Sense Key : Illegal Request [current]
[    4.521360] sd 3:0:0:0: [sdd] tag#1 Add. Sense: Unaligned write command
[    4.521835] sd 3:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 00 00 02 08 00 01 f8 00
[    4.522312] I/O error, dev sdd, sector 520 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 2
[    4.522809] ata4: EH complete
[    4.834192] ata4.00: exception Emask 0x10 SAct 0x1000000 SErr 0x680100 action 0x6 frozen
[    4.834793] ata4.00: irq_stat 0x08000000, interface fatal error
[    4.835298] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
[    4.835800] ata4.00: failed command: READ FPDMA QUEUED
[    4.836288] ata4.00: cmd 60/08:c0:78:02:00/00:00:00:00:00/40 tag 24 ncq dma 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[    4.837280] ata4.00: status: { DRDY }
[    4.837772] ata4: hard resetting link
[    5.367517] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    5.386178] ata4.00: configured for UDMA/133
[    5.386803] ata4: EH complete
[    5.690921] ata4.00: exception Emask 0x10 SAct 0x20000 SErr 0x680100 action 0x6 frozen
[    5.691527] ata4.00: irq_stat 0x08000000, interface fatal error
[    5.692062] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
[    5.692565] ata4.00: failed command: READ FPDMA QUEUED
[    5.693058] ata4.00: cmd 60/08:88:78:02:00/00:00:00:00:00/40 tag 17 ncq dma 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[    5.694045] ata4.00: status: { DRDY }
[    5.694538] ata4: hard resetting link
[    6.220927] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    6.239700] ata4.00: configured for UDMA/133
[    6.240322] ata4: EH complete


Update: the metadata corruption/problem is apparently gone, as the metadata is back in that folder, although I still can't delete files that trigger the filesystem going readonly:
Code:

[   59.859209] BTRFS error (device sdc1): parent transid verify failed on logical 9669466734592 mirror 2 wanted 155271 found 155117
[   59.888620] BTRFS error (device sdc1): parent transid verify failed on logical 9669466734592 mirror 1 wanted 155271 found 155117
[   59.888643] BTRFS error (device sdc1: state A): Transaction aborted (error -5)
[   59.888648] BTRFS: error (device sdc1: state A) in __btrfs_free_extent:3099: errno=-5 IO failure
[   59.888652] BTRFS info (device sdc1: state EA): forced readonly
[   59.888654] BTRFS error (device sdc1: state EA): failed to run delayed ref for logical 9671959871488 num_bytes 16384 type 176 action 2 ref_mod 1: -5
[   59.888658] BTRFS: error (device sdc1: state EA) in btrfs_run_delayed_refs:2168: errno=-5 IO failure
[   59.888661] BTRFS warning (device sdc1: state EA): Skipping commit of aborted transaction.
[   59.888663] BTRFS: error (device sdc1: state EA) in cleanup_transaction:2002: errno=-5 IO failure
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum