[solved] e2fsck skips block checks and fails

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

Hey there,

I am currently trying to fix an ext4 FS using e2fsck.The FS is on a 24TB RAID6, which is currently missing one disk. Unfortunately, scanning the FS takes days, fills up RAM and swapspace and gets killed in the end. I also get a huge number of the following lines:
German original: "Block %$b von Inode %$i steht in Konflikt mit kritischen Metadaten, Blockprüfungen werden übersprungen."
Translated: ""Inode %$i block %$b conflicts with critical metadata, skipping block checks."

What else can I try to scan the FS?

I use e2fprogs-1.44.5. The system has 9.43G RAM and over 100G of swapspace (I added 2 partitions on SSDs just for this scan).

389292 · Guru Joined: 26 Mar 2019 Posts: 504

as I was told once you should never stop fsck, at least if partitions were not mounted RO during the check. fsck in general considered to be risky operation not to be taken lightly, what exactly had happened with your fs?

NeddySeagoon · Posted: Fri Jun 14, 2019 6:17 pm Post subject:

Jimini,

Don't run fsck unless you have a backup or an image of the filesystem. fsck guesses what should be on the filesystem and often makes a bad situation worse.
It works by making the filesystem metadata self consistent but says nothing about any user data on the filesystem.
Its one of the last things to try.

Tell what happened to your raid6, how it came to be down a drive.

Are the underlying drives OK?
Can you post the underlying SMART data for all the drives in the raid set?

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

etnull & NeddySeagoon, thank you for your replies.

First of all: of course I have a backup :)
It only takes looooooong to copy all the stuff back via GB ethernet, and I'd like to dig a bit deeper into the problem first.

The disks are connected to two Dell PERC H200 controller cards (they are reflashed, since I wanted to use SW RAID):
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
02:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

These disks are assembled to /dev/md2. This array contains the LUKS container named "share".

One of the disks in this RAID6 got kicked out:

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

I have an update: I replaced the failed disk in the array and started e2fsck again. Now it seems to fix a bunch of errors - I am curious, if it can end its work this time.
Before I replaced the disk, e2fsck only complained about "Inode %$i block %$b conflicts with critical metadata, skipping block checks." - now it does name actual blocks and inodes.

Kind regards,
Jimini
_________________
"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents." (H.P. Lovecraft: The Call of Cthulhu)

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

I was now able to fix the file system. Since e2fsck could fix all errors in ~2 hours, I assume that the degraded (but clean!) RAID6 was the reason for all the problems.
For me, one big question remains unanswered: how redundant is a RAID6, when the FS on it throws errors as long as one disk is missing?

Kind regards,
Jimini
_________________
"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents." (H.P. Lovecraft: The Call of Cthulhu)

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

...the problem is NOT solved.

I tried to simulate the problem and set one of the disks in the array faulty. Afterwards, I replaced it and rebuilt the array.
Unfortunately, the ext4 errors occured again:
kernel: EXT4-fs error (device dm-1): ext4_find_dest_de:1802: inode #3833864: block 61343924: comm nfsd: bad entry in directory: rec_len % 4 != 0 - offset=1000, inode=2620025549, rec_len=30675, name_len=223, size=4096
kernel: EXT4-fs error (device dm-1): ext4_lookup:1577: inode #172824586: comm tvh:tasklet: iget: bad extra_isize 13022 (inode size 256)
kernel: EXT4-fs error (device dm-1): htree_dirblock_to_tree:1010: inode #7372807: block 117967811: comm tar: bad entry in directory: rec_len % 4 != 0 - offset=104440, inode=1855122647, rec_len=12017, name_len=209, size=4096
...and so on.

Sorry if I repeat myself, but IMHO a degraded RAID6 should not lead to filesystem corruption.

dumpe2fs:

NeddySeagoon · Posted: Fri Jun 21, 2019 9:56 pm Post subject:

Jimini,

Try

Jimini · l33t Joined: 31 Oct 2006 Posts: 605 Location: Germany

NeddySeagoon, thank you for your support - due to the misleading output of e2fsck I filed a bug report (https://bugzilla.kernel.org/show_bug.cgi?id=203943).
After setting $LANG to en_GB, e2fsck provided some helpful output, and I was able to clear the erroneous superblocks with debugfs. Afterwards, ef2sck fixed a huge amount of errors.

Although some data loss occured, this fortunately only affects directories the system wrote to, while ext4 was corrupted: local backup data and tv recordings.

The system is in clean shape now, but I will have a detailed look at the logs and the monitoring during the next weeks.

Kind regards,
Jimini
_________________
"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents." (H.P. Lovecraft: The Call of Cthulhu)

MPW · n00b Joined: 07 Jun 2020 Posts: 1

Hello Jimini,

I have pretty much the same problem as you had. Could explain further, what you did in detail to fix the filesystem?

I can't get a full list of currupted inodes, as badblocks -b 4096 doesn't run on my system for reasons I don't understand. I have a 11x 4TB raid6 (36 TB net) raid6 with ext4.

In syslog I saw fs errors and I deleted the broken files with debugfs clri. But still I see lot's of metadata conflicts just like you had.

What I don't understand: What does this have to do with the superblock, do I need to work with the backup superblock aswell? My raid is still mountable, but I don't want to destroy anything.

Best,
Matthias

fturco · Veteran Joined: 08 Dec 2010 Posts: 1181

@MPW: welcome to the Gentoo forums.

NeddySeagoon · Posted: Mon Jun 08, 2020 5:41 pm Post subject:

MPW,

Why are you using badblocks?
In general, its not useful on any HDD over 4G as they will do dynamic bad block remapping.

If you suspect faulty sectors on your raid set and want to test, Proceed as follows ...

Run