View previous topic :: View next topic |
Author |
Message |
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Tue May 17, 2022 9:45 am Post subject: [Solved] Possible XFS bug in kernel 5.15.32-r1 - how to dbg? |
|
|
I was doing a benchmark of linux file systems and I'm running into some very strange behavior with XFS and kernel 5.15.32-r1. When I try to extract an uncompressed archive on this newly created (empty) XFS file system, after first low tens of gigabytes extracted the process hangs in iowait indefinitely. One CPU core is 100% occupied with iowait, the other CPU core is idle (on 2-core Intel Celeron G1610T).
No other filesystem tested before or after this (ext2, ext3, ext4, reiserfs3, jfs, nilfs2, f2fs, btrfs, zfs) on the same drive has shown this behavior. When I downgraded the kernel to 5.10.109, the XFS started working again. Kernel versions higher than 5.15 seem to be affected, I tried 5.17.1 and 5.17.6, they both hang up after a few minutes.
No error is reported to the system log or to dmesg when the process hangs. No error shows on stdout or stderr of the tar process either.
Next I want to address the elephant in the (other) room. This is not a SMR problem. None of the disks present in the test setup are SMR. All are CMR, and while they certainly are not brand new, they are all in good working condition.
I searched recent kernel bugs related to XFS and only found ones about mounting a corrupted XFS file system. Mine is not corrupted, it is in fact newly created.
Here are some screenshots when it hangs up: dstat, top
Is there any way I can be useful to the community? How can I trace the bug and where can I submit my findings?
UPDATE 2022-05-26: should be solved in future upstream kernel versions.
Last edited by araxon on Thu May 26, 2022 11:02 am; edited 1 time in total |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Tue May 17, 2022 11:17 am Post subject: |
|
|
Which command do you run to extract the umcompressed archive?
What's the size of the uncompressed archive?
Do I understand correctly that the uncompressed archive is on another volume? And that your computer hangs while the extraction program writes files to the newly created XFS volume?
Which kernels do you use? Vanilla kernels? Or Gentoo kernels? |
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Tue May 17, 2022 11:42 am Post subject: |
|
|
mike155 wrote: | Which command do you run to extract the umcompressed archive? |
Code: | tar -x -f /var/backup/test.tar -C /mnt/test |
mike155 wrote: | What's the size of the uncompressed archive? |
The archive file size is 760.3 GiB. It hangs up when it has extracted somewhere between 15GiB and 32GiB (it varies from one run to another).
mike155 wrote: | Do I understand correctly that the uncompressed archive is on another volume? And that your computer hangs while the extraction program writes files to the newly created XFS volume? |
Yes, the archive is on SW RAID1 (/dev/sda3 and /dev/sdb3) formatted to ext4 mounted as /
while the XFS is on single disk (/dev/sdc1) mounted as /mnt/test
Yes, the XFS volume is pristine, newly created.
The extracting process hangs in iowait, but the computer remains perfectly usable, I can even read and write the XFS volume itself from another terminal window.
mike155 wrote: | Which kernels do you use? Vanilla kernels? Or Gentoo kernels? |
I'm using Gentoo kernels (sys-kernel/gentoo-sources). |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Tue May 17, 2022 12:05 pm Post subject: |
|
|
Thank you for your advice. I will try vanilla kernel and then see where to go from there.
Is sys-kernel/vanilla-kernel sufficient, or should I build it from git to be sure? |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Tue May 17, 2022 12:29 pm Post subject: |
|
|
I think it's a good idea to use a kernel from http://www.kernel.org - or, even better, to clone the stable kernel git repository from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Code: | mkdir /usr/src/linux-stable-git
cd /usr/src/linux-stable-git
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
git checkout linux-5.15.y |
Make sure that you're not chasing a bug that was introduced by Gentoo. |
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Tue May 17, 2022 3:05 pm Post subject: |
|
|
I compiled the latest kernel from the kernel.org (5.18.0-rc7), rebooted and tried the test again. Shortly after it is stuck in iowait, just 22GiB into the archive. So, the bug is not exclusive to Gentoo. |
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Thu May 19, 2022 8:54 am Post subject: |
|
|
I compiled a more mainstream version of sys-kernel/gentoo-sources-5.15.32-r1 (removed my .config file and let genkernel to fill it with default options) and lo and behold, in this kernel I could not make it go stuck anymore.
The only difference in XFS options were ("my" config on the left, the genkernel default config on the right):
Code: | CONFIG_XFS_FS=y | CONFIG_XFS_FS=m
# CONFIG_XFS_QUOTA is not set | CONFIG_XFS_QUOTA=y
# CONFIG_XFS_POSIX_ACL is not set | CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set | CONFIG_XFS_RT=y
# CONFIG_XFS_ONLINE_SCRUB is not set | CONFIG_XFS_ONLINE_SCRUB=y
> # CONFIG_XFS_ONLINE_REPAIR is not set |
However, after I altered my old kernel config to contain these values and rebooting, I'm still triggering the bug. It may not be a XFS issue after all. |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
Posted: Thu May 19, 2022 12:09 pm Post subject: |
|
|
Certainly!
Code: | [10271.222656] sysrq: Show Blocked State
[10271.222747] task:xfsaild/sdc1 state:D stack: 0 pid: 7298 ppid: 2 flags:0x00004000
[10271.222754] Call Trace:
[10271.222756] <TASK>
[10271.222758] __schedule+0x1c9/0x510
[10271.222767] schedule+0x3f/0xa0
[10271.222771] schedule_timeout+0x7c/0xf0
[10271.222777] ? init_timer_key+0x30/0x30
[10271.222782] xfsaild+0xcd/0x640
[10271.222788] ? xfs_trans_ail_cursor_first+0x70/0x70
[10271.222792] kthread+0x122/0x140
[10271.222798] ? set_kthread_struct+0x30/0x30
[10271.222802] ret_from_fork+0x22/0x30
[10271.222806] </TASK>
[10271.222808] task:tar state:D stack: 0 pid: 7303 ppid: 7302 flags:0x00004000
[10271.222812] Call Trace:
[10271.222812] <TASK>
[10271.222814] __schedule+0x1c9/0x510
[10271.222818] ? lock_timer_base+0x5c/0x80
[10271.222822] schedule+0x3f/0xa0
[10271.222826] schedule_timeout+0x7c/0xf0
[10271.222830] ? init_timer_key+0x30/0x30
[10271.222834] io_schedule_timeout+0x47/0x70
[10271.222838] congestion_wait+0x79/0xd0
[10271.222844] ? wait_woken+0x60/0x60
[10271.222848] xfs_buf_alloc_pages+0xd0/0x1b0
[10271.222853] xfs_buf_get_map+0x259/0x300
[10271.222858] xfs_trans_get_buf_map+0xa9/0x120
[10271.222863] xfs_ialloc_inode_init+0x129/0x2d0
[10271.222867] ? xfs_ialloc_ag_alloc+0x1df/0x630
[10271.222870] xfs_ialloc_ag_alloc+0x1df/0x630
[10271.222874] xfs_dialloc+0x1b4/0x720
[10271.222878] xfs_create+0x1d7/0x450
[10271.222883] xfs_generic_create+0x114/0x2d0
[10271.222888] path_openat+0x510/0xe10
[10271.222892] do_filp_open+0xad/0x150
[10271.222895] ? xfs_blockgc_clear_iflag+0x93/0xb0
[10271.222898] ? xfs_iunlock+0x52/0x90
[10271.222903] do_sys_openat2+0x91/0x150
[10271.222908] __x64_sys_openat+0x4e/0x90
[10271.222912] do_syscall_64+0x43/0x90
[10271.222918] entry_SYSCALL_64_after_hwframe+0x44/0xae
[10271.222924] RIP: 0033:0x7fbe40f5a572
[10271.222928] RSP: 002b:00007ffc1ca6f2e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[10271.222931] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe40f5a572
[10271.222934] RDX: 00000000000809c1 RSI: 00005606893c4270 RDI: 0000000000000004
[10271.222936] RBP: 0000000000000180 R08: 000000000000c0c0 R09: 00005606893bf5f0
[10271.222938] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000000
[10271.222940] R13: 00000000000809c1 R14: 00005606893c4270 R15: 00005606881e4248
[10271.222942] </TASK> |
There is a slight variance when I run your command again after a while:
Code: | [10504.513978] sysrq: Show Blocked State
[10504.514069] task:xfsaild/sdc1 state:D stack: 0 pid: 7298 ppid: 2 flags:0x00004000
[10504.514076] Call Trace:
[10504.514078] <TASK>
[10504.514080] __schedule+0x1c9/0x510
[10504.514089] schedule+0x3f/0xa0
[10504.514093] schedule_timeout+0x7c/0xf0
[10504.514098] ? init_timer_key+0x30/0x30
[10504.514103] xfsaild+0xcd/0x640
[10504.514109] ? xfs_trans_ail_cursor_first+0x70/0x70
[10504.514114] kthread+0x122/0x140
[10504.514119] ? set_kthread_struct+0x30/0x30
[10504.514124] ret_from_fork+0x22/0x30
[10504.514128] </TASK>
[10504.514129] task:tar state:D stack: 0 pid: 7303 ppid: 7302 flags:0x00004000
[10504.514133] Call Trace:
[10504.514134] <TASK>
[10504.514135] __schedule+0x1c9/0x510
[10504.514138] ? lock_timer_base+0x5c/0x80
[10504.514142] schedule+0x3f/0xa0
[10504.514146] schedule_timeout+0x7c/0xf0
[10504.514150] ? init_timer_key+0x30/0x30
[10504.514154] io_schedule_timeout+0x47/0x70
[10504.514157] congestion_wait+0x79/0xd0
[10504.514164] ? wait_woken+0x60/0x60
[10504.514168] xfs_buf_alloc_pages+0xd0/0x1b0
[10504.514173] xfs_buf_get_map+0x259/0x300
[10504.514177] ? xfs_buf_item_init+0x150/0x160 <==== this line is new
[10504.514184] xfs_trans_get_buf_map+0xa9/0x120
[10504.514189] xfs_ialloc_inode_init+0x129/0x2d0
[10504.514193] ? xfs_ialloc_ag_alloc+0x1df/0x630
[10504.514196] xfs_ialloc_ag_alloc+0x1df/0x630
[10504.514200] xfs_dialloc+0x1b4/0x720
[10504.514203] xfs_create+0x1d7/0x450
[10504.514208] xfs_generic_create+0x114/0x2d0
[10504.514213] path_openat+0x510/0xe10
[10504.514217] do_filp_open+0xad/0x150
[10504.514220] ? xfs_blockgc_clear_iflag+0x93/0xb0
[10504.514224] ? xfs_iunlock+0x52/0x90
[10504.514229] do_sys_openat2+0x91/0x150
[10504.514234] __x64_sys_openat+0x4e/0x90
[10504.514238] do_syscall_64+0x43/0x90
[10504.514244] entry_SYSCALL_64_after_hwframe+0x44/0xae
[10504.514251] RIP: 0033:0x7fbe40f5a572
[10504.514254] RSP: 002b:00007ffc1ca6f2e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[10504.514258] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe40f5a572
[10504.514260] RDX: 00000000000809c1 RSI: 00005606893c4270 RDI: 0000000000000004
[10504.514262] RBP: 0000000000000180 R08: 000000000000c0c0 R09: 00005606893bf5f0
[10504.514264] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000000
[10504.514266] R13: 00000000000809c1 R14: 00005606893c4270 R15: 00005606881e4248
[10504.514269] </TASK> |
|
|
Back to top |
|
|
araxon Tux's lil' helper
Joined: 25 May 2011 Posts: 85
|
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
|
Back to top |
|
|
|