View previous topic :: View next topic |
Author |
Message |
BitJam Advocate
Joined: 12 Aug 2003 Posts: 2513 Location: Silver City, NM
|
Posted: Wed Oct 24, 2012 5:00 pm Post subject: [SOLVED!]Ext-4 Data Corruption Bug Hits Stable Linux Kernels |
|
|
link Quote: | As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series. |
Forum member szczerb posted this news in a thread but I think it deserves a thread of its own.
TL;DR: In recent kernels ext-4 journal playback can in some cases bork your file system.
Edit: fixed
Last edited by BitJam on Wed Oct 31, 2012 6:18 pm; edited 1 time in total |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Wed Oct 24, 2012 6:05 pm Post subject: |
|
|
BitJam ...
Note that you'll only get hit if the journal hasn't been wrapped, so give the journal something to work on or don't reboot so often ;) ... hehe.
I applied Ted's patch to 3.6.3 earlier today but have't rebooted as yet, anyhow as I've been running an effected kenel (2.6.2) for a week or so without issues I'm not inclined to panic.
best ... khay |
|
Back to top |
|
|
e3k Guru
Joined: 01 Oct 2007 Posts: 515 Location: Quantum Flux
|
|
Back to top |
|
|
szczerb Veteran
Joined: 24 Feb 2007 Posts: 1709 Location: Poland => Lodz
|
Posted: Wed Oct 24, 2012 9:17 pm Post subject: |
|
|
This comment https://bugs.gentoo.org/show_bug.cgi?id=439502#c0 seems to suggest that 3.5.x < 3.5.7 should be safe. I just booted the 3.5.7 at work yesterday so I'm waiting things out without rebooting or shutting down for now. Patches seem to be flowing around fast.
EDIT: BitJam, you're right - I should've made it a separate thread. I was rather swarmed at work, so didn't think of it. |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 23070
|
Posted: Wed Oct 24, 2012 10:23 pm Post subject: |
|
|
Linux Weekly News has a free to read article following this. The situation is evolving. Ted now believes that journal wrapping may not be involved. Additionally, nix has now stated that the affected system has some rather unusual shutdown behavior that may cause it to halt without all filesystems finishing their unmount. If that is what happened to him and if the corruption occurs only on multiple journal replays, then standard systems that gracefully unmount (or remount readonly) all their filesystems are at much lower risk than suggested by early reports. However, those are substantial qualifiers and there is insufficient evidence to determine whether they are met in the reported cases. |
|
Back to top |
|
|
jimmij Tux's lil' helper
Joined: 02 Dec 2008 Posts: 143
|
Posted: Thu Oct 25, 2012 6:33 am Post subject: |
|
|
Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot? _________________ Vanitas vanitatum et omnia vanitas.
Libera temet ex inferis. |
|
Back to top |
|
|
szczerb Veteran
Joined: 24 Feb 2007 Posts: 1709 Location: Poland => Lodz
|
Posted: Thu Oct 25, 2012 9:41 am Post subject: |
|
|
jimmij wrote: | Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot? | I'm doing just that - waiting with my system on. |
|
Back to top |
|
|
depontius Advocate
Joined: 05 May 2004 Posts: 3526
|
Posted: Thu Oct 25, 2012 12:48 pm Post subject: |
|
|
So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes? _________________ .sigs waste space and bandwidth |
|
Back to top |
|
|
NoDataFound n00b
Joined: 01 Aug 2011 Posts: 34
|
Posted: Thu Oct 25, 2012 2:14 pm Post subject: |
|
|
I'd like to know what kind of corruption it produce.
Having a bug is bad in itself, although not the end of the world, but it's better if it's recoverable... |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Thu Oct 25, 2012 3:20 pm Post subject: |
|
|
depontius wrote: | So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes? |
depontius ... the situation seems to have moved on (as Hu noted above), its nolonger thought to be related to wrapping.
Note: "Update: It now looks like the reproduction involved something very esoteric indeed, involving using umount -l and shutdowns while the file system was still being unmounted --- and the user had nobarrier specified in the mount options as well." Ted Ts'o
So, I don't think there is much reason to panic, if this wasn't a corner case then there would be hundreds of reports of data loss, and the actual reported case so far are few.
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.
best ... khay |
|
Back to top |
|
|
leifbk Guru
Joined: 05 Jan 2004 Posts: 425 Location: Bærum, Norway
|
Posted: Thu Oct 25, 2012 4:18 pm Post subject: |
|
|
khayyam wrote: |
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane ... but for the rest of us its best not to blow this out of proportion.
best ... khay |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?
I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable. _________________ Grumpy old man |
|
Back to top |
|
|
bandreabis Advocate
Joined: 18 Feb 2005 Posts: 2495 Location: イタリアのロディで
|
Posted: Thu Oct 25, 2012 4:39 pm Post subject: |
|
|
I can't see any visible difference between 3.3.8 and 3.5.7 (freshly compiled) so I remain with the "not hard masked" one. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Thu Oct 25, 2012 6:02 pm Post subject: |
|
|
leifbk wrote: | khayyam wrote: | For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion. |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance? |
leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?
leifbk wrote: | I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable. |
A serious bug in linux kernel has caused users to believe that there is serious bug in the linux kernel, in a post made to the LKML, Linus Torvalds stated "we're not really sure if this is a bug or not, but we can assure everyone we're reading all of the hullaballoo on slashdot and we'll know more as and when news hits critical mass". The bug, code named "worse than y2k, stuxnet, and Window98 combined (WTY2KSTUXNET&W98)" is thought to effect at least three users, and more than ten million blogs and news sites". Users, who until recently had thought that the designation "stable" was a ancronym for "no need for backups any mo", are lining up to throw themselves under the wheels of this runaway train, as one commentator noted "its worse than Fukushima Daiichi and that other thing ... didn't you read my blog post?" :)
best of the bwaaaaa ... khay |
|
Back to top |
|
|
John R. Graham Administrator
Joined: 08 Mar 2005 Posts: 10727 Location: Somewhere over Atlanta, Georgia
|
Posted: Thu Oct 25, 2012 6:33 pm Post subject: |
|
|
Goodness. That's more sarcastic than me!
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Thu Oct 25, 2012 7:07 pm Post subject: |
|
|
John R. Graham wrote: | Goodness. That's more sarcastic than me! |
John ... the intention was to deflate the rise in panic with some humor. It seems that this serious bug, though no doubt an annoyance to those hit, is most likely a corner case, and so all the "hallaballoo" needs to step down a gear or three. Its already been said that this is reflecting badly on ext4, and some of the reporting has been out of proportion to the actual severity, so I guess my sarcasm reflects this.
best ... khay |
|
Back to top |
|
|
John R. Graham Administrator
Joined: 08 Mar 2005 Posts: 10727 Location: Somewhere over Atlanta, Georgia
|
Posted: Thu Oct 25, 2012 7:24 pm Post subject: |
|
|
Never explain sarcasm; it just ruins it.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
|
energyman76b Advocate
Joined: 26 Mar 2003 Posts: 2048 Location: Germany
|
Posted: Thu Oct 25, 2012 9:21 pm Post subject: |
|
|
short: don't do anything stupid and you won't hit the bug.
It is really that simple. Phoronix in the mean time is working hard to earn that Moronix moniker. _________________ Study finds stunning lack of racial, gender, and economic diversity among middle-class white males
I identify as a dirty penismensch. |
|
Back to top |
|
|
Jaglover Watchman
Joined: 29 May 2005 Posts: 8291 Location: Saint Amant, Acadiana
|
|
Back to top |
|
|
leifbk Guru
Joined: 05 Jan 2004 Posts: 425 Location: Bærum, Norway
|
Posted: Thu Oct 25, 2012 9:34 pm Post subject: |
|
|
khayyam wrote: | leifbk wrote: | khayyam wrote: | For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane ... but for the rest of us its best not to blow this out of proportion. |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance? |
leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?
|
Not quite, but I got carried away by the implications. BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now. _________________ Grumpy old man |
|
Back to top |
|
|
Jaglover Watchman
Joined: 29 May 2005 Posts: 8291 Location: Saint Amant, Acadiana
|
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Fri Oct 26, 2012 1:25 am Post subject: |
|
|
leifbk wrote: | BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now. |
leifbk ... what? and miss the opportunity to discover another bug, no ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?
best ... khay |
|
Back to top |
|
|
anyNiXwilldo Apprentice
Joined: 20 Feb 2004 Posts: 176 Location: US
|
Posted: Fri Oct 26, 2012 2:26 am Post subject: |
|
|
Well I hadn't rebooted in several days, but I noticed this morning 3.6.2 was masked. I knew why, from yesterday's articles. The info making the rounds today was saying it's a rather esoteric (hard to reproduce) bug, which probably meant I had nothing to worry about. However, given I run almost 100% stable, except for things like qpdfview, nomacs and the kernel, I felt it best to back the kernel back down to stable from ~amd64. I umounted my data partition after building 3.5.4-hardened-r1-gnu, prior to rebooting with that kernel. Everything seems to be fine. I just know I don't have the nerves to deal with these newer kernels and whatever very scary bugs they might have. _________________ Of course you can have my root password. I'm on Hardened! |
|
Back to top |
|
|
platojones Veteran
Joined: 23 Oct 2002 Posts: 1602 Location: Just over the horizon
|
Posted: Fri Oct 26, 2012 2:30 am Post subject: |
|
|
Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), I'd say it's looking very anti-climactic:
http://thread.gmane.org/gmane.linux.kernel/1379725/focus=1381772 |
|
Back to top |
|
|
leifbk Guru
Joined: 05 Jan 2004 Posts: 425 Location: Bærum, Norway
|
Posted: Fri Oct 26, 2012 5:22 am Post subject: |
|
|
khayyam wrote: | ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities? |
We love to burn cheap Swedish furniture
We still haven't forgiven the Swedes for Karl XII, who was shot through the head during his Norwegian campaign in 1718. Nobody knows for certain if the bullet was Norwegian or Swedish, but we love to claim the credit. This tends to make the Swedes irate. _________________ Grumpy old man |
|
Back to top |
|
|
ulenrich Veteran
Joined: 10 Oct 2010 Posts: 1483
|
Posted: Fri Oct 26, 2012 11:44 am Post subject: |
|
|
platojones wrote: | Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), |
Yes, but it was not hardware related but setup:
cascading mounts of mixed ext4 and network devices, were it was forcefully configured to be able to very fast reboot: lazy "umount -l" was used to not wait for net devices. And a local machine ext4 partition was mounted on top of a net mount??? And "nobarrier" mount option??
In this special case _and_ if additionally some crash induced reboots then:
there was data loss after the second reboot!
A clean bit was set, when there hasn't been a journal cleanup yet (writeback?). A workaround for this setup would have been forcefsck in the boot cmdline. This would have played the capabilities of a journaled filesystem: The missing data would have been written back. But the additional forcefsck wouldn't quickly boot up the system ...
not a very general used setup
This is why Greg Kroah-Hartman doesn't quickly thin release to fix the issue. At first all of us who observe the stable patchlevel releases felt a panic attack because we knew there had been an ext4 feature backport for linux-3.6.2 . But the jbd2 patch which obviously caused the data loss would have been attached any way: it was (thought) a fix. Greg Kroah-Hartman should serialize such feature backports to reduce our psycho panics.
[edit]Don't take the last sentence as a serious suggestion, but as a tool to self audit (for me at least).
[edit2]Because Greg does it already when possible.
Last edited by ulenrich on Fri Oct 26, 2012 4:19 pm; edited 2 times in total |
|
Back to top |
|
|
|