Message ID | 1394150467-5990-1-git-send-email-jbacik@fb.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote: > Zach found this deadlock that would happen like this > > btrfs_end_transaction <- reduce trans->use_count to 0 > btrfs_run_delayed_refs > btrfs_cow_block > find_free_extent > btrfs_start_transaction <- increase trans->use_count to 1 > allocate chunk > btrfs_end_transaction <- decrease trans->use_count to 0 > btrfs_run_delayed_refs > lock tree block we are cowing above ^^ Indeed, I stumbled across this while trying to reproduce reported problems with iozone. This deadlock would consistently hit during random 1k reads in a 2gig file. > We need to only decrease trans->use_count if it is above 1, otherwise leave it > alone. This will make nested trans be the only ones who decrease their added > ref, and will let us get rid of the trans->use_count++ hack if we have to commit > the transaction. Thanks, And this fixes it. It's run through a few times successfully. > cc: stable@vger.kernel.org > Reported-by: Zach Brown <zab@redhat.com> > Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Zach Brown <zab@redhat.com> - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 6, 2014 at 7:25 PM, Zach Brown <zab@redhat.com> wrote: > On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote: >> Zach found this deadlock that would happen like this >> > > And this fixes it. It's run through a few times successfully. I'm not sure if my issue is related to this or not - happy to start a new thread if not. I applied this patch as I was running into locks, but I am still having them. See: http://picpaste.com/IMG_20140312_072458-KPH35pQ6.jpg After a number of reboots the system became stable, presumably whatever race condition btrfs was hitting followed a favorable path. I do have a 2GB btrfs-image pre-dating my application of this patch that was causing the issue last week. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/12/2014 08:56 AM, Rich Freeman wrote: > On Thu, Mar 6, 2014 at 7:25 PM, Zach Brown <zab@redhat.com> wrote: >> On Thu, Mar 06, 2014 at 07:01:07PM -0500, Josef Bacik wrote: >>> Zach found this deadlock that would happen like this >>> >> >> And this fixes it. It's run through a few times successfully. > > I'm not sure if my issue is related to this or not - happy to start > a new thread if not. I applied this patch as I was running into > locks, but I am still having them. > > See: > https://urldefense.proofpoint.com/v1/url?u=http://picpaste.com/IMG_20140312_072458-KPH35pQ6.jpg&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=i0z2iBhr7rZW%2Bkc1oo9FXeEzYukWs4Q36PGBkcCzub4%3D%0A&s=a3430d4f9555ea4c4ae32e4570a75bb66153a7d4b76e99e46da7e7c3d2a80a2e > > After a number of reboots the system became stable, presumably > whatever race condition btrfs was hitting followed a favorable > path. > > I do have a 2GB btrfs-image pre-dating my application of this > patch that was causing the issue last week. > Uhm wow that's pretty epic. I will talk to chris and figure out how we want to deal with that and send you a patch shortly. Thanks, Josef -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTIHxGAAoJEANb+wAKly3BNMYP/0HYjAVhf9NndNM/ZsLcR48k 4Br0S3QPuNuMvyZMtTSmpSWmLjhtPOl87llOSBfYOTti3AVXYuyzHruGxpkjrqAS PLYF+1wlrJeJ7FO6of7B0ZTzbEZcpMOV5Y0FV5m2ONGO1HthLISVTipVgm9KIcWG wa1BmyqdVg2h2aRmBeFTJVHWWHCK3m2qLXiNpiH3Vh399vZwrdh7tmVI3Pvv+nVj NGl+xskoaQHV9TUz4RnpyQ2dg/y3A+f+GmbDHcI3WUi/6is1Pnctv0RrGb1XYwlu DuhYxP/vbrgD9PE4C3/6kaXMfnzVkqQk9dN6hqWZnOq5Vo1zbTeEpemvDvRPRYVf PwJ29PXXRJsUtHpr68/xBAjBE2elLQvnIHOkfqVSsTNKA4PZzojcw7k0WIFTRveN iWDDGqcL+Pnvq9Ajj8hjlnVIqQwExx5K0Bikue/Nr8p3vuy1FEvdOXlxmOqPVd6K Bw2H9mPsJeLHZdwV3Kb0aH3rA8ruPqZHy3Dp0/r0gcYKwpQTKv5cDnjhu5aJ4xgZ k7Ve8AFZwqChnXPdr9yZr1dBrxBTVa7b/uY+7PiAQQiBXa5M2yZOP9R3H+yak1PP w11FNrhemaKpYX63Equ9VVRDv0TFIslklLfvVRIQIUKpZ+4XX43hmU0EqYbbN+Eh fGiSsK1fFbWt/js23RoD =sGCH -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik <jbacik@fb.com> wrote: > On 03/12/2014 08:56 AM, Rich Freeman wrote: >> >> After a number of reboots the system became stable, presumably >> whatever race condition btrfs was hitting followed a favorable >> path. >> >> I do have a 2GB btrfs-image pre-dating my application of this >> patch that was causing the issue last week. >> > > Uhm wow that's pretty epic. I will talk to chris and figure out how > we want to deal with that and send you a patch shortly. Thanks, If you need any info from me at all beyond the capture let me know. A tiny bit more background. The system would boot normally, but panic after about 30-90 seconds (usually long enough to log into KDE, perhaps even fire up a browser/etc). In single-user mode I could mount the filesystem read-only without issue. If I mounted it read-write (in recovery mode or normally) I'd get the panic after about 30-60 seconds. On one occasion it seemed stable, but panicked when I unmounted it. I have to say that I'm impressed that it recovers at all. I'd rather have the file system not write anything if it isn't sure it can't write it correctly, and that seems to be the effect here. Just about all the issues I've run into with btrfs have tended to be lockup/etc type issues, and not silent corruption. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 12, 2014 at 12:34 PM, Rich Freeman <r-btrfs@thefreemanclan.net> wrote: > On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik <jbacik@fb.com> wrote: >> On 03/12/2014 08:56 AM, Rich Freeman wrote: >>> >>> After a number of reboots the system became stable, presumably >>> whatever race condition btrfs was hitting followed a favorable >>> path. >>> >>> I do have a 2GB btrfs-image pre-dating my application of this >>> patch that was causing the issue last week. >>> >> >> Uhm wow that's pretty epic. I will talk to chris and figure out how >> we want to deal with that and send you a patch shortly. Thanks, > > A tiny bit more background. And some more background. I had more reboots over the next two days at the same time each day, just after my crontab successfully completed. One of the last thing it does is runs the snapper cleanups which delete a bunch of snapshots. During a reboot I checked and there were a bunch of deleted snapshots, which disappeared over the next 30-60 seconds before the panic, and then they would re-appear on the next reboot. I disabled the snapper cron job and this morning had no issues at all. One day isn't much to establish a trend, but I suspect that this is the cause. Obviously getting rid of snapshots would be desirable at some point, but I can wait for a patch. Snapper would be deleting about 48 snapshots at the same time, since I create them hourly and the cleanup occurs daily on two different subvolumes on the same filesystem. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rich Freeman posted on Fri, 14 Mar 2014 18:40:25 -0400 as excerpted: > And some more background. I had more reboots over the next two days at > the same time each day, just after my crontab successfully completed. > One of the last thing it does is runs the snapper cleanups which delete > a bunch of snapshots. During a reboot I checked and there were a bunch > of deleted snapshots, which disappeared over the next 30-60 seconds > before the panic, and then they would re-appear on the next reboot. > > I disabled the snapper cron job and this morning had no issues at all. > One day isn't much to establish a trend, but I suspect that this is > the cause. Obviously getting rid of snapshots would be desirable at > some point, but I can wait for a patch. Snapper would be deleting about > 48 snapshots at the same time, since I create them hourly and the > cleanup occurs daily on two different subvolumes on the same filesystem. Hi, Rich. Imagine seeing you here! =:^) (Note to others, I run gentoo and he's a gentoo dev, so we normally see each other on the gentoo lists. But btrfs comes up occasionally there too, so we knew we were both running it, I'd just not noticed any of his posts here, previously.) Three things: 1) Does running the snapper cleanup command from that cron job manually trigger the problem as well? Presumably if you run it manually, you'll do so at a different time of day, thus eliminating the possibility that it's a combination of that and something else occurring at that specific time, as well as confirming that it is indeed the snapper 2) What about modifying the cron job to run hourly, or perhaps every six hours, so it's deleting only 2 or 12 instead of 48 at a time? Does that help? If so then it's a thundering herd problem. While definitely still a bug, you'll at least have a workaround until its fixed. 3) I'd be wary of letting too many snapshots build up. A couple hundred shouldn't be a huge issue, but particularly when the snapshot-aware- defrag was still enabled, people were reporting problems with thousands of snapshots, so I'd recommend trying to keep it under 500 or so, at least of the same subvol (so under 1000 total since you're snapshotting two different subvols). So a hourly cron job deleting or at least thinning down snapshots over say 2 days old, possibly in the same cron job that creates the new snaps, might be a good idea. That'd only do two at a time, the same rate they're created, but with a 48 hour set of snaps before deletion.
On 03/14/2014 06:40 PM, Rich Freeman wrote: > On Wed, Mar 12, 2014 at 12:34 PM, Rich Freeman > <r-btrfs@thefreemanclan.net> wrote: >> On Wed, Mar 12, 2014 at 11:24 AM, Josef Bacik <jbacik@fb.com> wrote: >>> On 03/12/2014 08:56 AM, Rich Freeman wrote: >>>> >>>> After a number of reboots the system became stable, presumably >>>> whatever race condition btrfs was hitting followed a favorable >>>> path. >>>> >>>> I do have a 2GB btrfs-image pre-dating my application of this >>>> patch that was causing the issue last week. >>>> >>> >>> Uhm wow that's pretty epic. I will talk to chris and figure out how >>> we want to deal with that and send you a patch shortly. Thanks, >> >> A tiny bit more background. > > And some more background. I had more reboots over the next two days > at the same time each day, just after my crontab successfully > completed. One of the last thing it does is runs the snapper cleanups > which delete a bunch of snapshots. During a reboot I checked and > there were a bunch of deleted snapshots, which disappeared over the > next 30-60 seconds before the panic, and then they would re-appear on > the next reboot. > > I disabled the snapper cron job and this morning had no issues at all. > One day isn't much to establish a trend, but I suspect that this is > the cause. Obviously getting rid of snapshots would be desirable at > some point, but I can wait for a patch. Snapper would be deleting > about 48 snapshots at the same time, since I create them hourly and > the cleanup occurs daily on two different subvolumes on the same > filesystem. Ok that's helpful, I'm no longer positive I know what's causing this, I'll try to reproduce once I've nailed down these backref problems and balance corruption. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 15, 2014 at 7:51 AM, Duncan <1i5t5.duncan@cox.net> wrote: > 1) Does running the snapper cleanup command from that cron job manually > trigger the problem as well? As you can imagine I'm not too keen to trigger this often. But yes, I just gave it a shot on my SSD and cleaning a few days of timelines triggered a panic. > 2) What about modifying the cron job to run hourly, or perhaps every six > hours, so it's deleting only 2 or 12 instead of 48 at a time? Does that > help? > > If so then it's a thundering herd problem. While definitely still a bug, > you'll at least have a workaround until its fixed. Definitely looks like a thundering herd problem. I stopped the cron jobs (including the creation of snapshots based on your later warning). However, I am my snapshots one at a time at a rate of one every 5-30 minutes, and while that is creating surprisingly high disk loads on my ssd and hard drives, I don't get any panics. I figured that having only one deletion pending per checkpoint would eliminate locking risk. I did get some blocked task messages in dmesg, like: [105538.121239] INFO: task mysqld:3006 blocked for more than 120 seconds. [105538.121251] Not tainted 3.13.6-gentoo #1 [105538.121256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [105538.121262] mysqld D ffff880395f63e80 3432 3006 1 0x00000000 [105538.121273] ffff88028b623d38 0000000000000086 ffff88028b623dc8 ffffffff81c10440 [105538.121283] 0000000000000200 ffff88028b623fd8 ffff880395f63b80 0000000000012c40 [105538.121291] 0000000000012c40 ffff880395f63b80 00000000532b7877 ffff880410e7e578 [105538.121299] Call Trace: [105538.121316] [<ffffffff81623d73>] schedule+0x6a/0x6c [105538.121327] [<ffffffff81623f52>] schedule_preempt_disabled+0x9/0xb [105538.121337] [<ffffffff816251af>] __mutex_lock_slowpath+0x155/0x1af [105538.121347] [<ffffffff812b9db0>] ? radix_tree_tag_set+0x71/0xd4 [105538.121356] [<ffffffff81625225>] mutex_lock+0x1c/0x2e [105538.121365] [<ffffffff8123c168>] btrfs_log_inode_parent+0x161/0x308 [105538.121373] [<ffffffff8162466d>] ? mutex_unlock+0x11/0x13 [105538.121382] [<ffffffff8123cd37>] btrfs_log_dentry_safe+0x39/0x52 [105538.121390] [<ffffffff8121a0c9>] btrfs_sync_file+0x1bc/0x280 [105538.121401] [<ffffffff811339a3>] vfs_fsync_range+0x13/0x1d [105538.121409] [<ffffffff811339c4>] vfs_fsync+0x17/0x19 [105538.121416] [<ffffffff81133c3c>] do_fsync+0x30/0x55 [105538.121423] [<ffffffff81133e40>] SyS_fsync+0xb/0xf [105538.121432] [<ffffffff8162c2e2>] system_call_fastpath+0x16/0x1b I suspect that this may not be terribly helpful - it probably reflects tasks waiting for a lock rather than whatever is holding it. It was more of a problem when I was trying to delete a snapshot per minute on my ssd, or one every 5 min on hdd. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rich Freeman posted on Thu, 20 Mar 2014 22:13:51 -0400 as excerpted: > However, I am my snapshots one at a time at a rate of one every 5-30 > minutes, and while that is creating surprisingly high disk loads on my > ssd and hard drives, I don't get any panics. I figured that having only > one deletion pending per checkpoint would eliminate locking risk. > > I did get some blocked task messages in dmesg, like: > [105538.121239] INFO: task mysqld:3006 blocked for more than 120 > seconds. These... are a continuing issue. The devs are working on it, but... The people that seem to have it the worst are doing both scripted snapshotting and large (gig+) constantly internal-rewritten files such as VM images (the most commonly reported case) or databases. Properly setting NOCOW on the files[1] helps, but... * The key thing to realize about snapshotting continually rewritten NOCOW files is that the first change to a block after a snapshot by definition MUST be COWed anyway, since the file content has changed from that of the snapshot. Further writes to the same block (until the next snapshot) will be rewritten in-place (the existing NOCOW attribute is maintained thru that mandatory COW), but next snapshot and following write, BAM! gotta COW again! So while NOCOW helps, in scenarios such as hourly snapshotting of active VM-image data loads its ability to control actual fragmentation is unfortunately rather limited. And it's precisely this fragmentation that appears to be the problem! =:^( It's almost certainly that fragmentation that's triggering your blocked for X seconds issues. But the interesting thing here is the reports even from people with fast SSDs where seek-time and even IOPs shouldn't be a huge issue. In at least some cases, the problem has been CPU time, not physical media access. Which is one reason the snapshot-aware-defrag was disabled again recently, because it simply wasn't scaling. (To answer the question, yes, defrag still works; it's only the snapshot-awareness that was disabled. Defrag is back to dumbly ignoring other snapshots and simply defragging the working file-extent-mapping the defrag is being run on, with other snapshots staying untouched.) They're reworking the whole feature now in ordered to scale better. But while that considerably reduces the pain point, people were seeing little or no defrag/balance/restripe progress in /hours/ if they had enough snapshots and that problem has been bypassed for the moment, we're still left with these nasty N-second stalls at times, especially when doing anything else involving those snapshots and the corresponding fragmentation they cover, including deleting them. Hopefully tweaking the algorithms and eventually optimizing can do away with much of this problem eventually, but I've a feeling it'll be around to some degree for some years. Meanwhile, for data that fits that known problematic profile, the current recommendation is, preferably, to isolate it to a subvolume that has only very limited or no snapshotting done. The other alternative, of course, since NOCOW already turns off many of the features a lot of people are using btrfs for in the first place (checksumming and compression are disabled with NOCOW as well, tho it turns out they're not so well suited to VM images in the first place), is that given the subvolume isolation already, just stick it on an entirely different filesystem, either btrfs with the nocow mount option, or arguably something a bit more traditional and mature such as ext4 or xfs, where xfs of course is actually targeted at large to huge file use-cases so multi-gig VMs should be an ideal fit. Of course you lose the benefits of btrfs doing that, but given its COW nature, btrfs arguably isn't the ideal solution for such huge internal-write files in the first place, and even when fully mature will likely only have /acceptable/ performance with them as suitable for use as a general purpose filesystem, with xfs or similar still likely being a better dedicated filesystem for such use- cases. Meanwhile, I think everyone agrees that getting that locking down to avoid the deadlocks, etc, really must be priority one, at least now that the huge scaling blocker of snapshot-aware-defrag is (hopefully temporarily) disabled. Blocking for a couple minutes at a time certainly isn't ideal, but since the triggering jobs such as snapshot deletion, etc, can be rescheduled to otherwise idle time, that's certainly less critical than crashes if people accidentally or in ignorance queue up too many snapshot deletions at a time! --- [1] NOCOW: chattr +C . With btrfs, this should be done while the file is zero-size, before it has content. The easiest way to do that is to create a dedicated directory for these files and set the attribute on the directory, such that the files inherit it at file creation.
Hi Josef, this problem could not happen when find_free_extent() was receiving a transaction handle (which was changed in "Btrfs: avoid starting a transaction in the write path"), correct? Because it would have used the passed transaction handle to do the chunk allocation, and thus would not need to do join_transaction/end_transaction leading to recursive run_delayed_refs call. Alex. On Fri, Mar 7, 2014 at 3:01 AM, Josef Bacik <jbacik@fb.com> wrote: > Zach found this deadlock that would happen like this > > btrfs_end_transaction <- reduce trans->use_count to 0 > btrfs_run_delayed_refs > btrfs_cow_block > find_free_extent > btrfs_start_transaction <- increase trans->use_count to 1 > allocate chunk > btrfs_end_transaction <- decrease trans->use_count to 0 > btrfs_run_delayed_refs > lock tree block we are cowing above ^^ > > We need to only decrease trans->use_count if it is above 1, otherwise leave it > alone. This will make nested trans be the only ones who decrease their added > ref, and will let us get rid of the trans->use_count++ hack if we have to commit > the transaction. Thanks, > > cc: stable@vger.kernel.org > Reported-by: Zach Brown <zab@redhat.com> > Signed-off-by: Josef Bacik <jbacik@fb.com> > --- > fs/btrfs/transaction.c | 14 ++++---------- > 1 file changed, 4 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 34cd831..b05bf58 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -683,7 +683,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, > int lock = (trans->type != TRANS_JOIN_NOLOCK); > int err = 0; > > - if (--trans->use_count) { > + if (trans->use_count > 1) { > + trans->use_count--; > trans->block_rsv = trans->orig_rsv; > return 0; > } > @@ -731,17 +732,10 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, > } > > if (lock && ACCESS_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) { > - if (throttle) { > - /* > - * We may race with somebody else here so end up having > - * to call end_transaction on ourselves again, so inc > - * our use_count. > - */ > - trans->use_count++; > + if (throttle) > return btrfs_commit_transaction(trans, root); > - } else { > + else > wake_up_process(info->transaction_kthread); > - } > } > > if (trans->type & __TRANS_FREEZABLE) > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 34cd831..b05bf58 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -683,7 +683,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, int lock = (trans->type != TRANS_JOIN_NOLOCK); int err = 0; - if (--trans->use_count) { + if (trans->use_count > 1) { + trans->use_count--; trans->block_rsv = trans->orig_rsv; return 0; } @@ -731,17 +732,10 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, } if (lock && ACCESS_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) { - if (throttle) { - /* - * We may race with somebody else here so end up having - * to call end_transaction on ourselves again, so inc - * our use_count. - */ - trans->use_count++; + if (throttle) return btrfs_commit_transaction(trans, root); - } else { + else wake_up_process(info->transaction_kthread); - } } if (trans->type & __TRANS_FREEZABLE)
Zach found this deadlock that would happen like this btrfs_end_transaction <- reduce trans->use_count to 0 btrfs_run_delayed_refs btrfs_cow_block find_free_extent btrfs_start_transaction <- increase trans->use_count to 1 allocate chunk btrfs_end_transaction <- decrease trans->use_count to 0 btrfs_run_delayed_refs lock tree block we are cowing above ^^ We need to only decrease trans->use_count if it is above 1, otherwise leave it alone. This will make nested trans be the only ones who decrease their added ref, and will let us get rid of the trans->use_count++ hack if we have to commit the transaction. Thanks, cc: stable@vger.kernel.org Reported-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> --- fs/btrfs/transaction.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-)