Message ID | 1441242317-16547-1-git-send-email-jmaggard@netgear.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, Sep 02, 2015 at 06:05:17PM -0700, Justin Maggard wrote: > v2: Fix stupid error while making formatting changes... I haven't noticed any difference between the patches, what exactly did you change? > I was hitting a consistent NULL pointer dereference during shutdown that > showed the trace running through end_workqueue_bio(). I traced it back to > the endio_meta_workers workqueue being poked after it had already been > destroyed. > > Eventually I found that the root cause was a qgroup rescan that was still > in progress while we were stopping all the btrfs workers. > > Currently we explicitly pause balance and scrub operations in > close_ctree(), but we do nothing to stop the qgroup rescan. We should > probably be doing the same for qgroup rescan, but that's a much larger > change. This small change is good enough to allow me to unmount without > crashing. > > Signed-off-by: Justin Maggard <jmaggard@netgear.com> Can you please submit the test you've used to trigger the crash to fstests? Reviewed-by: David Sterba <dsterba@suse.com> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 7:45 AM, David Sterba <dsterba@suse.cz> wrote: > On Wed, Sep 02, 2015 at 06:05:17PM -0700, Justin Maggard wrote: >> v2: Fix stupid error while making formatting changes... > > I haven't noticed any difference between the patches, what exactly did > you change? > I broke compiling while cleaning up some checkpatch.pl feedback. Here's what changed between v1 and v2: - if (!btrfs_fs_closing(fs_info)) { + if (!btrfs_fs_closing(fs_info)) >> I was hitting a consistent NULL pointer dereference during shutdown that >> showed the trace running through end_workqueue_bio(). I traced it back to >> the endio_meta_workers workqueue being poked after it had already been >> destroyed. >> >> Eventually I found that the root cause was a qgroup rescan that was still >> in progress while we were stopping all the btrfs workers. >> >> Currently we explicitly pause balance and scrub operations in >> close_ctree(), but we do nothing to stop the qgroup rescan. We should >> probably be doing the same for qgroup rescan, but that's a much larger >> change. This small change is good enough to allow me to unmount without >> crashing. >> >> Signed-off-by: Justin Maggard <jmaggard@netgear.com> > > Can you please submit the test you've used to trigger the crash to > fstests? > Sure, I've got a reproducer coded up for xfstests now. Should I just send that to this list, or is there a better place to send it? -Justin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 26, 2015 at 1:25 AM, Justin Maggard <jmaggard10@gmail.com> wrote: > On Tue, Sep 22, 2015 at 7:45 AM, David Sterba <dsterba@suse.cz> wrote: >> On Wed, Sep 02, 2015 at 06:05:17PM -0700, Justin Maggard wrote: >>> v2: Fix stupid error while making formatting changes... >> >> I haven't noticed any difference between the patches, what exactly did >> you change? >> > > I broke compiling while cleaning up some checkpatch.pl feedback. > Here's what changed between v1 and v2: > > - if (!btrfs_fs_closing(fs_info)) { > + if (!btrfs_fs_closing(fs_info)) > > >>> I was hitting a consistent NULL pointer dereference during shutdown that >>> showed the trace running through end_workqueue_bio(). I traced it back to >>> the endio_meta_workers workqueue being poked after it had already been >>> destroyed. >>> >>> Eventually I found that the root cause was a qgroup rescan that was still >>> in progress while we were stopping all the btrfs workers. >>> >>> Currently we explicitly pause balance and scrub operations in >>> close_ctree(), but we do nothing to stop the qgroup rescan. We should >>> probably be doing the same for qgroup rescan, but that's a much larger >>> change. This small change is good enough to allow me to unmount without >>> crashing. >>> >>> Signed-off-by: Justin Maggard <jmaggard@netgear.com> >> >> Can you please submit the test you've used to trigger the crash to >> fstests? >> > > Sure, I've got a reproducer coded up for xfstests now. Should I just > send that to this list, or is there a better place to send it? Just send it to fstests@vger.kernel.org with the btrfs mailing list on cc. If you take a look at test submission emails in the btrfs mailing list, you'll see how it's usually done. thanks > > -Justin > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 3, 2015 at 2:05 AM, Justin Maggard <jmaggard10@gmail.com> wrote: > v2: Fix stupid error while making formatting changes... > > I was hitting a consistent NULL pointer dereference during shutdown that > showed the trace running through end_workqueue_bio(). I traced it back to > the endio_meta_workers workqueue being poked after it had already been > destroyed. > > Eventually I found that the root cause was a qgroup rescan that was still > in progress while we were stopping all the btrfs workers. > > Currently we explicitly pause balance and scrub operations in > close_ctree(), but we do nothing to stop the qgroup rescan. We should > probably be doing the same for qgroup rescan, but that's a much larger > change. This small change is good enough to allow me to unmount without > crashing. > > Signed-off-by: Justin Maggard <jmaggard@netgear.com> > --- > fs/btrfs/qgroup.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index d904ee1..5bfcee9 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -2278,7 +2278,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work) > goto out; > > err = 0; > - while (!err) { > + while (!err && !btrfs_fs_closing(fs_info)) { > trans = btrfs_start_transaction(fs_info->fs_root, 0); > if (IS_ERR(trans)) { > err = PTR_ERR(trans); > @@ -2301,7 +2301,8 @@ out: > btrfs_free_path(path); > > mutex_lock(&fs_info->qgroup_rescan_lock); > - fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; > + if (!btrfs_fs_closing(fs_info)) > + fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; > > if (err > 0 && > fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) { > @@ -2330,7 +2331,9 @@ out: > } > btrfs_end_transaction(trans, fs_info->quota_root); > > - if (err >= 0) { > + if (btrfs_fs_closing(fs_info)) { > + btrfs_info(fs_info, "qgroup scan paused"); > + } else if (err >= 0) { > btrfs_info(fs_info, "qgroup scan completed%s", > err > 0 ? " (inconsistency flag cleared)" : ""); > } else { Justin, this is still racy (however much less racy than before). Once we leave the loop because of the condition btrfs_fs_closing(fs_info), we start a transaction and do some write operation on the quota btree. While or before we do such write operation, close_ctree() might have completed or be at a point where such write operation will result in another null pointer dereference, or accessing some dangling pointer, or leak a transaction that never gets committed (because close_ctree() already stopped the transaction kthread), etc, etc. So in addition to what you did, you need to call btrfs_qgroup_wait_for_completion(fs_info) at disk-io.c:close_ctree() right after setting fs_info->closing to 1. Otherwise it looks good. Thanks. > -- > 2.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index d904ee1..5bfcee9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2278,7 +2278,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work) goto out; err = 0; - while (!err) { + while (!err && !btrfs_fs_closing(fs_info)) { trans = btrfs_start_transaction(fs_info->fs_root, 0); if (IS_ERR(trans)) { err = PTR_ERR(trans); @@ -2301,7 +2301,8 @@ out: btrfs_free_path(path); mutex_lock(&fs_info->qgroup_rescan_lock); - fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; + if (!btrfs_fs_closing(fs_info)) + fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; if (err > 0 && fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) { @@ -2330,7 +2331,9 @@ out: } btrfs_end_transaction(trans, fs_info->quota_root); - if (err >= 0) { + if (btrfs_fs_closing(fs_info)) { + btrfs_info(fs_info, "qgroup scan paused"); + } else if (err >= 0) { btrfs_info(fs_info, "qgroup scan completed%s", err > 0 ? " (inconsistency flag cleared)" : ""); } else {
v2: Fix stupid error while making formatting changes... I was hitting a consistent NULL pointer dereference during shutdown that showed the trace running through end_workqueue_bio(). I traced it back to the endio_meta_workers workqueue being poked after it had already been destroyed. Eventually I found that the root cause was a qgroup rescan that was still in progress while we were stopping all the btrfs workers. Currently we explicitly pause balance and scrub operations in close_ctree(), but we do nothing to stop the qgroup rescan. We should probably be doing the same for qgroup rescan, but that's a much larger change. This small change is good enough to allow me to unmount without crashing. Signed-off-by: Justin Maggard <jmaggard@netgear.com> --- fs/btrfs/qgroup.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)