diff mbox

[v2] btrfs: clean snapshots one by one

Message ID 1362154654-17655-1-git-send-email-dsterba@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

David Sterba March 1, 2013, 4:17 p.m. UTC
Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point wait for cleaning all currently
queued dead roots.

A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.

Process snapshot cleaning is now done only from the cleaner thread and
the others wake it if needed.

Signed-off-by: David Sterba <dsterba@suse.cz>
---

v1->v2:
- added s_umount trylock in cleaner thread
- added exit into drop_snapshot if fs is going down

patch based on cmason/integration

 fs/btrfs/disk-io.c     |   10 ++++++--
 fs/btrfs/extent-tree.c |    8 ++++++
 fs/btrfs/relocation.c  |    3 --
 fs/btrfs/transaction.c |   57 ++++++++++++++++++++++++++++++++----------------
 fs/btrfs/transaction.h |    2 +-
 5 files changed, 54 insertions(+), 26 deletions(-)

Comments

David Sterba March 1, 2013, 4:30 p.m. UTC | #1
The mail did not make it to the list on Monday, anyway, now I have a
few testing results to share.

The umount responsiveness is much improved, it took a few seconds when
there were many roots to clean scheduled.

With

$ echo 'func btrfs_drop_snapshot +pf' > /sys/debug/kernel/dynamic_debug/control
$ echo 'func btrfs_clean_one_deleted_snapshot +pf' > /sys/debug/kernel/dynamic_debug/control

one can watch how it proceeds.

After umount and and mount, the cleaning process starts again, after 30
seconds when transaction commit triggers.

Next test was to let it run with balance (just 'fi balance', no other
options). Worked well, but I've hit an oops from balance cancel command that
was triggered sometime during the balance. According to the log it happened
right after the balance finished. CCing Ilya, I'm not sure if this is somehow
induced by the patch.

3407         BUG_ON(fs_info->balance_ctl || atomic_read(&fs_info->balance_running));
3408         atomic_dec(&fs_info->balance_cancel_req);
3409         mutex_unlock(&fs_info->balance_mutex);
3410         return 0;
3411 }

 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/volumes.c:3407!
 invalid opcode: 0000 [#1] SMP
 Modules linked in: btrfs aoe dm_crypt loop [last unloaded: btrfs]
 CPU 0
 Pid: 19117, comm: btrfs Tainted: G        W    3.8.0-default+ #267 Intel
 RIP: 0010:[<ffffffffa0160919>]  [<ffffffffa0160919>] btrfs_cancel_balance+0x179/0x180 [btrfs]
 RSP: 0018:ffff88005fddfd78  EFLAGS: 00010286
 RAX: ffff88005fddffd8 RBX: ffff88005adec000 RCX: 0000000000000006
 RDX: 0000000000000001 RSI: ffff880027cc88c0 RDI: 2222222222222222
 RBP: ffff88005fddfdd8 R08: 2222222222222222 R09: 2222222222222222
 R10: 2222222222222222 R11: 0000000000000000 R12: ffff88005adeeac8
 R13: ffff88005adeeb70 R14: ffff88005fddfd78 R15: ffff88005adeeb88
 FS:  00007fe136daa740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 00007ff04c655000 CR3: 000000005fc4b000 CR4: 00000000000007f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process btrfs (pid: 19117, threadinfo ffff88005fdde000, task ffff880027cc8240)
 Stack:
  0000000000000000 ffff880027cc8240 ffffffff810745c0 ffff88005fddfd90
  ffff88005fddfd90 ffffffff810a6c6d ffff88005fddfdc8 ffff88007724d6c0
  ffff880055e6d000 0000000000000002 ffff880078ccc080 ffff880078ccc538
 Call Trace:
  [<ffffffff810745c0>] ? wake_up_bit+0x40/0x40
  [<ffffffff810a6c6d>] ? lock_release_holdtime+0x3d/0x1c0
  [<ffffffffa016c8a3>] btrfs_ioctl+0x693/0x1d80 [btrfs]
  [<ffffffff81079723>] ? up_read+0x23/0x40
  [<ffffffff8195e268>] ? __do_page_fault+0x238/0x590
  [<ffffffff81087d1f>] ? local_clock+0x6f/0x80
  [<ffffffff810a6909>] ? trace_hardirqs_off_caller+0x29/0xc0
  [<ffffffff811708f8>] do_vfs_ioctl+0x98/0x560
  [<ffffffff8195a906>] ? error_sti+0x5/0x6
  [<ffffffff8195a458>] ? retint_swapgs+0x13/0x1b
  [<ffffffff81170e17>] sys_ioctl+0x57/0x90
  [<ffffffff8139adce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
 RIP  [<ffffffffa0160919>] btrfs_cancel_balance+0x179/0x180 [btrfs]
  RSP <ffff88005fddfd78>
 ---[ end trace 930320f35566d010 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 2, 2013, 9:32 p.m. UTC | #2
Hi David,

On Fri, Mar 1, 2013 at 6:17 PM, David Sterba <dsterba@suse.cz> wrote:
> Each time pick one dead root from the list and let the caller know if
> it's needed to continue. This should improve responsiveness during
> umount and balance which at some point wait for cleaning all currently
> queued dead roots.
>
> A new dead root is added to the end of the list, so the snapshots
> disappear in the order of deletion.
>
> Process snapshot cleaning is now done only from the cleaner thread and
> the others wake it if needed.
>
> Signed-off-by: David Sterba <dsterba@suse.cz>
> ---
>
> v1->v2:
> - added s_umount trylock in cleaner thread
> - added exit into drop_snapshot if fs is going down
>
> patch based on cmason/integration
>
>  fs/btrfs/disk-io.c     |   10 ++++++--
>  fs/btrfs/extent-tree.c |    8 ++++++
>  fs/btrfs/relocation.c  |    3 --
>  fs/btrfs/transaction.c |   57 ++++++++++++++++++++++++++++++++----------------
>  fs/btrfs/transaction.h |    2 +-
>  5 files changed, 54 insertions(+), 26 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index eb7c143..cc85fc7 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1652,15 +1652,19 @@ static int cleaner_kthread(void *arg)
>         struct btrfs_root *root = arg;
>
>         do {
> +               int again = 0;
> +
>                 if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
> +                   down_read_trylock(&root->fs_info->sb->s_umount) &&
>                     mutex_trylock(&root->fs_info->cleaner_mutex)) {
>                         btrfs_run_delayed_iputs(root);
> -                       btrfs_clean_old_snapshots(root);
> +                       again = btrfs_clean_one_deleted_snapshot(root);
>                         mutex_unlock(&root->fs_info->cleaner_mutex);
>                         btrfs_run_defrag_inodes(root->fs_info);
> +                       up_read(&root->fs_info->sb->s_umount);
>                 }
>
> -               if (!try_to_freeze()) {
> +               if (!try_to_freeze() && !again) {
>                         set_current_state(TASK_INTERRUPTIBLE);
>                         if (!kthread_should_stop())
>                                 schedule();
> @@ -3338,8 +3342,8 @@ int btrfs_commit_super(struct btrfs_root *root)
>
>         mutex_lock(&root->fs_info->cleaner_mutex);
>         btrfs_run_delayed_iputs(root);
> -       btrfs_clean_old_snapshots(root);
>         mutex_unlock(&root->fs_info->cleaner_mutex);
> +       wake_up_process(root->fs_info->cleaner_kthread);
>
>         /* wait until ongoing cleanup work done */
>         down_write(&root->fs_info->cleanup_work_sem);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d2b3a5e..0119ae7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7078,6 +7078,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
>   * reference count by one. if update_ref is true, this function
>   * also make sure backrefs for the shared block and all lower level
>   * blocks are properly updated.
> + *
> + * If called with for_reloc == 0, may exit early with -EAGAIN
>   */
>  int btrfs_drop_snapshot(struct btrfs_root *root,
>                          struct btrfs_block_rsv *block_rsv, int update_ref,
> @@ -7179,6 +7181,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>         wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
>
>         while (1) {
> +               if (!for_reloc && btrfs_fs_closing(root->fs_info)) {
> +                       pr_debug("btrfs: drop snapshot early exit\n");
> +                       err = -EAGAIN;
> +                       goto out_end_trans;
> +               }
> +
>                 ret = walk_down_tree(trans, root, path, wc);
>                 if (ret < 0) {
>                         err = ret;
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index ba5a321..ab6a718 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
>
>         while (1) {
>                 mutex_lock(&fs_info->cleaner_mutex);
> -
> -               btrfs_clean_old_snapshots(fs_info->tree_root);
>                 ret = relocate_block_group(rc);
> -
>                 mutex_unlock(&fs_info->cleaner_mutex);
>                 if (ret < 0) {
>                         err = ret;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index a83d486..6b233c15 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -950,7 +950,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans,
>  int btrfs_add_dead_root(struct btrfs_root *root)
>  {
>         spin_lock(&root->fs_info->trans_lock);
> -       list_add(&root->root_list, &root->fs_info->dead_roots);
> +       list_add_tail(&root->root_list, &root->fs_info->dead_roots);
>         spin_unlock(&root->fs_info->trans_lock);
>         return 0;
>  }
> @@ -1858,31 +1858,50 @@ cleanup_transaction:
>  }
>
>  /*
> - * interface function to delete all the snapshots we have scheduled for deletion
> + * return < 0 if error
> + * 0 if there are no more dead_roots at the time of call
> + * 1 there are more to be processed, call me again
> + *
> + * The return value indicates there are certainly more snapshots to delete, but
> + * if there comes a new one during processing, it may return 0. We don't mind,
> + * because btrfs_commit_super will poke cleaner thread and it will process it a
> + * few seconds later.
>   */
> -int btrfs_clean_old_snapshots(struct btrfs_root *root)
> +int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
>  {
> -       LIST_HEAD(list);
> +       int ret;
> +       int run_again = 1;
>         struct btrfs_fs_info *fs_info = root->fs_info;
>
> +       if (root->fs_info->sb->s_flags & MS_RDONLY) {
> +               pr_debug("btrfs: cleaner called for RO fs!\n");
> +               return 0;
> +       }
> +
>         spin_lock(&fs_info->trans_lock);
> -       list_splice_init(&fs_info->dead_roots, &list);
> +       if (list_empty(&fs_info->dead_roots)) {
> +               spin_unlock(&fs_info->trans_lock);
> +               return 0;
> +       }
> +       root = list_first_entry(&fs_info->dead_roots,
> +                       struct btrfs_root, root_list);
> +       list_del(&root->root_list);
>         spin_unlock(&fs_info->trans_lock);
>
> -       while (!list_empty(&list)) {
> -               int ret;
> -
> -               root = list_entry(list.next, struct btrfs_root, root_list);
> -               list_del(&root->root_list);
> +       pr_debug("btrfs: cleaner removing %llu\n",
> +                       (unsigned long long)root->objectid);
>
> -               btrfs_kill_all_delayed_nodes(root);
> +       btrfs_kill_all_delayed_nodes(root);
>
> -               if (btrfs_header_backref_rev(root->node) <
> -                   BTRFS_MIXED_BACKREF_REV)
> -                       ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> -               else
> -                       ret =btrfs_drop_snapshot(root, NULL, 1, 0);
> -               BUG_ON(ret < 0);
> -       }
> -       return 0;
> +       if (btrfs_header_backref_rev(root->node) <
> +                       BTRFS_MIXED_BACKREF_REV)
> +               ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> +       else
> +               ret = btrfs_drop_snapshot(root, NULL, 1, 0);
> +       /*
> +        * If we encounter a transaction abort during snapshot cleaning, we
> +        * don't want to crash here
> +        */
> +       BUG_ON(ret < 0 && (ret != -EAGAIN || ret != -EROFS));
> +       return run_again || ret == -EAGAIN;
I think that this expression will always evaluate to "true", because
"run_again" is set to 1 and never reset.

So now if btrfs_drop_snapshot() returns -EAGAIN, then
btrfs_clean_one_deleted_snapshot() will return 1, thus telling the
caller "call me again", like your comment says. However, in this case
we are unmounting, so is this ok? Or this is just to avoid the cleaner
going to sleep?



>  }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 5afd7b1..52f6f25 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -121,7 +121,7 @@ int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
>
>  int btrfs_add_dead_root(struct btrfs_root *root);
>  int btrfs_defrag_root(struct btrfs_root *root);
> -int btrfs_clean_old_snapshots(struct btrfs_root *root);
> +int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
>  int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>                              struct btrfs_root *root);
>  int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
> --
> 1.7.9
>

Also, I want to ask, hope this is not inappropriate. Do you also agree
with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
transaction, but just to detach from it? Had we committed, we would
have ensured that ORPHAN_ITEM is in the root tree, thus preventing
from subvol to re-appear after crash.
It seems a little inconsistent with snap creation, where not only the
transaction is committed, but delalloc flush is performed to ensure
that all data is on disk before creating the snap.

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov March 6, 2013, 9:08 a.m. UTC | #3
On Fri, Mar 01, 2013 at 05:30:57PM +0100, David Sterba wrote:
> The mail did not make it to the list on Monday, anyway, now I have a
> few testing results to share.
> 
> The umount responsiveness is much improved, it took a few seconds when
> there were many roots to clean scheduled.
> 
> With
> 
> $ echo 'func btrfs_drop_snapshot +pf' > /sys/debug/kernel/dynamic_debug/control
> $ echo 'func btrfs_clean_one_deleted_snapshot +pf' > /sys/debug/kernel/dynamic_debug/control
> 
> one can watch how it proceeds.
> 
> After umount and and mount, the cleaning process starts again, after 30
> seconds when transaction commit triggers.
> 
> Next test was to let it run with balance (just 'fi balance', no other
> options). Worked well, but I've hit an oops from balance cancel command that
> was triggered sometime during the balance. According to the log it happened
> right after the balance finished. CCing Ilya, I'm not sure if this is somehow
> induced by the patch.
> 
> 3407         BUG_ON(fs_info->balance_ctl || atomic_read(&fs_info->balance_running));

Were you by any chance running this on top of a for-linus that included
a raid56 merge commit?  If so, this is result of a raid56 mismerge --
fs_info->balance_ctl is supposed to be cleared in __cancel_balance(), a
call to which was accidentally nuked.

Thanks,

		Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba March 6, 2013, 4:47 p.m. UTC | #4
On Wed, Mar 06, 2013 at 11:08:57AM +0200, Ilya Dryomov wrote:
> On Fri, Mar 01, 2013 at 05:30:57PM +0100, David Sterba wrote:
> > 3407         BUG_ON(fs_info->balance_ctl || atomic_read(&fs_info->balance_running));
> 
> Were you by any chance running this on top of a for-linus that included
> a raid56 merge commit?  If so, this is result of a raid56 mismerge --
> fs_info->balance_ctl is supposed to be cleared in __cancel_balance(), a
> call to which was accidentally nuked.

I've been testing either plain next/master or based on linus' tree, so
yes, this merge was there and the missing cancel explains the BUG_ON.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba March 6, 2013, 6:09 p.m. UTC | #5
On Sat, Mar 02, 2013 at 11:32:07PM +0200, Alex Lyakas wrote:
> On Fri, Mar 1, 2013 at 6:17 PM, David Sterba <dsterba@suse.cz> wrote:
> > -               btrfs_kill_all_delayed_nodes(root);
> > +       btrfs_kill_all_delayed_nodes(root);
> >
> > -               if (btrfs_header_backref_rev(root->node) <
> > -                   BTRFS_MIXED_BACKREF_REV)
> > -                       ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> > -               else
> > -                       ret =btrfs_drop_snapshot(root, NULL, 1, 0);
> > -               BUG_ON(ret < 0);
> > -       }
> > -       return 0;
> > +       if (btrfs_header_backref_rev(root->node) <
> > +                       BTRFS_MIXED_BACKREF_REV)
> > +               ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> > +       else
> > +               ret = btrfs_drop_snapshot(root, NULL, 1, 0);
> > +       /*
> > +        * If we encounter a transaction abort during snapshot cleaning, we
> > +        * don't want to crash here
> > +        */
> > +       BUG_ON(ret < 0 && (ret != -EAGAIN || ret != -EROFS));
> > +       return run_again || ret == -EAGAIN;
> I think that this expression will always evaluate to "true", because
> "run_again" is set to 1 and never reset.

That's right, I probably had some intentions in the past, but now it's
just superflous.

> So now if btrfs_drop_snapshot() returns -EAGAIN, then
> btrfs_clean_one_deleted_snapshot() will return 1, thus telling the
> caller "call me again", like your comment says. However, in this case
> we are unmounting, so is this ok? Or this is just to avoid the cleaner
> going to sleep?

close_ctree calls kthread_stop(cleaner) which would wake cleaner anyway.

The cases when we want send cleaner to sleep are when there's nothing to
do or the read-only case which is just a safety check.

From that point, the last statement should be 'return 1' and run_again
removed.

> Also, I want to ask, hope this is not inappropriate. Do you also agree
> with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
> transaction, but just to detach from it? Had we committed, we would
> have ensured that ORPHAN_ITEM is in the root tree, thus preventing
> from subvol to re-appear after crash.
> It seems a little inconsistent with snap creation, where not only the
> transaction is committed, but delalloc flush is performed to ensure
> that all data is on disk before creating the snap.

That's another question, can you please point me to the thread where
this was discussed?

thanks,
david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason March 7, 2013, 3:12 a.m. UTC | #6
On Wed, Mar 06, 2013 at 11:09:15AM -0700, David Sterba wrote:
> On Sat, Mar 02, 2013 at 11:32:07PM +0200, Alex Lyakas wrote:
> > On Fri, Mar 1, 2013 at 6:17 PM, David Sterba <dsterba@suse.cz> wrote:
> > > -               btrfs_kill_all_delayed_nodes(root);
> > > +       btrfs_kill_all_delayed_nodes(root);
> > >
> > > -               if (btrfs_header_backref_rev(root->node) <
> > > -                   BTRFS_MIXED_BACKREF_REV)
> > > -                       ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> > > -               else
> > > -                       ret =btrfs_drop_snapshot(root, NULL, 1, 0);
> > > -               BUG_ON(ret < 0);
> > > -       }
> > > -       return 0;
> > > +       if (btrfs_header_backref_rev(root->node) <
> > > +                       BTRFS_MIXED_BACKREF_REV)
> > > +               ret = btrfs_drop_snapshot(root, NULL, 0, 0);
> > > +       else
> > > +               ret = btrfs_drop_snapshot(root, NULL, 1, 0);
> > > +       /*
> > > +        * If we encounter a transaction abort during snapshot cleaning, we
> > > +        * don't want to crash here
> > > +        */
> > > +       BUG_ON(ret < 0 && (ret != -EAGAIN || ret != -EROFS));
> > > +       return run_again || ret == -EAGAIN;
> > I think that this expression will always evaluate to "true", because
> > "run_again" is set to 1 and never reset.
> 
> That's right, I probably had some intentions in the past, but now it's
> just superflous.
> 
> > So now if btrfs_drop_snapshot() returns -EAGAIN, then
> > btrfs_clean_one_deleted_snapshot() will return 1, thus telling the
> > caller "call me again", like your comment says. However, in this case
> > we are unmounting, so is this ok? Or this is just to avoid the cleaner
> > going to sleep?
> 
> close_ctree calls kthread_stop(cleaner) which would wake cleaner anyway.
> 
> The cases when we want send cleaner to sleep are when there's nothing to
> do or the read-only case which is just a safety check.
> 
> From that point, the last statement should be 'return 1' and run_again
> removed.
> 
> > Also, I want to ask, hope this is not inappropriate. Do you also agree
> > with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
> > transaction, but just to detach from it? Had we committed, we would
> > have ensured that ORPHAN_ITEM is in the root tree, thus preventing
> > from subvol to re-appear after crash.
> > It seems a little inconsistent with snap creation, where not only the
> > transaction is committed, but delalloc flush is performed to ensure
> > that all data is on disk before creating the snap.
> 
> That's another question, can you please point me to the thread where
> this was discussed?

That's a really old one.  The original snapshot code expected people to
run sync first, but that's not very user friendly.  The idea is that if
you write a file and then take a snapshot, that file should be in the
snapshot.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba March 7, 2013, 11:55 a.m. UTC | #7
On Wed, Mar 06, 2013 at 10:12:11PM -0500, Chris Mason wrote:
> > > Also, I want to ask, hope this is not inappropriate. Do you also agree
> > > with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
> > > transaction, but just to detach from it? Had we committed, we would
> > > have ensured that ORPHAN_ITEM is in the root tree, thus preventing
> > > from subvol to re-appear after crash.
> > > It seems a little inconsistent with snap creation, where not only the
> > > transaction is committed, but delalloc flush is performed to ensure
> > > that all data is on disk before creating the snap.
> > 
> > That's another question, can you please point me to the thread where
> > this was discussed?
> 
> That's a really old one.  The original snapshot code expected people to
> run sync first, but that's not very user friendly.  The idea is that if
> you write a file and then take a snapshot, that file should be in the
> snapshot.

The snapshot behaviour sounds ok to me.

That a subvol/snapshot may appear after crash if transation commit did
not happen does not feel so good. We know that the subvol is only
scheduled for deletion and needs to be processed by cleaner.

From that point I'd rather see the commit to happen to avoid any
unexpected surprises.  A subvolume that re-appears still holds the data
references and consumes space although the user does not assume that.

Automated snapshotting and deleting needs some guarantees about the
behaviour and what to do after a crash. So now it has to process the
backlog of previously deleted snapshots and verify that they're not
there, compared to "deleted -> will never appear, can forget about it".


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 16, 2013, 7:33 p.m. UTC | #8
Hi David,

On Thu, Mar 7, 2013 at 1:55 PM, David Sterba <dsterba@suse.cz> wrote:
> On Wed, Mar 06, 2013 at 10:12:11PM -0500, Chris Mason wrote:
>> > > Also, I want to ask, hope this is not inappropriate. Do you also agree
>> > > with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
>> > > transaction, but just to detach from it? Had we committed, we would
>> > > have ensured that ORPHAN_ITEM is in the root tree, thus preventing
>> > > from subvol to re-appear after crash.
>> > > It seems a little inconsistent with snap creation, where not only the
>> > > transaction is committed, but delalloc flush is performed to ensure
>> > > that all data is on disk before creating the snap.
>> >
>> > That's another question, can you please point me to the thread where
>> > this was discussed?
http://www.spinics.net/lists/linux-btrfs/msg22256.html


>>
>> That's a really old one.  The original snapshot code expected people to
>> run sync first, but that's not very user friendly.  The idea is that if
>> you write a file and then take a snapshot, that file should be in the
>> snapshot.
>
> The snapshot behaviour sounds ok to me.
>
> That a subvol/snapshot may appear after crash if transation commit did
> not happen does not feel so good. We know that the subvol is only
> scheduled for deletion and needs to be processed by cleaner.
>
> From that point I'd rather see the commit to happen to avoid any
> unexpected surprises.  A subvolume that re-appears still holds the data
> references and consumes space although the user does not assume that.
>
> Automated snapshotting and deleting needs some guarantees about the
> behaviour and what to do after a crash. So now it has to process the
> backlog of previously deleted snapshots and verify that they're not
> there, compared to "deleted -> will never appear, can forget about it".

Exactly. Currently, the user space has no idea when the deletion will
start, or when it is completed (it has to track the ROOT_ITEM, drop
progress, ORPHAN_ITEM etc.). That's why I was thinking, that at least
committing a transaction on snap_destroy could ensure that deletion
will not be reverted.

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index eb7c143..cc85fc7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1652,15 +1652,19 @@  static int cleaner_kthread(void *arg)
 	struct btrfs_root *root = arg;
 
 	do {
+		int again = 0;
+
 		if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
+		    down_read_trylock(&root->fs_info->sb->s_umount) &&
 		    mutex_trylock(&root->fs_info->cleaner_mutex)) {
 			btrfs_run_delayed_iputs(root);
-			btrfs_clean_old_snapshots(root);
+			again = btrfs_clean_one_deleted_snapshot(root);
 			mutex_unlock(&root->fs_info->cleaner_mutex);
 			btrfs_run_defrag_inodes(root->fs_info);
+			up_read(&root->fs_info->sb->s_umount);
 		}
 
-		if (!try_to_freeze()) {
+		if (!try_to_freeze() && !again) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (!kthread_should_stop())
 				schedule();
@@ -3338,8 +3342,8 @@  int btrfs_commit_super(struct btrfs_root *root)
 
 	mutex_lock(&root->fs_info->cleaner_mutex);
 	btrfs_run_delayed_iputs(root);
-	btrfs_clean_old_snapshots(root);
 	mutex_unlock(&root->fs_info->cleaner_mutex);
+	wake_up_process(root->fs_info->cleaner_kthread);
 
 	/* wait until ongoing cleanup work done */
 	down_write(&root->fs_info->cleanup_work_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d2b3a5e..0119ae7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7078,6 +7078,8 @@  static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
  * reference count by one. if update_ref is true, this function
  * also make sure backrefs for the shared block and all lower level
  * blocks are properly updated.
+ *
+ * If called with for_reloc == 0, may exit early with -EAGAIN
  */
 int btrfs_drop_snapshot(struct btrfs_root *root,
 			 struct btrfs_block_rsv *block_rsv, int update_ref,
@@ -7179,6 +7181,12 @@  int btrfs_drop_snapshot(struct btrfs_root *root,
 	wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
 
 	while (1) {
+		if (!for_reloc && btrfs_fs_closing(root->fs_info)) {
+			pr_debug("btrfs: drop snapshot early exit\n");
+			err = -EAGAIN;
+			goto out_end_trans;
+		}
+
 		ret = walk_down_tree(trans, root, path, wc);
 		if (ret < 0) {
 			err = ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ba5a321..ab6a718 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4060,10 +4060,7 @@  int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
 
 	while (1) {
 		mutex_lock(&fs_info->cleaner_mutex);
-
-		btrfs_clean_old_snapshots(fs_info->tree_root);
 		ret = relocate_block_group(rc);
-
 		mutex_unlock(&fs_info->cleaner_mutex);
 		if (ret < 0) {
 			err = ret;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a83d486..6b233c15 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -950,7 +950,7 @@  static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans,
 int btrfs_add_dead_root(struct btrfs_root *root)
 {
 	spin_lock(&root->fs_info->trans_lock);
-	list_add(&root->root_list, &root->fs_info->dead_roots);
+	list_add_tail(&root->root_list, &root->fs_info->dead_roots);
 	spin_unlock(&root->fs_info->trans_lock);
 	return 0;
 }
@@ -1858,31 +1858,50 @@  cleanup_transaction:
 }
 
 /*
- * interface function to delete all the snapshots we have scheduled for deletion
+ * return < 0 if error
+ * 0 if there are no more dead_roots at the time of call
+ * 1 there are more to be processed, call me again
+ *
+ * The return value indicates there are certainly more snapshots to delete, but
+ * if there comes a new one during processing, it may return 0. We don't mind,
+ * because btrfs_commit_super will poke cleaner thread and it will process it a
+ * few seconds later.
  */
-int btrfs_clean_old_snapshots(struct btrfs_root *root)
+int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
 {
-	LIST_HEAD(list);
+	int ret;
+	int run_again = 1;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 
+	if (root->fs_info->sb->s_flags & MS_RDONLY) {
+		pr_debug("btrfs: cleaner called for RO fs!\n");
+		return 0;
+	}
+
 	spin_lock(&fs_info->trans_lock);
-	list_splice_init(&fs_info->dead_roots, &list);
+	if (list_empty(&fs_info->dead_roots)) {
+		spin_unlock(&fs_info->trans_lock);
+		return 0;
+	}
+	root = list_first_entry(&fs_info->dead_roots,
+			struct btrfs_root, root_list);
+	list_del(&root->root_list);
 	spin_unlock(&fs_info->trans_lock);
 
-	while (!list_empty(&list)) {
-		int ret;
-
-		root = list_entry(list.next, struct btrfs_root, root_list);
-		list_del(&root->root_list);
+	pr_debug("btrfs: cleaner removing %llu\n",
+			(unsigned long long)root->objectid);
 
-		btrfs_kill_all_delayed_nodes(root);
+	btrfs_kill_all_delayed_nodes(root);
 
-		if (btrfs_header_backref_rev(root->node) <
-		    BTRFS_MIXED_BACKREF_REV)
-			ret = btrfs_drop_snapshot(root, NULL, 0, 0);
-		else
-			ret =btrfs_drop_snapshot(root, NULL, 1, 0);
-		BUG_ON(ret < 0);
-	}
-	return 0;
+	if (btrfs_header_backref_rev(root->node) <
+			BTRFS_MIXED_BACKREF_REV)
+		ret = btrfs_drop_snapshot(root, NULL, 0, 0);
+	else
+		ret = btrfs_drop_snapshot(root, NULL, 1, 0);
+	/*
+	 * If we encounter a transaction abort during snapshot cleaning, we
+	 * don't want to crash here
+	 */
+	BUG_ON(ret < 0 && (ret != -EAGAIN || ret != -EROFS));
+	return run_again || ret == -EAGAIN;
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 5afd7b1..52f6f25 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -121,7 +121,7 @@  int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
 
 int btrfs_add_dead_root(struct btrfs_root *root);
 int btrfs_defrag_root(struct btrfs_root *root);
-int btrfs_clean_old_snapshots(struct btrfs_root *root);
+int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root);
 int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,