Message ID | 1359371042-5553-1-git-send-email-bo.li.liu@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jan 28, 2013 at 07:04:02PM +0800, Liu Bo wrote: > While running snapshot testscript created by Mitch and David, > the race between autodefrag and snapshot deletion can lead to > corruption of dead_root list so that we can get crash on > btrfs_clean_old_snapshots(). > > And besides autodefrag, scrub also do the same thing, ie. read > root first and get inode. > > Here is the story(take autodefrag as an example): > (1) when we delete a snapshot or subvolume, it will set its root's > refs to zero and do a iput() on its own inode, and if this inode happens > to be the only active in-meory one in root's inode rbtree, it will add > itself to the global dead_roots list for later cleanup. > > (2) after (1), the autodefrag thread may read another inode for defrag > and the inode is just in the deleted snapshot/subvolume, but all of these > are without checking if the root is still valid(refs > 0). So the end up > result is adding the deleted snapshot/subvolume's root to the global > dead_roots list AGAIN. > > Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. > > So all we need to do is to take the lock to protect 'read root and get inode', > since we synchronize to wait for the rcu grace period before adding something > to the global dead_roots list. Nice writeup! > Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > --- > fs/btrfs/file.c | 12 ++++++++++++ > fs/btrfs/scrub.c | 25 ++++++++++++++++++++----- > 2 files changed, 32 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index f76b1fd..1cca5c9 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -293,21 +293,33 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, > struct btrfs_key key; > struct btrfs_ioctl_defrag_range_args range; > int num_defrag; > + int index; > > /* get the inode */ > key.objectid = defrag->root; > btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); > key.offset = (u64)-1; > + > + index = srcu_read_lock(&fs_info->subvol_srcu); > + > inode_root = btrfs_read_fs_root_no_name(fs_info, &key); > if (IS_ERR(inode_root)) { > + srcu_read_unlock(&fs_info->subvol_srcu, index); > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > return PTR_ERR(inode_root); > } > + if (btrfs_root_refs(&inode_root->root_item) == 0) { > + srcu_read_unlock(&fs_info->subvol_srcu, index); > + kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > + return -ENOENT; Both error conditions share fair amount of code, I suggest to put it into the exit block. > + } > > key.objectid = defrag->ino; > btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); > key.offset = 0; > inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); > + > + srcu_read_unlock(&fs_info->subvol_srcu, index); > if (IS_ERR(inode)) { > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > return PTR_ERR(inode); > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index bdbb94f..e487d54 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -580,20 +580,29 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) > int corrected = 0; > struct btrfs_key key; > struct inode *inode = NULL; > + struct btrfs_fs_info *fs_info; > u64 end = offset + PAGE_SIZE - 1; > struct btrfs_root *local_root; > + int i_srcu; This looks like it's someting related to the inode, while it's the opaque SRCU index and I suggest to name it like this. > > key.objectid = root; > key.type = BTRFS_ROOT_ITEM_KEY; > key.offset = (u64)-1; > - local_root = btrfs_read_fs_root_no_name(fixup->root->fs_info, &key); > - if (IS_ERR(local_root)) > + > + fs_info = fixup->root->fs_info; > + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); > + > + local_root = btrfs_read_fs_root_no_name(fs_info, &key); > + if (IS_ERR(local_root)) { > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > return PTR_ERR(local_root); > + } > > key.type = BTRFS_INODE_ITEM_KEY; > key.objectid = inum; > key.offset = 0; > - inode = btrfs_iget(fixup->root->fs_info->sb, &key, local_root, NULL); > + inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > if (IS_ERR(inode)) > return PTR_ERR(inode); > > @@ -606,7 +615,6 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) > } > > if (PageUptodate(page)) { > - struct btrfs_fs_info *fs_info; Unrelated change, and afaics fs_info is used a few lines below. > if (PageDirty(page)) { > /* > * we need to write the data to the defect sector. the > @@ -3180,18 +3188,25 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root, void *ctx) > u64 physical_for_dev_replace; > u64 len; > struct btrfs_fs_info *fs_info = nocow_ctx->sctx->dev_root->fs_info; > + int i_srcu; int index; > > key.objectid = root; > key.type = BTRFS_ROOT_ITEM_KEY; > key.offset = (u64)-1; > + > + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); > + > local_root = btrfs_read_fs_root_no_name(fs_info, &key); > - if (IS_ERR(local_root)) > + if (IS_ERR(local_root)) { > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > return PTR_ERR(local_root); > + } > > key.type = BTRFS_INODE_ITEM_KEY; > key.objectid = inum; > key.offset = 0; > inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > if (IS_ERR(inode)) > return PTR_ERR(inode); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 28, 2013 at 5:04 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > While running snapshot testscript created by Mitch and David, > the race between autodefrag and snapshot deletion can lead to > corruption of dead_root list so that we can get crash on > btrfs_clean_old_snapshots(). > > And besides autodefrag, scrub also do the same thing, ie. read > root first and get inode. > > Here is the story(take autodefrag as an example): > (1) when we delete a snapshot or subvolume, it will set its root's > refs to zero and do a iput() on its own inode, and if this inode happens > to be the only active in-meory one in root's inode rbtree, it will add > itself to the global dead_roots list for later cleanup. > > (2) after (1), the autodefrag thread may read another inode for defrag > and the inode is just in the deleted snapshot/subvolume, but all of these > are without checking if the root is still valid(refs > 0). So the end up > result is adding the deleted snapshot/subvolume's root to the global > dead_roots list AGAIN. > > Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. > > So all we need to do is to take the lock to protect 'read root and get inode', > since we synchronize to wait for the rcu grace period before adding something > to the global dead_roots list. > > Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> I'm still seeing seeing issues with duplications in the dead_roots list. I'm using a 3.7.4 kernel merged with the for-linus branch with the following four patches: [PATCH V5] Btrfs: snapshot-aware defrag [PATCH] Btrfs: List Debugging for cleaning deleted Non-functional patch to issue some trace_printk debugging. [PATCH] [RFC] Btrfs: Check for duplicate dead root list This is the patch discussed in the snapshot-aware defrag thread. It checks for duplicate list entries, and dumps a backtrace if it finds one. Btrfs: fix race between snapshot deletion and getting inode I've run into several backtraces similar to the following: [ 3129.368196] btrfs: Duplicate dead root entry. [ 3129.368199] ------------[ cut here ]------------ [ 3129.368220] WARNING: at fs/btrfs/transaction.c:893 btrfs_add_dead_root+0x73/0xbc [btrfs]() [ 3129.368223] Hardware name: OptiPlex 745 [ 3129.368224] Modules linked in: ipv6 snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc iTCO_wdt ppdev iTCO_vendor_support i2c_i801 parport_pc floppy tg3 sr_mod microcode snd_timer snd lpc_ich serio_raw pcspkr parport ablk_helper cryptd lrw xts gf128mul aes_x86_64 sha256_generic fuse xfs nfs lockd sunrpc reiserfs btrfs zlib_deflate ext4 jbd2 ext3 jbd ext2 mbcache sl811_hcd hid_generic xhci_hcd ohci_hcd uhci_hcd ehci_hcd [ 3129.368268] Pid: 4309, comm: btrfs-endio-wri Tainted: G W 3.7.4-sad-v2+ #1 [ 3129.368271] Call Trace: [ 3129.368278] [<ffffffff81030586>] warn_slowpath_common+0x83/0x9b [ 3129.368282] [<ffffffff810305b8>] warn_slowpath_null+0x1a/0x1c [ 3129.368297] [<ffffffffa0179e0b>] btrfs_add_dead_root+0x73/0xbc [btrfs] [ 3129.368313] [<ffffffffa0187bef>] btrfs_destroy_inode+0x227/0x25b [btrfs] [ 3129.368319] [<ffffffff8111393a>] destroy_inode+0x3b/0x54 [ 3129.368322] [<ffffffff81113a9c>] evict+0x149/0x151 [ 3129.368327] [<ffffffff81114322>] iput+0x12c/0x135 [ 3129.368342] [<ffffffffa01845e7>] relink_extent_backref+0x669/0x6af [btrfs] [ 3129.368346] [<ffffffff815e9849>] ? __slab_free+0x17c/0x21b [ 3129.368362] [<ffffffffa017c33d>] ? record_extent_backrefs+0xa3/0xa3 [btrfs] [ 3129.368377] [<ffffffffa0184d9d>] ? btrfs_finish_ordered_io+0x770/0x827 [btrfs] [ 3129.368393] [<ffffffffa0184d6d>] btrfs_finish_ordered_io+0x740/0x827 [btrfs] [ 3129.368409] [<ffffffffa0184e69>] finish_ordered_fn+0x15/0x17 [btrfs] [ 3129.368424] [<ffffffffa019e7fd>] worker_loop+0x14c/0x493 [btrfs] [ 3129.368439] [<ffffffffa019e6b1>] ? btrfs_queue_worker+0x258/0x258 [btrfs] [ 3129.368443] [<ffffffff8104c750>] kthread+0xba/0xc2 [ 3129.368447] [<ffffffff8104c696>] ? kthread_freezable_should_stop+0x52/0x52 [ 3129.368451] [<ffffffff815f301c>] ret_from_fork+0x7c/0xb0 [ 3129.368455] [<ffffffff8104c696>] ? kthread_freezable_should_stop+0x52/0x52 [ 3129.368458] ---[ end trace 46705ba72c45db88 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 28, 2013 at 06:48:24PM -0600, Mitch Harder wrote: > On Mon, Jan 28, 2013 at 5:04 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > > While running snapshot testscript created by Mitch and David, > > the race between autodefrag and snapshot deletion can lead to > > corruption of dead_root list so that we can get crash on > > btrfs_clean_old_snapshots(). > > > > And besides autodefrag, scrub also do the same thing, ie. read > > root first and get inode. > > > > Here is the story(take autodefrag as an example): > > (1) when we delete a snapshot or subvolume, it will set its root's > > refs to zero and do a iput() on its own inode, and if this inode happens > > to be the only active in-meory one in root's inode rbtree, it will add > > itself to the global dead_roots list for later cleanup. > > > > (2) after (1), the autodefrag thread may read another inode for defrag > > and the inode is just in the deleted snapshot/subvolume, but all of these > > are without checking if the root is still valid(refs > 0). So the end up > > result is adding the deleted snapshot/subvolume's root to the global > > dead_roots list AGAIN. > > > > Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. > > > > So all we need to do is to take the lock to protect 'read root and get inode', > > since we synchronize to wait for the rcu grace period before adding something > > to the global dead_roots list. > > > > Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > > I'm still seeing seeing issues with duplications in the dead_roots list. Hah, don't worry, surely this will happen, since I don't yet send you and the ML the updated snapshot-aware defrag patch with srcu protection, it'll be a V6 version ;) Thanks for debugging and testing it, Mitch! thanks, liubo > > I'm using a 3.7.4 kernel merged with the for-linus branch with the > following four patches: > [PATCH V5] Btrfs: snapshot-aware defrag > [PATCH] Btrfs: List Debugging for cleaning deleted > Non-functional patch to issue some trace_printk debugging. > [PATCH] [RFC] Btrfs: Check for duplicate dead root list > This is the patch discussed in the snapshot-aware defrag thread. > It checks for duplicate list entries, and dumps a backtrace > if it finds one. > Btrfs: fix race between snapshot deletion and getting inode > > I've run into several backtraces similar to the following: > > [ 3129.368196] btrfs: Duplicate dead root entry. > [ 3129.368199] ------------[ cut here ]------------ > [ 3129.368220] WARNING: at fs/btrfs/transaction.c:893 > btrfs_add_dead_root+0x73/0xbc [btrfs]() > [ 3129.368223] Hardware name: OptiPlex 745 > [ 3129.368224] Modules linked in: ipv6 snd_hda_codec_analog > snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc iTCO_wdt > ppdev iTCO_vendor_support i2c_i801 parport_pc floppy tg3 sr_mod > microcode snd_timer snd lpc_ich serio_raw pcspkr parport ablk_helper > cryptd lrw xts gf128mul aes_x86_64 sha256_generic fuse xfs nfs lockd > sunrpc reiserfs btrfs zlib_deflate ext4 jbd2 ext3 jbd ext2 mbcache > sl811_hcd hid_generic xhci_hcd ohci_hcd uhci_hcd ehci_hcd > [ 3129.368268] Pid: 4309, comm: btrfs-endio-wri Tainted: G W > 3.7.4-sad-v2+ #1 > [ 3129.368271] Call Trace: > [ 3129.368278] [<ffffffff81030586>] warn_slowpath_common+0x83/0x9b > [ 3129.368282] [<ffffffff810305b8>] warn_slowpath_null+0x1a/0x1c > [ 3129.368297] [<ffffffffa0179e0b>] btrfs_add_dead_root+0x73/0xbc [btrfs] > [ 3129.368313] [<ffffffffa0187bef>] btrfs_destroy_inode+0x227/0x25b [btrfs] > [ 3129.368319] [<ffffffff8111393a>] destroy_inode+0x3b/0x54 > [ 3129.368322] [<ffffffff81113a9c>] evict+0x149/0x151 > [ 3129.368327] [<ffffffff81114322>] iput+0x12c/0x135 > [ 3129.368342] [<ffffffffa01845e7>] relink_extent_backref+0x669/0x6af [btrfs] > [ 3129.368346] [<ffffffff815e9849>] ? __slab_free+0x17c/0x21b > [ 3129.368362] [<ffffffffa017c33d>] ? record_extent_backrefs+0xa3/0xa3 [btrfs] > [ 3129.368377] [<ffffffffa0184d9d>] ? > btrfs_finish_ordered_io+0x770/0x827 [btrfs] > [ 3129.368393] [<ffffffffa0184d6d>] btrfs_finish_ordered_io+0x740/0x827 [btrfs] > [ 3129.368409] [<ffffffffa0184e69>] finish_ordered_fn+0x15/0x17 [btrfs] > [ 3129.368424] [<ffffffffa019e7fd>] worker_loop+0x14c/0x493 [btrfs] > [ 3129.368439] [<ffffffffa019e6b1>] ? btrfs_queue_worker+0x258/0x258 [btrfs] > [ 3129.368443] [<ffffffff8104c750>] kthread+0xba/0xc2 > [ 3129.368447] [<ffffffff8104c696>] ? kthread_freezable_should_stop+0x52/0x52 > [ 3129.368451] [<ffffffff815f301c>] ret_from_fork+0x7c/0xb0 > [ 3129.368455] [<ffffffff8104c696>] ? kthread_freezable_should_stop+0x52/0x52 > [ 3129.368458] ---[ end trace 46705ba72c45db88 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 28, 2013 at 11:21:27PM +0100, David Sterba wrote: > On Mon, Jan 28, 2013 at 07:04:02PM +0800, Liu Bo wrote: > > While running snapshot testscript created by Mitch and David, > > the race between autodefrag and snapshot deletion can lead to > > corruption of dead_root list so that we can get crash on > > btrfs_clean_old_snapshots(). > > > > And besides autodefrag, scrub also do the same thing, ie. read > > root first and get inode. > > > > Here is the story(take autodefrag as an example): > > (1) when we delete a snapshot or subvolume, it will set its root's > > refs to zero and do a iput() on its own inode, and if this inode happens > > to be the only active in-meory one in root's inode rbtree, it will add > > itself to the global dead_roots list for later cleanup. > > > > (2) after (1), the autodefrag thread may read another inode for defrag > > and the inode is just in the deleted snapshot/subvolume, but all of these > > are without checking if the root is still valid(refs > 0). So the end up > > result is adding the deleted snapshot/subvolume's root to the global > > dead_roots list AGAIN. > > > > Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. > > > > So all we need to do is to take the lock to protect 'read root and get inode', > > since we synchronize to wait for the rcu grace period before adding something > > to the global dead_roots list. > > Nice writeup! > > > Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > > --- > > fs/btrfs/file.c | 12 ++++++++++++ > > fs/btrfs/scrub.c | 25 ++++++++++++++++++++----- > > 2 files changed, 32 insertions(+), 5 deletions(-) > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > > index f76b1fd..1cca5c9 100644 > > --- a/fs/btrfs/file.c > > +++ b/fs/btrfs/file.c > > @@ -293,21 +293,33 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, > > struct btrfs_key key; > > struct btrfs_ioctl_defrag_range_args range; > > int num_defrag; > > + int index; > > > > /* get the inode */ > > key.objectid = defrag->root; > > btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); > > key.offset = (u64)-1; > > + > > + index = srcu_read_lock(&fs_info->subvol_srcu); > > + > > inode_root = btrfs_read_fs_root_no_name(fs_info, &key); > > if (IS_ERR(inode_root)) { > > + srcu_read_unlock(&fs_info->subvol_srcu, index); > > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > > return PTR_ERR(inode_root); > > } > > + if (btrfs_root_refs(&inode_root->root_item) == 0) { > > + srcu_read_unlock(&fs_info->subvol_srcu, index); > > + kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > > + return -ENOENT; > > Both error conditions share fair amount of code, I suggest to put it > into the exit block. Good catch. > > > + } > > > > key.objectid = defrag->ino; > > btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); > > key.offset = 0; > > inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); > > + > > + srcu_read_unlock(&fs_info->subvol_srcu, index); > > if (IS_ERR(inode)) { > > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > > return PTR_ERR(inode); > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > index bdbb94f..e487d54 100644 > > --- a/fs/btrfs/scrub.c > > +++ b/fs/btrfs/scrub.c > > @@ -580,20 +580,29 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) > > int corrected = 0; > > struct btrfs_key key; > > struct inode *inode = NULL; > > + struct btrfs_fs_info *fs_info; > > u64 end = offset + PAGE_SIZE - 1; > > struct btrfs_root *local_root; > > + int i_srcu; > > This looks like it's someting related to the inode, while it's the > opaque SRCU index and I suggest to name it like this. > Since there is already a 'unsigned long index;', what about a direct name 'srcu_index'? > > > > key.objectid = root; > > key.type = BTRFS_ROOT_ITEM_KEY; > > key.offset = (u64)-1; > > - local_root = btrfs_read_fs_root_no_name(fixup->root->fs_info, &key); > > - if (IS_ERR(local_root)) > > + > > + fs_info = fixup->root->fs_info; > > + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); > > + > > + local_root = btrfs_read_fs_root_no_name(fs_info, &key); > > + if (IS_ERR(local_root)) { > > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > > return PTR_ERR(local_root); > > + } > > > > key.type = BTRFS_INODE_ITEM_KEY; > > key.objectid = inum; > > key.offset = 0; > > - inode = btrfs_iget(fixup->root->fs_info->sb, &key, local_root, NULL); > > + inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); > > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > > if (IS_ERR(inode)) > > return PTR_ERR(inode); > > > > @@ -606,7 +615,6 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) > > } > > > > if (PageUptodate(page)) { > > - struct btrfs_fs_info *fs_info; > > Unrelated change, and afaics fs_info is used a few lines below. To use fs_info earlier, I just move this to the head of this function. > > > if (PageDirty(page)) { > > /* > > * we need to write the data to the defect sector. the > > @@ -3180,18 +3188,25 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root, void *ctx) > > u64 physical_for_dev_replace; > > u64 len; > > struct btrfs_fs_info *fs_info = nocow_ctx->sctx->dev_root->fs_info; > > + int i_srcu; > > int index; See 'srcu_index' above. Thanks for reviewing. thanks, liubo > > > > key.objectid = root; > > key.type = BTRFS_ROOT_ITEM_KEY; > > key.offset = (u64)-1; > > + > > + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); > > + > > local_root = btrfs_read_fs_root_no_name(fs_info, &key); > > - if (IS_ERR(local_root)) > > + if (IS_ERR(local_root)) { > > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > > return PTR_ERR(local_root); > > + } > > > > key.type = BTRFS_INODE_ITEM_KEY; > > key.objectid = inum; > > key.offset = 0; > > inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); > > + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); > > if (IS_ERR(inode)) > > return PTR_ERR(inode); > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f76b1fd..1cca5c9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -293,21 +293,33 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, struct btrfs_key key; struct btrfs_ioctl_defrag_range_args range; int num_defrag; + int index; /* get the inode */ key.objectid = defrag->root; btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); key.offset = (u64)-1; + + index = srcu_read_lock(&fs_info->subvol_srcu); + inode_root = btrfs_read_fs_root_no_name(fs_info, &key); if (IS_ERR(inode_root)) { + srcu_read_unlock(&fs_info->subvol_srcu, index); kmem_cache_free(btrfs_inode_defrag_cachep, defrag); return PTR_ERR(inode_root); } + if (btrfs_root_refs(&inode_root->root_item) == 0) { + srcu_read_unlock(&fs_info->subvol_srcu, index); + kmem_cache_free(btrfs_inode_defrag_cachep, defrag); + return -ENOENT; + } key.objectid = defrag->ino; btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); key.offset = 0; inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); + + srcu_read_unlock(&fs_info->subvol_srcu, index); if (IS_ERR(inode)) { kmem_cache_free(btrfs_inode_defrag_cachep, defrag); return PTR_ERR(inode); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index bdbb94f..e487d54 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -580,20 +580,29 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) int corrected = 0; struct btrfs_key key; struct inode *inode = NULL; + struct btrfs_fs_info *fs_info; u64 end = offset + PAGE_SIZE - 1; struct btrfs_root *local_root; + int i_srcu; key.objectid = root; key.type = BTRFS_ROOT_ITEM_KEY; key.offset = (u64)-1; - local_root = btrfs_read_fs_root_no_name(fixup->root->fs_info, &key); - if (IS_ERR(local_root)) + + fs_info = fixup->root->fs_info; + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); + + local_root = btrfs_read_fs_root_no_name(fs_info, &key); + if (IS_ERR(local_root)) { + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); return PTR_ERR(local_root); + } key.type = BTRFS_INODE_ITEM_KEY; key.objectid = inum; key.offset = 0; - inode = btrfs_iget(fixup->root->fs_info->sb, &key, local_root, NULL); + inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); if (IS_ERR(inode)) return PTR_ERR(inode); @@ -606,7 +615,6 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) } if (PageUptodate(page)) { - struct btrfs_fs_info *fs_info; if (PageDirty(page)) { /* * we need to write the data to the defect sector. the @@ -3180,18 +3188,25 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root, void *ctx) u64 physical_for_dev_replace; u64 len; struct btrfs_fs_info *fs_info = nocow_ctx->sctx->dev_root->fs_info; + int i_srcu; key.objectid = root; key.type = BTRFS_ROOT_ITEM_KEY; key.offset = (u64)-1; + + i_srcu = srcu_read_lock(&fs_info->subvol_srcu); + local_root = btrfs_read_fs_root_no_name(fs_info, &key); - if (IS_ERR(local_root)) + if (IS_ERR(local_root)) { + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); return PTR_ERR(local_root); + } key.type = BTRFS_INODE_ITEM_KEY; key.objectid = inum; key.offset = 0; inode = btrfs_iget(fs_info->sb, &key, local_root, NULL); + srcu_read_unlock(&fs_info->subvol_srcu, i_srcu); if (IS_ERR(inode)) return PTR_ERR(inode);
While running snapshot testscript created by Mitch and David, the race between autodefrag and snapshot deletion can lead to corruption of dead_root list so that we can get crash on btrfs_clean_old_snapshots(). And besides autodefrag, scrub also do the same thing, ie. read root first and get inode. Here is the story(take autodefrag as an example): (1) when we delete a snapshot or subvolume, it will set its root's refs to zero and do a iput() on its own inode, and if this inode happens to be the only active in-meory one in root's inode rbtree, it will add itself to the global dead_roots list for later cleanup. (2) after (1), the autodefrag thread may read another inode for defrag and the inode is just in the deleted snapshot/subvolume, but all of these are without checking if the root is still valid(refs > 0). So the end up result is adding the deleted snapshot/subvolume's root to the global dead_roots list AGAIN. Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. So all we need to do is to take the lock to protect 'read root and get inode', since we synchronize to wait for the rcu grace period before adding something to the global dead_roots list. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/file.c | 12 ++++++++++++ fs/btrfs/scrub.c | 25 ++++++++++++++++++++----- 2 files changed, 32 insertions(+), 5 deletions(-)