Message ID | 1473870467-18721-1-git-send-email-bo.li.liu@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On 09/14/2016 12:27 PM, Liu Bo wrote: > While updating btree, we try to push items between sibling > nodes/leaves in order to keep height as low as possible. > But we don't memset the original places with zero when > pushing items so that we could end up leaving stale content > in nodes/leaves. One may read the above stale content by > increasing btree blocks' @nritems. > Ok this sounds really bad. Is this as bad as I think it sounds? We should probably fix this like right now right? > One case I've come across is that in fs tree, a leaf has two > parent nodes, hence running balance ends up with processing > this leaf with two parent nodes, but it can only reach the > valid parent node through btrfs_search_slot, so it'd be like, > > do_relocation > for P in all parent nodes of block A: > if !P->eb: > btrfs_search_slot(key); --> get path from P to A. > if lowest: > BUG_ON(A->bytenr != bytenr of A recorded in P); > btrfs_cow_block(P, A); --> change A's bytenr in P. > > After btrfs_cow_block, P has the new bytenr of A, but with the > same @key, we get the same path again, and get panic by BUG_ON. > Ok so this happens because of the problem you described above right? Because this shouldn't actually happen. We should go down the next parent and still get to the original block where A->bytenr == node->bytenr. So I'm all for killing this BUG_ON(), but the problem description is wrong. We need to keep this scenario from happening in the first place. And then we kill this BUG_ON() because it can happen if there is FS corruption or something. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/14/2016 01:13 PM, Josef Bacik wrote: > On 09/14/2016 12:27 PM, Liu Bo wrote: >> While updating btree, we try to push items between sibling >> nodes/leaves in order to keep height as low as possible. >> But we don't memset the original places with zero when >> pushing items so that we could end up leaving stale content >> in nodes/leaves. One may read the above stale content by >> increasing btree blocks' @nritems. >> > > Ok this sounds really bad. Is this as bad as I think it sounds? We > should probably fix this like right now right? He's bumping @nritems with a fuzzer I think? As in this happens when someone forces it (or via some other bug) but not in normal operations. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/14/2016 01:29 PM, Chris Mason wrote: > > > On 09/14/2016 01:13 PM, Josef Bacik wrote: >> On 09/14/2016 12:27 PM, Liu Bo wrote: >>> While updating btree, we try to push items between sibling >>> nodes/leaves in order to keep height as low as possible. >>> But we don't memset the original places with zero when >>> pushing items so that we could end up leaving stale content >>> in nodes/leaves. One may read the above stale content by >>> increasing btree blocks' @nritems. >>> >> >> Ok this sounds really bad. Is this as bad as I think it sounds? We >> should probably fix this like right now right? > > He's bumping @nritems with a fuzzer I think? As in this happens when someone > forces it (or via some other bug) but not in normal operations. > Oh ok if this happens with a fuzzer than this is fine, but I'd rather do -EIO so we know this is something bad with the fs. And change the changelog to make it explicit that this is the result of fs corruption, not normal operation. Then you can add Reviewed-by: Josef Bacik <jbacik@fb.com> Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 14, 2016 at 01:13:34PM -0400, Josef Bacik wrote: > On 09/14/2016 12:27 PM, Liu Bo wrote: > > While updating btree, we try to push items between sibling > > nodes/leaves in order to keep height as low as possible. > > But we don't memset the original places with zero when > > pushing items so that we could end up leaving stale content > > in nodes/leaves. One may read the above stale content by > > increasing btree blocks' @nritems. > > > > Ok this sounds really bad. Is this as bad as I think it sounds? We should > probably fix this like right now right? Yeah, I'm preparing two patches to address it. > > > One case I've come across is that in fs tree, a leaf has two > > parent nodes, hence running balance ends up with processing > > this leaf with two parent nodes, but it can only reach the > > valid parent node through btrfs_search_slot, so it'd be like, > > > > do_relocation > > for P in all parent nodes of block A: > > if !P->eb: > > btrfs_search_slot(key); --> get path from P to A. > > if lowest: > > BUG_ON(A->bytenr != bytenr of A recorded in P); > > btrfs_cow_block(P, A); --> change A's bytenr in P. > > > > After btrfs_cow_block, P has the new bytenr of A, but with the > > same @key, we get the same path again, and get panic by BUG_ON. > > > > Ok so this happens because of the problem you described above right? > Because this shouldn't actually happen. We should go down the next parent > and still get to the original block where A->bytenr == node->bytenr. After bumping block 55279616's @nritems from 320 to 492, fs tree key (FS_TREE ROOT_ITEM 0) node 55230464 level 2 items 4 free 489 generation 11 owner 5 fs uuid 03737dfb-8087-4923-b058-2ec629bf39bd chunk uuid d586e037-9d50-4332-9b3a-fa2344d210e1 key (256 INODE_ITEM 0) block 55279616 (3374) gen 11 nr 0 key (257 DIR_INDEX 24879) block 56410112 (3443) gen 11 nr 1 key (10404 INODE_ITEM 0) block 60391424 (3686) gen 11 nr 2 key (34068 INODE_ITEM 0) block 55246848 (3372) gen 11 nr 3 node 55279616 level 1 items 492 free 1 generation 11 owner 5 ... key (257 DIR_INDEX 24625) block 44335104 (2706) gen 9 nr 319 key (2100 INODE_ITEM 0) block 30425088 (1857) gen 7 nr 320 key (2148 INODE_ITEM 0) block 30441472 (1858) gen 7 nr 321 node 56410112 level 1 items 311 free 182 generation 11 owner 5 fs uuid 03737dfb-8087-4923-b058-2ec629bf39bd chunk uuid d586e037-9d50-4332-9b3a-fa2344d210e1 ... key (2052 INODE_ITEM 0) block 30408704 (1856) gen 7 nr 137 key (2100 INODE_ITEM 0) block 30425088 (1857) gen 7 nr 138 key (2148 INODE_ITEM 0) block 30441472 (1858) gen 7 nr 139 ... If we search fs tree with disk key (2100 INODE_ITEM 0), we always get into node block 56410112, not node block 55279616 after bin_search in the top level, so leaf block 30408704 never gets both parent. > So I'm > all for killing this BUG_ON(), but the problem description is wrong. We > need to keep this scenario from happening in the first place. And then we > kill this BUG_ON() because it can happen if there is FS corruption or > something. Thanks, We really should, we can prevent it from happening by checking btree node's validation although it is not as easy as checking a leaf. The description is what I got from debugging, just wanna show how we come to the BUG_ON since it's not straighforward at all. But yes, I'll update the description that this problem is due to fs corruption. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote: > On 09/14/2016 01:29 PM, Chris Mason wrote: > > > > > > On 09/14/2016 01:13 PM, Josef Bacik wrote: > > > On 09/14/2016 12:27 PM, Liu Bo wrote: > > > > While updating btree, we try to push items between sibling > > > > nodes/leaves in order to keep height as low as possible. > > > > But we don't memset the original places with zero when > > > > pushing items so that we could end up leaving stale content > > > > in nodes/leaves. One may read the above stale content by > > > > increasing btree blocks' @nritems. > > > > > > > > > > Ok this sounds really bad. Is this as bad as I think it sounds? We > > > should probably fix this like right now right? > > > > He's bumping @nritems with a fuzzer I think? As in this happens when someone > > forces it (or via some other bug) but not in normal operations. > > > > Oh ok if this happens with a fuzzer than this is fine, but I'd rather do > -EIO so we know this is something bad with the fs. -EIO may be more appropriate to be given while reading btree blocks and checking their validation? > And change the changelog > to make it explicit that this is the result of fs corruption, not normal > operation. Then you can add > > Reviewed-by: Josef Bacik <jbacik@fb.com> OK, make sense. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/15/2016 03:01 PM, Liu Bo wrote: > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote: >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote: >>> On 09/14/2016 01:29 PM, Chris Mason wrote: >>>> >>>> >>>> On 09/14/2016 01:13 PM, Josef Bacik wrote: >>>>> On 09/14/2016 12:27 PM, Liu Bo wrote: >>>>>> While updating btree, we try to push items between sibling >>>>>> nodes/leaves in order to keep height as low as possible. >>>>>> But we don't memset the original places with zero when >>>>>> pushing items so that we could end up leaving stale content >>>>>> in nodes/leaves. One may read the above stale content by >>>>>> increasing btree blocks' @nritems. >>>>>> >>>>> >>>>> Ok this sounds really bad. Is this as bad as I think it sounds? We >>>>> should probably fix this like right now right? >>>> >>>> He's bumping @nritems with a fuzzer I think? As in this happens when someone >>>> forces it (or via some other bug) but not in normal operations. >>>> >>> >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do >>> -EIO so we know this is something bad with the fs. >> >> -EIO may be more appropriate to be given while reading btree blocks and >> checking their validation? > > Looks like EIO doesn't fit into this case, either, do we have any errno > representing 'corrupted filesystem'? That's EIO. Sometimes the EIO is big enough we have to abort, but really the abort is just adding bonus. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote: > On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote: > > On 09/14/2016 01:29 PM, Chris Mason wrote: > > > > > > > > > On 09/14/2016 01:13 PM, Josef Bacik wrote: > > > > On 09/14/2016 12:27 PM, Liu Bo wrote: > > > > > While updating btree, we try to push items between sibling > > > > > nodes/leaves in order to keep height as low as possible. > > > > > But we don't memset the original places with zero when > > > > > pushing items so that we could end up leaving stale content > > > > > in nodes/leaves. One may read the above stale content by > > > > > increasing btree blocks' @nritems. > > > > > > > > > > > > > Ok this sounds really bad. Is this as bad as I think it sounds? We > > > > should probably fix this like right now right? > > > > > > He's bumping @nritems with a fuzzer I think? As in this happens when someone > > > forces it (or via some other bug) but not in normal operations. > > > > > > > Oh ok if this happens with a fuzzer than this is fine, but I'd rather do > > -EIO so we know this is something bad with the fs. > > -EIO may be more appropriate to be given while reading btree blocks and > checking their validation? Looks like EIO doesn't fit into this case, either, do we have any errno representing 'corrupted filesystem'? Thanks, -liubo > > > And change the changelog > > to make it explicit that this is the result of fs corruption, not normal > > operation. Then you can add > > > > Reviewed-by: Josef Bacik <jbacik@fb.com> > > OK, make sense. > > Thanks, > > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 15, 2016 at 02:58:12PM -0400, Chris Mason wrote: > > > On 09/15/2016 03:01 PM, Liu Bo wrote: > > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote: > >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote: > >>> On 09/14/2016 01:29 PM, Chris Mason wrote: > >>>> > >>>> > >>>> On 09/14/2016 01:13 PM, Josef Bacik wrote: > >>>>> On 09/14/2016 12:27 PM, Liu Bo wrote: > >>>>>> While updating btree, we try to push items between sibling > >>>>>> nodes/leaves in order to keep height as low as possible. > >>>>>> But we don't memset the original places with zero when > >>>>>> pushing items so that we could end up leaving stale content > >>>>>> in nodes/leaves. One may read the above stale content by > >>>>>> increasing btree blocks' @nritems. > >>>>>> > >>>>> > >>>>> Ok this sounds really bad. Is this as bad as I think it sounds? We > >>>>> should probably fix this like right now right? > >>>> > >>>> He's bumping @nritems with a fuzzer I think? As in this happens when someone > >>>> forces it (or via some other bug) but not in normal operations. > >>>> > >>> > >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do > >>> -EIO so we know this is something bad with the fs. > >> > >> -EIO may be more appropriate to be given while reading btree blocks and > >> checking their validation? > > > > Looks like EIO doesn't fit into this case, either, do we have any errno > > representing 'corrupted filesystem'? > > That's EIO. Sometimes the EIO is big enough we have to abort, but > really the abort is just adding bonus. I think we misuse the EIO where we should really return EFSCORRUPTED that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be really a message that the hardware is bad. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 19, 2016 at 08:01:05PM +0200, David Sterba wrote: > On Thu, Sep 15, 2016 at 02:58:12PM -0400, Chris Mason wrote: > > > > > > On 09/15/2016 03:01 PM, Liu Bo wrote: > > > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote: > > >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote: > > >>> On 09/14/2016 01:29 PM, Chris Mason wrote: > > >>>> > > >>>> > > >>>> On 09/14/2016 01:13 PM, Josef Bacik wrote: > > >>>>> On 09/14/2016 12:27 PM, Liu Bo wrote: > > >>>>>> While updating btree, we try to push items between sibling > > >>>>>> nodes/leaves in order to keep height as low as possible. > > >>>>>> But we don't memset the original places with zero when > > >>>>>> pushing items so that we could end up leaving stale content > > >>>>>> in nodes/leaves. One may read the above stale content by > > >>>>>> increasing btree blocks' @nritems. > > >>>>>> > > >>>>> > > >>>>> Ok this sounds really bad. Is this as bad as I think it sounds? We > > >>>>> should probably fix this like right now right? > > >>>> > > >>>> He's bumping @nritems with a fuzzer I think? As in this happens when someone > > >>>> forces it (or via some other bug) but not in normal operations. > > >>>> > > >>> > > >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do > > >>> -EIO so we know this is something bad with the fs. > > >> > > >> -EIO may be more appropriate to be given while reading btree blocks and > > >> checking their validation? > > > > > > Looks like EIO doesn't fit into this case, either, do we have any errno > > > representing 'corrupted filesystem'? > > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > really the abort is just adding bonus. > > I think we misuse the EIO where we should really return EFSCORRUPTED > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > really a message that the hardware is bad. I love this idea, but one quick question, when returning EUCLEAN, what message do users get? "#define EUCLEAN 117 /* Structure needs cleaning */" Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote: > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > > really the abort is just adding bonus. > > > > I think we misuse the EIO where we should really return EFSCORRUPTED > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > > really a message that the hardware is bad. > > I love this idea, but one quick question, when returning EUCLEAN, what > message do users get? > > "#define EUCLEAN 117 /* Structure needs cleaning */" strerror(EUCLEAN) -> "Structure needs cleaning" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote: > On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote: > > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > > > really the abort is just adding bonus. > > > > > > I think we misuse the EIO where we should really return EFSCORRUPTED > > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > > > really a message that the hardware is bad. > > > > I love this idea, but one quick question, when returning EUCLEAN, what > > message do users get? > > > > "#define EUCLEAN 117 /* Structure needs cleaning */" > > strerror(EUCLEAN) -> "Structure needs cleaning" Hmm, if I was the user, I'm not sure how to deal with "Structure needs cleaning", so still need to take a glance at dmesg log. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 20, 2016 at 10:59:59AM -0700, Liu Bo wrote: > On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote: > > On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote: > > > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > > > > really the abort is just adding bonus. > > > > > > > > I think we misuse the EIO where we should really return EFSCORRUPTED > > > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > > > > really a message that the hardware is bad. > > > > > > I love this idea, but one quick question, when returning EUCLEAN, what > > > message do users get? > > > > > > "#define EUCLEAN 117 /* Structure needs cleaning */" > > > > strerror(EUCLEAN) -> "Structure needs cleaning" > > Hmm, if I was the user, I'm not sure how to deal with "Structure needs > cleaning", so still need to take a glance at dmesg log. I understand that, it's not descriptive at all. We could remap it to EIO once it goes outside of btrfs module, so it would be only internal error. The commit that introduces it to xfs is 10 years old, da2f4d679c8070ba5b6a920281e495917b293aa0 and mentions something like that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 680e234..bce8fea 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2723,7 +2723,14 @@ static int do_relocation(struct btrfs_trans_handle *trans, bytenr = btrfs_node_blockptr(upper->eb, slot); if (lowest) { - BUG_ON(bytenr != node->bytenr); + if (bytenr != node->bytenr) { + btrfs_err(root->fs_info, + "lowest leaf/node mismatch: bytenr %llu node->bytenr %llu slot %d upper %llu", + bytenr, node->bytenr, slot, + upper->eb->start); + err = -EEXIST; + goto next; + } } else { if (node->eb->start == bytenr) goto next;
While updating btree, we try to push items between sibling nodes/leaves in order to keep height as low as possible. But we don't memset the original places with zero when pushing items so that we could end up leaving stale content in nodes/leaves. One may read the above stale content by increasing btree blocks' @nritems. One case I've come across is that in fs tree, a leaf has two parent nodes, hence running balance ends up with processing this leaf with two parent nodes, but it can only reach the valid parent node through btrfs_search_slot, so it'd be like, do_relocation for P in all parent nodes of block A: if !P->eb: btrfs_search_slot(key); --> get path from P to A. if lowest: BUG_ON(A->bytenr != bytenr of A recorded in P); btrfs_cow_block(P, A); --> change A's bytenr in P. After btrfs_cow_block, P has the new bytenr of A, but with the same @key, we get the same path again, and get panic by BUG_ON. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/relocation.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)