Message ID | cf69fdbd608338aaa7916736ac50e2cfdc3d4338.1679512207.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix corruption caused by partial dio writes | expand |
This is a bit of a mess. And the root cause of that is that btrfs_extract_ordered_extent the way it is used right now does the wrong thing in terms of splitting the ordered_extent. What we want is to allocate a new one for the beginning of the range, and leave the rest alone. I did run into this a while ago during my (nt yet submitted) work to keep an ordered_extent pointer in the btrfs_bio, and I have some patches to sort it out. I've rebased your fix on top of those, can you check if this tree makes sense to you; http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio-fix-hch it passes basic testing so far.
On Thu, Mar 23, 2023 at 01:47:28AM -0700, Christoph Hellwig wrote: > This is a bit of a mess. And the root cause of that is that > btrfs_extract_ordered_extent the way it is used right now does > the wrong thing in terms of splitting the ordered_extent. What > we want is to allocate a new one for the beginning of the range, > and leave the rest alone. > > I did run into this a while ago during my (nt yet submitted) work > to keep an ordered_extent pointer in the btrfs_bio, and I have some > patches to sort it out. > > I've rebased your fix on top of those, can you check if this tree > makes sense to you; > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio-fix-hch > > it passes basic testing so far. Nice, this is great! I actually also made the same changes as in your branch while working on my fix, but didn't know enough about the zoned use case to realize that the simpler "extract from beginning" constraint also applied to the zoned case. So what happened in my branch was I implemented the three way split as two "split at starts" which ultimately felt too messy and I opted for returning the new split objects from the the existing model. If it's true that we can always do a "split from front" then I'm all aboard and think this is the way forward. Given that I found what I think is a serious bug in the case where pre>0, I suspect you are right, and we aren't hitting that case. I will check that this passes my testing for the various dio cases (I have one modified xfstests case I haven't sent yet for the meanest version of the deadlock I have come up with so far) and the other tests that I saw races/flakiness on, but from a quick look, your branch looks correct to me. I believe the most non-obvious property my fix relies on is dio_data->ordered having the leftovers from the partial after submission so that it can be cancelled, which your branch looks to maintain. Assuming the tests pass, I do want to get this in sooner than later, since downstream is still waiting on a fix. Would you be willing to send your stack soon for my fix to land atop? I don't mind if you just send a patch series with my patches mixed in, either. If, OTOH, your patches are still a while out, or depend on something else that's underway, maybe we could land mine, then gut them for your improvements. I'm fine with it either way. Thanks, Boris
On Thu, Mar 23, 2023 at 09:15:29AM -0700, Boris Burkov wrote: > On Thu, Mar 23, 2023 at 01:47:28AM -0700, Christoph Hellwig wrote: > > This is a bit of a mess. And the root cause of that is that > > btrfs_extract_ordered_extent the way it is used right now does > > the wrong thing in terms of splitting the ordered_extent. What > > we want is to allocate a new one for the beginning of the range, > > and leave the rest alone. > > > > I did run into this a while ago during my (nt yet submitted) work > > to keep an ordered_extent pointer in the btrfs_bio, and I have some > > patches to sort it out. > > > > I've rebased your fix on top of those, can you check if this tree > > makes sense to you; > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio-fix-hch > > > > it passes basic testing so far. > > Nice, this is great! > > I actually also made the same changes as in your branch while working on > my fix, but didn't know enough about the zoned use case to realize that > the simpler "extract from beginning" constraint also applied to the > zoned case. So what happened in my branch was I implemented the three > way split as two "split at starts" which ultimately felt too messy and I > opted for returning the new split objects from the the existing model. > > If it's true that we can always do a "split from front" then I'm all > aboard and think this is the way forward. Given that I found what I > think is a serious bug in the case where pre>0, I suspect you are right, > and we aren't hitting that case. > > I will check that this passes my testing for the various dio cases (I > have one modified xfstests case I haven't sent yet for the meanest > version of the deadlock I have come up with so far) and the other tests > that I saw races/flakiness on, but from a quick look, your branch looks > correct to me. I believe the most non-obvious property my fix relies on > is dio_data->ordered having the leftovers from the partial after > submission so that it can be cancelled, which your branch looks to > maintain. Your branch as-is does not pass the existing tests, It's missing a fix from my V5. We need to avoid splitting partial OEs when doing NOCOW dio writes, because iomap_begin() does not create a fresh pinned em in that case, since it reuses the existing extent. e.g., diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8cb61f4daec0..bbc89a0872e7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7719,7 +7719,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, * cancelled in iomap_end to avoid a deadlock wherein faulting the * remaining pages is blocked on the outstanding ordered extent. */ - if (iter->flags & IOMAP_WRITE) { + if (iter->flags & IOMAP_WRITE && !test_bit(BTRFS_ORDERED_NOCOW, &dio_data->ordered->flags)) { int err; err = btrfs_extract_ordered_extent(bbio, dio_data->ordered); With that patch, I pass 10x of btrfs/250, so running the full suite next. > > Assuming the tests pass, I do want to get this in sooner than later, > since downstream is still waiting on a fix. Would you be willing to send > your stack soon for my fix to land atop? I don't mind if you just send a > patch series with my patches mixed in, either. If, OTOH, your patches > are still a while out, or depend on something else that's underway, > maybe we could land mine, then gut them for your improvements. I'm fine > with it either way. > > Thanks, > Boris
On Thu, Mar 23, 2023 at 10:00:06AM -0700, Boris Burkov wrote: > On Thu, Mar 23, 2023 at 09:15:29AM -0700, Boris Burkov wrote: > > On Thu, Mar 23, 2023 at 01:47:28AM -0700, Christoph Hellwig wrote: > > > This is a bit of a mess. And the root cause of that is that > > > btrfs_extract_ordered_extent the way it is used right now does > > > the wrong thing in terms of splitting the ordered_extent. What > > > we want is to allocate a new one for the beginning of the range, > > > and leave the rest alone. > > > > > > I did run into this a while ago during my (nt yet submitted) work > > > to keep an ordered_extent pointer in the btrfs_bio, and I have some > > > patches to sort it out. > > > > > > I've rebased your fix on top of those, can you check if this tree > > > makes sense to you; > > > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio-fix-hch > > > > > > it passes basic testing so far. > > > > Nice, this is great! > > > > I actually also made the same changes as in your branch while working on > > my fix, but didn't know enough about the zoned use case to realize that > > the simpler "extract from beginning" constraint also applied to the > > zoned case. So what happened in my branch was I implemented the three > > way split as two "split at starts" which ultimately felt too messy and I > > opted for returning the new split objects from the the existing model. > > > > If it's true that we can always do a "split from front" then I'm all > > aboard and think this is the way forward. Given that I found what I > > think is a serious bug in the case where pre>0, I suspect you are right, > > and we aren't hitting that case. > > > > I will check that this passes my testing for the various dio cases (I > > have one modified xfstests case I haven't sent yet for the meanest > > version of the deadlock I have come up with so far) and the other tests > > that I saw races/flakiness on, but from a quick look, your branch looks > > correct to me. I believe the most non-obvious property my fix relies on > > is dio_data->ordered having the leftovers from the partial after > > submission so that it can be cancelled, which your branch looks to > > maintain. > > Your branch as-is does not pass the existing tests, It's missing a fix > from my V5. We need to avoid splitting partial OEs when doing NOCOW dio > writes, because iomap_begin() does not create a fresh pinned em in that > case, since it reuses the existing extent. > > e.g., > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 8cb61f4daec0..bbc89a0872e7 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7719,7 +7719,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > * cancelled in iomap_end to avoid a deadlock wherein faulting the > * remaining pages is blocked on the outstanding ordered extent. > */ > - if (iter->flags & IOMAP_WRITE) { > + if (iter->flags & IOMAP_WRITE && !test_bit(BTRFS_ORDERED_NOCOW, &dio_data->ordered->flags)) { > int err; > > err = btrfs_extract_ordered_extent(bbio, dio_data->ordered); > > With that patch, I pass 10x of btrfs/250, so running the full suite next. fstests in general passed on my system, so I am happy with this branch + my above tweak if Naohiro/Johannes are on board with the simplified ordered_extent/extent_map splitting model that assumes the bio is at the start offset. > > > > > Assuming the tests pass, I do want to get this in sooner than later, > > since downstream is still waiting on a fix. Would you be willing to send > > your stack soon for my fix to land atop? I don't mind if you just send a > > patch series with my patches mixed in, either. If, OTOH, your patches > > are still a while out, or depend on something else that's underway, > > maybe we could land mine, then gut them for your improvements. I'm fine > > with it either way. > > > > Thanks, > > Boris
On Thu, Mar 23, 2023 at 10:00:06AM -0700, Boris Burkov wrote: > Your branch as-is does not pass the existing tests, It's missing a fix > from my V5. We need to avoid splitting partial OEs when doing NOCOW dio > writes, because iomap_begin() does not create a fresh pinned em in that > case, since it reuses the existing extent. Oops, yes, that got lost. I can add this as another patch attributed to you. That beeing said, I'm a bit confused about: 1) if we need this split call at all for the non-zoned case as we don't need to record a different physical disk address 2) how we clean up this on-disk logical to physical mapping at all on a write failure Maybe we should let those dragons sleep for now and just do the minimal fix, though. I just woke up on an airplane, so depending on my jetlag I might have a new series ready with the minimal fix for varying definitions of "in a few hours". > > e.g., > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 8cb61f4daec0..bbc89a0872e7 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7719,7 +7719,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > * cancelled in iomap_end to avoid a deadlock wherein faulting the > * remaining pages is blocked on the outstanding ordered extent. > */ > - if (iter->flags & IOMAP_WRITE) { > + if (iter->flags & IOMAP_WRITE && !test_bit(BTRFS_ORDERED_NOCOW, &dio_data->ordered->flags)) { > int err; > > err = btrfs_extract_ordered_extent(bbio, dio_data->ordered); I think the BTRFS_ORDERED_NOCOW should be just around the split_extent_map call. That matches your series, and without that we wouldn't split the ordered_extent for nowcow writes and thus only fix the original problem for non-nocow writes.
On Thu, Mar 23, 2023 at 02:29:39PM -0700, Christoph Hellwig wrote: > On Thu, Mar 23, 2023 at 10:00:06AM -0700, Boris Burkov wrote: > > Your branch as-is does not pass the existing tests, It's missing a fix > > from my V5. We need to avoid splitting partial OEs when doing NOCOW dio > > writes, because iomap_begin() does not create a fresh pinned em in that > > case, since it reuses the existing extent. > > Oops, yes, that got lost. I can add this as another patch attributed > to you. > > That beeing said, I'm a bit confused about: > > 1) if we need this split call at all for the non-zoned case as we don't > need to record a different physical disk address I think I understand this, but maybe I'm missing exactly what you're asking. In finish_ordered_io, we call unpin_extent_cache, which blows up on em->start != oe->file_offset. I believe the rationale is we are creating a new em which is PINNED when we allocate the extent in btrfs_new_extent_direct (via the call to btrfs_reserve_extent), so we need to unpin it and allow it to be merged, etc... For nocow, we don't allocate that new extent, so we don't need to split/unpin the existing extent_map which we are just reusing. > 2) how we clean up this on-disk logical to physical mapping at all on > a write failure This I haven't thought much about, so I will leave it in the "dragons sleep for now" category. > > Maybe we should let those dragons sleep for now and just do the minimal > fix, though. > > I just woke up on an airplane, so depending on my jetlag I might have > a new series ready with the minimal fix for varying definitions of > "in a few hours". Great, that works for me. I just didn't want to wait weeks if you were blocked on other stuff. > > > > > e.g., > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 8cb61f4daec0..bbc89a0872e7 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -7719,7 +7719,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, > > * cancelled in iomap_end to avoid a deadlock wherein faulting the > > * remaining pages is blocked on the outstanding ordered extent. > > */ > > - if (iter->flags & IOMAP_WRITE) { > > + if (iter->flags & IOMAP_WRITE && !test_bit(BTRFS_ORDERED_NOCOW, &dio_data->ordered->flags)) { > > int err; > > > > err = btrfs_extract_ordered_extent(bbio, dio_data->ordered); > > I think the BTRFS_ORDERED_NOCOW should be just around the > split_extent_map call. That matches your series, and without > that we wouldn't split the ordered_extent for nowcow writes and thus > only fix the original problem for non-nocow writes. Oops, my bad. Good catch.
On Thu, Mar 23, 2023 at 03:43:36PM -0700, Boris Burkov wrote: > In finish_ordered_io, we call unpin_extent_cache, which blows up on > em->start != oe->file_offset. I believe the rationale is we are creating > a new em which is PINNED when we allocate the extent in > btrfs_new_extent_direct (via the call to btrfs_reserve_extent), so we > need to unpin it and allow it to be merged, etc... For nocow, we don't > allocate that new extent, so we don't need to split/unpin the existing > extent_map which we are just reusing. Yeah, I actually just ran into that when testing my idea :)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index cf09c6271edb..b849ced40d37 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -653,7 +653,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) if (use_append) { bio->bi_opf &= ~REQ_OP_WRITE; bio->bi_opf |= REQ_OP_ZONE_APPEND; - ret = btrfs_extract_ordered_extent(bbio); + ret = btrfs_extract_ordered_extent_bio(bbio, NULL, NULL, NULL); if (ret) goto fail_put_bio; } diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9dc21622806e..e92a09559058 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -407,7 +407,10 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags, int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page, u32 pgoff, u8 *csum, const u8 * const csum_expected); -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio); +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev, u32 bio_offset, struct bio_vec *bv); noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5ab486f448eb..e30390051f15 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2514,10 +2514,14 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode, /* * Split an extent_map at [start, start + len] * - * This function is intended to be used only for extract_ordered_extent(). + * This function is intended to be used only for + * btrfs_extract_ordered_extent_bio(). + * + * It makes assumptions about the extent map that are only valid in the narrow + * situations in which we are extracting a bio from a containing ordered extent, + * that are specific to zoned filesystems or partial dio writes. */ -static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, - u64 pre, u64 post) +static int split_em(struct btrfs_inode *inode, u64 start, u64 len, u64 pre, u64 post) { struct extent_map_tree *em_tree = &inode->extent_tree; struct extent_map *em; @@ -2626,22 +2630,36 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, return ret; } -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) +/* + * Extract a bio from an ordered extent to enforce an invariant where the bio + * fully matches a single ordered extent. + * + * @bbio: the bio to extract. + * @ordered: the ordered extent the bio is in, will be shrunk to fit. If NULL we + * will look it up. + * @ret_pre: out parameter to return the new oe in front of the bio, if needed. + * @ret_post: out parameter to return the new oe past the bio, if needed. + */ +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) { u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; u64 len = bbio->bio.bi_iter.bi_size; struct btrfs_inode *inode = bbio->inode; - struct btrfs_ordered_extent *ordered; u64 file_len; u64 end = start + len; u64 ordered_end; u64 pre, post; int ret = 0; - ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); + if (!ordered) + ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); if (WARN_ON_ONCE(!ordered)) return BLK_STS_IOERR; + ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* No need to split */ if (ordered->disk_num_bytes == len) goto out; @@ -2658,7 +2676,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) goto out; } - ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* bio must be in one ordered extent */ if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) { ret = -EINVAL; @@ -2675,10 +2692,11 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) pre = start - ordered->disk_bytenr; post = ordered_end - end; - ret = btrfs_split_ordered_extent(ordered, pre, post); + ret = btrfs_split_ordered_extent(ordered, pre, post, ret_pre, ret_post); if (ret) goto out; - ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post); + if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + ret = split_em(inode, bbio->file_offset, file_len, pre, post); out: btrfs_put_ordered_extent(ordered); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 1848d0d1a9c4..4bebebb9b434 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1117,8 +1117,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, } -static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, - u64 len) +static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pos, u64 len) { struct inode *inode = ordered->inode; struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; @@ -1133,18 +1133,22 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch); WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED)); - return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len, - disk_bytenr, len, 0, flags, - ordered->compress_type); + return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len, + disk_bytenr, len, 0, flags, + ordered->compress_type); } -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post) +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) + { struct inode *inode = ordered->inode; struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree; struct rb_node *node; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_ordered_extent *oe; int ret = 0; trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered); @@ -1172,12 +1176,18 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, spin_unlock_irq(&tree->lock); - if (pre) - ret = clone_ordered_extent(ordered, 0, pre); - if (ret == 0 && post) - ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, - post); - + if (pre) { + oe = clone_ordered_extent(ordered, 0, pre); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_pre) + *ret_pre = oe; + } + if (!ret && post) { + oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_post) + *ret_post = oe; + } return ret; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 18007f9c00ad..933f6f0d8c10 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -212,8 +212,10 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, struct extent_state **cached_state); bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, struct extent_state **cached_state); -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post); +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); int __init ordered_data_init(void); void __cold ordered_data_exit(void);
When extracting a bio from its ordered extent for dio partial writes, we need the "remainder" ordered extent. It would be possible to look it up in that case, but since we can grab the ordered_extent from the new allocation function, we might as well wire it up to be returned to the caller via out parameter and save that lookup. Refactor the clone ordered extent function to return the new ordered extent, then refactor the split and extract functions to pass back the new pre and post split ordered extents via output parameter. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/bio.c | 2 +- fs/btrfs/btrfs_inode.h | 5 ++++- fs/btrfs/inode.c | 36 +++++++++++++++++++++++++++--------- fs/btrfs/ordered-data.c | 36 +++++++++++++++++++++++------------- fs/btrfs/ordered-data.h | 6 ++++-- 5 files changed, 59 insertions(+), 26 deletions(-)