Message ID | 20240610144229.26373-1-jth@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: scrub: don't call calc_sector_number on failed bios | expand |
On Mon, Jun 10, 2024 at 04:42:29PM +0200, Johannes Thumshirn wrote: > When running fstests' test-case btrfs/060 with MKFS_OPTIONS="-O rst" > to force the creation of a RAID stripe tree even on regular devices I > hit the follwoing ASSERT() in calc_sector_number(): > > ASSERT(i < stripe->nr_sectors); > > This ASSERT() is triggered, because we cannot find the page in the > passed in bio_vec in the scrub_stripe's pages[] array. > > When having a closer look at the stripe using drgn on the vmcore dump > and comparing the stripe's pages to the bio_vec's pages it is evident > that we cannot find it as first_bvec's bv_page is NULL: > > >>> stripe = t.stack_trace()[13]['stripe'] > >>> print(stripe.pages) > (struct page *[16]){ 0xffffea000418b280, 0xffffea00043051c0, > 0xffffea000430ed00, 0xffffea00040fcc00, 0xffffea000441fc80, > 0xffffea00040fc980, 0xffffea00040fc9c0, 0xffffea00040fc940, > 0xffffea0004223040, 0xffffea00043a1940, 0xffffea00040fea80, > 0xffffea00040a5740, 0xffffea0004490f40, 0xffffea00040f7dc0, > 0xffffea00044985c0, 0xffffea00040f7d80 } > >>> bio = t.stack_trace()[12]['bbio'].bio > >>> print(bio.bi_io_vec) > *(struct bio_vec *)0xffff88810632bc00 = { > .bv_page = (struct page *)0x0, > .bv_len = (unsigned int)0, > .bv_offset = (unsigned int)0, > } > > Upon closer inspection of the bio itself we see that bio->bi_status is > 10, which corresponds to BLK_STS_IOERR: > > >>> print(bio) > (struct bio){ > .bi_next = (struct bio *)0x0, > .bi_bdev = (struct block_device *)0x0, > .bi_opf = (blk_opf_t)0, > .bi_flags = (unsigned short)0, > .bi_ioprio = (unsigned short)0, > .bi_write_hint = (enum rw_hint)WRITE_LIFE_NOT_SET, > .bi_status = (blk_status_t)10, > .__bi_remaining = (atomic_t){ > .counter = (int)1, > }, > .bi_iter = (struct bvec_iter){ > .bi_sector = (sector_t)2173056, > .bi_size = (unsigned int)0, > .bi_idx = (unsigned int)0, > .bi_bvec_done = (unsigned int)0, > }, > .bi_cookie = (blk_qc_t)4294967295, > .__bi_nr_segments = (unsigned int)4294967295, > .bi_end_io = (bio_end_io_t *)0x0, > .bi_private = (void *)0x0, > .bi_integrity = (struct bio_integrity_payload *)0x0, > .bi_vcnt = (unsigned short)0, > .bi_max_vecs = (unsigned short)16, > .__bi_cnt = (atomic_t){ > .counter = (int)1, > }, > .bi_io_vec = (struct bio_vec *)0xffff88810632bc00, > .bi_pool = (struct bio_set *)btrfs_bioset+0x0 = 0xffffffff82c85620, > .bi_inline_vecs = (struct bio_vec []){}, > } > > So only call calc_sector_number when we know the bio completed without error. > > Cc: Qu Wenru <wqu@suse.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
在 2024/6/11 00:12, Johannes Thumshirn 写道: > When running fstests' test-case btrfs/060 with MKFS_OPTIONS="-O rst" > to force the creation of a RAID stripe tree even on regular devices I > hit the follwoing ASSERT() in calc_sector_number(): > > ASSERT(i < stripe->nr_sectors); > > This ASSERT() is triggered, because we cannot find the page in the > passed in bio_vec in the scrub_stripe's pages[] array. > > When having a closer look at the stripe using drgn on the vmcore dump > and comparing the stripe's pages to the bio_vec's pages it is evident > that we cannot find it as first_bvec's bv_page is NULL: > > >>> stripe = t.stack_trace()[13]['stripe'] > >>> print(stripe.pages) > (struct page *[16]){ 0xffffea000418b280, 0xffffea00043051c0, > 0xffffea000430ed00, 0xffffea00040fcc00, 0xffffea000441fc80, > 0xffffea00040fc980, 0xffffea00040fc9c0, 0xffffea00040fc940, > 0xffffea0004223040, 0xffffea00043a1940, 0xffffea00040fea80, > 0xffffea00040a5740, 0xffffea0004490f40, 0xffffea00040f7dc0, > 0xffffea00044985c0, 0xffffea00040f7d80 } > >>> bio = t.stack_trace()[12]['bbio'].bio > >>> print(bio.bi_io_vec) > *(struct bio_vec *)0xffff88810632bc00 = { > .bv_page = (struct page *)0x0, > .bv_len = (unsigned int)0, > .bv_offset = (unsigned int)0, > } I'm more interested in why we got a bi_vec with all zeros. Especially if the bv_len is 0, we won't update the error bitmap at all. So it's not simply ignore it if the IO failed. To me it looks more like at some stage (RST layer?) the bio is reseted/modified unexpected? Thanks, Qu > > Upon closer inspection of the bio itself we see that bio->bi_status is > 10, which corresponds to BLK_STS_IOERR: > > >>> print(bio) > (struct bio){ > .bi_next = (struct bio *)0x0, > .bi_bdev = (struct block_device *)0x0, > .bi_opf = (blk_opf_t)0, > .bi_flags = (unsigned short)0, > .bi_ioprio = (unsigned short)0, > .bi_write_hint = (enum rw_hint)WRITE_LIFE_NOT_SET, > .bi_status = (blk_status_t)10, > .__bi_remaining = (atomic_t){ > .counter = (int)1, > }, > .bi_iter = (struct bvec_iter){ > .bi_sector = (sector_t)2173056, > .bi_size = (unsigned int)0, > .bi_idx = (unsigned int)0, > .bi_bvec_done = (unsigned int)0, > }, > .bi_cookie = (blk_qc_t)4294967295, > .__bi_nr_segments = (unsigned int)4294967295, > .bi_end_io = (bio_end_io_t *)0x0, > .bi_private = (void *)0x0, > .bi_integrity = (struct bio_integrity_payload *)0x0, > .bi_vcnt = (unsigned short)0, > .bi_max_vecs = (unsigned short)16, > .__bi_cnt = (atomic_t){ > .counter = (int)1, > }, > .bi_io_vec = (struct bio_vec *)0xffff88810632bc00, > .bi_pool = (struct bio_set *)btrfs_bioset+0x0 = 0xffffffff82c85620, > .bi_inline_vecs = (struct bio_vec []){}, > } > > So only call calc_sector_number when we know the bio completed without error. > > Cc: Qu Wenru <wqu@suse.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/scrub.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 188a9c42c9eb..91590dc509de 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -1099,12 +1099,17 @@ static void scrub_read_endio(struct btrfs_bio *bbio) > { > struct scrub_stripe *stripe = bbio->private; > struct bio_vec *bvec; > - int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio)); > + int sector_nr = 0; > int num_sectors; > u32 bio_size = 0; > int i; > > - ASSERT(sector_nr < stripe->nr_sectors); > + if (bbio->bio.bi_status == BLK_STS_OK) { > + sector_nr = calc_sector_number(stripe, > + bio_first_bvec_all(&bbio->bio)); > + ASSERT(sector_nr < stripe->nr_sectors); > + } > + > bio_for_each_bvec_all(bvec, &bbio->bio, i) > bio_size += bvec->bv_len; > num_sectors = bio_size >> stripe->bg->fs_info->sectorsize_bits;
On 10.06.24 23:24, Qu Wenruo wrote: > > 在 2024/6/11 00:12, Johannes Thumshirn 写道: >> When running fstests' test-case btrfs/060 with MKFS_OPTIONS="-O rst" >> to force the creation of a RAID stripe tree even on regular devices I >> hit the follwoing ASSERT() in calc_sector_number(): >> >> ASSERT(i < stripe->nr_sectors); >> >> This ASSERT() is triggered, because we cannot find the page in the >> passed in bio_vec in the scrub_stripe's pages[] array. >> >> When having a closer look at the stripe using drgn on the vmcore dump >> and comparing the stripe's pages to the bio_vec's pages it is evident >> that we cannot find it as first_bvec's bv_page is NULL: >> >> >>> stripe = t.stack_trace()[13]['stripe'] >> >>> print(stripe.pages) >> (struct page *[16]){ 0xffffea000418b280, 0xffffea00043051c0, >> 0xffffea000430ed00, 0xffffea00040fcc00, 0xffffea000441fc80, >> 0xffffea00040fc980, 0xffffea00040fc9c0, 0xffffea00040fc940, >> 0xffffea0004223040, 0xffffea00043a1940, 0xffffea00040fea80, >> 0xffffea00040a5740, 0xffffea0004490f40, 0xffffea00040f7dc0, >> 0xffffea00044985c0, 0xffffea00040f7d80 } >> >>> bio = t.stack_trace()[12]['bbio'].bio >> >>> print(bio.bi_io_vec) >> *(struct bio_vec *)0xffff88810632bc00 = { >> .bv_page = (struct page *)0x0, >> .bv_len = (unsigned int)0, >> .bv_offset = (unsigned int)0, >> } > I'm more interested in why we got a bi_vec with all zeros. Yes me too, hence the RFC. TBH I was hoping you had an idea :D > Especially if the bv_len is 0, we won't update the error bitmap at all. > > So it's not simply ignore it if the IO failed. > > To me it looks more like at some stage (RST layer?) the bio is > reseted/modified unexpected? Most probably. But btrfs_submit_bio() will only split a bio if it's needed (due to i.e. RST stripe crossing, etc).
在 2024/6/11 15:56, Johannes Thumshirn 写道: > On 10.06.24 23:24, Qu Wenruo wrote: >> >> 在 2024/6/11 00:12, Johannes Thumshirn 写道: [...] >> I'm more interested in why we got a bi_vec with all zeros. > > Yes me too, hence the RFC. TBH I was hoping you had an idea :D > >> Especially if the bv_len is 0, we won't update the error bitmap at all. >> >> So it's not simply ignore it if the IO failed. >> >> To me it looks more like at some stage (RST layer?) the bio is >> reseted/modified unexpected? > > Most probably. But btrfs_submit_bio() will only split a bio if it's > needed (due to i.e. RST stripe crossing, etc). I guess I got the direct cause, but not yet to the root cause. [DIRECT CAUSE] It turns out it's not RST layer, but the RST handling inside scrub causing the empty bbio. In `scrub_submit_extent_sector_read()`, everytime we allocated a new bbio we call `btrfs_map_block()` to do a scrub specific map lookup. But if that `btrfs_map_block()` failed, the bbio is still empty, but we call `btrfs_bio_end_io()` on it, resulting such problem. The problem is happening because there is no other error path inside scrub that can lead to such empty bbio. All other paths go `btrfs_alloc_bio()` then immediately add pages into it (either the whole stripe or a simple sector). [PROPER FIX] So for the fix, I'd prefer to do a 0 length check first. If it's 0 length, we do not need to get the sector nr and error out immediately. If it's not zero length, go the regular routine to mark error bitmaps. Meanwhile the RFC fix has the side effect that, if a non-RST read scrub bbio ended with error, we can wrongly updated the error bitmap (since the sector_nr is set to 0). [UNCERTAIN ROOT CAUSE] Then let's talk about the ugly part, why `btrfs_map_block()` failed? In the dump, it shows there is no such RST entry found for the target bytenr. Meanwhile at `fill_one_extent_info()` we fill `extent_sector_bitmap` correctly using commit extent root. This means, by somehow the extent commit root desync with rst commit root. My current guess is, between we do the extent commit root search and do the rst commit root search, a commit transaction happened. This is not a problem for non RST case because we only did the extent root search once, and in one go, so there is no transaction commit happening between. But for RST-routine, the RST lookup happens much later than scrub_find_fill_first_stripe(), thus it could happen a transaction committed, updating RST. My current plan is to do the RST lookup inside scrub_find_fill_first_stripe too, so everything can happen at the same transaction. But I'll need to do extra debugging to confirm the guess. Thanks, Qu
在 2024/6/17 07:44, Qu Wenruo 写道: > > > 在 2024/6/11 15:56, Johannes Thumshirn 写道: >> On 10.06.24 23:24, Qu Wenruo wrote: >>> >>> 在 2024/6/11 00:12, Johannes Thumshirn 写道: > [...] >>> I'm more interested in why we got a bi_vec with all zeros. >> >> Yes me too, hence the RFC. TBH I was hoping you had an idea :D >> >>> Especially if the bv_len is 0, we won't update the error bitmap at all. >>> >>> So it's not simply ignore it if the IO failed. >>> >>> To me it looks more like at some stage (RST layer?) the bio is >>> reseted/modified unexpected? >> >> Most probably. But btrfs_submit_bio() will only split a bio if it's >> needed (due to i.e. RST stripe crossing, etc). > > I guess I got the direct cause, but not yet to the root cause. > > [DIRECT CAUSE] > It turns out it's not RST layer, but the RST handling inside scrub > causing the empty bbio. > > In `scrub_submit_extent_sector_read()`, everytime we allocated a new > bbio we call `btrfs_map_block()` to do a scrub specific map lookup. > > But if that `btrfs_map_block()` failed, the bbio is still empty, but we > call `btrfs_bio_end_io()` on it, resulting such problem. > > The problem is happening because there is no other error path inside > scrub that can lead to such empty bbio. > All other paths go `btrfs_alloc_bio()` then immediately add pages into > it (either the whole stripe or a simple sector). > > [PROPER FIX] > So for the fix, I'd prefer to do a 0 length check first. > If it's 0 length, we do not need to get the sector nr and error out > immediately. > If it's not zero length, go the regular routine to mark error bitmaps. > > Meanwhile the RFC fix has the side effect that, if a non-RST read scrub > bbio ended with error, we can wrongly updated the error bitmap (since > the sector_nr is set to 0). > > [UNCERTAIN ROOT CAUSE] > Then let's talk about the ugly part, why `btrfs_map_block()` failed? > In the dump, it shows there is no such RST entry found for the target > bytenr. > Meanwhile at `fill_one_extent_info()` we fill `extent_sector_bitmap` > correctly using commit extent root. > > This means, by somehow the extent commit root desync with rst commit root. > My current guess is, between we do the extent commit root search and do > the rst commit root search, a commit transaction happened. My bad, this is impossible and during transaction commit scrub would be intentionally paused to avoid such commit roots mismatch. So it is something different. Thanks, Qu > > This is not a problem for non RST case because we only did the extent > root search once, and in one go, so there is no transaction commit > happening between. > > But for RST-routine, the RST lookup happens much later than > scrub_find_fill_first_stripe(), thus it could happen a transaction > committed, updating RST. > > My current plan is to do the RST lookup inside > scrub_find_fill_first_stripe too, so everything can happen at the same > transaction. > But I'll need to do extra debugging to confirm the guess. > > Thanks, > Qu >
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 188a9c42c9eb..91590dc509de 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1099,12 +1099,17 @@ static void scrub_read_endio(struct btrfs_bio *bbio) { struct scrub_stripe *stripe = bbio->private; struct bio_vec *bvec; - int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio)); + int sector_nr = 0; int num_sectors; u32 bio_size = 0; int i; - ASSERT(sector_nr < stripe->nr_sectors); + if (bbio->bio.bi_status == BLK_STS_OK) { + sector_nr = calc_sector_number(stripe, + bio_first_bvec_all(&bbio->bio)); + ASSERT(sector_nr < stripe->nr_sectors); + } + bio_for_each_bvec_all(bvec, &bbio->bio, i) bio_size += bvec->bv_len; num_sectors = bio_size >> stripe->bg->fs_info->sectorsize_bits;