diff mbox series

[md-6.13,5/5] md/md-bitmap: move bitmap_{start, end}write to md upper layer

Message ID 20241118114157.355749-6-yukuai1@huaweicloud.com (mailing list archive)
State New
Headers show
Series md/md-bitmap: move bitmap_{start,end}write to md upper layer | expand

Checks

Context Check Description
mdraidci/vmtest-md-6_13-PR success PR summary
mdraidci/vmtest-md-6_13-VM_Test-0 success Logs for per-patch-testing

Commit Message

Yu Kuai Nov. 18, 2024, 11:41 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, and while reviewing raid5 code, it's found that
bitmap operations can be optimized. For example, for a 4 disks raid5, with
chunksize=8k, if user issue a IO (0 + 48k) to the array:

┌────────────────────────────────────────────────────────────┐
│chunk 0                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┼
│  sh0 │A0: 0 + 4k  │A1: 8k + 4k  │A2: 16k + 4k │A3: P       │
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P       │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┤
│  sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P        │C3: 40k + 4k│
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P        │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘

Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
 - 3 times (0 + 4k), for (A0, A1 and A2)
 - 3 times (4 + 4k), for (B0, B1 and B2)
 - 3 times (8 + 4k), for (C0, C1 and C3)
 - 3 times (12 + 4k), for (D0, D1 and D3)

After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.

Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:

- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;

Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.

[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c          | 29 +++++++++++++++++++++++++++++
 drivers/md/md.h          |  2 ++
 drivers/md/raid1.c       |  4 ----
 drivers/md/raid10.c      |  3 ---
 drivers/md/raid5-cache.c |  2 --
 drivers/md/raid5.c       | 24 +-----------------------
 6 files changed, 32 insertions(+), 32 deletions(-)

Comments

Jinpu Wang Nov. 19, 2024, 9:49 a.m. UTC | #1
Hi Kuai,

Thanks for the patchset, as you mentioned both both hung problem regarding raid5
bitmap, just want to get confirmation, will this patchset fix the hung or just a
performance improvement/code cleanup?


In patch4, as you removed the set/clear bit STRIPE_BITMAP_PENDING, I think you
should also remove the test_bit line in stripe_can_batch, also the definition 
enum in raif5.h and the few lines in comments in __add_stripe_bio, same for the
line in break_stripe_batch_list.


Regards!
Jinpu Wang @ IONOS
Yu Kuai Nov. 19, 2024, 10:58 a.m. UTC | #2
Hi,

在 2024/11/19 17:49, Jack Wang 写道:
> Hi Kuai,
> 
> Thanks for the patchset, as you mentioned both both hung problem regarding raid5
> bitmap, just want to get confirmation, will this patchset fix the hung or just a
> performance improvement/code cleanup?

I'm hoping both. :)

After review, I'll CC related folks and see if they can comfirm this
will fix the hang problem.
> 
> 
> In patch4, as you removed the set/clear bit STRIPE_BITMAP_PENDING, I think you
> should also remove the test_bit line in stripe_can_batch, also the definition
> enum in raif5.h and the few lines in comments in __add_stripe_bio, same for the
> line in break_stripe_batch_list.

Yes, and I assume you mean patch 5.

Thanks,
Kuai

> 
> 
> Regards!
> Jinpu Wang @ IONOS
> 
> 
> .
>
Jinpu Wang Nov. 19, 2024, 3:29 p.m. UTC | #3
Hi Kuai,

We will test on our side and report back.

Yes, I meant patch5.

Regards!
Jinpu Wang @ IONOS
Jinpu Wang Nov. 21, 2024, 8:10 a.m. UTC | #4
On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>
> Hi Kuai,
>
> We will test on our side and report back.
Hi Kuai,

Haris tested the new patchset, and it works fine.
Thanks for the work.
>
> Yes, I meant patch5.
>
> Regards!
> Jinpu Wang @ IONOS
>
Yu Kuai Nov. 21, 2024, 8:33 a.m. UTC | #5
Hi,

在 2024/11/21 16:10, Jinpu Wang 写道:
> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>>
>> Hi Kuai,
>>
>> We will test on our side and report back.
> Hi Kuai,
> 
> Haris tested the new patchset, and it works fine.
> Thanks for the work.

Thanks for the test! And just to be sure, the BUG_ON() problem in the
other thread is not triggered as well, right?

+CC Christian

Are you able to test this set for lastest kernel?

Thanks,
Kuai

>>
>> Yes, I meant patch5.
>>
>> Regards!
>> Jinpu Wang @ IONOS
>>
> 
> .
>
Jinpu Wang Nov. 21, 2024, 8:40 a.m. UTC | #6
Hi
On Thu, Nov 21, 2024 at 9:33 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/21 16:10, Jinpu Wang 写道:
> > On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
> >>
> >> Hi Kuai,
> >>
> >> We will test on our side and report back.
> > Hi Kuai,
> >
> > Haris tested the new patchset, and it works fine.
> > Thanks for the work.
>
> Thanks for the test! And just to be sure, the BUG_ON() problem in the
> other thread is not triggered as well, right?
Yes, we tested the patchset on top of md/for-6.13 branch, no hang, no
BUG_ON, it was running fine
>
> +CC Christian
>
> Are you able to test this set for lastest kernel?
see above.
>
> Thanks,
> Kuai
Thx!
>
> >>
> >> Yes, I meant patch5.
> >>
> >> Regards!
> >> Jinpu Wang @ IONOS
> >>
> >
> > .
> >
>
Yu Kuai Nov. 21, 2024, 8:44 a.m. UTC | #7
Hi,

在 2024/11/21 16:40, Jinpu Wang 写道:
> Hi
> On Thu, Nov 21, 2024 at 9:33 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/11/21 16:10, Jinpu Wang 写道:
>>> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>>>>
>>>> Hi Kuai,
>>>>
>>>> We will test on our side and report back.
>>> Hi Kuai,
>>>
>>> Haris tested the new patchset, and it works fine.
>>> Thanks for the work.
>>
>> Thanks for the test! And just to be sure, the BUG_ON() problem in the
>> other thread is not triggered as well, right?
> Yes, we tested the patchset on top of md/for-6.13 branch, no hang, no
> BUG_ON, it was running fine
>>
>> +CC Christian
>>
>> Are you able to test this set for lastest kernel?
> see above.

Are you guys the same team with Christian? Because there is another
thread that he reported the same problem.

Thanks,
Kuai

>>
>> Thanks,
>> Kuai
> Thx!
>>
>>>>
>>>> Yes, I meant patch5.
>>>>
>>>> Regards!
>>>> Jinpu Wang @ IONOS
>>>>
>>>
>>> .
>>>
>>
> 
> .
>
Jinpu Wang Nov. 21, 2024, 8:46 a.m. UTC | #8
On Thu, Nov 21, 2024 at 9:44 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/21 16:40, Jinpu Wang 写道:
> > Hi
> > On Thu, Nov 21, 2024 at 9:33 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/11/21 16:10, Jinpu Wang 写道:
> >>> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
> >>>>
> >>>> Hi Kuai,
> >>>>
> >>>> We will test on our side and report back.
> >>> Hi Kuai,
> >>>
> >>> Haris tested the new patchset, and it works fine.
> >>> Thanks for the work.
> >>
> >> Thanks for the test! And just to be sure, the BUG_ON() problem in the
> >> other thread is not triggered as well, right?
> > Yes, we tested the patchset on top of md/for-6.13 branch, no hang, no
> > BUG_ON, it was running fine
> >>
> >> +CC Christian
> >>
> >> Are you able to test this set for lastest kernel?
> > see above.
>
> Are you guys the same team with Christian? Because there is another
> thread that he reported the same problem.
No, we are not the same team with Christian, I guess he will test on his side.
>
> Thanks,
> Kuai
>
> >>
> >> Thanks,
> >> Kuai
> > Thx!
> >>
> >>>>
> >>>> Yes, I meant patch5.
> >>>>
> >>>> Regards!
> >>>> Jinpu Wang @ IONOS
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>
Christian Theune Nov. 21, 2024, 9:30 a.m. UTC | #9
Hi,

> On 21. Nov 2024, at 09:33, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/11/21 16:10, Jinpu Wang 写道:
>> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>>> 
>>> Hi Kuai,
>>> 
>>> We will test on our side and report back.
>> Hi Kuai,
>> Haris tested the new patchset, and it works fine.
>> Thanks for the work.
> 
> Thanks for the test! And just to be sure, the BUG_ON() problem in the
> other thread is not triggered as well, right?
> 
> +CC Christian
> 
> Are you able to test this set for lastest kernel?

I have scheduled testing for later today. My current plan was to try Xiao Ni’s fix on 6.6 as that did fix it on 6.11 for me.

Which way forward makes more sense now? Are those two patches independent or amalgamated or might they be stepping on each others’ toes?

Christian
Yu Kuai Nov. 21, 2024, 11:01 a.m. UTC | #10
Hi,

在 2024/11/21 17:30, Christian Theune 写道:
> Hi,
> 
>> On 21. Nov 2024, at 09:33, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/11/21 16:10, Jinpu Wang 写道:
>>> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>>>>
>>>> Hi Kuai,
>>>>
>>>> We will test on our side and report back.
>>> Hi Kuai,
>>> Haris tested the new patchset, and it works fine.
>>> Thanks for the work.
>>
>> Thanks for the test! And just to be sure, the BUG_ON() problem in the
>> other thread is not triggered as well, right?
>>
>> +CC Christian
>>
>> Are you able to test this set for lastest kernel?
> 
> I have scheduled testing for later today. My current plan was to try Xiao Ni’s fix on 6.6 as that did fix it on 6.11 for me.
> 
> Which way forward makes more sense now? Are those two patches independent or amalgamated or might they be stepping on each others’ toes?

Our plan is to apply this set to latest kernel first, and then backport
this to older kernel. Xiao's fix will not be considered. :(

Thanks,
Kuai

> 
> Christian
>
Christian Theune Nov. 21, 2024, 12:05 p.m. UTC | #11
Hi,

ok, good thing I asked, then … :)

I’ll try this out later today.

Christian

> On 21. Nov 2024, at 12:01, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/11/21 17:30, Christian Theune 写道:
>> Hi,
>>> On 21. Nov 2024, at 09:33, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>> 
>>> Hi,
>>> 
>>> 在 2024/11/21 16:10, Jinpu Wang 写道:
>>>> On Tue, Nov 19, 2024 at 4:29 PM Jack Wang <jinpu.wang@ionos.com> wrote:
>>>>> 
>>>>> Hi Kuai,
>>>>> 
>>>>> We will test on our side and report back.
>>>> Hi Kuai,
>>>> Haris tested the new patchset, and it works fine.
>>>> Thanks for the work.
>>> 
>>> Thanks for the test! And just to be sure, the BUG_ON() problem in the
>>> other thread is not triggered as well, right?
>>> 
>>> +CC Christian
>>> 
>>> Are you able to test this set for lastest kernel?
>> I have scheduled testing for later today. My current plan was to try Xiao Ni’s fix on 6.6 as that did fix it on 6.11 for me.
>> Which way forward makes more sense now? Are those two patches independent or amalgamated or might they be stepping on each others’ toes?
> 
> Our plan is to apply this set to latest kernel first, and then backport
> this to older kernel. Xiao's fix will not be considered. :(
> 
> Thanks,
> Kuai
> 
>> Christian


Liebe Grüße,
Christian Theune
Christian Theune Nov. 21, 2024, 12:21 p.m. UTC | #12
Hi,

> On 21. Nov 2024, at 09:33, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Thanks for the test! And just to be sure, the BUG_ON() problem in the
> other thread is not triggered as well, right?
> 
> +CC Christian
> 
> Are you able to test this set for lastest kernel?

Reading through the list - I’m confused. You posted a patchset on 2024-11-18@12:48. There was discussion later around an adjustment but I’m not seeing the adjusted patchset.

Also, which kernel do you want me to test on? 6.12?

Christian
Yu Kuai Nov. 21, 2024, 12:37 p.m. UTC | #13
Hi,

在 2024/11/21 20:21, Christian Theune 写道:
> Hi,
> 
>> On 21. Nov 2024, at 09:33, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Thanks for the test! And just to be sure, the BUG_ON() problem in the
>> other thread is not triggered as well, right?
>>
>> +CC Christian
>>
>> Are you able to test this set for lastest kernel?
> 
> Reading through the list - I’m confused. You posted a patchset on 2024-11-18@12:48. There was discussion later around an adjustment but I’m not seeing the adjusted patchset.

I don't know you mean here, this is the only set. If you didn't
subscribe mail-list and don't know how to find the set:

https://lore.kernel.org/all/20241118114157.355749-1-yukuai1@huaweicloud.com/

> 
> Also, which kernel do you want me to test on? 6.12?

The branch is from the title, md-6.13. If you don't know:

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.13

Thanks,
Kuai

> 
> Christian
>
Christian Theune Nov. 21, 2024, 4:14 p.m. UTC | #14
Hi,

> On 21. Nov 2024, at 13:37, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> 
> I don't know you mean here, this is the only set. If you didn't
> subscribe mail-list and don't know how to find the set:
> 
> https://lore.kernel.org/all/20241118114157.355749-1-yukuai1@huaweicloud.com/

Yes, I saw those and was talking about https://lore.kernel.org/all/adf796b9-2443-d29a-f4ac-fb9b8a657f93@huaweicloud.com/

>> Also, which kernel do you want me to test on? 6.12?
> 
> The branch is from the title, md-6.13. If you don't know:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.13

Ok, I can use that. Had to bail out for today and will try again tomorrow.

Christian
Xiao Ni Nov. 22, 2024, 2:06 a.m. UTC | #15
On Mon, Nov 18, 2024 at 7:44 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> There are two BUG reports that raid5 will hang at
> bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
> write is unbalanced, and while reviewing raid5 code, it's found that

Hi Kuai

It's better to describe more about "unbalanced" in the patch. For
raid5, bitmap is set and cleared based on stripe->dev[] now. It looks
like the set operation matches the clear operation already.

> bitmap operations can be optimized. For example, for a 4 disks raid5, with
> chunksize=8k, if user issue a IO (0 + 48k) to the array:
>
> ┌────────────────────────────────────────────────────────────┐
> │chunk 0                                                     │
> │      ┌────────────┬─────────────┬─────────────┬────────────┼
> │  sh0 │A0: 0 + 4k  │A1: 8k + 4k  │A2: 16k + 4k │A3: P       │
> │      ┼────────────┼─────────────┼─────────────┼────────────┼
> │  sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P       │
> ┼──────┴────────────┴─────────────┴─────────────┴────────────┼
> │chunk 1                                                     │
> │      ┌────────────┬─────────────┬─────────────┬────────────┤
> │  sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P        │C3: 40k + 4k│
> │      ┼────────────┼─────────────┼─────────────┼────────────┼
> │  sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P        │D3: 44k + 4k│
> └──────┴────────────┴─────────────┴─────────────┴────────────┘
>
> Before this patch, 4 stripe head will be used, and each sh will attach
> bio for 3 disks, and each attached bio will trigger
> bitmap_startwrite() once, which means total 12 times.
>  - 3 times (0 + 4k), for (A0, A1 and A2)
>  - 3 times (4 + 4k), for (B0, B1 and B2)
>  - 3 times (8 + 4k), for (C0, C1 and C3)
>  - 3 times (12 + 4k), for (D0, D1 and D3)
>
> After this patch, md upper layer will calculate that IO range (0 + 48k)
> is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
> just once.
>
> Noted that this patch will align bitmap ranges to the chunks, for example,
> if user issue a IO (0 + 4k) to array:
>
> - Before this patch, 1 time (0 + 4k), for A0;
> - After this patch, 1 time (0 + 8k) for chunk 0;
>
> Usually, one bitmap bit will represent more than one disk chunk, and this
> doesn't have any difference. And even if user really created a array
> that one chunk contain multiple bits, the overhead is that more data
> will be recovered after power failure.
>
> [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
> [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md.c          | 29 +++++++++++++++++++++++++++++
>  drivers/md/md.h          |  2 ++
>  drivers/md/raid1.c       |  4 ----
>  drivers/md/raid10.c      |  3 ---
>  drivers/md/raid5-cache.c |  2 --
>  drivers/md/raid5.c       | 24 +-----------------------
>  6 files changed, 32 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index bbe002ebd584..ab37c9939ac6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8723,12 +8723,32 @@ void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
>  }
>  EXPORT_SYMBOL_GPL(md_submit_discard_bio);
>
> +static void md_bitmap_start(struct mddev *mddev,
> +                           struct md_io_clone *md_io_clone)
> +{
> +       if (mddev->pers->bitmap_sector)
> +               mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
> +                                          &md_io_clone->sectors);
> +
> +       mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
> +                                     md_io_clone->sectors);
> +}
> +
> +static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
> +{
> +       mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
> +                                   md_io_clone->sectors);
> +}
> +
>  static void md_end_clone_io(struct bio *bio)
>  {
>         struct md_io_clone *md_io_clone = bio->bi_private;
>         struct bio *orig_bio = md_io_clone->orig_bio;
>         struct mddev *mddev = md_io_clone->mddev;
>
> +       if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap)
> +               md_bitmap_end(mddev, md_io_clone);
> +
>         if (bio->bi_status && !orig_bio->bi_status)
>                 orig_bio->bi_status = bio->bi_status;
>
> @@ -8753,6 +8773,12 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
>         if (blk_queue_io_stat(bdev->bd_disk->queue))
>                 md_io_clone->start_time = bio_start_io_acct(*bio);
>
> +       if (bio_data_dir(*bio) == WRITE && mddev->bitmap) {
> +               md_io_clone->offset = (*bio)->bi_iter.bi_sector;
> +               md_io_clone->sectors = bio_sectors(*bio);
> +               md_bitmap_start(mddev, md_io_clone);
> +       }
> +
>         clone->bi_end_io = md_end_clone_io;
>         clone->bi_private = md_io_clone;
>         *bio = clone;
> @@ -8771,6 +8797,9 @@ void md_free_cloned_bio(struct bio *bio)
>         struct bio *orig_bio = md_io_clone->orig_bio;
>         struct mddev *mddev = md_io_clone->mddev;
>
> +       if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap)
> +               md_bitmap_end(mddev, md_io_clone);
> +
>         if (bio->bi_status && !orig_bio->bi_status)
>                 orig_bio->bi_status = bio->bi_status;
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index de6dadb9a40b..def808064ad8 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -831,6 +831,8 @@ struct md_io_clone {
>         struct mddev    *mddev;
>         struct bio      *orig_bio;
>         unsigned long   start_time;
> +       sector_t        offset;
> +       unsigned long   sectors;
>         struct bio      bio_clone;
>  };
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b9819f9c8ed2..e2adfeff5ae6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -422,8 +422,6 @@ static void close_write(struct r1bio *r1_bio)
>
>         if (test_bit(R1BIO_BehindIO, &r1_bio->state))
>                 mddev->bitmap_ops->end_behind_write(mddev);
> -       /* clear the bitmap if all writes complete successfully */
> -       mddev->bitmap_ops->endwrite(mddev, r1_bio->sector, r1_bio->sectors);
>         md_write_end(mddev);
>  }
>
> @@ -1616,8 +1614,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>
>                         if (test_bit(R1BIO_BehindIO, &r1_bio->state))
>                                 mddev->bitmap_ops->start_behind_write(mddev);
> -                       mddev->bitmap_ops->startwrite(mddev, r1_bio->sector,
> -                                                     r1_bio->sectors);
>                         first_clone = 0;
>                 }
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index efbc0bd92479..79a209236c9f 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -428,8 +428,6 @@ static void close_write(struct r10bio *r10_bio)
>  {
>         struct mddev *mddev = r10_bio->mddev;
>
> -       /* clear the bitmap if all writes complete successfully */
> -       mddev->bitmap_ops->endwrite(mddev, r10_bio->sector, r10_bio->sectors);
>         md_write_end(mddev);
>  }
>
> @@ -1487,7 +1485,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>         md_account_bio(mddev, &bio);
>         r10_bio->master_bio = bio;
>         atomic_set(&r10_bio->remaining, 1);
> -       mddev->bitmap_ops->startwrite(mddev, r10_bio->sector, r10_bio->sectors);
>
>         for (i = 0; i < conf->copies; i++) {
>                 if (r10_bio->devs[i].bio)
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index ba4f9577c737..011246e16a99 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -313,8 +313,6 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
>                 if (sh->dev[i].written) {
>                         set_bit(R5_UPTODATE, &sh->dev[i].flags);
>                         r5c_return_dev_pending_writes(conf, &sh->dev[i]);
> -                       conf->mddev->bitmap_ops->endwrite(conf->mddev,
> -                                       sh->sector, RAID5_STRIPE_SECTORS(conf));
>                 }
>         }
>  }
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 95caed41654c..976788138a6e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3578,12 +3578,6 @@ static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
>                  * is added to a batch, STRIPE_BIT_DELAY cannot be changed
>                  * any more.
>                  */
> -               set_bit(STRIPE_BITMAP_PENDING, &sh->state);
> -               spin_unlock_irq(&sh->stripe_lock);
> -               conf->mddev->bitmap_ops->startwrite(conf->mddev, sh->sector,
> -                                       RAID5_STRIPE_SECTORS(conf));
> -               spin_lock_irq(&sh->stripe_lock);
> -               clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
>                 if (!sh->batch_head) {
>                         sh->bm_seq = conf->seq_flush+1;
>                         set_bit(STRIPE_BIT_DELAY, &sh->state);
> @@ -3638,7 +3632,6 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>         BUG_ON(sh->batch_head);
>         for (i = disks; i--; ) {
>                 struct bio *bi;
> -               int bitmap_end = 0;
>
>                 if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
>                         struct md_rdev *rdev = conf->disks[i].rdev;
> @@ -3663,8 +3656,6 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>                 sh->dev[i].towrite = NULL;
>                 sh->overwrite_disks = 0;
>                 spin_unlock_irq(&sh->stripe_lock);
> -               if (bi)
> -                       bitmap_end = 1;
>
>                 log_stripe_write_finished(sh);
>
> @@ -3679,10 +3670,6 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>                         bio_io_error(bi);
>                         bi = nextbi;
>                 }
> -               if (bitmap_end)
> -                       conf->mddev->bitmap_ops->endwrite(conf->mddev,
> -                                       sh->sector, RAID5_STRIPE_SECTORS(conf));
> -               bitmap_end = 0;
>                 /* and fail all 'written' */
>                 bi = sh->dev[i].written;
>                 sh->dev[i].written = NULL;
> @@ -3691,7 +3678,6 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>                         sh->dev[i].page = sh->dev[i].orig_page;
>                 }
>
> -               if (bi) bitmap_end = 1;
>                 while (bi && bi->bi_iter.bi_sector <
>                        sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
>                         struct bio *bi2 = r5_next_bio(conf, bi, sh->dev[i].sector);
> @@ -3725,9 +3711,6 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>                                 bi = nextbi;
>                         }
>                 }
> -               if (bitmap_end)
> -                       conf->mddev->bitmap_ops->endwrite(conf->mddev,
> -                                       sh->sector, RAID5_STRIPE_SECTORS(conf));
>                 /* If we were in the middle of a write the parity block might
>                  * still be locked - so just clear all R5_LOCKED flags
>                  */
> @@ -4076,8 +4059,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>                                         bio_endio(wbi);
>                                         wbi = wbi2;
>                                 }
> -                               conf->mddev->bitmap_ops->endwrite(conf->mddev,
> -                                       sh->sector, RAID5_STRIPE_SECTORS(conf));
> +
>                                 if (head_sh->batch_head) {
>                                         sh = list_first_entry(&sh->batch_list,
>                                                               struct stripe_head,
> @@ -5797,10 +5779,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>                 }
>                 spin_unlock_irq(&sh->stripe_lock);
>                 if (conf->mddev->bitmap) {
> -                       for (d = 0; d < conf->raid_disks - conf->max_degraded;
> -                            d++)
> -                               mddev->bitmap_ops->startwrite(mddev, sh->sector,
> -                                       RAID5_STRIPE_SECTORS(conf));
>                         sh->bm_seq = conf->seq_flush + 1;
>                         set_bit(STRIPE_BIT_DELAY, &sh->state);
>                 }
> --
> 2.39.2
>

This patch looks good to me.

Reviewd-by: Xiao Ni <xni@redhat.com>
Yu Kuai Nov. 22, 2024, 2:45 a.m. UTC | #16
Hi,

在 2024/11/22 10:06, Xiao Ni 写道:
> On Mon, Nov 18, 2024 at 7:44 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> There are two BUG reports that raid5 will hang at
>> bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
>> write is unbalanced, and while reviewing raid5 code, it's found that
> 
> Hi Kuai
> 
> It's better to describe more about "unbalanced" in the patch. For
> raid5, bitmap is set and cleared based on stripe->dev[] now. It looks
> like the set operation matches the clear operation already.

Ok, one place that I found is that raid5 can do extra end write while
stripe->dev[].towrite is NULL, the null checking is missing. I'll
mention that in the next version.

...

> 
> This patch looks good to me.
> 
> Reviewd-by: Xiao Ni <xni@redhat.com>
> 

Thanks for the review. I'll also remove the unused STRIPE_BITMAP_PENDING
in v2.

Kuai

> 
> .
>
Xiao Ni Nov. 22, 2024, 3:14 a.m. UTC | #17
On Fri, Nov 22, 2024 at 10:46 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/22 10:06, Xiao Ni 写道:
> > On Mon, Nov 18, 2024 at 7:44 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> There are two BUG reports that raid5 will hang at
> >> bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
> >> write is unbalanced, and while reviewing raid5 code, it's found that
> >
> > Hi Kuai
> >
> > It's better to describe more about "unbalanced" in the patch. For
> > raid5, bitmap is set and cleared based on stripe->dev[] now. It looks
> > like the set operation matches the clear operation already.
>
> Ok, one place that I found is that raid5 can do extra end write while
> stripe->dev[].towrite is NULL, the null checking is missing. I'll
> mention that in the next version.

Does this can cause the deadlock?

Regards
Xiao

>
> ...
>
> >
> > This patch looks good to me.
> >
> > Reviewd-by: Xiao Ni <xni@redhat.com>
> >
>
> Thanks for the review. I'll also remove the unused STRIPE_BITMAP_PENDING
> in v2.
>
> Kuai
>
> >
> > .
> >
>
Yu Kuai Nov. 22, 2024, 8:32 a.m. UTC | #18
Hi,

在 2024/11/22 11:14, Xiao Ni 写道:
>> Ok, one place that I found is that raid5 can do extra end write while
>> stripe->dev[].towrite is NULL, the null checking is missing. I'll
>> mention that in the next version.
> Does this can cause the deadlock?

Not deadlock, the bit counter will underflow and reach COUNTER_MAX, and
bitmap_startwrite() can wait forever for the counter.

Thanks,
Kuai
Xiao Ni Nov. 23, 2024, 3:55 a.m. UTC | #19
On Fri, Nov 22, 2024 at 4:32 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/22 11:14, Xiao Ni 写道:
> >> Ok, one place that I found is that raid5 can do extra end write while
> >> stripe->dev[].towrite is NULL, the null checking is missing. I'll
> >> mention that in the next version.
> > Does this can cause the deadlock?
>
> Not deadlock, the bit counter will underflow and reach COUNTER_MAX, and
> bitmap_startwrite() can wait forever for the counter.

Hi Kuai

For normal io, endwrite is called in function
handle_stripe_clean_event when sh->dev[i].written has value.
For failed io, endwrite is called when bitmap_end has value.
bitmap_end is set when sh->dev[i].to_write is not NULL.
Which place does extra endwrite?

Regards
Xiao
>
> Thanks,
> Kuai
>
Yu Kuai Nov. 25, 2024, 1:15 a.m. UTC | #20
Hi,

在 2024/11/23 11:55, Xiao Ni 写道:
> For normal io, endwrite is called in function
> handle_stripe_clean_event when sh->dev[i].written has value.
> For failed io, endwrite is called when bitmap_end has value.
> bitmap_end is set when sh->dev[i].to_write is not NULL.
> Which place does extra endwrite?

I think it's related to the sh batch list. For example, the
returnbi tag from handle_stripe_clean_event() doesn't check
the written value.

Thanks,
Kuai
diff mbox series

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bbe002ebd584..ab37c9939ac6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8723,12 +8723,32 @@  void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
 }
 EXPORT_SYMBOL_GPL(md_submit_discard_bio);
 
+static void md_bitmap_start(struct mddev *mddev,
+			    struct md_io_clone *md_io_clone)
+{
+	if (mddev->pers->bitmap_sector)
+		mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
+					   &md_io_clone->sectors);
+
+	mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
+				      md_io_clone->sectors);
+}
+
+static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
+{
+	mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
+				    md_io_clone->sectors);
+}
+
 static void md_end_clone_io(struct bio *bio)
 {
 	struct md_io_clone *md_io_clone = bio->bi_private;
 	struct bio *orig_bio = md_io_clone->orig_bio;
 	struct mddev *mddev = md_io_clone->mddev;
 
+	if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap)
+		md_bitmap_end(mddev, md_io_clone);
+
 	if (bio->bi_status && !orig_bio->bi_status)
 		orig_bio->bi_status = bio->bi_status;
 
@@ -8753,6 +8773,12 @@  static void md_clone_bio(struct mddev *mddev, struct bio **bio)
 	if (blk_queue_io_stat(bdev->bd_disk->queue))
 		md_io_clone->start_time = bio_start_io_acct(*bio);
 
+	if (bio_data_dir(*bio) == WRITE && mddev->bitmap) {
+		md_io_clone->offset = (*bio)->bi_iter.bi_sector;
+		md_io_clone->sectors = bio_sectors(*bio);
+		md_bitmap_start(mddev, md_io_clone);
+	}
+
 	clone->bi_end_io = md_end_clone_io;
 	clone->bi_private = md_io_clone;
 	*bio = clone;
@@ -8771,6 +8797,9 @@  void md_free_cloned_bio(struct bio *bio)
 	struct bio *orig_bio = md_io_clone->orig_bio;
 	struct mddev *mddev = md_io_clone->mddev;
 
+	if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap)
+		md_bitmap_end(mddev, md_io_clone);
+
 	if (bio->bi_status && !orig_bio->bi_status)
 		orig_bio->bi_status = bio->bi_status;
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index de6dadb9a40b..def808064ad8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -831,6 +831,8 @@  struct md_io_clone {
 	struct mddev	*mddev;
 	struct bio	*orig_bio;
 	unsigned long	start_time;
+	sector_t	offset;
+	unsigned long	sectors;
 	struct bio	bio_clone;
 };
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b9819f9c8ed2..e2adfeff5ae6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -422,8 +422,6 @@  static void close_write(struct r1bio *r1_bio)
 
 	if (test_bit(R1BIO_BehindIO, &r1_bio->state))
 		mddev->bitmap_ops->end_behind_write(mddev);
-	/* clear the bitmap if all writes complete successfully */
-	mddev->bitmap_ops->endwrite(mddev, r1_bio->sector, r1_bio->sectors);
 	md_write_end(mddev);
 }
 
@@ -1616,8 +1614,6 @@  static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 			if (test_bit(R1BIO_BehindIO, &r1_bio->state))
 				mddev->bitmap_ops->start_behind_write(mddev);
-			mddev->bitmap_ops->startwrite(mddev, r1_bio->sector,
-						      r1_bio->sectors);
 			first_clone = 0;
 		}
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index efbc0bd92479..79a209236c9f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -428,8 +428,6 @@  static void close_write(struct r10bio *r10_bio)
 {
 	struct mddev *mddev = r10_bio->mddev;
 
-	/* clear the bitmap if all writes complete successfully */
-	mddev->bitmap_ops->endwrite(mddev, r10_bio->sector, r10_bio->sectors);
 	md_write_end(mddev);
 }
 
@@ -1487,7 +1485,6 @@  static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	md_account_bio(mddev, &bio);
 	r10_bio->master_bio = bio;
 	atomic_set(&r10_bio->remaining, 1);
-	mddev->bitmap_ops->startwrite(mddev, r10_bio->sector, r10_bio->sectors);
 
 	for (i = 0; i < conf->copies; i++) {
 		if (r10_bio->devs[i].bio)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ba4f9577c737..011246e16a99 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -313,8 +313,6 @@  void r5c_handle_cached_data_endio(struct r5conf *conf,
 		if (sh->dev[i].written) {
 			set_bit(R5_UPTODATE, &sh->dev[i].flags);
 			r5c_return_dev_pending_writes(conf, &sh->dev[i]);
-			conf->mddev->bitmap_ops->endwrite(conf->mddev,
-					sh->sector, RAID5_STRIPE_SECTORS(conf));
 		}
 	}
 }
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 95caed41654c..976788138a6e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3578,12 +3578,6 @@  static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
 		 * is added to a batch, STRIPE_BIT_DELAY cannot be changed
 		 * any more.
 		 */
-		set_bit(STRIPE_BITMAP_PENDING, &sh->state);
-		spin_unlock_irq(&sh->stripe_lock);
-		conf->mddev->bitmap_ops->startwrite(conf->mddev, sh->sector,
-					RAID5_STRIPE_SECTORS(conf));
-		spin_lock_irq(&sh->stripe_lock);
-		clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
 		if (!sh->batch_head) {
 			sh->bm_seq = conf->seq_flush+1;
 			set_bit(STRIPE_BIT_DELAY, &sh->state);
@@ -3638,7 +3632,6 @@  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 	BUG_ON(sh->batch_head);
 	for (i = disks; i--; ) {
 		struct bio *bi;
-		int bitmap_end = 0;
 
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
 			struct md_rdev *rdev = conf->disks[i].rdev;
@@ -3663,8 +3656,6 @@  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		sh->dev[i].towrite = NULL;
 		sh->overwrite_disks = 0;
 		spin_unlock_irq(&sh->stripe_lock);
-		if (bi)
-			bitmap_end = 1;
 
 		log_stripe_write_finished(sh);
 
@@ -3679,10 +3670,6 @@  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bio_io_error(bi);
 			bi = nextbi;
 		}
-		if (bitmap_end)
-			conf->mddev->bitmap_ops->endwrite(conf->mddev,
-					sh->sector, RAID5_STRIPE_SECTORS(conf));
-		bitmap_end = 0;
 		/* and fail all 'written' */
 		bi = sh->dev[i].written;
 		sh->dev[i].written = NULL;
@@ -3691,7 +3678,6 @@  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			sh->dev[i].page = sh->dev[i].orig_page;
 		}
 
-		if (bi) bitmap_end = 1;
 		while (bi && bi->bi_iter.bi_sector <
 		       sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
 			struct bio *bi2 = r5_next_bio(conf, bi, sh->dev[i].sector);
@@ -3725,9 +3711,6 @@  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 				bi = nextbi;
 			}
 		}
-		if (bitmap_end)
-			conf->mddev->bitmap_ops->endwrite(conf->mddev,
-					sh->sector, RAID5_STRIPE_SECTORS(conf));
 		/* If we were in the middle of a write the parity block might
 		 * still be locked - so just clear all R5_LOCKED flags
 		 */
@@ -4076,8 +4059,7 @@  static void handle_stripe_clean_event(struct r5conf *conf,
 					bio_endio(wbi);
 					wbi = wbi2;
 				}
-				conf->mddev->bitmap_ops->endwrite(conf->mddev,
-					sh->sector, RAID5_STRIPE_SECTORS(conf));
+
 				if (head_sh->batch_head) {
 					sh = list_first_entry(&sh->batch_list,
 							      struct stripe_head,
@@ -5797,10 +5779,6 @@  static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		}
 		spin_unlock_irq(&sh->stripe_lock);
 		if (conf->mddev->bitmap) {
-			for (d = 0; d < conf->raid_disks - conf->max_degraded;
-			     d++)
-				mddev->bitmap_ops->startwrite(mddev, sh->sector,
-					RAID5_STRIPE_SECTORS(conf));
 			sh->bm_seq = conf->seq_flush + 1;
 			set_bit(STRIPE_BIT_DELAY, &sh->state);
 		}