[-next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Message ID	20231204031703.3102254-1-yukuai1@huaweicloud.com (mailing list archive)
State	Not Applicable, archived
Delegated to:	Mike Snitzer
Headers	show Received: from dggsgout12.his.huawei.com (unknown [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5572210B for <dm-devel@lists.linux.dev>; Mon, 4 Dec 2023 03:18:14 +0000 (UTC) From: Yu Kuai <yukuai1@huaweicloud.com> To: agk@redhat.com, snitzer@kernel.org, mpatocka@redhat.com, dm-devel@lists.linux.dev, song@kernel.org, yukuai3@huawei.com Cc: janpieter.sollie@edpnet.be, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume Date: Mon, 4 Dec 2023 11:17:03 +0800 Message-Id: <20231204031703.3102254-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[-next] md: split MD_RECOVERY_NEEDED out of mddev_resume \| expand [-next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Yu Kuai Dec. 4, 2023, 3:17 a.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

New mddev_resume() calls are added to synchroniza IO with array
reconfiguration, however, this introduce a regression while adding it in
md_start_sync():

1) someone set MD_RECOVERY_NEEDED first;
2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
   queue a new sync work;
3) daemon thread release reconfig_mutex;
4) in md_start_sync
   a) check that there are spares that can be added/removed, then suspend
      the array;
   b) remove_and_add_spares may not be called, or called without really
      add/remove spares;
   c) resume the array, then set MD_RECOVERY_NEEDED again!

Loop between 2 - 4, then mddev_suspend() will be called quite often, for
consequence, normal IO will be quite slow.

Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
that md_start_sync() won't set such flag and hence the loop will be broken.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Reported-and-tested-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/dm-raid.c   | 1 +
 drivers/md/md-bitmap.c | 2 ++
 drivers/md/md.c        | 6 +++++-
 drivers/md/raid5.c     | 4 ++++
 4 files changed, 12 insertions(+), 1 deletion(-)

Paul Menzel Dec. 4, 2023, 2:08 p.m. UTC | #1

Dear Yu,


Thank you for your patch.

Am 04.12.23 um 04:17 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> New mddev_resume() calls are added to synchroniza IO with array

synchronize

> reconfiguration, however, this introduce a regression while adding it in

1.  Maybe: … performance regression …
2.  introduce*s*

> md_start_sync():
> 
> 1) someone set MD_RECOVERY_NEEDED first;

set*s*

> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>     queue a new sync work;

grab*s*, clear*s*, queue*s*

> 3) daemon thread release reconfig_mutex;

release*s*

> 4) in md_start_sync
>     a) check that there are spares that can be added/removed, then suspend
>        the array;
>     b) remove_and_add_spares may not be called, or called without really
>        add/remove spares;
>     c) resume the array, then set MD_RECOVERY_NEEDED again!
> 
> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> consequence, normal IO will be quite slow.

It’d be great if you could document the exact “test case”, and numbers.

> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so

split*t*ing

> that md_start_sync() won't set such flag and hence the loop will be broken.
> 
> Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
> Reported-and-tested-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/dm-raid.c   | 1 +
>   drivers/md/md-bitmap.c | 2 ++
>   drivers/md/md.c        | 6 +++++-
>   drivers/md/raid5.c     | 4 ++++
>   4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index eb009d6bb03a..e9c0d70f7fe5 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -4059,6 +4059,7 @@ static void raid_resume(struct dm_target *ti)
>   		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>   		mddev->ro = 0;
>   		mddev->in_sync = 0;
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   		mddev_unlock_and_resume(mddev);
>   	}
>   }
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 9672f75c3050..16112750ee64 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2428,6 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
>   	}
>   	rv = 0;
>   out:
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	if (rv)
>   		return rv;
> @@ -2571,6 +2572,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
>   	if (old_mwb != backlog)
>   		md_bitmap_update_sb(mddev->bitmap);
>   
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	return len;
>   }
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4b1e8007dd15..48a1b12f3c2c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
>   	percpu_ref_resurrect(&mddev->active_io);
>   	wake_up(&mddev->sb_wait);
>   
> -	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	md_wakeup_thread(mddev->thread);
>   	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
>   
> @@ -4146,6 +4145,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>   	md_new_event();
>   	rv = len;
>   out_unlock:
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	return rv;
>   }
> @@ -4652,6 +4652,8 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
>    out:
>   	if (err)
>   		export_rdev(rdev, mddev);
> +	else
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	if (!err)
>   		md_new_event();
> @@ -5533,6 +5535,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
>   		mddev_destroy_serial_pool(mddev, NULL);
>   	mddev->serialize_policy = value;
>   unlock:
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	return err ?: len;
>   }
> @@ -6593,6 +6596,7 @@ static void autorun_devices(int part)
>   					export_rdev(rdev, mddev);
>   			}
>   			autorun_array(mddev);
> +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   			mddev_unlock_and_resume(mddev);
>   		}
>   		/* on success, candidates will be empty, on error
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 42ba3581cfea..f88f92517a18 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6989,6 +6989,7 @@ raid5_store_stripe_size(struct mddev  *mddev, const char *page, size_t len)
>   	mutex_unlock(&conf->cache_size_mutex);
>   
>   out_unlock:
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	return err ?: len;
>   }
> @@ -7090,6 +7091,7 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
>   		else
>   			blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
>   	}
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   	return err ?: len;
>   }
> @@ -7169,6 +7171,7 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
>   			kfree(old_groups);
>   		}
>   	}
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   
>   	return err ?: len;
> @@ -8920,6 +8923,7 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
>   	if (!err)
>   		md_update_sb(mddev, 1);
>   
> +	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   	mddev_unlock_and_resume(mddev);
>   
>   	return err;

Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>


Kind regards,

Paul

Song Liu Dec. 6, 2023, 8:30 a.m. UTC | #2

On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> New mddev_resume() calls are added to synchroniza IO with array
> reconfiguration, however, this introduce a regression while adding it in
> md_start_sync():
>
> 1) someone set MD_RECOVERY_NEEDED first;
> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>    queue a new sync work;
> 3) daemon thread release reconfig_mutex;
> 4) in md_start_sync
>    a) check that there are spares that can be added/removed, then suspend
>       the array;
>    b) remove_and_add_spares may not be called, or called without really
>       add/remove spares;
>    c) resume the array, then set MD_RECOVERY_NEEDED again!
>
> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> consequence, normal IO will be quite slow.
>
> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
> that md_start_sync() won't set such flag and hence the loop will be broken.

I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
sites of mddev_resume().

How about something like the following instead?

Please also incorporate feedback from Paul in the next version.

Thanks,
Song

diff --git i/drivers/md/md.c w/drivers/md/md.c
index c94373d64f2c..2d53e1b57070 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -490,7 +490,7 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
 }
 EXPORT_SYMBOL_GPL(mddev_suspend);

-void mddev_resume(struct mddev *mddev)
+static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
 {
        lockdep_assert_not_held(&mddev->reconfig_mutex);

@@ -507,12 +507,18 @@ void mddev_resume(struct mddev *mddev)
        percpu_ref_resurrect(&mddev->active_io);
        wake_up(&mddev->sb_wait);

-       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+       if (recovery_needed)
+               set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
        md_wakeup_thread(mddev->thread);
        md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */

        mutex_unlock(&mddev->suspend_mutex);
 }
+
+void mddev_resume(struct mddev *mddev)
+{
+       __mddev_resume(mddev, true);
+}
 EXPORT_SYMBOL_GPL(mddev_resume);

 /*
@@ -9403,7 +9409,9 @@ static void md_start_sync(struct work_struct *ws)
                goto not_running;
        }

-       suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
+       mddev_unlock(mddev);
+       if (suspend)
+               __mddev_resume(mddev, false);
        md_wakeup_thread(mddev->sync_thread);
        sysfs_notify_dirent_safe(mddev->sysfs_action);
        md_new_event();
@@ -9415,7 +9423,9 @@ static void md_start_sync(struct work_struct *ws)
        clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
        clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
        clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-       suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
+       mddev_unlock(mddev);
+       if (suspend)
+               __mddev_resume(mddev, false);

        wake_up(&resync_wait);
        if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&

Yu Kuai Dec. 6, 2023, 11:36 a.m. UTC | #3

Hi,

在 2023/12/06 16:30, Song Liu 写道:
> On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> New mddev_resume() calls are added to synchroniza IO with array
>> reconfiguration, however, this introduce a regression while adding it in
>> md_start_sync():
>>
>> 1) someone set MD_RECOVERY_NEEDED first;
>> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>>     queue a new sync work;
>> 3) daemon thread release reconfig_mutex;
>> 4) in md_start_sync
>>     a) check that there are spares that can be added/removed, then suspend
>>        the array;
>>     b) remove_and_add_spares may not be called, or called without really
>>        add/remove spares;
>>     c) resume the array, then set MD_RECOVERY_NEEDED again!
>>
>> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
>> consequence, normal IO will be quite slow.
>>
>> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
>> that md_start_sync() won't set such flag and hence the loop will be broken.
> 
> I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
> sites of mddev_resume().

There are also some other mddev_resume() that is added later and don't
need recovery, so md_start_sync() is not the only place:

  - md_setup_drive
  - rdev_attr_store
  - suspend_lo_store
  - suspend_hi_store
  - autorun_devices
  - md_ioct
  - r5c_disable_writeback_async
  - error path from new_dev_store(), ...

I'm not sure add a new helper is a good idea, because all above apis
should use new helper as well.

> 
> How about something like the following instead?
> 
> Please also incorporate feedback from Paul in the next version.

Of course.

Thanks,
Kuai

> 
> Thanks,
> Song
> 
> diff --git i/drivers/md/md.c w/drivers/md/md.c
> index c94373d64f2c..2d53e1b57070 100644
> --- i/drivers/md/md.c
> +++ w/drivers/md/md.c
> @@ -490,7 +490,7 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
>   }
>   EXPORT_SYMBOL_GPL(mddev_suspend);
> 
> -void mddev_resume(struct mddev *mddev)
> +static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
>   {
>          lockdep_assert_not_held(&mddev->reconfig_mutex);
> 
> @@ -507,12 +507,18 @@ void mddev_resume(struct mddev *mddev)
>          percpu_ref_resurrect(&mddev->active_io);
>          wake_up(&mddev->sb_wait);
> 
> -       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +       if (recovery_needed)
> +               set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>          md_wakeup_thread(mddev->thread);
>          md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
> 
>          mutex_unlock(&mddev->suspend_mutex);
>   }
> +
> +void mddev_resume(struct mddev *mddev)
> +{
> +       __mddev_resume(mddev, true);
> +}
>   EXPORT_SYMBOL_GPL(mddev_resume);
> 
>   /*
> @@ -9403,7 +9409,9 @@ static void md_start_sync(struct work_struct *ws)
>                  goto not_running;
>          }
> 
> -       suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
> +       mddev_unlock(mddev);
> +       if (suspend)
> +               __mddev_resume(mddev, false);
>          md_wakeup_thread(mddev->sync_thread);
>          sysfs_notify_dirent_safe(mddev->sysfs_action);
>          md_new_event();
> @@ -9415,7 +9423,9 @@ static void md_start_sync(struct work_struct *ws)
>          clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
>          clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
>          clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
> -       suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
> +       mddev_unlock(mddev);
> +       if (suspend)
> +               __mddev_resume(mddev, false);
> 
>          wake_up(&resync_wait);
>          if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
> 
> .
>

Song Liu Dec. 6, 2023, 5:24 p.m. UTC | #4

On Wed, Dec 6, 2023 at 3:36 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/12/06 16:30, Song Liu 写道:
> > On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> New mddev_resume() calls are added to synchroniza IO with array
> >> reconfiguration, however, this introduce a regression while adding it in
> >> md_start_sync():
> >>
> >> 1) someone set MD_RECOVERY_NEEDED first;
> >> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
> >>     queue a new sync work;
> >> 3) daemon thread release reconfig_mutex;
> >> 4) in md_start_sync
> >>     a) check that there are spares that can be added/removed, then suspend
> >>        the array;
> >>     b) remove_and_add_spares may not be called, or called without really
> >>        add/remove spares;
> >>     c) resume the array, then set MD_RECOVERY_NEEDED again!
> >>
> >> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> >> consequence, normal IO will be quite slow.
> >>
> >> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
> >> that md_start_sync() won't set such flag and hence the loop will be broken.
> >
> > I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
> > sites of mddev_resume().
>
> There are also some other mddev_resume() that is added later and don't
> need recovery, so md_start_sync() is not the only place:
>
>   - md_setup_drive
>   - rdev_attr_store
>   - suspend_lo_store
>   - suspend_hi_store
>   - autorun_devices
>   - md_ioct
>   - r5c_disable_writeback_async
>   - error path from new_dev_store(), ...
>
> I'm not sure add a new helper is a good idea, because all above apis
> should use new helper as well.

I think for most of these call sites, it is OK to set MD_RECOVERY_NEEDED
(although it is not needed), and md_start_sync() is the only one that may
trigger "loop between 2 - 4" scenario. Did I miss something?

It is already rc4, so we need to send the fix soon.

Thanks,
Song

Yu Kuai Dec. 7, 2023, 1:34 a.m. UTC | #5

Hi,

在 2023/12/07 1:24, Song Liu 写道:
> On Wed, Dec 6, 2023 at 3:36 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/12/06 16:30, Song Liu 写道:
>>> On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> New mddev_resume() calls are added to synchroniza IO with array
>>>> reconfiguration, however, this introduce a regression while adding it in
>>>> md_start_sync():
>>>>
>>>> 1) someone set MD_RECOVERY_NEEDED first;
>>>> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>>>>      queue a new sync work;
>>>> 3) daemon thread release reconfig_mutex;
>>>> 4) in md_start_sync
>>>>      a) check that there are spares that can be added/removed, then suspend
>>>>         the array;
>>>>      b) remove_and_add_spares may not be called, or called without really
>>>>         add/remove spares;
>>>>      c) resume the array, then set MD_RECOVERY_NEEDED again!
>>>>
>>>> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
>>>> consequence, normal IO will be quite slow.
>>>>
>>>> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
>>>> that md_start_sync() won't set such flag and hence the loop will be broken.
>>>
>>> I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
>>> sites of mddev_resume().
>>
>> There are also some other mddev_resume() that is added later and don't
>> need recovery, so md_start_sync() is not the only place:
>>
>>    - md_setup_drive
>>    - rdev_attr_store
>>    - suspend_lo_store
>>    - suspend_hi_store
>>    - autorun_devices
>>    - md_ioct
>>    - r5c_disable_writeback_async
>>    - error path from new_dev_store(), ...
>>
>> I'm not sure add a new helper is a good idea, because all above apis
>> should use new helper as well.
> 
> I think for most of these call sites, it is OK to set MD_RECOVERY_NEEDED
> (although it is not needed), and md_start_sync() is the only one that may
> trigger "loop between 2 - 4" scenario. Did I miss something?

Yes, it's the only problematic one. I'll send v2.

Thanks,
Kuai

> 
> It is already rc4, so we need to send the fix soon.
> 
> Thanks,
> Song
> .
>

[-next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Commit Message

Comments

Patch