diff mbox series

[v2,3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d

Message ID 20220908161516.4361-4-logang@deltatee.com (mailing list archive)
State New, archived
Headers show
Series A couple more bug fixes | expand

Commit Message

Logan Gunthorpe Sept. 8, 2022, 4:15 p.m. UTC
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.

When the deadlock occurs, multiple threads will hang in different ways:

 1) The group threads will hang in the blk-wbt code with bios waiting to
    be submitted to the block layer:

        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        ops_run_io+0x46b/0x1a30
        handle_stripe+0xcd3/0x36b0
        handle_active_stripes.constprop.0+0x6f6/0xa60
        raid5_do_work+0x177/0x330

    Or:
        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        flush_deferred_bios+0x136/0x170
        raid5_do_work+0x262/0x330

 2) The r5l_reclaim thread will hang in the same way, submitting a
    bio to the block layer:

        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        submit_bio+0x3f/0xf0
        md_super_write+0x12f/0x1b0
        md_update_sb.part.0+0x7c6/0xff0
        md_update_sb+0x30/0x60
        r5l_do_reclaim+0x4f9/0x5e0
        r5l_reclaim_thread+0x69/0x30b

    However, before hanging, the MD_SB_CHANGE_PENDING flag will be
    set for sb_flags in r5l_write_super_and_discard_space(). This
    flag will never be cleared because the submit_bio() call never
    returns.

 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
    will do no processing on any pending stripes and re-set
    STRIPE_HANDLE. This will cause the raid5d thread to enter an
    infinite loop, constantly trying to handle the same stripes
    stuck in the queue.

    The raid5d thread has a blk_plug that holds a number of bios
    that are also stuck waiting seeing the thread is in a loop
    that never schedules. These bios have been accounted for by
    blk-wbt thus preventing the other threads above from
    continuing when they try to submit bios. --Deadlock.

To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.

It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:

    16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
                   is updated.")

Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Logan Gunthorpe Sept. 15, 2022, 3:15 p.m. UTC | #1
Hi Song,

On 2022-09-08 10:15, Logan Gunthorpe wrote:
> A complicated deadlock exists when using the journal and an elevated
> group_thrtead_cnt. It was found with loop devices, but its not clear
> whether it can be seen with real disks. The deadlock can occur simply
> by writing data with an fio script.
> 
> When the deadlock occurs, multiple threads will hang in different ways:
> 
>  1) The group threads will hang in the blk-wbt code with bios waiting to
>     be submitted to the block layer:
> 
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         ops_run_io+0x46b/0x1a30
>         handle_stripe+0xcd3/0x36b0
>         handle_active_stripes.constprop.0+0x6f6/0xa60
>         raid5_do_work+0x177/0x330
> 
>     Or:
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         flush_deferred_bios+0x136/0x170
>         raid5_do_work+0x262/0x330
> 
>  2) The r5l_reclaim thread will hang in the same way, submitting a
>     bio to the block layer:
> 
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         submit_bio+0x3f/0xf0
>         md_super_write+0x12f/0x1b0
>         md_update_sb.part.0+0x7c6/0xff0
>         md_update_sb+0x30/0x60
>         r5l_do_reclaim+0x4f9/0x5e0
>         r5l_reclaim_thread+0x69/0x30b
> 
>     However, before hanging, the MD_SB_CHANGE_PENDING flag will be
>     set for sb_flags in r5l_write_super_and_discard_space(). This
>     flag will never be cleared because the submit_bio() call never
>     returns.
> 
>  3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
>     will do no processing on any pending stripes and re-set
>     STRIPE_HANDLE. This will cause the raid5d thread to enter an
>     infinite loop, constantly trying to handle the same stripes
>     stuck in the queue.
> 
>     The raid5d thread has a blk_plug that holds a number of bios
>     that are also stuck waiting seeing the thread is in a loop
>     that never schedules. These bios have been accounted for by
>     blk-wbt thus preventing the other threads above from
>     continuing when they try to submit bios. --Deadlock.
> 
> To fix this, add the same wait_event() that is used in raid5_do_work()
> to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
> schedule and wait until the flag is cleared. The schedule action will
> flush the plug which will allow the r5l_reclaim thread to continue,
> thus preventing the deadlock.
> 
> It's not clear when the deadlock was introduced, but the similar
> wait_event() call in raid5_do_work() was added in 2017 by this
> commit:
> 
>     16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
>                    is updated.")
> 
> Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Seems this patch of mine that is in md-next is causing an 1 in ~30
failure on the mdadm test 13imsm-r0_r5_3d-grow-r0_r5_4d.

I'm looking into it and will try to send an updated patch when I have a fix.

Logan
Song Liu Sept. 19, 2022, 6:31 p.m. UTC | #2
On Thu, Sep 15, 2022 at 8:15 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> Hi Song,
>
> On 2022-09-08 10:15, Logan Gunthorpe wrote:
> > A complicated deadlock exists when using the journal and an elevated
> > group_thrtead_cnt. It was found with loop devices, but its not clear
> > whether it can be seen with real disks. The deadlock can occur simply
> > by writing data with an fio script.
> >
> > When the deadlock occurs, multiple threads will hang in different ways:
> >
> >  1) The group threads will hang in the blk-wbt code with bios waiting to
> >     be submitted to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         ops_run_io+0x46b/0x1a30
> >         handle_stripe+0xcd3/0x36b0
> >         handle_active_stripes.constprop.0+0x6f6/0xa60
> >         raid5_do_work+0x177/0x330
> >
> >     Or:
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         flush_deferred_bios+0x136/0x170
> >         raid5_do_work+0x262/0x330
> >
> >  2) The r5l_reclaim thread will hang in the same way, submitting a
> >     bio to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         submit_bio+0x3f/0xf0
> >         md_super_write+0x12f/0x1b0
> >         md_update_sb.part.0+0x7c6/0xff0
> >         md_update_sb+0x30/0x60
> >         r5l_do_reclaim+0x4f9/0x5e0
> >         r5l_reclaim_thread+0x69/0x30b
> >
> >     However, before hanging, the MD_SB_CHANGE_PENDING flag will be
> >     set for sb_flags in r5l_write_super_and_discard_space(). This
> >     flag will never be cleared because the submit_bio() call never
> >     returns.
> >
> >  3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
> >     will do no processing on any pending stripes and re-set
> >     STRIPE_HANDLE. This will cause the raid5d thread to enter an
> >     infinite loop, constantly trying to handle the same stripes
> >     stuck in the queue.
> >
> >     The raid5d thread has a blk_plug that holds a number of bios
> >     that are also stuck waiting seeing the thread is in a loop
> >     that never schedules. These bios have been accounted for by
> >     blk-wbt thus preventing the other threads above from
> >     continuing when they try to submit bios. --Deadlock.
> >
> > To fix this, add the same wait_event() that is used in raid5_do_work()
> > to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
> > schedule and wait until the flag is cleared. The schedule action will
> > flush the plug which will allow the r5l_reclaim thread to continue,
> > thus preventing the deadlock.
> >
> > It's not clear when the deadlock was introduced, but the similar
> > wait_event() call in raid5_do_work() was added in 2017 by this
> > commit:
> >
> >     16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
> >                    is updated.")
> >
> > Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
> > Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>
> Seems this patch of mine that is in md-next is causing an 1 in ~30
> failure on the mdadm test 13imsm-r0_r5_3d-grow-r0_r5_4d.
>
> I'm looking into it and will try to send an updated patch when I have a fix.

Thanks for the heads-up! I will remove this one from md-next for now.

Song
diff mbox series

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 734f92e75f85..7d1b7e361ef2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6783,6 +6783,10 @@  static void raid5d(struct md_thread *thread)
 			md_check_recovery(mddev);
 			spin_lock_irq(&conf->device_lock);
 		}
+
+		wait_event_lock_irq(mddev->sb_wait,
+			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
+			conf->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);