diff mbox series

Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")

Message ID c34e6c38-ca47-439a-baf1-3489c05a65a8@kernel.dk (mailing list archive)
State New
Headers show
Series Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") | expand

Commit Message

Jens Axboe Nov. 4, 2024, 2:38 a.m. UTC
On 11/3/24 5:06 PM, Jens Axboe wrote:
> On 11/3/24 5:01 PM, Keith Busch wrote:
>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>> problematic commit simply by browsing git log. As indicated above;
>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>> minimal reproducer, but can provide my large one (it is testing a
>>>> larger build process in a VM) if needed?there are some additional
>>>> details in the above-linked downstream bug report, though. I hope that
>>>> having identified the problematic commit is enough for someone with
>>>> more context to go off of. Happy to provide more information if
>>>> needed.
>>>
>>> Don't worry about not having a reproducer, having the backport commit
>>> pin pointed will do just fine. I'll take a look at this.
>>
>> I think stable is missing:
>>
>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
> 
> I think you need to go back further than that, this one already
> unconditionally holds ->uring_lock around overflow flushing...

Took a look, it's this one:

commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Wed Apr 10 02:26:54 2024 +0100

    io_uring: always lock __io_cqring_overflow_flush

Greg/stable, can you pick this one for 6.6-stable? It picks
cleanly.

For 6.1, which is the other stable of that age that has the backport,
the attached patch will do the trick.

With that, I believe it should be sorted. Hopefully that can make
6.6.60 and 6.1.116.

Comments

Andrew Marshall Nov. 4, 2024, 4:25 a.m. UTC | #1
On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
> On 11/3/24 5:06 PM, Jens Axboe wrote:
>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>> larger build process in a VM) if needed?there are some additional
>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>> having identified the problematic commit is enough for someone with
>>>>> more context to go off of. Happy to provide more information if
>>>>> needed.
>>>>
>>>> Don't worry about not having a reproducer, having the backport commit
>>>> pin pointed will do just fine. I'll take a look at this.
>>>
>>> I think stable is missing:
>>>
>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>> 
>> I think you need to go back further than that, this one already
>> unconditionally holds ->uring_lock around overflow flushing...
>
> Took a look, it's this one:
>
> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Wed Apr 10 02:26:54 2024 +0100
>
>     io_uring: always lock __io_cqring_overflow_flush
>
> Greg/stable, can you pick this one for 6.6-stable? It picks
> cleanly.
>
> For 6.1, which is the other stable of that age that has the backport,
> the attached patch will do the trick.
>
> With that, I believe it should be sorted. Hopefully that can make
> 6.6.60 and 6.1.116.
>
> -- 
> Jens Axboe
> Attachments:
> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch

Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my reproducer (run a few times). Your first quick patch also passed, for what it’s worth. Thanks for the quick responses!
Andrew Marshall Nov. 4, 2024, 1:17 p.m. UTC | #2
On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote:
> On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>> having identified the problematic commit is enough for someone with
>>>>>> more context to go off of. Happy to provide more information if
>>>>>> needed.
>>>>>
>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>
>>>> I think stable is missing:
>>>>
>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>> 
>>> I think you need to go back further than that, this one already
>>> unconditionally holds ->uring_lock around overflow flushing...
>>
>> Took a look, it's this one:
>>
>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>> Author: Pavel Begunkov <asml.silence@gmail.com>
>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>
>>     io_uring: always lock __io_cqring_overflow_flush
>>
>> Greg/stable, can you pick this one for 6.6-stable? It picks
>> cleanly.
>>
>> For 6.1, which is the other stable of that age that has the backport,
>> the attached patch will do the trick.
>>
>> With that, I believe it should be sorted. Hopefully that can make
>> 6.6.60 and 6.1.116.
>>
>> -- 
>> Jens Axboe
>> Attachments:
>> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch
>
> Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my 
> reproducer (run a few times). Your first quick patch also passed, for 
> what it’s worth. Thanks for the quick responses!

Correction: I cherry-picked and tested 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you identified), not 6b231248e97fc3. Apologies for any confusion.
Jens Axboe Nov. 4, 2024, 3:58 p.m. UTC | #3
On 11/4/24 6:17 AM, Andrew Marshall wrote:
> On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote:
>> On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
>>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>>> having identified the problematic commit is enough for someone with
>>>>>>> more context to go off of. Happy to provide more information if
>>>>>>> needed.
>>>>>>
>>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>>
>>>>> I think stable is missing:
>>>>>
>>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>>>
>>>> I think you need to go back further than that, this one already
>>>> unconditionally holds ->uring_lock around overflow flushing...
>>>
>>> Took a look, it's this one:
>>>
>>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>>> Author: Pavel Begunkov <asml.silence@gmail.com>
>>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>>
>>>     io_uring: always lock __io_cqring_overflow_flush
>>>
>>> Greg/stable, can you pick this one for 6.6-stable? It picks
>>> cleanly.
>>>
>>> For 6.1, which is the other stable of that age that has the backport,
>>> the attached patch will do the trick.
>>>
>>> With that, I believe it should be sorted. Hopefully that can make
>>> 6.6.60 and 6.1.116.
>>>
>>> -- 
>>> Jens Axboe
>>> Attachments:
>>> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch
>>
>> Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my 
>> reproducer (run a few times). Your first quick patch also passed, for 
>> what it?s worth. Thanks for the quick responses!
> 
> Correction: I cherry-picked and tested
> 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you
> identified), not 6b231248e97fc3. Apologies for any confusion.

Thanks for clarifying, so it's as expected. Hopefully -stable can pick
this backport up soonish, so the next stable release will be sorted.
Thanks for reporting the issue!
Greg KH Nov. 6, 2024, 6:05 a.m. UTC | #4
On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote:
> On 11/3/24 5:06 PM, Jens Axboe wrote:
> > On 11/3/24 5:01 PM, Keith Busch wrote:
> >> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
> >>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
> >>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
> >>>> problematic commit simply by browsing git log. As indicated above;
> >>>> reverting that atop 6.6.59 results in success. Since it is passing on
> >>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
> >>>> other semantic merge conflict. Unfortunately I do not have a compact,
> >>>> minimal reproducer, but can provide my large one (it is testing a
> >>>> larger build process in a VM) if needed?there are some additional
> >>>> details in the above-linked downstream bug report, though. I hope that
> >>>> having identified the problematic commit is enough for someone with
> >>>> more context to go off of. Happy to provide more information if
> >>>> needed.
> >>>
> >>> Don't worry about not having a reproducer, having the backport commit
> >>> pin pointed will do just fine. I'll take a look at this.
> >>
> >> I think stable is missing:
> >>
> >>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
> > 
> > I think you need to go back further than that, this one already
> > unconditionally holds ->uring_lock around overflow flushing...
> 
> Took a look, it's this one:
> 
> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Wed Apr 10 02:26:54 2024 +0100
> 
>     io_uring: always lock __io_cqring_overflow_flush
> 
> Greg/stable, can you pick this one for 6.6-stable? It picks
> cleanly.
> 
> For 6.1, which is the other stable of that age that has the backport,
> the attached patch will do the trick.
> 
> With that, I believe it should be sorted. Hopefully that can make
> 6.6.60 and 6.1.116.

Now queued up, thanks.

greg k-h
Jens Axboe Nov. 6, 2024, 2:11 p.m. UTC | #5
On 11/5/24 11:05 PM, Greg Kroah-Hartman wrote:
> On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote:
>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>> having identified the problematic commit is enough for someone with
>>>>>> more context to go off of. Happy to provide more information if
>>>>>> needed.
>>>>>
>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>
>>>> I think stable is missing:
>>>>
>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>>
>>> I think you need to go back further than that, this one already
>>> unconditionally holds ->uring_lock around overflow flushing...
>>
>> Took a look, it's this one:
>>
>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>> Author: Pavel Begunkov <asml.silence@gmail.com>
>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>
>>     io_uring: always lock __io_cqring_overflow_flush
>>
>> Greg/stable, can you pick this one for 6.6-stable? It picks
>> cleanly.
>>
>> For 6.1, which is the other stable of that age that has the backport,
>> the attached patch will do the trick.
>>
>> With that, I believe it should be sorted. Hopefully that can make
>> 6.6.60 and 6.1.116.
> 
> Now queued up, thanks.

Thanks Greg!
diff mbox series

Patch

From 3f1c33f03386c481caf2044a836f3ca611094098 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence@gmail.com>
Date: Wed, 10 Apr 2024 02:26:54 +0100
Subject: [PATCH] io_uring: always lock __io_cqring_overflow_flush

Commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 upstream.

Conditional locking is never great, in case of
__io_cqring_overflow_flush(), which is a slow path, it's not justified.
Don't handle IOPOLL separately, always grab uring_lock for overflow
flushing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f902b161f02c..92c1aa8f3501 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -593,6 +593,8 @@  static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 	bool all_flushed;
 	size_t cqe_size = sizeof(struct io_uring_cqe);
 
+	lockdep_assert_held(&ctx->uring_lock);
+
 	if (!force && __io_cqring_events(ctx) == ctx->cq_entries)
 		return false;
 
@@ -647,12 +649,9 @@  static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 	bool ret = true;
 
 	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) {
-		/* iopoll syncs against uring_lock, not completion_lock */
-		if (ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_lock(&ctx->uring_lock);
+		mutex_lock(&ctx->uring_lock);
 		ret = __io_cqring_overflow_flush(ctx, false);
-		if (ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_unlock(&ctx->uring_lock);
+		mutex_unlock(&ctx->uring_lock);
 	}
 
 	return ret;
@@ -1405,6 +1404,8 @@  static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 	int ret = 0;
 	unsigned long check_cq;
 
+	lockdep_assert_held(&ctx->uring_lock);
+
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
 
-- 
2.45.2