Message ID | c34e6c38-ca47-439a-baf1-3489c05a65a8@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") | expand |
On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote: > On 11/3/24 5:06 PM, Jens Axboe wrote: >> On 11/3/24 5:01 PM, Keith Busch wrote: >>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote: >>>> On 11/3/24 4:47 PM, Andrew Marshall wrote: >>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely >>>>> problematic commit simply by browsing git log. As indicated above; >>>>> reverting that atop 6.6.59 results in success. Since it is passing on >>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some >>>>> other semantic merge conflict. Unfortunately I do not have a compact, >>>>> minimal reproducer, but can provide my large one (it is testing a >>>>> larger build process in a VM) if needed?there are some additional >>>>> details in the above-linked downstream bug report, though. I hope that >>>>> having identified the problematic commit is enough for someone with >>>>> more context to go off of. Happy to provide more information if >>>>> needed. >>>> >>>> Don't worry about not having a reproducer, having the backport commit >>>> pin pointed will do just fine. I'll take a look at this. >>> >>> I think stable is missing: >>> >>> 6b231248e97fc3 ("io_uring: consolidate overflow flushing") >> >> I think you need to go back further than that, this one already >> unconditionally holds ->uring_lock around overflow flushing... > > Took a look, it's this one: > > commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 > Author: Pavel Begunkov <asml.silence@gmail.com> > Date: Wed Apr 10 02:26:54 2024 +0100 > > io_uring: always lock __io_cqring_overflow_flush > > Greg/stable, can you pick this one for 6.6-stable? It picks > cleanly. > > For 6.1, which is the other stable of that age that has the backport, > the attached patch will do the trick. > > With that, I believe it should be sorted. Hopefully that can make > 6.6.60 and 6.1.116. > > -- > Jens Axboe > Attachments: > * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my reproducer (run a few times). Your first quick patch also passed, for what it’s worth. Thanks for the quick responses!
On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote: > On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote: >> On 11/3/24 5:06 PM, Jens Axboe wrote: >>> On 11/3/24 5:01 PM, Keith Busch wrote: >>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote: >>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote: >>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely >>>>>> problematic commit simply by browsing git log. As indicated above; >>>>>> reverting that atop 6.6.59 results in success. Since it is passing on >>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some >>>>>> other semantic merge conflict. Unfortunately I do not have a compact, >>>>>> minimal reproducer, but can provide my large one (it is testing a >>>>>> larger build process in a VM) if needed?there are some additional >>>>>> details in the above-linked downstream bug report, though. I hope that >>>>>> having identified the problematic commit is enough for someone with >>>>>> more context to go off of. Happy to provide more information if >>>>>> needed. >>>>> >>>>> Don't worry about not having a reproducer, having the backport commit >>>>> pin pointed will do just fine. I'll take a look at this. >>>> >>>> I think stable is missing: >>>> >>>> 6b231248e97fc3 ("io_uring: consolidate overflow flushing") >>> >>> I think you need to go back further than that, this one already >>> unconditionally holds ->uring_lock around overflow flushing... >> >> Took a look, it's this one: >> >> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 >> Author: Pavel Begunkov <asml.silence@gmail.com> >> Date: Wed Apr 10 02:26:54 2024 +0100 >> >> io_uring: always lock __io_cqring_overflow_flush >> >> Greg/stable, can you pick this one for 6.6-stable? It picks >> cleanly. >> >> For 6.1, which is the other stable of that age that has the backport, >> the attached patch will do the trick. >> >> With that, I believe it should be sorted. Hopefully that can make >> 6.6.60 and 6.1.116. >> >> -- >> Jens Axboe >> Attachments: >> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch > > Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my > reproducer (run a few times). Your first quick patch also passed, for > what it’s worth. Thanks for the quick responses! Correction: I cherry-picked and tested 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you identified), not 6b231248e97fc3. Apologies for any confusion.
On 11/4/24 6:17 AM, Andrew Marshall wrote: > On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote: >> On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote: >>> On 11/3/24 5:06 PM, Jens Axboe wrote: >>>> On 11/3/24 5:01 PM, Keith Busch wrote: >>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote: >>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote: >>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely >>>>>>> problematic commit simply by browsing git log. As indicated above; >>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on >>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some >>>>>>> other semantic merge conflict. Unfortunately I do not have a compact, >>>>>>> minimal reproducer, but can provide my large one (it is testing a >>>>>>> larger build process in a VM) if needed?there are some additional >>>>>>> details in the above-linked downstream bug report, though. I hope that >>>>>>> having identified the problematic commit is enough for someone with >>>>>>> more context to go off of. Happy to provide more information if >>>>>>> needed. >>>>>> >>>>>> Don't worry about not having a reproducer, having the backport commit >>>>>> pin pointed will do just fine. I'll take a look at this. >>>>> >>>>> I think stable is missing: >>>>> >>>>> 6b231248e97fc3 ("io_uring: consolidate overflow flushing") >>>> >>>> I think you need to go back further than that, this one already >>>> unconditionally holds ->uring_lock around overflow flushing... >>> >>> Took a look, it's this one: >>> >>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 >>> Author: Pavel Begunkov <asml.silence@gmail.com> >>> Date: Wed Apr 10 02:26:54 2024 +0100 >>> >>> io_uring: always lock __io_cqring_overflow_flush >>> >>> Greg/stable, can you pick this one for 6.6-stable? It picks >>> cleanly. >>> >>> For 6.1, which is the other stable of that age that has the backport, >>> the attached patch will do the trick. >>> >>> With that, I believe it should be sorted. Hopefully that can make >>> 6.6.60 and 6.1.116. >>> >>> -- >>> Jens Axboe >>> Attachments: >>> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch >> >> Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my >> reproducer (run a few times). Your first quick patch also passed, for >> what it?s worth. Thanks for the quick responses! > > Correction: I cherry-picked and tested > 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you > identified), not 6b231248e97fc3. Apologies for any confusion. Thanks for clarifying, so it's as expected. Hopefully -stable can pick this backport up soonish, so the next stable release will be sorted. Thanks for reporting the issue!
On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote: > On 11/3/24 5:06 PM, Jens Axboe wrote: > > On 11/3/24 5:01 PM, Keith Busch wrote: > >> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote: > >>> On 11/3/24 4:47 PM, Andrew Marshall wrote: > >>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely > >>>> problematic commit simply by browsing git log. As indicated above; > >>>> reverting that atop 6.6.59 results in success. Since it is passing on > >>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some > >>>> other semantic merge conflict. Unfortunately I do not have a compact, > >>>> minimal reproducer, but can provide my large one (it is testing a > >>>> larger build process in a VM) if needed?there are some additional > >>>> details in the above-linked downstream bug report, though. I hope that > >>>> having identified the problematic commit is enough for someone with > >>>> more context to go off of. Happy to provide more information if > >>>> needed. > >>> > >>> Don't worry about not having a reproducer, having the backport commit > >>> pin pointed will do just fine. I'll take a look at this. > >> > >> I think stable is missing: > >> > >> 6b231248e97fc3 ("io_uring: consolidate overflow flushing") > > > > I think you need to go back further than that, this one already > > unconditionally holds ->uring_lock around overflow flushing... > > Took a look, it's this one: > > commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 > Author: Pavel Begunkov <asml.silence@gmail.com> > Date: Wed Apr 10 02:26:54 2024 +0100 > > io_uring: always lock __io_cqring_overflow_flush > > Greg/stable, can you pick this one for 6.6-stable? It picks > cleanly. > > For 6.1, which is the other stable of that age that has the backport, > the attached patch will do the trick. > > With that, I believe it should be sorted. Hopefully that can make > 6.6.60 and 6.1.116. Now queued up, thanks. greg k-h
On 11/5/24 11:05 PM, Greg Kroah-Hartman wrote: > On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote: >> On 11/3/24 5:06 PM, Jens Axboe wrote: >>> On 11/3/24 5:01 PM, Keith Busch wrote: >>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote: >>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote: >>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely >>>>>> problematic commit simply by browsing git log. As indicated above; >>>>>> reverting that atop 6.6.59 results in success. Since it is passing on >>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some >>>>>> other semantic merge conflict. Unfortunately I do not have a compact, >>>>>> minimal reproducer, but can provide my large one (it is testing a >>>>>> larger build process in a VM) if needed?there are some additional >>>>>> details in the above-linked downstream bug report, though. I hope that >>>>>> having identified the problematic commit is enough for someone with >>>>>> more context to go off of. Happy to provide more information if >>>>>> needed. >>>>> >>>>> Don't worry about not having a reproducer, having the backport commit >>>>> pin pointed will do just fine. I'll take a look at this. >>>> >>>> I think stable is missing: >>>> >>>> 6b231248e97fc3 ("io_uring: consolidate overflow flushing") >>> >>> I think you need to go back further than that, this one already >>> unconditionally holds ->uring_lock around overflow flushing... >> >> Took a look, it's this one: >> >> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 >> Author: Pavel Begunkov <asml.silence@gmail.com> >> Date: Wed Apr 10 02:26:54 2024 +0100 >> >> io_uring: always lock __io_cqring_overflow_flush >> >> Greg/stable, can you pick this one for 6.6-stable? It picks >> cleanly. >> >> For 6.1, which is the other stable of that age that has the backport, >> the attached patch will do the trick. >> >> With that, I believe it should be sorted. Hopefully that can make >> 6.6.60 and 6.1.116. > > Now queued up, thanks. Thanks Greg!
From 3f1c33f03386c481caf2044a836f3ca611094098 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov <asml.silence@gmail.com> Date: Wed, 10 Apr 2024 02:26:54 +0100 Subject: [PATCH] io_uring: always lock __io_cqring_overflow_flush Commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 upstream. Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> --- io_uring/io_uring.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index f902b161f02c..92c1aa8f3501 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -593,6 +593,8 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) bool all_flushed; size_t cqe_size = sizeof(struct io_uring_cqe); + lockdep_assert_held(&ctx->uring_lock); + if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false; @@ -647,12 +649,9 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx) bool ret = true; if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) { - /* iopoll syncs against uring_lock, not completion_lock */ - if (ctx->flags & IORING_SETUP_IOPOLL) - mutex_lock(&ctx->uring_lock); + mutex_lock(&ctx->uring_lock); ret = __io_cqring_overflow_flush(ctx, false); - if (ctx->flags & IORING_SETUP_IOPOLL) - mutex_unlock(&ctx->uring_lock); + mutex_unlock(&ctx->uring_lock); } return ret; @@ -1405,6 +1404,8 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min) int ret = 0; unsigned long check_cq; + lockdep_assert_held(&ctx->uring_lock); + if (!io_allowed_run_tw(ctx)) return -EEXIST; -- 2.45.2