[1/1] iomap: propagate nowait to block layer

Message ID	ca8f7e4efb902ee6500ab5b1fafd67acb3224c45.1740533564.git.asml.silence@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7714618E368; Wed, 26 Feb 2025 01:33:25 +0000 (UTC) From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org, Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org Cc: asml.silence@gmail.com, "Darrick J . Wong" <djwong@kernel.org>, linux-xfs@vger.kernel.org, wu lei <uwydoc@gmail.com> Subject: [PATCH 1/1] iomap: propagate nowait to block layer Date: Wed, 26 Feb 2025 01:33:58 +0000 Message-ID: <ca8f7e4efb902ee6500ab5b1fafd67acb3224c45.1740533564.git.asml.silence@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/1] iomap: propagate nowait to block layer \| expand [1/1] iomap: propagate nowait to block layer

Pavel Begunkov Feb. 26, 2025, 1:33 a.m. UTC

There are reports of high io_uring submission latency for ext4 and xfs,
which is due to iomap not propagating nowait flag to the block layer
resulting in waiting for IO during tag allocation.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
Reported-by: wu lei <uwydoc@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/iomap/direct-io.c | 3 +++
 1 file changed, 3 insertions(+)

Darrick J. Wong Feb. 26, 2025, 1:53 a.m. UTC | #1

On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
> There are reports of high io_uring submission latency for ext4 and xfs,
> which is due to iomap not propagating nowait flag to the block layer
> resulting in waiting for IO during tag allocation.
> 
> Cc: stable@vger.kernel.org
> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> Reported-by: wu lei <uwydoc@gmail.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  fs/iomap/direct-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..25c5e87dbd94 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  		WRITE_ONCE(iocb->private, bio);
>  	}
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		bio->bi_opf |= REQ_NOWAIT;

Shouldn't this go in iomap_dio_bio_opflags?

--D

> +
>  	if (dio->dops && dio->dops->submit_io)
>  		dio->dops->submit_io(iter, bio, pos);
>  	else
> -- 
> 2.48.1
>

Pavel Begunkov Feb. 26, 2025, 2:06 a.m. UTC | #2

On 2/26/25 01:53, Darrick J. Wong wrote:
> On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
>> There are reports of high io_uring submission latency for ext4 and xfs,
>> which is due to iomap not propagating nowait flag to the block layer
>> resulting in waiting for IO during tag allocation.
>>
>> Cc: stable@vger.kernel.org
>> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
>> Reported-by: wu lei <uwydoc@gmail.com>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   fs/iomap/direct-io.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index b521eb15759e..25c5e87dbd94 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>>   		WRITE_ONCE(iocb->private, bio);
>>   	}
>>   
>> +	if (iocb->ki_flags & IOCB_NOWAIT)
>> +		bio->bi_opf |= REQ_NOWAIT;
> 
> Shouldn't this go in iomap_dio_bio_opflags?

It can, if that's the preference, but iomap_dio_zero() would need
to have a separate check. It also affects 5.4, and I'm not sure
which version would be easier to back port.

Darrick J. Wong Feb. 26, 2025, 2:20 a.m. UTC | #3

On Wed, Feb 26, 2025 at 02:06:51AM +0000, Pavel Begunkov wrote:
> On 2/26/25 01:53, Darrick J. Wong wrote:
> > On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
> > > There are reports of high io_uring submission latency for ext4 and xfs,
> > > which is due to iomap not propagating nowait flag to the block layer
> > > resulting in waiting for IO during tag allocation.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> > > Reported-by: wu lei <uwydoc@gmail.com>
> > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > > ---
> > >   fs/iomap/direct-io.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index b521eb15759e..25c5e87dbd94 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
> > >   		WRITE_ONCE(iocb->private, bio);
> > >   	}
> > > +	if (iocb->ki_flags & IOCB_NOWAIT)
> > > +		bio->bi_opf |= REQ_NOWAIT;
> > 
> > Shouldn't this go in iomap_dio_bio_opflags?
> 
> It can, if that's the preference, but iomap_dio_zero() would need
> to have a separate check. It also affects 5.4, and I'm not sure
> which version would be easier to back port.

Yes, please don't go scattering the bi_opf setting code all around the
file.  Also, do you need to modify iomap_dio_zero?

--D

> -- 
> Pavel Begunkov
>

Pavel Begunkov Feb. 26, 2025, 2:46 a.m. UTC | #4

On 2/26/25 02:20, Darrick J. Wong wrote:
> On Wed, Feb 26, 2025 at 02:06:51AM +0000, Pavel Begunkov wrote:
>> On 2/26/25 01:53, Darrick J. Wong wrote:
>>> On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
>>>> There are reports of high io_uring submission latency for ext4 and xfs,
>>>> which is due to iomap not propagating nowait flag to the block layer
>>>> resulting in waiting for IO during tag allocation.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
>>>> Reported-by: wu lei <uwydoc@gmail.com>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>    fs/iomap/direct-io.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>>>> index b521eb15759e..25c5e87dbd94 100644
>>>> --- a/fs/iomap/direct-io.c
>>>> +++ b/fs/iomap/direct-io.c
>>>> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>>>>    		WRITE_ONCE(iocb->private, bio);
>>>>    	}
>>>> +	if (iocb->ki_flags & IOCB_NOWAIT)
>>>> +		bio->bi_opf |= REQ_NOWAIT;
>>>
>>> Shouldn't this go in iomap_dio_bio_opflags?
>>
>> It can, if that's the preference, but iomap_dio_zero() would need
>> to have a separate check. It also affects 5.4, and I'm not sure
>> which version would be easier to back port.
> 
> Yes, please don't go scattering the bi_opf setting code all around the

Sure, even though the function already adjusts bi_opf through
bio_set_polled.

> file.  Also, do you need to modify iomap_dio_zero?

Is there a reason why it can't be hit? It's in the same path, and
from a quick look ending up there seems as easy as writing just a
little beyond the file size.

Dave Chinner Feb. 26, 2025, 4:52 a.m. UTC | #5

On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
> There are reports of high io_uring submission latency for ext4 and xfs,
> which is due to iomap not propagating nowait flag to the block layer
> resulting in waiting for IO during tag allocation.
> 
> Cc: stable@vger.kernel.org
> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> Reported-by: wu lei <uwydoc@gmail.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  fs/iomap/direct-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..25c5e87dbd94 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  		WRITE_ONCE(iocb->private, bio);
>  	}
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		bio->bi_opf |= REQ_NOWAIT;

ISTR that this was omitted on purpose because REQ_NOWAIT doesn't
work in the way iomap filesystems expect IO to behave.

I think it has to do with large direct IOs that require multiple
calls to submit_bio(). Each bio that is allocated and submitted
takes a reference to the iomap_dio object, and the iomap_dio is not
completed until that reference count goes to zero.

hence if we have submitted a series of bios in a IOCB_NOWAIT DIO
and then the next bio submission in the DIO triggers a REQ_NOWAIT
condition, that bio is marked with a BLK_STS_AGAIN and completed.
This error is then caught by the iomap dio bio completion function,
recorded in the iomap_dio structure, but because there is still
bios in flight, the iomap_dio ref count does not fall to zero and so
the DIO itself is not completed.

Then submission loops again, sees dio->error is set and aborts
submission. Because this is AIO, and the iomap_dio refcount is
non-zero at this point, __iomap_dio_rw() returns -EIOCBQUEUED.
It does not return the -EAGAIN state that was reported to bio
completion because the overall DIO has not yet been completed
and all the IO completion status gathered.

Hence when the in flight async bios actually complete, they drop the
iomap dio reference count to zero, iomap_dio_complete() is called,
and the BLK_STS_AGAIN error is gathered from the previous submission
failure. This then calls AIO completion, and reports a -EAGAIN error
to the AIO/io_uring completion code.

IOWs, -EAGAIN is *not reported to the IO submitter* that needs
this information to defer and resubmit the IO - it is reported to IO
completion where it is completely useless and, most likely, not in a
context that can resubmit the IO.

Put simply: any code that submits multiple bios (either individually
or as a bio chain) for a single high level IO can not use REQ_NOWAIT
reliably for async IO submission.

We have similar limitations on IO polling (IOCB_HIPRI) in iomap, but
I'm not sure if REQ_NOWAIT can be handled the same way. i.e. only
setting REQ_NOWAIT on the first bio means that the second+ bio can
still block and cause latency issues.

So, yeah, fixing this source of latency is not as simple as just
setting REQ_NOWAIT. I don't know if there is a better solution that
what we currently have, but causing large AIO DIOs to
randomly fail with EAGAIN reported at IO completion (with the likely
result of unexpected data corruption) is far worse behaviour that
occasionally having to deal with a long IO submission latency.

-Dave.

Pavel Begunkov Feb. 26, 2025, 12:33 p.m. UTC | #6

On 2/26/25 04:52, Dave Chinner wrote:
> On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
>> There are reports of high io_uring submission latency for ext4 and xfs,
>> which is due to iomap not propagating nowait flag to the block layer
>> resulting in waiting for IO during tag allocation.
>>
>> Cc: stable@vger.kernel.org
>> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
>> Reported-by: wu lei <uwydoc@gmail.com>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   fs/iomap/direct-io.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index b521eb15759e..25c5e87dbd94 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>>   		WRITE_ONCE(iocb->private, bio);
>>   	}
>>   
>> +	if (iocb->ki_flags & IOCB_NOWAIT)
>> +		bio->bi_opf |= REQ_NOWAIT;
> 
> ISTR that this was omitted on purpose because REQ_NOWAIT doesn't
> work in the way iomap filesystems expect IO to behave.
> 
> I think it has to do with large direct IOs that require multiple
> calls to submit_bio(). Each bio that is allocated and submitted
> takes a reference to the iomap_dio object, and the iomap_dio is not
> completed until that reference count goes to zero.
> 
> hence if we have submitted a series of bios in a IOCB_NOWAIT DIO
> and then the next bio submission in the DIO triggers a REQ_NOWAIT
> condition, that bio is marked with a BLK_STS_AGAIN and completed.
> This error is then caught by the iomap dio bio completion function,
> recorded in the iomap_dio structure, but because there is still
> bios in flight, the iomap_dio ref count does not fall to zero and so
> the DIO itself is not completed.
> 
> Then submission loops again, sees dio->error is set and aborts
> submission. Because this is AIO, and the iomap_dio refcount is
> non-zero at this point, __iomap_dio_rw() returns -EIOCBQUEUED.
> It does not return the -EAGAIN state that was reported to bio
> completion because the overall DIO has not yet been completed
> and all the IO completion status gathered.
> 
> Hence when the in flight async bios actually complete, they drop the
> iomap dio reference count to zero, iomap_dio_complete() is called,
> and the BLK_STS_AGAIN error is gathered from the previous submission
> failure. This then calls AIO completion, and reports a -EAGAIN error
> to the AIO/io_uring completion code.
> 
> IOWs, -EAGAIN is *not reported to the IO submitter* that needs
> this information to defer and resubmit the IO - it is reported to IO
> completion where it is completely useless and, most likely, not in a
> context that can resubmit the IO.
> 
> Put simply: any code that submits multiple bios (either individually
> or as a bio chain) for a single high level IO can not use REQ_NOWAIT
> reliably for async IO submission.

I know the issue, but admittedly forgot about it here, thanks for
reminding! Considering that attempts to change the situation failed
some years ago and I haven't heard about it after, I don't think
it'll going to change any time soon.

So how about to follow what the block layer does and disable multi
bio nowait submissions for async IO?

if (!iocb_is_sync(iocb)) {
	if (multi_bio)
		return -EAGAIN;
	bio_opf |= REQ_NOWAIT;
}

Is there anything else but io_uring and AIO that can issue async
IO though this path?

> We have similar limitations on IO polling (IOCB_HIPRI) in iomap, but
> I'm not sure if REQ_NOWAIT can be handled the same way. i.e. only
> setting REQ_NOWAIT on the first bio means that the second+ bio can
> still block and cause latency issues.
> 
> So, yeah, fixing this source of latency is not as simple as just
> setting REQ_NOWAIT. I don't know if there is a better solution that
> what we currently have, but causing large AIO DIOs to
> randomly fail with EAGAIN reported at IO completion (with the likely
> result of unexpected data corruption) is far worse behaviour that
> occasionally having to deal with a long IO submission latency.

By the end of the day, it's waiting for IO, the first and very thing
the user don't want to see for async IO, and that's pretty much what
makes AIO borderline unusable. We just can't have it for an asynchronous
interface. If we can't fix it up here, the only other option I see
is to push all such io_uring requests to a slow path where we can
block, and that'd be quite a large regression.

Dave Chinner Feb. 26, 2025, 8:49 p.m. UTC | #7

On Wed, Feb 26, 2025 at 12:33:21PM +0000, Pavel Begunkov wrote:
> On 2/26/25 04:52, Dave Chinner wrote:
> > On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
> > > There are reports of high io_uring submission latency for ext4 and xfs,
> > > which is due to iomap not propagating nowait flag to the block layer
> > > resulting in waiting for IO during tag allocation.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> > > Reported-by: wu lei <uwydoc@gmail.com>
> > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > > ---
> > >   fs/iomap/direct-io.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index b521eb15759e..25c5e87dbd94 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
> > >   		WRITE_ONCE(iocb->private, bio);
> > >   	}
> > > +	if (iocb->ki_flags & IOCB_NOWAIT)
> > > +		bio->bi_opf |= REQ_NOWAIT;
> > 
> > ISTR that this was omitted on purpose because REQ_NOWAIT doesn't
> > work in the way iomap filesystems expect IO to behave.
> > 
> > I think it has to do with large direct IOs that require multiple
> > calls to submit_bio(). Each bio that is allocated and submitted
> > takes a reference to the iomap_dio object, and the iomap_dio is not
> > completed until that reference count goes to zero.
> > 
> > hence if we have submitted a series of bios in a IOCB_NOWAIT DIO
> > and then the next bio submission in the DIO triggers a REQ_NOWAIT
> > condition, that bio is marked with a BLK_STS_AGAIN and completed.
> > This error is then caught by the iomap dio bio completion function,
> > recorded in the iomap_dio structure, but because there is still
> > bios in flight, the iomap_dio ref count does not fall to zero and so
> > the DIO itself is not completed.
> > 
> > Then submission loops again, sees dio->error is set and aborts
> > submission. Because this is AIO, and the iomap_dio refcount is
> > non-zero at this point, __iomap_dio_rw() returns -EIOCBQUEUED.
> > It does not return the -EAGAIN state that was reported to bio
> > completion because the overall DIO has not yet been completed
> > and all the IO completion status gathered.
> > 
> > Hence when the in flight async bios actually complete, they drop the
> > iomap dio reference count to zero, iomap_dio_complete() is called,
> > and the BLK_STS_AGAIN error is gathered from the previous submission
> > failure. This then calls AIO completion, and reports a -EAGAIN error
> > to the AIO/io_uring completion code.
> > 
> > IOWs, -EAGAIN is *not reported to the IO submitter* that needs
> > this information to defer and resubmit the IO - it is reported to IO
> > completion where it is completely useless and, most likely, not in a
> > context that can resubmit the IO.
> > 
> > Put simply: any code that submits multiple bios (either individually
> > or as a bio chain) for a single high level IO can not use REQ_NOWAIT
> > reliably for async IO submission.
> 
> I know the issue, but admittedly forgot about it here, thanks for
> reminding! Considering that attempts to change the situation failed
> some years ago and I haven't heard about it after, I don't think
> it'll going to change any time soon.
> 
> So how about to follow what the block layer does and disable multi
> bio nowait submissions for async IO?
> 
> if (!iocb_is_sync(iocb)) {
> 	if (multi_bio)
> 		return -EAGAIN;
> 	bio_opf |= REQ_NOWAIT;
> }

How do we know it's going to be multi-bio before we actually start
packing the data into the bios? More below, because I kinda pointed
out how this might be solved...

> Is there anything else but io_uring and AIO that can issue async
> IO though this path?

We can't assume anything about the callers in the lower layers.
Anything that can call the VFS read/write paths could be using async
IO.

> > We have similar limitations on IO polling (IOCB_HIPRI) in iomap, but
> > I'm not sure if REQ_NOWAIT can be handled the same way. i.e. only
> > setting REQ_NOWAIT on the first bio means that the second+ bio can
> > still block and cause latency issues.

Please have a look at how IOCB_HIPRI is handled by iomap for
multi-bio IOs. I -think- the same can be done with IOMAP_NOWAIT
bios, because the bio IO completion for the EAGAIN error will be
present on the iomap_dio by the time submit_bio returns. i.e.
REQ_NOWAIT can be set on the first bio in the submission chain,
but only on the first bio.

i.e. if REQ_NOWAIT causes the first bio submission to fail with
-EAGAIN being reported to completion, we abort the submission or
more bios because dio->error is now set. As there are no actual bios
in flight at this point in time, the only reference to the iomap_dio
is held by the iomap submission code.  Hence as we finalise the
aborted DIO submission, __iomap_dio_rw() drops the last reference
and iomap_dio_rw() calls iomap_dio_complete() on the iomap_dio. This
then gathers the -EAGAIN error that was stashed in the iomap_dio
and returns it to the caller.

i.e. I *think* this "REQ_NOWAIT only for the first bio" method will
solve most of the issues that cause submission latency (especially
for apps doing small IOs), but still behave correctly when large,
multi-bio DIOs are submitted.

Confirming that the logic is sound and writing fstests that exercise
the functionality to demonstrate your eventual kernel change works
correctly (and that we don't break it in future) is your problem,
though.

> > So, yeah, fixing this source of latency is not as simple as just
> > setting REQ_NOWAIT. I don't know if there is a better solution that
> > what we currently have, but causing large AIO DIOs to
> > randomly fail with EAGAIN reported at IO completion (with the likely
> > result of unexpected data corruption) is far worse behaviour that
> > occasionally having to deal with a long IO submission latency.
> 
> By the end of the day, it's waiting for IO, the first and very thing
> the user don't want to see for async IO, and that's pretty much what
> makes AIO borderline unusable.  We just can't have it for an asynchronous
> interface.

Tough cookies. Random load related IO errors that can result in
unexpected user data corruption is a far worse outcome than an
application suffering from a bit of unexpected latency. You are not
going to win that argument, so don't bother wasting time on it.

> If we can't fix it up here, the only other option I see
> is to push all such io_uring requests to a slow path where we can
> block, and that'd be quite a large regression.

Don't be so melodramatic. Async IO has always been, and will always
be, -best effort- within the constraints of filesystem
implementation, data integrity and behavioural correctness.

-Dave.

[1/1] iomap: propagate nowait to block layer

Commit Message

Comments

Patch