diff mbox series

block: Allow REQ_FUA|REQ_READ

Message ID 20250311133517.3095878-1-kent.overstreet@linux.dev (mailing list archive)
State New
Headers show
Series block: Allow REQ_FUA|REQ_READ | expand

Commit Message

Kent Overstreet March 11, 2025, 1:35 p.m. UTC
REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
the same as writes.

This is useful for when the filesystem gets a checksum error, it's
possible that a bit was flipped in the controller cache, and when we
retry we want to retry the entire IO, not just from cache.

The nvme driver already passes through REQ_FUA for reads, not just
writes, so disabling the warning is sufficient to start using it, and
bcachefs is implementing additional retries for checksum errors so can
immediately use it.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/blk-core.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Comments

Niklas Cassel March 11, 2025, 2:53 p.m. UTC | #1
Hello Kent,

On Tue, Mar 11, 2025 at 09:35:16AM -0400, Kent Overstreet wrote:
> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> the same as writes.
> 
> This is useful for when the filesystem gets a checksum error, it's
> possible that a bit was flipped in the controller cache, and when we
> retry we want to retry the entire IO, not just from cache.

Looking at ATA Command Set - 6 (ACS-6),
7.23 READ FPDMA QUEUED - 60h

"""
If the Forced Unit Access (FUA) bit is set to one, the device shall retrieve
the data from the non-volatile media regardless of whether the device holds
the requested information in the volatile cache.

If the device holds a modified copy of the requested data as a result of
having volatile cached writes, the modified data shall be written to the
non-volatile media before being retrieved from the non-volatile media as
part of this operation.
"""

So IIUC, at least for ATA, if something is corrupted in the volatile
write cache, setting the FUA bit will ensure that the corruption will
get propagated to the non-volatile media.


Kind regards,
Niklas
Martin K. Petersen March 11, 2025, 3:02 p.m. UTC | #2
Kent,

> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> the same as writes.

FUA does not bypass anything. On the contrary, A FUA READ causes data in
the volatile cache (if any) to be flushed to non-volatile storage
(either cache or media or both). So I'm afraid it has the exact opposite
effect of what you are looking for.
Kent Overstreet March 11, 2025, 3:09 p.m. UTC | #3
On Tue, Mar 11, 2025 at 03:53:51PM +0100, Niklas Cassel wrote:
> Hello Kent,
> 
> On Tue, Mar 11, 2025 at 09:35:16AM -0400, Kent Overstreet wrote:
> > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> > the same as writes.
> > 
> > This is useful for when the filesystem gets a checksum error, it's
> > possible that a bit was flipped in the controller cache, and when we
> > retry we want to retry the entire IO, not just from cache.
> 
> Looking at ATA Command Set - 6 (ACS-6),
> 7.23 READ FPDMA QUEUED - 60h
> 
> """
> If the Forced Unit Access (FUA) bit is set to one, the device shall retrieve
> the data from the non-volatile media regardless of whether the device holds
> the requested information in the volatile cache.
> 
> If the device holds a modified copy of the requested data as a result of
> having volatile cached writes, the modified data shall be written to the
> non-volatile media before being retrieved from the non-volatile media as
> part of this operation.
> """
> 
> So IIUC, at least for ATA, if something is corrupted in the volatile
> write cache, setting the FUA bit will ensure that the corruption will
> get propagated to the non-volatile media.

Corrupted in the volatile writeback cache is not the expected case - we
don't usually read data we've just written.

Generally, the data will be in the device's cache only because we just
read it, and we don't want to retry the read from the device cache.

It's possible that either the device's cache was faulty, or there was a
transient error in reading from flash (SSD ec algorithms are effectively
probabilistic these days), so in either case retrying with a FUA read
has a much better chance of clearing a transient error and correctly
reading the bad data.
Keith Busch March 11, 2025, 3:13 p.m. UTC | #4
On Tue, Mar 11, 2025 at 03:53:51PM +0100, Niklas Cassel wrote:
> So IIUC, at least for ATA, if something is corrupted in the volatile
> write cache, setting the FUA bit will ensure that the corruption will
> get propagated to the non-volatile media.

It's not necessarily going to corrupt persistent media if the volatile
write cache is corrupted. It should just apply only to data written to
the volatile write cache that hasn't been flushed to the media. If the
device reads persisted media data into a corrupted cache, the FUA here
should catch that and see good data.

But if you haven't flushed previous writes from a corrupted volatile
write cache, then I believe you're right: the read FUA will presist that
corruption. Same is true for NVMe.
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index d6c4fa3943b5..7b1103eb877d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -793,20 +793,21 @@  void submit_bio_noacct(struct bio *bio)
 			goto end_io;
 	}
 
+	if (WARN_ON_ONCE((bio->bi_opf & REQ_PREFLUSH) &&
+			 bio_op(bio) != REQ_OP_WRITE &&
+			 bio_op(bio) != REQ_OP_ZONE_APPEND))
+		goto end_io;
+
 	/*
 	 * Filter flush bio's early so that bio based drivers without flush
 	 * support don't have to worry about them.
 	 */
-	if (op_is_flush(bio->bi_opf)) {
-		if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE &&
-				 bio_op(bio) != REQ_OP_ZONE_APPEND))
+	if (op_is_flush(bio->bi_opf) &&
+	    !bdev_write_cache(bdev)) {
+		bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
+		if (!bio_sectors(bio)) {
+			status = BLK_STS_OK;
 			goto end_io;
-		if (!bdev_write_cache(bdev)) {
-			bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
-			if (!bio_sectors(bio)) {
-				status = BLK_STS_OK;
-				goto end_io;
-			}
 		}
 	}