diff mbox series

bcachefs: fix incorrect usage of REQ_OP_FLUSH

Message ID 20240111073655.2095423-1-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series bcachefs: fix incorrect usage of REQ_OP_FLUSH | expand

Commit Message

Christoph Hellwig Jan. 11, 2024, 7:36 a.m. UTC
REQ_OP_FLUSH is only for internal use in the blk-mq and request based
drivers. File systems and other block layer consumers must use
REQ_OP_WRITE | REQ_PREFLUSH as documented in
Documentation/block/writeback_cache_control.rst.

While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
get the proper flush state machine handling, and completely fails
for any bio based drivers, including all the stacking drivers.  The
block layer will also get a check in 6.8 to reject this use case
entirely.

[Note: completely untested, but as this never got fixed since the
original bug report in November:

   https://bugzilla.kernel.org/show_bug.cgi?id=218184

and the the discussion in December:

    https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/

this seems to be best way to force it]

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/bcachefs/fs-io.c      | 2 +-
 fs/bcachefs/journal_io.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

Comments

Kent Overstreet Jan. 11, 2024, 4:37 p.m. UTC | #1
On Thu, Jan 11, 2024 at 08:36:55AM +0100, Christoph Hellwig wrote:
> REQ_OP_FLUSH is only for internal use in the blk-mq and request based
> drivers. File systems and other block layer consumers must use
> REQ_OP_WRITE | REQ_PREFLUSH as documented in
> Documentation/block/writeback_cache_control.rst.
> 
> While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
> get the proper flush state machine handling, and completely fails
> for any bio based drivers, including all the stacking drivers.  The
> block layer will also get a check in 6.8 to reject this use case
> entirely.
> 
> [Note: completely untested, but as this never got fixed since the
> original bug report in November:
> 
>    https://bugzilla.kernel.org/show_bug.cgi?id=218184
> 
> and the the discussion in December:
> 
>     https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/
> 
> this seems to be best way to force it]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Hey Christoph, thanks - applied.
Kent Overstreet Jan. 19, 2024, 9:32 p.m. UTC | #2
On Thu, Jan 11, 2024 at 08:36:55AM +0100, Christoph Hellwig wrote:
> REQ_OP_FLUSH is only for internal use in the blk-mq and request based
> drivers. File systems and other block layer consumers must use
> REQ_OP_WRITE | REQ_PREFLUSH as documented in
> Documentation/block/writeback_cache_control.rst.
> 
> While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
> get the proper flush state machine handling, and completely fails
> for any bio based drivers, including all the stacking drivers.  The
> block layer will also get a check in 6.8 to reject this use case
> entirely.
> 
> [Note: completely untested, but as this never got fixed since the
> original bug report in November:
> 
>    https://bugzilla.kernel.org/show_bug.cgi?id=218184
> 
> and the the discussion in December:
> 
>     https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/
> 
> this seems to be best way to force it]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I've got a new bug report with this patch:
https://www.reddit.com/r/bcachefs/comments/19a2u3c/error_writing_journal_entry/

We seem to be geting -EOPNOTSUPP back from REQ_OP_FLUSH?
Christoph Hellwig Jan. 22, 2024, 6:30 a.m. UTC | #3
On Fri, Jan 19, 2024 at 04:32:44PM -0500, Kent Overstreet wrote:
> 
> I've got a new bug report with this patch:
> https://www.reddit.com/r/bcachefs/comments/19a2u3c/error_writing_journal_entry/
> 
> We seem to be geting -EOPNOTSUPP back from REQ_OP_FLUSH?

With the patch you are replying to you will not get -EOPNOTSUPP
for REQ_OP_FLUSH?, because bcachefs won't send it as it's supposed to.

Without this patch as in current mainline you will get -EOPNOTSUPP
because sending REQ_OP_FLUSH and finally check for that to catch bugs
like the one fixed with this patch.
Kent Overstreet Jan. 22, 2024, 6:37 a.m. UTC | #4
On Mon, Jan 22, 2024 at 07:30:07AM +0100, Christoph Hellwig wrote:
> On Fri, Jan 19, 2024 at 04:32:44PM -0500, Kent Overstreet wrote:
> > 
> > I've got a new bug report with this patch:
> > https://www.reddit.com/r/bcachefs/comments/19a2u3c/error_writing_journal_entry/
> > 
> > We seem to be geting -EOPNOTSUPP back from REQ_OP_FLUSH?
> 
> With the patch you are replying to you will not get -EOPNOTSUPP
> for REQ_OP_FLUSH?, because bcachefs won't send it as it's supposed to.
> 
> Without this patch as in current mainline you will get -EOPNOTSUPP
> because sending REQ_OP_FLUSH and finally check for that to catch bugs
> like the one fixed with this patch.

Then why did the user report -EOPNOTSUPP with the patch, which went away
when reverted?
Christoph Hellwig Jan. 22, 2024, 6:50 a.m. UTC | #5
On Mon, Jan 22, 2024 at 01:37:45AM -0500, Kent Overstreet wrote:
> > Without this patch as in current mainline you will get -EOPNOTSUPP
> > because sending REQ_OP_FLUSH and finally check for that to catch bugs
> > like the one fixed with this patch.
> 
> Then why did the user report -EOPNOTSUPP with the patch, which went away
> when reverted?

I have no idea, as we never return -EOPNOTSUPP for
REQ_OP_WRITE | REQ_PREFLUSH this would be odd.  According to your
report the users runs a Fedora kernel and this commit never went
upstream or even into linux-next, which doesn't quite add a up to me
either.

But in the end just try it yourself.  On current mainline with the buggy
REQ_OP_FLUSH you should be seeing -EOPNOTSUPP every single time you
try a data integrity operation, without it you should not.  Even the most
basic test should be catching this.
Kent Overstreet Jan. 22, 2024, 5:37 p.m. UTC | #6
On Mon, Jan 22, 2024 at 07:50:38AM +0100, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 01:37:45AM -0500, Kent Overstreet wrote:
> > > Without this patch as in current mainline you will get -EOPNOTSUPP
> > > because sending REQ_OP_FLUSH and finally check for that to catch bugs
> > > like the one fixed with this patch.
> > 
> > Then why did the user report -EOPNOTSUPP with the patch, which went away
> > when reverted?
> 
> I have no idea, as we never return -EOPNOTSUPP for
> REQ_OP_WRITE | REQ_PREFLUSH this would be odd.  According to your
> report the users runs a Fedora kernel and this commit never went
> upstream or even into linux-next, which doesn't quite add a up to me
> either.

Ahh - I misread the bug report (fedora puts out kernels before rc1!?).
Thanks, your patch is back in :)
Christoph Hellwig Jan. 22, 2024, 5:38 p.m. UTC | #7
On Mon, Jan 22, 2024 at 12:37:10PM -0500, Kent Overstreet wrote:
> Ahh - I misread the bug report (fedora puts out kernels before rc1!?).
> Thanks, your patch is back in :)

Please throw it in your test setup and watch the results carefully.
I'm pretty sure it is correct, but I do not actually have a way to
verify it right now.
Kent Overstreet Jan. 22, 2024, 5:42 p.m. UTC | #8
On Mon, Jan 22, 2024 at 06:38:09PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 12:37:10PM -0500, Kent Overstreet wrote:
> > Ahh - I misread the bug report (fedora puts out kernels before rc1!?).
> > Thanks, your patch is back in :)
> 
> Please throw it in your test setup and watch the results carefully.
> I'm pretty sure it is correct, but I do not actually have a way to
> verify it right now.

updating my tests for the MD_FAULTY removal, then will do. Is there
anything you want me to look for?

considering most tests won't break or won't clearly break if flush/fua
is being dropped (even generic/388 was passing reliably, of course,
since virtual block devices aren't going to reorder writes...) maybe we
could do some print statement sanity checking...
Christoph Hellwig Jan. 22, 2024, 6:41 p.m. UTC | #9
On Mon, Jan 22, 2024 at 12:42:24PM -0500, Kent Overstreet wrote:
> updating my tests for the MD_FAULTY removal, then will do. Is there
> anything you want me to look for?

Nothing fancy.  Just that you do data integrity operations and do
not see an error.

> considering most tests won't break or won't clearly break if flush/fua
> is being dropped (even generic/388 was passing reliably, of course,
> since virtual block devices aren't going to reorder writes...) maybe we
> could do some print statement sanity checking...

Well, xfstests is not very good about power fail testing, because it
can't inject a power fail..  Which is a problem in it's own, but
would need hardware power failing or at least some pretty good VM
emulation of the same outside the scope of the actual xfstests runner.

But basically what this patch needs is just seeing that you don't
get I/O errors on anything that flushes, which on currently mainline
without the patch you should be seeing.
Dave Chinner Jan. 23, 2024, 12:42 a.m. UTC | #10
On Mon, Jan 22, 2024 at 07:41:47PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 12:42:24PM -0500, Kent Overstreet wrote:
> > updating my tests for the MD_FAULTY removal, then will do. Is there
> > anything you want me to look for?
> 
> Nothing fancy.  Just that you do data integrity operations and do
> not see an error.
> 
> > considering most tests won't break or won't clearly break if flush/fua
> > is being dropped (even generic/388 was passing reliably, of course,
> > since virtual block devices aren't going to reorder writes...) maybe we
> > could do some print statement sanity checking...
> 
> Well, xfstests is not very good about power fail testing, because it
> can't inject a power fail..  Which is a problem in it's own, but
> would need hardware power failing or at least some pretty good VM
> emulation of the same outside the scope of the actual xfstests runner.

We do actually have stuff in fstests that checks write vs flush
ordering as issued by the filesystem. We use dm-logwrites for
tracking the writes and flushes and to be able to replay arbitrary
groups of writes between flushes.  generic/482 is one of those
tests.

These are the ordering problems that power failures expose,
so we do actually have tests that cover the same conditions that
pulling the power from a device would exercise.

I even wrote a debugging script (tools/dm-logwrite-replay) to
iterate write+fua groups in the test corpse to isolate the write
groups where the consistency failure occurs when doing work to
optimise flushes being issued by the XFS journal checkpoint writes.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/bcachefs/fs-io.c b/fs/bcachefs/fs-io.c
index b0e8144ec5500c..a8500af6c7438f 100644
--- a/fs/bcachefs/fs-io.c
+++ b/fs/bcachefs/fs-io.c
@@ -79,7 +79,7 @@  void bch2_inode_flush_nocow_writes_async(struct bch_fs *c,
 			continue;
 
 		bio = container_of(bio_alloc_bioset(ca->disk_sb.bdev, 0,
-						    REQ_OP_FLUSH,
+						    REQ_OP_WRITE | REQ_PREFLUSH,
 						    GFP_KERNEL,
 						    &c->nocow_flush_bioset),
 				   struct nocow_flush, bio);
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 3eb6c3f62a811b..43c76c9ad9a316 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1948,7 +1948,8 @@  CLOSURE_CALLBACK(bch2_journal_write)
 			percpu_ref_get(&ca->io_ref);
 
 			bio = ca->journal.bio;
-			bio_reset(bio, ca->disk_sb.bdev, REQ_OP_FLUSH);
+			bio_reset(bio, ca->disk_sb.bdev,
+					REQ_OP_WRITE | REQ_PREFLUSH),
 			bio->bi_end_io		= journal_write_endio;
 			bio->bi_private		= ca;
 			closure_bio_submit(bio, cl);