[Q] gfs2: mmap write vs. punch_hole consistency

Message ID	20190906205241.2292-1-agruenba@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=8ISX=XB=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Andreas Gruenbacher <agruenba@redhat.com> To: cluster-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org Cc: Andreas Gruenbacher <agruenba@redhat.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@lst.de>, Lukas Czerner <lczerner@redhat.com> Subject: [Q] gfs2: mmap write vs. punch_hole consistency Date: Fri, 6 Sep 2019 22:52:41 +0200 Message-Id: <20190906205241.2292-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	[Q] gfs2: mmap write vs. punch_hole consistency \| expand [Q] gfs2: mmap write vs. punch_hole consistency

Message ID

20190906205241.2292-1-agruenba@redhat.com (mailing list archive)

State

New, archived

Headers

From: Andreas Gruenbacher <agruenba@redhat.com>
To: cluster-devel@redhat.com, linux-fsdevel@vger.kernel.org,
        linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org
Cc: Andreas Gruenbacher <agruenba@redhat.com>,
        "Darrick J . Wong" <darrick.wong@oracle.com>,
        Dave Chinner <david@fromorbit.com>,
        Christoph Hellwig <hch@lst.de>,
        Lukas Czerner <lczerner@redhat.com>
Subject: [Q] gfs2: mmap write vs. punch_hole consistency
Date: Fri,  6 Sep 2019 22:52:41 +0200
Message-Id: <20190906205241.2292-1-agruenba@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

[Q] gfs2: mmap write vs. punch_hole consistency | expand

Commit Message

Andreas Gruenbacher Sept. 6, 2019, 8:52 p.m. UTC

Hi,

I've just fixed a mmap write vs. truncate consistency issue on gfs on
filesystems with a block size smaller that the page size [1].

It turns out that the same problem exists between mmap write and hole
punching, and since xfstests doesn't seem to cover that, I've written a
new test [2].  Ext4 and xfs both pass that test; they both apparently
mark the pages that have a hole punched in them as read-only so that
page_mkwrite is called before those pages can be written to again.

gfs2 fails that: for some reason, the partially block-mapped pages are
not marked read-only on gfs2, and so page_mkwrite is not called for the
partially block-mapped pages, and the hole is not filled in correctly.

The attached patch fixes the problem, but this really doesn't look right
as neither ext4 nor xfs require this kind of hack.  So what am I
overlooking, how does this work on ext4 and xfs?

Thanks,
Andreas

[1] gfs2: Improve mmap write vs. truncate consistency
https://www.redhat.com/archives/cluster-devel/2019-September/msg00009.html

[2] generic/567: test mmap write vs. hole punching
https://www.spinics.net/lists/fstests/msg12474.html

[PATCH] gfs2: Improve mmap write vs. punch_hole consistency

Fixes xfstest generic/567.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Dave Chinner Sept. 6, 2019, 9:27 p.m. UTC | #1

On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> Hi,
> 
> I've just fixed a mmap write vs. truncate consistency issue on gfs on
> filesystems with a block size smaller that the page size [1].
> 
> It turns out that the same problem exists between mmap write and hole
> punching, and since xfstests doesn't seem to cover that,

AFAIA, fsx exercises it pretty often. Certainly it's found problems
with XFS in the past w.r.t. these operations.

> I've written a
> new test [2].

I suspect that what we really want is a test that runs
_test_generic_punch using mmap rather than pwrite...

> Ext4 and xfs both pass that test; they both apparently
> mark the pages that have a hole punched in them as read-only so that
> page_mkwrite is called before those pages can be written to again.

XFS invalidates the range being hole punched (see
xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
attempt to fault that page back in will block on the MMAPLOCK until
the hole punch finishes.

> gfs2 fails that: for some reason, the partially block-mapped pages are
> not marked read-only on gfs2, and so page_mkwrite is not called for the
> partially block-mapped pages, and the hole is not filled in correctly.
> 
> The attached patch fixes the problem, but this really doesn't look right
> as neither ext4 nor xfs require this kind of hack.  So what am I
> overlooking, how does this work on ext4 and xfs?

XFS uses XFS_MMAPLOCK_* to serialise page faults against extent
manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses
a similar locking mechanism to do the same thing. Fundamentally, the
page cache does not provide the necessary mechanisms to detect and
prevent invalidation races inside EOF....

> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  fs/gfs2/bmap.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 9ef543dd38e2..e677e813be4c 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
>  			if (error)
>  				goto out;
>  		}
> +		/*
> +		 * If the first or last page partially lies in the hole, mark
> +		 * the page read-only so that memory-mapped writes will trigger
> +		 * page_mkwrite.
> +		 */
> +		pagecache_isize_extended(inode, offset, inode->i_size);
> +		pagecache_isize_extended(inode, offset + length, inode->i_size);

See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL
to serialise against incoming page faults...

Cheers,

Dave.

Andreas Gruenbacher Sept. 6, 2019, 9:48 p.m. UTC | #2

On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> > Hi,
> >
> > I've just fixed a mmap write vs. truncate consistency issue on gfs on
> > filesystems with a block size smaller that the page size [1].
> >
> > It turns out that the same problem exists between mmap write and hole
> > punching, and since xfstests doesn't seem to cover that,
>
> AFAIA, fsx exercises it pretty often. Certainly it's found problems
> with XFS in the past w.r.t. these operations.
>
> > I've written a
> > new test [2].
>
> I suspect that what we really want is a test that runs
> _test_generic_punch using mmap rather than pwrite...
>
> > Ext4 and xfs both pass that test; they both apparently
> > mark the pages that have a hole punched in them as read-only so that
> > page_mkwrite is called before those pages can be written to again.
>
> XFS invalidates the range being hole punched (see
> xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
> attempt to fault that page back in will block on the MMAPLOCK until
> the hole punch finishes.

This isn't about writes during the hole punching, this is about writes
once the hole is punched. For example, the test case I've posted
creates the following file layout with 1k blocksize:

  DDDD DDDD DDDD

Then it punches a hole like this:

  DDHH HHHH HHDD

Then it fills the hole again with mwrite:

  DDDD DDDD DDDD

As far as I can tell, that needs to trigger page faults on all three
pages. I did get these on ext4; judging from the fact that xfs works,
the also seem to occur there; but on gfs2, page_mkwrite isn't called
for the two partially mapped pages, only for the page in the middle
that's entirely within the hole. And I don't see where those pages are
marked read-only; it appears like pagecache_isize_extended isn't
called on ext4 or xfs. So how does this work there?

> > gfs2 fails that: for some reason, the partially block-mapped pages are
> > not marked read-only on gfs2, and so page_mkwrite is not called for the
> > partially block-mapped pages, and the hole is not filled in correctly.
> >
> > The attached patch fixes the problem, but this really doesn't look right
> > as neither ext4 nor xfs require this kind of hack.  So what am I
> > overlooking, how does this work on ext4 and xfs?
>
> XFS uses XFS_MMAPLOCK_* to serialise page faults against extent
> manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses
> a similar locking mechanism to do the same thing. Fundamentally, the
> page cache does not provide the necessary mechanisms to detect and
> prevent invalidation races inside EOF....

Yes, but that unfortunately doesn't answer my question.

Thanks,
Andreas

> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > ---
> >  fs/gfs2/bmap.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> > index 9ef543dd38e2..e677e813be4c 100644
> > --- a/fs/gfs2/bmap.c
> > +++ b/fs/gfs2/bmap.c
> > @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
> >                       if (error)
> >                               goto out;
> >               }
> > +             /*
> > +              * If the first or last page partially lies in the hole, mark
> > +              * the page read-only so that memory-mapped writes will trigger
> > +              * page_mkwrite.
> > +              */
> > +             pagecache_isize_extended(inode, offset, inode->i_size);
> > +             pagecache_isize_extended(inode, offset + length, inode->i_size);
>
> See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL
> to serialise against incoming page faults...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Sept. 6, 2019, 11:15 p.m. UTC | #3

On Fri, Sep 06, 2019 at 11:48:31PM +0200, Andreas Gruenbacher wrote:
> On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> > > Hi,
> > >
> > > I've just fixed a mmap write vs. truncate consistency issue on gfs on
> > > filesystems with a block size smaller that the page size [1].
> > >
> > > It turns out that the same problem exists between mmap write and hole
> > > punching, and since xfstests doesn't seem to cover that,
> >
> > AFAIA, fsx exercises it pretty often. Certainly it's found problems
> > with XFS in the past w.r.t. these operations.
> >
> > > I've written a
> > > new test [2].
> >
> > I suspect that what we really want is a test that runs
> > _test_generic_punch using mmap rather than pwrite...
> >
> > > Ext4 and xfs both pass that test; they both apparently
> > > mark the pages that have a hole punched in them as read-only so that
> > > page_mkwrite is called before those pages can be written to again.
> >
> > XFS invalidates the range being hole punched (see
> > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
> > attempt to fault that page back in will block on the MMAPLOCK until
> > the hole punch finishes.
> 
> This isn't about writes during the hole punching, this is about writes
> once the hole is punched.

How do you prevent this:

hole punch process:		other process
  clean page
  invalidate page
				write page fault
				instantiate new page
				page_mkwrite
				page dirty
  do hole punch
  overwrite hole

Firstly, you end up with a dirty page over a hole with no backing
store, and second, you get no page fault on overwrite because the
pages are already dirty.

That's the problem the MMAPLOCK in XFS solves - it's integral to
ensuring the page faults on mmap write are correctly serialised with
both the page cache invalidation and the underlying extent
manipulation that the hole punch operation then does.

i.e. if you want a page fault /after/ a hole punch, you have to
prevent it occurring during the hole punch after the page has
already been marked clean and/or invalidated.

> For example, the test case I've posted
> creates the following file layout with 1k blocksize:
> 
>   DDDD DDDD DDDD
> 
> Then it punches a hole like this:
> 
>   DDHH HHHH HHDD
> 
> Then it fills the hole again with mwrite:
> 
>   DDDD DDDD DDDD
> 
> As far as I can tell, that needs to trigger page faults on all three
> pages.

Yes, but only /after/ the hole has been punched.

> I did get these on ext4; judging from the fact that xfs works,
> the also seem to occur there; but on gfs2, page_mkwrite isn't called
> for the two partially mapped pages, only for the page in the middle
> that's entirely within the hole. And I don't see where those pages are
> marked read-only; it appears like pagecache_isize_extended isn't
> called on ext4 or xfs. So how does this work there?

As I said in my last response, the answer is in
xfs_flush_unmap_range(). That uses truncate_pagecache_range() to do
the necessary page cache manipulation.

Cheers,

Dave.

Jan Kara Sept. 9, 2019, 9:19 a.m. UTC | #4

On Fri 06-09-19 23:48:31, Andreas Gruenbacher wrote:
> On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> > > Hi,
> > >
> > > I've just fixed a mmap write vs. truncate consistency issue on gfs on
> > > filesystems with a block size smaller that the page size [1].
> > >
> > > It turns out that the same problem exists between mmap write and hole
> > > punching, and since xfstests doesn't seem to cover that,
> >
> > AFAIA, fsx exercises it pretty often. Certainly it's found problems
> > with XFS in the past w.r.t. these operations.
> >
> > > I've written a
> > > new test [2].
> >
> > I suspect that what we really want is a test that runs
> > _test_generic_punch using mmap rather than pwrite...
> >
> > > Ext4 and xfs both pass that test; they both apparently
> > > mark the pages that have a hole punched in them as read-only so that
> > > page_mkwrite is called before those pages can be written to again.
> >
> > XFS invalidates the range being hole punched (see
> > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
> > attempt to fault that page back in will block on the MMAPLOCK until
> > the hole punch finishes.
> 
> This isn't about writes during the hole punching, this is about writes
> once the hole is punched. For example, the test case I've posted
> creates the following file layout with 1k blocksize:
> 
>   DDDD DDDD DDDD
> 
> Then it punches a hole like this:
> 
>   DDHH HHHH HHDD
> 
> Then it fills the hole again with mwrite:
> 
>   DDDD DDDD DDDD
> 
> As far as I can tell, that needs to trigger page faults on all three
> pages. I did get these on ext4; judging from the fact that xfs works,
> the also seem to occur there; but on gfs2, page_mkwrite isn't called
> for the two partially mapped pages, only for the page in the middle
> that's entirely within the hole. And I don't see where those pages are
> marked read-only; it appears like pagecache_isize_extended isn't
> called on ext4 or xfs. So how does this work there?

The trick ext4 & xfs use is that they writeout the range being punched
first (see e.g. ext4_punch_hole() calling filemap_write_and_wait_range() or
xfs_flush_unmap_range() called from xfs_free_file_space()). This writeout
also has the effect that all the page mappings for that range get
write-protected.

Another related issue is what Dave points out: Even if you use writeout to
writeprotect pages, GFS2 still seems to have a race where page fault can
come while you are freeing blocks and if you allow that you usually get
into a problematic state. Effects depend on fs implementation details but
usually it can result in stale data exposure or fs corruption.

								Honza

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 9ef543dd38e2..e677e813be4c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -2475,6 +2475,13 @@  int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 			if (error)
 				goto out;
 		}
+		/*
+		 * If the first or last page partially lies in the hole, mark
+		 * the page read-only so that memory-mapped writes will trigger
+		 * page_mkwrite.
+		 */
+		pagecache_isize_extended(inode, offset, inode->i_size);
+		pagecache_isize_extended(inode, offset + length, inode->i_size);
 	}
 
 	if (gfs2_is_jdata(ip)) {

[Q] gfs2: mmap write vs. punch_hole consistency

Commit Message

Comments

Patch