diff mbox series

iomap: set page dirty after partial delalloc on mkwrite

Message ID 20180928173956.42428-1-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series iomap: set page dirty after partial delalloc on mkwrite | expand

Commit Message

Brian Foster Sept. 28, 2018, 5:39 p.m. UTC
The iomap page fault mechanism currently dirties the associated page
after the full block range of the page has been allocated. This
leaves the page susceptible to delayed allocations without ever
being set dirty on sub-page block sized filesystems.

For example, consider a page fault on a page with one preexisting
real (non-delalloc) block allocated in the middle of the page. The
first iomap_apply() iteration performs delayed allocation on the
range up to the preexisting block, the next iteration finds the
preexisting block, and the last iteration attempts to perform
delayed allocation on the range after the prexisting block to the
end of the page. If the first allocation succeeds and the final
allocation fails with -ENOSPC, iomap_apply() returns the error and
iomap_page_mkwrite() fails to dirty the page having already
performed partial delayed allocation. This eventually results in the
page being invalidated without ever converting the delayed
allocation to real blocks.

This problem is reliably reproduced by generic/083 on XFS on ppc64
systems (64k page size, 4k block size). It results in leaked
delalloc blocks on inode reclaim, which triggers an assert failure
in xfs_fs_destroy_inode() and filesystem accounting inconsistency.

Move the set_page_dirty() call from iomap_page_mkwrite() to the
actor callback, similar to how the buffer head implementation works.
The actor callback is called iff ->iomap_begin() returns success, so
ensures the page is dirtied as soon as possible after an allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This patch addresses the problem with generic/083, but I'm still in the
process of running broader testing. I wanted to get it on the list for
review in the meantime...

Brian

 fs/iomap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christoph Hellwig Sept. 30, 2018, 10:44 p.m. UTC | #1
On Fri, Sep 28, 2018 at 01:39:56PM -0400, Brian Foster wrote:
> This problem is reliably reproduced by generic/083 on XFS on ppc64
> systems (64k page size, 4k block size). It results in leaked
> delalloc blocks on inode reclaim, which triggers an assert failure
> in xfs_fs_destroy_inode() and filesystem accounting inconsistency.

Interesting - I've not seen this on 1k block size / 4k page size
systems.

> This patch addresses the problem with generic/083, but I'm still in the
> process of running broader testing. I wanted to get it on the list for
> review in the meantime...

But in general this looks sensible to me.  So if the testing passes
this looks good to me.
Dave Chinner Oct. 1, 2018, 12:31 a.m. UTC | #2
On Fri, Sep 28, 2018 at 01:39:56PM -0400, Brian Foster wrote:
> The iomap page fault mechanism currently dirties the associated page
> after the full block range of the page has been allocated. This
> leaves the page susceptible to delayed allocations without ever
> being set dirty on sub-page block sized filesystems.
> 
> For example, consider a page fault on a page with one preexisting
> real (non-delalloc) block allocated in the middle of the page. The
> first iomap_apply() iteration performs delayed allocation on the
> range up to the preexisting block, the next iteration finds the
> preexisting block, and the last iteration attempts to perform
> delayed allocation on the range after the prexisting block to the
> end of the page. If the first allocation succeeds and the final
> allocation fails with -ENOSPC, iomap_apply() returns the error and
> iomap_page_mkwrite() fails to dirty the page having already
> performed partial delayed allocation. This eventually results in the
> page being invalidated without ever converting the delayed
> allocation to real blocks.
> 
> This problem is reliably reproduced by generic/083 on XFS on ppc64
> systems (64k page size, 4k block size). It results in leaked
> delalloc blocks on inode reclaim, which triggers an assert failure
> in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
> 
> Move the set_page_dirty() call from iomap_page_mkwrite() to the
> actor callback, similar to how the buffer head implementation works.
> The actor callback is called iff ->iomap_begin() returns success, so
> ensures the page is dirtied as soon as possible after an allocation.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks sensible to me.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> ---
> 
> This patch addresses the problem with generic/083, but I'm still in the
> process of running broader testing. I wanted to get it on the list for
> review in the meantime...

It's been fine on my weekend test runs in the my candidate fixes for
4.19 branch. There have been no regressions for v4 512b and v5 1k
block size test runs from it - I got 14 complete fstests runs of sub
page block size configs across multiple test machines over the
weekend, so I've pushed it with that branch.

-Dave.
Brian Foster Oct. 1, 2018, 10:54 a.m. UTC | #3
On Mon, Oct 01, 2018 at 12:44:35AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 28, 2018 at 01:39:56PM -0400, Brian Foster wrote:
> > This problem is reliably reproduced by generic/083 on XFS on ppc64
> > systems (64k page size, 4k block size). It results in leaked
> > delalloc blocks on inode reclaim, which triggers an assert failure
> > in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
> 
> Interesting - I've not seen this on 1k block size / 4k page size
> systems.
> 

I hadn't either. I tried again on x86-64 with a smaller block size when
I realized generic/083 only reproduced on ppc64 with 4k blocks and still
couldn't reproduce.

That said, an explicit test of the circumstances described in the commit
log (write fault on a page that performs partial delalloc and fails with
-ENOSPC) reproduces the problem reliably regardless of page size. I
suspect the 1k/512b block vs 4k page size is just too small of a window
for a sub-page/partial delalloc -ENOSPC as opposed to 4k fsb and 64k
pages to reproduce reliably via fsstress.

> > This patch addresses the problem with generic/083, but I'm still in the
> > process of running broader testing. I wanted to get it on the list for
> > review in the meantime...
> 
> But in general this looks sensible to me.  So if the testing passes
> this looks good to me.

As a follow up to Dave's additional testing (thanks!), I ran a few
xfstests runs on x86-64 and ppc64 over the weekend and didn't reproduce
any regressions (though I note that XFS has a high failure rate with
-bsize=64k, but that's a separate problem).

Brian
diff mbox series

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index 74762b1ec233..ec15cf2ec696 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1051,6 +1051,7 @@  iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 	} else {
 		WARN_ON_ONCE(!PageUptodate(page));
 		iomap_page_create(inode, page);
+		set_page_dirty(page);
 	}
 
 	return length;
@@ -1090,7 +1091,6 @@  int iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 		length -= ret;
 	}
 
-	set_page_dirty(page);
 	wait_for_stable_page(page);
 	return VM_FAULT_LOCKED;
 out_unlock: