[v4,36/39] netfs: Implement a write-through caching option

Message ID	20231213152350.431591-37-dhowells@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Howells <dhowells@redhat.com> To: Jeff Layton <jlayton@kernel.org>, Steve French <smfrench@gmail.com> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Marc Dionne <marc.dionne@auristor.com>, Paulo Alcantara <pc@manguebit.com>, Shyam Prasad N <sprasad@microsoft.com>, Tom Talpey <tom@talpey.com>, Dominique Martinet <asmadeus@codewreck.org>, Eric Van Hensbergen <ericvh@kernel.org>, Ilya Dryomov <idryomov@gmail.com>, Christian Brauner <christian@brauner.io>, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 36/39] netfs: Implement a write-through caching option Date: Wed, 13 Dec 2023 15:23:46 +0000 Message-ID: <20231213152350.431591-37-dhowells@redhat.com> In-Reply-To: <20231213152350.431591-1-dhowells@redhat.com> References: <20231213152350.431591-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	netfs, afs, 9p: Delegate high-level I/O to netfslib \| expand [v4,00/39] netfs, afs, 9p: Delegate high-level I/O to netfslib [v4,01/39] netfs, fscache: Move fs/fscache/* into fs/netfs/ [v4,02/39] netfs, fscache: Combine fscache with netfs [v4,03/39] netfs, fscache: Remove ->begin_cache_operation [v4,04/39] netfs, fscache: Move /proc/fs/fscache to /proc/fs/netfs and put in a symlink [v4,05/39] netfs: Move pinning-for-writeback from fscache to netfs [v4,06/39] netfs: Add a procfile to list in-progress requests [v4,07/39] netfs: Allow the netfs to make the io (sub)request alloc larger [v4,08/39] netfs: Add a ->free_subrequest() op [v4,09/39] afs: Don't use folio->private to record partial modification [v4,10/39] netfs: Provide invalidate_folio and release_folio calls [v4,11/39] netfs: Implement unbuffered/DIO vs buffered I/O locking [v4,12/39] netfs: Add iov_iters to (sub)requests to describe various buffers [v4,13/39] netfs: Add support for DIO buffering [v4,14/39] netfs: Provide tools to create a buffer in an xarray [v4,15/39] netfs: Add bounce buffering support [v4,16/39] netfs: Add func to calculate pagecount/size-limited span of an iterator [v4,17/39] netfs: Limit subrequest by size or number of segments [v4,18/39] netfs: Export netfs_put_subrequest() and some tracepoints [v4,19/39] netfs: Extend the netfs_io_*request structs to handle writes [v4,20/39] netfs: Add a hook to allow tell the netfs to update its i_size [v4,21/39] netfs: Make netfs_put_request() handle a NULL pointer [v4,22/39] netfs: Make the refcounting of netfs_begin_read() easier to use [v4,23/39] netfs: Prep to use folio->private for write grouping and streaming write [v4,24/39] netfs: Dispatch write requests to process a writeback slice [v4,25/39] netfs: Provide func to copy data to pagecache for buffered write [v4,26/39] netfs: Make netfs_read_folio() handle streaming-write pages [v4,27/39] netfs: Allocate multipage folios in the writepath [v4,28/39] netfs: Implement support for unbuffered/DIO read [v4,29/39] netfs: Implement unbuffered/DIO write support [v4,30/39] netfs: Implement buffered write API [v4,31/39] netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite() [v4,32/39] netfs: Provide netfs_file_read_iter() [v4,33/39] netfs, cachefiles: Pass upper bound length to allow expansion [v4,34/39] netfs: Provide a writepages implementation [v4,35/39] netfs: Provide a launder_folio implementation [v4,36/39] netfs: Implement a write-through caching option [v4,37/39] netfs: Optimise away reads above the point at which there can be no data [v4,38/39] afs: Use the netfs write helpers [v4,39/39] 9p: Use netfslib read/write_iter

Message ID

20231213152350.431591-37-dhowells@redhat.com (mailing list archive)

State

New

Headers

From: David Howells <dhowells@redhat.com>
To: Jeff Layton <jlayton@kernel.org>,
	Steve French <smfrench@gmail.com>
Cc: David Howells <dhowells@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Marc Dionne <marc.dionne@auristor.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Christian Brauner <christian@brauner.io>,
	linux-cachefs@redhat.com,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,
	netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v4 36/39] netfs: Implement a write-through caching option
Date: Wed, 13 Dec 2023 15:23:46 +0000
Message-ID: <20231213152350.431591-37-dhowells@redhat.com>
In-Reply-To: <20231213152350.431591-1-dhowells@redhat.com>
References: <20231213152350.431591-1-dhowells@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

netfs, afs, 9p: Delegate high-level I/O to netfslib | expand

Commit Message

David Howells Dec. 13, 2023, 3:23 p.m. UTC

Provide a flag whereby a filesystem may request that cifs_perform_write()
perform write-through caching.  This involves putting pages directly into
writeback rather than dirty and attaching them to a write operation as we
go.

Further, the writes being made are limited to the byte range being written
rather than whole folios being written.  This can be used by cifs, for
example, to deal with strict byte-range locking.

This can't be used with content encryption as that may require expansion of
the write RPC beyond the write being made.

This doesn't affect writes via mmap - those are written back in the normal
way; similarly failed writethrough writes are marked dirty and left to
writeback to retry.  Another option would be to simply invalidate them, but
the contents can be simultaneously accessed by read() and through mmap.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/netfs/buffered_write.c    | 69 +++++++++++++++++++++++----
 fs/netfs/internal.h          |  3 ++
 fs/netfs/main.c              |  1 +
 fs/netfs/objects.c           |  1 +
 fs/netfs/output.c            | 90 ++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h        |  2 +
 include/trace/events/netfs.h |  8 +++-
 7 files changed, 162 insertions(+), 12 deletions(-)

Comments

Jeff Layton Dec. 14, 2023, 1:49 p.m. UTC | #1

On Wed, 2023-12-13 at 15:23 +0000, David Howells wrote:
> Provide a flag whereby a filesystem may request that cifs_perform_write()
> perform write-through caching.  This involves putting pages directly into
> writeback rather than dirty and attaching them to a write operation as we
> go.
> 
> Further, the writes being made are limited to the byte range being written
> rather than whole folios being written.  This can be used by cifs, for
> example, to deal with strict byte-range locking.
> 

This is pretty cool. I wonder if that will help cifs pass more locking
tests?

> This can't be used with content encryption as that may require expansion of
> the write RPC beyond the write being made.
> 
> This doesn't affect writes via mmap - those are written back in the normal
> way; similarly failed writethrough writes are marked dirty and left to
> writeback to retry.  Another option would be to simply invalidate them, but
> the contents can be simultaneously accessed by read() and through mmap.
> 

I do wish Linux were less of a mess in this regard. Different
filesystems behave differently when writeback fails.

That said, the modern consensus with local filesystems is to just leave
the pages clean when buffered writeback fails, but set a writeback error
on the inode. That at least keeps dirty pages from stacking up in the
cache. In the case of something like a netfs, we usually invalidate the
inode and the pages -- netfs's usually have to spontaneously deal with
that anyway, so we might as well.

Marking the pages dirty here should mean that they'll effectively get a
second try at writeback, which is a change in behavior from most
filesystems. I'm not sure it's a bad one, but writeback can take a long
time if you have a laggy network.

When a write has already failed once, why do you think it'll succeed on
a second attempt (and probably with page-aligned I/O, I guess)?

Another question: when the writeback is (re)attempted, will it end up
just doing page-aligned I/O, or is the byte range still going to be
limited to the written range?

The more I consider it, I think it might be a lot simpler to just "fail
fast" here rather than remarking the write dirty.

> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: linux-cachefs@redhat.com
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  fs/netfs/buffered_write.c    | 69 +++++++++++++++++++++++----
>  fs/netfs/internal.h          |  3 ++
>  fs/netfs/main.c              |  1 +
>  fs/netfs/objects.c           |  1 +
>  fs/netfs/output.c            | 90 ++++++++++++++++++++++++++++++++++++
>  include/linux/netfs.h        |  2 +
>  include/trace/events/netfs.h |  8 +++-
>  7 files changed, 162 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
> index 8e0ebb7175a4..dce6995fb644 100644
> --- a/fs/netfs/buffered_write.c
> +++ b/fs/netfs/buffered_write.c
> @@ -26,6 +26,8 @@ enum netfs_how_to_modify {
>  	NETFS_FLUSH_CONTENT,		/* Flush incompatible content. */
>  };
>  
> +static void netfs_cleanup_buffered_write(struct netfs_io_request *wreq);
> +
>  static void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
>  {
>  	if (netfs_group && !folio_get_private(folio))
> @@ -133,6 +135,14 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  	struct inode *inode = file_inode(file);
>  	struct address_space *mapping = inode->i_mapping;
>  	struct netfs_inode *ctx = netfs_inode(inode);
> +	struct writeback_control wbc = {
> +		.sync_mode	= WB_SYNC_NONE,
> +		.for_sync	= true,
> +		.nr_to_write	= LONG_MAX,
> +		.range_start	= iocb->ki_pos,
> +		.range_end	= iocb->ki_pos + iter->count,
> +	};
> +	struct netfs_io_request *wreq = NULL;
>  	struct netfs_folio *finfo;
>  	struct folio *folio;
>  	enum netfs_how_to_modify howto;
> @@ -143,6 +153,30 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  	size_t max_chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>  	bool maybe_trouble = false;
>  
> +	if (unlikely(test_bit(NETFS_ICTX_WRITETHROUGH, &ctx->flags) ||
> +		     iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC))
> +	    ) {
> +		if (pos < i_size_read(inode)) {
> +			ret = filemap_write_and_wait_range(mapping, pos, pos + iter->count);
> +			if (ret < 0) {
> +				goto out;
> +			}
> +		}
> +
> +		wbc_attach_fdatawrite_inode(&wbc, mapping->host);
> +
> +		wreq = netfs_begin_writethrough(iocb, iter->count);
> +		if (IS_ERR(wreq)) {
> +			wbc_detach_inode(&wbc);
> +			ret = PTR_ERR(wreq);
> +			wreq = NULL;
> +			goto out;
> +		}
> +		if (!is_sync_kiocb(iocb))
> +			wreq->iocb = iocb;
> +		wreq->cleanup = netfs_cleanup_buffered_write;
> +	}
> +
>  	do {
>  		size_t flen;
>  		size_t offset;	/* Offset into pagecache folio */
> @@ -315,7 +349,25 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  		}
>  		written += copied;
>  
> -		folio_mark_dirty(folio);
> +		if (likely(!wreq)) {
> +			folio_mark_dirty(folio);
> +		} else {
> +			if (folio_test_dirty(folio))
> +				/* Sigh.  mmap. */
> +				folio_clear_dirty_for_io(folio);
> +			/* We make multiple writes to the folio... */
> +			if (!folio_test_writeback(folio)) {
> +				folio_wait_fscache(folio);
> +				folio_start_writeback(folio);
> +				folio_start_fscache(folio);
> +				if (wreq->iter.count == 0)
> +					trace_netfs_folio(folio, netfs_folio_trace_wthru);
> +				else
> +					trace_netfs_folio(folio, netfs_folio_trace_wthru_plus);
> +			}
> +			netfs_advance_writethrough(wreq, copied,
> +						   offset + copied == flen);
> +		}
>  	retry:
>  		folio_unlock(folio);
>  		folio_put(folio);
> @@ -325,17 +377,14 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
>  	} while (iov_iter_count(iter));
>  
>  out:
> -	if (likely(written)) {
> -		/* Flush and wait for a write that requires immediate synchronisation. */
> -		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) {
> -			_debug("dsync");
> -			ret = filemap_fdatawait_range(mapping, iocb->ki_pos,
> -						      iocb->ki_pos + written);
> -		}
> -
> -		iocb->ki_pos += written;
> +	if (unlikely(wreq)) {
> +		ret = netfs_end_writethrough(wreq, iocb);
> +		wbc_detach_inode(&wbc);
> +		if (ret == -EIOCBQUEUED)
> +			return ret;
>  	}
>  
> +	iocb->ki_pos += written;
>  	_leave(" = %zd [%zd]", written, ret);
>  	return written ? written : ret;
>  
> diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
> index fe72280b0f30..b3749d6ec1ff 100644
> --- a/fs/netfs/internal.h
> +++ b/fs/netfs/internal.h
> @@ -101,6 +101,9 @@ static inline void netfs_see_request(struct netfs_io_request *rreq,
>   */
>  int netfs_begin_write(struct netfs_io_request *wreq, bool may_wait,
>  		      enum netfs_write_trace what);
> +struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len);
> +int netfs_advance_writethrough(struct netfs_io_request *wreq, size_t copied, bool to_page_end);
> +int netfs_end_writethrough(struct netfs_io_request *wreq, struct kiocb *iocb);
>  
>  /*
>   * stats.c
> diff --git a/fs/netfs/main.c b/fs/netfs/main.c
> index 8d5ee0f56f28..7139397931b7 100644
> --- a/fs/netfs/main.c
> +++ b/fs/netfs/main.c
> @@ -33,6 +33,7 @@ static const char *netfs_origins[nr__netfs_io_origin] = {
>  	[NETFS_READPAGE]		= "RP",
>  	[NETFS_READ_FOR_WRITE]		= "RW",
>  	[NETFS_WRITEBACK]		= "WB",
> +	[NETFS_WRITETHROUGH]		= "WT",
>  	[NETFS_LAUNDER_WRITE]		= "LW",
>  	[NETFS_UNBUFFERED_WRITE]	= "UW",
>  	[NETFS_DIO_READ]		= "DR",
> diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
> index 16252cc4576e..37626328577e 100644
> --- a/fs/netfs/objects.c
> +++ b/fs/netfs/objects.c
> @@ -42,6 +42,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
>  	rreq->debug_id	= atomic_inc_return(&debug_ids);
>  	xa_init(&rreq->bounce);
>  	INIT_LIST_HEAD(&rreq->subrequests);
> +	INIT_WORK(&rreq->work, NULL);
>  	refcount_set(&rreq->ref, 1);
>  
>  	__set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
> diff --git a/fs/netfs/output.c b/fs/netfs/output.c
> index cc9065733b42..625eb68f3e5a 100644
> --- a/fs/netfs/output.c
> +++ b/fs/netfs/output.c
> @@ -386,3 +386,93 @@ int netfs_begin_write(struct netfs_io_request *wreq, bool may_wait,
>  		    TASK_UNINTERRUPTIBLE);
>  	return wreq->error;
>  }
> +
> +/*
> + * Begin a write operation for writing through the pagecache.
> + */
> +struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len)
> +{
> +	struct netfs_io_request *wreq;
> +	struct file *file = iocb->ki_filp;
> +
> +	wreq = netfs_alloc_request(file->f_mapping, file, iocb->ki_pos, len,
> +				   NETFS_WRITETHROUGH);
> +	if (IS_ERR(wreq))
> +		return wreq;
> +
> +	trace_netfs_write(wreq, netfs_write_trace_writethrough);
> +
> +	__set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
> +	iov_iter_xarray(&wreq->iter, ITER_SOURCE, &wreq->mapping->i_pages, wreq->start, 0);
> +	wreq->io_iter = wreq->iter;
> +
> +	/* ->outstanding > 0 carries a ref */
> +	netfs_get_request(wreq, netfs_rreq_trace_get_for_outstanding);
> +	atomic_set(&wreq->nr_outstanding, 1);
> +	return wreq;
> +}
> +
> +static void netfs_submit_writethrough(struct netfs_io_request *wreq, bool final)
> +{
> +	struct netfs_inode *ictx = netfs_inode(wreq->inode);
> +	unsigned long long start;
> +	size_t len;
> +
> +	if (!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
> +		return;
> +
> +	start = wreq->start + wreq->submitted;
> +	len = wreq->iter.count - wreq->submitted;
> +	if (!final) {
> +		len /= wreq->wsize; /* Round to number of maximum packets */
> +		len *= wreq->wsize;
> +	}
> +
> +	ictx->ops->create_write_requests(wreq, start, len);
> +	wreq->submitted += len;
> +}
> +
> +/*
> + * Advance the state of the write operation used when writing through the
> + * pagecache.  Data has been copied into the pagecache that we need to append
> + * to the request.  If we've added more than wsize then we need to create a new
> + * subrequest.
> + */
> +int netfs_advance_writethrough(struct netfs_io_request *wreq, size_t copied, bool to_page_end)
> +{
> +	_enter("ic=%zu sb=%zu ws=%u cp=%zu tp=%u",
> +	       wreq->iter.count, wreq->submitted, wreq->wsize, copied, to_page_end);
> +
> +	wreq->iter.count += copied;
> +	wreq->io_iter.count += copied;
> +	if (to_page_end && wreq->io_iter.count - wreq->submitted >= wreq->wsize)
> +		netfs_submit_writethrough(wreq, false);
> +
> +	return wreq->error;
> +}
> +
> +/*
> + * End a write operation used when writing through the pagecache.
> + */
> +int netfs_end_writethrough(struct netfs_io_request *wreq, struct kiocb *iocb)
> +{
> +	int ret = -EIOCBQUEUED;
> +
> +	_enter("ic=%zu sb=%zu ws=%u",
> +	       wreq->iter.count, wreq->submitted, wreq->wsize);
> +
> +	if (wreq->submitted < wreq->io_iter.count)
> +		netfs_submit_writethrough(wreq, true);
> +
> +	if (atomic_dec_and_test(&wreq->nr_outstanding))
> +		netfs_write_terminated(wreq, false);
> +
> +	if (is_sync_kiocb(iocb)) {
> +		wait_on_bit(&wreq->flags, NETFS_RREQ_IN_PROGRESS,
> +			    TASK_UNINTERRUPTIBLE);
> +		ret = wreq->error;
> +	}
> +
> +	netfs_put_request(wreq, false, netfs_rreq_trace_put_return);
> +	return ret;
> +}
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index a7c2cb856e81..fc77f7be220a 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -139,6 +139,7 @@ struct netfs_inode {
>  	unsigned long		flags;
>  #define NETFS_ICTX_ODIRECT	0		/* The file has DIO in progress */
>  #define NETFS_ICTX_UNBUFFERED	1		/* I/O should not use the pagecache */
> +#define NETFS_ICTX_WRITETHROUGH	2		/* Write-through caching */
>  };
>  
>  /*
> @@ -227,6 +228,7 @@ enum netfs_io_origin {
>  	NETFS_READPAGE,			/* This read is a synchronous read */
>  	NETFS_READ_FOR_WRITE,		/* This read is to prepare a write */
>  	NETFS_WRITEBACK,		/* This write was triggered by writepages */
> +	NETFS_WRITETHROUGH,		/* This write was made by netfs_perform_write() */
>  	NETFS_LAUNDER_WRITE,		/* This is triggered by ->launder_folio() */
>  	NETFS_UNBUFFERED_WRITE,		/* This is an unbuffered write */
>  	NETFS_DIO_READ,			/* This is a direct I/O read */
> diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
> index cc998798e20a..447a8c21cf57 100644
> --- a/include/trace/events/netfs.h
> +++ b/include/trace/events/netfs.h
> @@ -27,13 +27,15 @@
>  	EM(netfs_write_trace_dio_write,		"DIO-WRITE")	\
>  	EM(netfs_write_trace_launder,		"LAUNDER  ")	\
>  	EM(netfs_write_trace_unbuffered_write,	"UNB-WRITE")	\
> -	E_(netfs_write_trace_writeback,		"WRITEBACK")
> +	EM(netfs_write_trace_writeback,		"WRITEBACK")	\
> +	E_(netfs_write_trace_writethrough,	"WRITETHRU")
>  
>  #define netfs_rreq_origins					\
>  	EM(NETFS_READAHEAD,			"RA")		\
>  	EM(NETFS_READPAGE,			"RP")		\
>  	EM(NETFS_READ_FOR_WRITE,		"RW")		\
>  	EM(NETFS_WRITEBACK,			"WB")		\
> +	EM(NETFS_WRITETHROUGH,			"WT")		\
>  	EM(NETFS_LAUNDER_WRITE,			"LW")		\
>  	EM(NETFS_UNBUFFERED_WRITE,		"UW")		\
>  	EM(NETFS_DIO_READ,			"DR")		\
> @@ -136,7 +138,9 @@
>  	EM(netfs_folio_trace_redirty,		"redirty")	\
>  	EM(netfs_folio_trace_redirtied,		"redirtied")	\
>  	EM(netfs_folio_trace_store,		"store")	\
> -	E_(netfs_folio_trace_store_plus,	"store+")
> +	EM(netfs_folio_trace_store_plus,	"store+")	\
> +	EM(netfs_folio_trace_wthru,		"wthru")	\
> +	E_(netfs_folio_trace_wthru_plus,	"wthru+")
>  
>  #ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
>  #define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
>

David Howells Dec. 19, 2023, 4:51 p.m. UTC | #2

Jeff Layton <jlayton@kernel.org> wrote:

> > This can't be used with content encryption as that may require expansion of
> > the write RPC beyond the write being made.
> > 
> > This doesn't affect writes via mmap - those are written back in the normal
> > way; similarly failed writethrough writes are marked dirty and left to
> > writeback to retry.  Another option would be to simply invalidate them, but
> > the contents can be simultaneously accessed by read() and through mmap.
> > 
> 
> I do wish Linux were less of a mess in this regard. Different
> filesystems behave differently when writeback fails.

Cifs is particularly, um, entertaining in this regard as it allows the write
to fail on the server due to a checksum failure if the source data changes
during the write and then just retries it later.

> That said, the modern consensus with local filesystems is to just leave
> the pages clean when buffered writeback fails, but set a writeback error
> on the inode. That at least keeps dirty pages from stacking up in the
> cache. In the case of something like a netfs, we usually invalidate the
> inode and the pages -- netfs's usually have to spontaneously deal with
> that anyway, so we might as well.
> 
> Marking the pages dirty here should mean that they'll effectively get a
> second try at writeback, which is a change in behavior from most
> filesystems. I'm not sure it's a bad one, but writeback can take a long
> time if you have a laggy network.

I'm not sure what the best thing to do is.  If everything is doing
O_DSYNC/writethrough I/O on an inode and there is no mmap, then invalidating
the pages is probably not a bad way to deal with failure here.

> When a write has already failed once, why do you think it'll succeed on
> a second attempt (and probably with page-aligned I/O, I guess)?

See above with cifs.  I wonder if the pages being written to should be made RO
and page_mkwrite() forced to lock against DSYNC writethrough.

> Another question: when the writeback is (re)attempted, will it end up
> just doing page-aligned I/O, or is the byte range still going to be
> limited to the written range?

At the moment, it then happens exactly as it would if it wasn't doing
writethrough - so it will write partial folios if it's doing a streaming write
and will do full folios otherwise.

> The more I consider it, I think it might be a lot simpler to just "fail
> fast" here rather than remarking the write dirty.

You may be right - but, again, mmap:-/

David

Jeff Layton Dec. 19, 2023, 5:19 p.m. UTC | #3

On Tue, 2023-12-19 at 16:51 +0000, David Howells wrote:
> Jeff Layton <jlayton@kernel.org> wrote:
> 
> > > This can't be used with content encryption as that may require expansion of
> > > the write RPC beyond the write being made.
> > > 
> > > This doesn't affect writes via mmap - those are written back in the normal
> > > way; similarly failed writethrough writes are marked dirty and left to
> > > writeback to retry.  Another option would be to simply invalidate them, but
> > > the contents can be simultaneously accessed by read() and through mmap.
> > > 
> > 
> > I do wish Linux were less of a mess in this regard. Different
> > filesystems behave differently when writeback fails.
> 
> Cifs is particularly, um, entertaining in this regard as it allows the write
> to fail on the server due to a checksum failure if the source data changes
> during the write and then just retries it later.
> 

Should they be using bounce pages here? Maybe that's more efficient in
the common case though and worth the extra hit if it happens seldom
enough.

> > That said, the modern consensus with local filesystems is to just leave
> > the pages clean when buffered writeback fails, but set a writeback error
> > on the inode. That at least keeps dirty pages from stacking up in the
> > cache. In the case of something like a netfs, we usually invalidate the
> > inode and the pages -- netfs's usually have to spontaneously deal with
> > that anyway, so we might as well.
> > 
> > Marking the pages dirty here should mean that they'll effectively get a
> > second try at writeback, which is a change in behavior from most
> > filesystems. I'm not sure it's a bad one, but writeback can take a long
> > time if you have a laggy network.
> 
> I'm not sure what the best thing to do is.  If everything is doing
> O_DSYNC/writethrough I/O on an inode and there is no mmap, then invalidating
> the pages is probably not a bad way to deal with failure here.
> 

That's a big if ;)

> > When a write has already failed once, why do you think it'll succeed on
> > a second attempt (and probably with page-aligned I/O, I guess)?
> 
> See above with cifs.  I wonder if the pages being written to should be made RO
> and page_mkwrite() forced to lock against DSYNC writethrough.
> 

That sounds pretty heavy handed, particularly if the server goes offline
for a bit. Now you're stuck in some locking call in page_mkwrite...

> > Another question: when the writeback is (re)attempted, will it end up
> > just doing page-aligned I/O, or is the byte range still going to be
> > limited to the written range?
> 
> At the moment, it then happens exactly as it would if it wasn't doing
> writethrough - so it will write partial folios if it's doing a streaming write
> and will do full folios otherwise.
> 
>
> > The more I consider it, I think it might be a lot simpler to just "fail
> > fast" here rather than remarking the write dirty.
> 
> You may be right - but, again, mmap:-/
> 

There's nothing we can do about mmap -- we're stuck page-sized I/Os
there.

With normal buffered I/O I still think just leaving the pages clean is
probably the least bad option. I think it's also sort of the Linux
"standard" behavior (for better or worse).

Willy, do you have any thoughts here?

diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 8e0ebb7175a4..dce6995fb644 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -26,6 +26,8 @@  enum netfs_how_to_modify {
 	NETFS_FLUSH_CONTENT,		/* Flush incompatible content. */
 };
 
+static void netfs_cleanup_buffered_write(struct netfs_io_request *wreq);
+
 static void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group)
 {
 	if (netfs_group && !folio_get_private(folio))
@@ -133,6 +135,14 @@  ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct inode *inode = file_inode(file);
 	struct address_space *mapping = inode->i_mapping;
 	struct netfs_inode *ctx = netfs_inode(inode);
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_NONE,
+		.for_sync	= true,
+		.nr_to_write	= LONG_MAX,
+		.range_start	= iocb->ki_pos,
+		.range_end	= iocb->ki_pos + iter->count,
+	};
+	struct netfs_io_request *wreq = NULL;
 	struct netfs_folio *finfo;
 	struct folio *folio;
 	enum netfs_how_to_modify howto;
@@ -143,6 +153,30 @@  ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 	size_t max_chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	bool maybe_trouble = false;
 
+	if (unlikely(test_bit(NETFS_ICTX_WRITETHROUGH, &ctx->flags) ||
+		     iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC))
+	    ) {
+		if (pos < i_size_read(inode)) {
+			ret = filemap_write_and_wait_range(mapping, pos, pos + iter->count);
+			if (ret < 0) {
+				goto out;
+			}
+		}
+
+		wbc_attach_fdatawrite_inode(&wbc, mapping->host);
+
+		wreq = netfs_begin_writethrough(iocb, iter->count);
+		if (IS_ERR(wreq)) {
+			wbc_detach_inode(&wbc);
+			ret = PTR_ERR(wreq);
+			wreq = NULL;
+			goto out;
+		}
+		if (!is_sync_kiocb(iocb))
+			wreq->iocb = iocb;
+		wreq->cleanup = netfs_cleanup_buffered_write;
+	}
+
 	do {
 		size_t flen;
 		size_t offset;	/* Offset into pagecache folio */
@@ -315,7 +349,25 @@  ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 		}
 		written += copied;
 
-		folio_mark_dirty(folio);
+		if (likely(!wreq)) {
+			folio_mark_dirty(folio);
+		} else {
+			if (folio_test_dirty(folio))
+				/* Sigh.  mmap. */
+				folio_clear_dirty_for_io(folio);
+			/* We make multiple writes to the folio... */
+			if (!folio_test_writeback(folio)) {
+				folio_wait_fscache(folio);
+				folio_start_writeback(folio);
+				folio_start_fscache(folio);
+				if (wreq->iter.count == 0)
+					trace_netfs_folio(folio, netfs_folio_trace_wthru);
+				else
+					trace_netfs_folio(folio, netfs_folio_trace_wthru_plus);
+			}
+			netfs_advance_writethrough(wreq, copied,
+						   offset + copied == flen);
+		}
 	retry:
 		folio_unlock(folio);
 		folio_put(folio);
@@ -325,17 +377,14 @@  ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 	} while (iov_iter_count(iter));
 
 out:
-	if (likely(written)) {
-		/* Flush and wait for a write that requires immediate synchronisation. */
-		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) {
-			_debug("dsync");
-			ret = filemap_fdatawait_range(mapping, iocb->ki_pos,
-						      iocb->ki_pos + written);
-		}
-
-		iocb->ki_pos += written;
+	if (unlikely(wreq)) {
+		ret = netfs_end_writethrough(wreq, iocb);
+		wbc_detach_inode(&wbc);
+		if (ret == -EIOCBQUEUED)
+			return ret;
 	}
 
+	iocb->ki_pos += written;
 	_leave(" = %zd [%zd]", written, ret);
 	return written ? written : ret;
 
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index fe72280b0f30..b3749d6ec1ff 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -101,6 +101,9 @@  static inline void netfs_see_request(struct netfs_io_request *rreq,
  */
 int netfs_begin_write(struct netfs_io_request *wreq, bool may_wait,
 		      enum netfs_write_trace what);
+struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len);
+int netfs_advance_writethrough(struct netfs_io_request *wreq, size_t copied, bool to_page_end);
+int netfs_end_writethrough(struct netfs_io_request *wreq, struct kiocb *iocb);
 
 /*
  * stats.c
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
index 8d5ee0f56f28..7139397931b7 100644
--- a/fs/netfs/main.c
+++ b/fs/netfs/main.c
@@ -33,6 +33,7 @@  static const char *netfs_origins[nr__netfs_io_origin] = {
 	[NETFS_READPAGE]		= "RP",
 	[NETFS_READ_FOR_WRITE]		= "RW",
 	[NETFS_WRITEBACK]		= "WB",
+	[NETFS_WRITETHROUGH]		= "WT",
 	[NETFS_LAUNDER_WRITE]		= "LW",
 	[NETFS_UNBUFFERED_WRITE]	= "UW",
 	[NETFS_DIO_READ]		= "DR",
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 16252cc4576e..37626328577e 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -42,6 +42,7 @@  struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 	rreq->debug_id	= atomic_inc_return(&debug_ids);
 	xa_init(&rreq->bounce);
 	INIT_LIST_HEAD(&rreq->subrequests);
+	INIT_WORK(&rreq->work, NULL);
 	refcount_set(&rreq->ref, 1);
 
 	__set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
diff --git a/fs/netfs/output.c b/fs/netfs/output.c
index cc9065733b42..625eb68f3e5a 100644
--- a/fs/netfs/output.c
+++ b/fs/netfs/output.c
@@ -386,3 +386,93 @@  int netfs_begin_write(struct netfs_io_request *wreq, bool may_wait,
 		    TASK_UNINTERRUPTIBLE);
 	return wreq->error;
 }
+
+/*
+ * Begin a write operation for writing through the pagecache.
+ */
+struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len)
+{
+	struct netfs_io_request *wreq;
+	struct file *file = iocb->ki_filp;
+
+	wreq = netfs_alloc_request(file->f_mapping, file, iocb->ki_pos, len,
+				   NETFS_WRITETHROUGH);
+	if (IS_ERR(wreq))
+		return wreq;
+
+	trace_netfs_write(wreq, netfs_write_trace_writethrough);
+
+	__set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
+	iov_iter_xarray(&wreq->iter, ITER_SOURCE, &wreq->mapping->i_pages, wreq->start, 0);
+	wreq->io_iter = wreq->iter;
+
+	/* ->outstanding > 0 carries a ref */
+	netfs_get_request(wreq, netfs_rreq_trace_get_for_outstanding);
+	atomic_set(&wreq->nr_outstanding, 1);
+	return wreq;
+}
+
+static void netfs_submit_writethrough(struct netfs_io_request *wreq, bool final)
+{
+	struct netfs_inode *ictx = netfs_inode(wreq->inode);
+	unsigned long long start;
+	size_t len;
+
+	if (!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
+		return;
+
+	start = wreq->start + wreq->submitted;
+	len = wreq->iter.count - wreq->submitted;
+	if (!final) {
+		len /= wreq->wsize; /* Round to number of maximum packets */
+		len *= wreq->wsize;
+	}
+
+	ictx->ops->create_write_requests(wreq, start, len);
+	wreq->submitted += len;
+}
+
+/*
+ * Advance the state of the write operation used when writing through the
+ * pagecache.  Data has been copied into the pagecache that we need to append
+ * to the request.  If we've added more than wsize then we need to create a new
+ * subrequest.
+ */
+int netfs_advance_writethrough(struct netfs_io_request *wreq, size_t copied, bool to_page_end)
+{
+	_enter("ic=%zu sb=%zu ws=%u cp=%zu tp=%u",
+	       wreq->iter.count, wreq->submitted, wreq->wsize, copied, to_page_end);
+
+	wreq->iter.count += copied;
+	wreq->io_iter.count += copied;
+	if (to_page_end && wreq->io_iter.count - wreq->submitted >= wreq->wsize)
+		netfs_submit_writethrough(wreq, false);
+
+	return wreq->error;
+}
+
+/*
+ * End a write operation used when writing through the pagecache.
+ */
+int netfs_end_writethrough(struct netfs_io_request *wreq, struct kiocb *iocb)
+{
+	int ret = -EIOCBQUEUED;
+
+	_enter("ic=%zu sb=%zu ws=%u",
+	       wreq->iter.count, wreq->submitted, wreq->wsize);
+
+	if (wreq->submitted < wreq->io_iter.count)
+		netfs_submit_writethrough(wreq, true);
+
+	if (atomic_dec_and_test(&wreq->nr_outstanding))
+		netfs_write_terminated(wreq, false);
+
+	if (is_sync_kiocb(iocb)) {
+		wait_on_bit(&wreq->flags, NETFS_RREQ_IN_PROGRESS,
+			    TASK_UNINTERRUPTIBLE);
+		ret = wreq->error;
+	}
+
+	netfs_put_request(wreq, false, netfs_rreq_trace_put_return);
+	return ret;
+}
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index a7c2cb856e81..fc77f7be220a 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -139,6 +139,7 @@  struct netfs_inode {
 	unsigned long		flags;
 #define NETFS_ICTX_ODIRECT	0		/* The file has DIO in progress */
 #define NETFS_ICTX_UNBUFFERED	1		/* I/O should not use the pagecache */
+#define NETFS_ICTX_WRITETHROUGH	2		/* Write-through caching */
 };
 
 /*
@@ -227,6 +228,7 @@  enum netfs_io_origin {
 	NETFS_READPAGE,			/* This read is a synchronous read */
 	NETFS_READ_FOR_WRITE,		/* This read is to prepare a write */
 	NETFS_WRITEBACK,		/* This write was triggered by writepages */
+	NETFS_WRITETHROUGH,		/* This write was made by netfs_perform_write() */
 	NETFS_LAUNDER_WRITE,		/* This is triggered by ->launder_folio() */
 	NETFS_UNBUFFERED_WRITE,		/* This is an unbuffered write */
 	NETFS_DIO_READ,			/* This is a direct I/O read */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index cc998798e20a..447a8c21cf57 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -27,13 +27,15 @@ 
 	EM(netfs_write_trace_dio_write,		"DIO-WRITE")	\
 	EM(netfs_write_trace_launder,		"LAUNDER  ")	\
 	EM(netfs_write_trace_unbuffered_write,	"UNB-WRITE")	\
-	E_(netfs_write_trace_writeback,		"WRITEBACK")
+	EM(netfs_write_trace_writeback,		"WRITEBACK")	\
+	E_(netfs_write_trace_writethrough,	"WRITETHRU")
 
 #define netfs_rreq_origins					\
 	EM(NETFS_READAHEAD,			"RA")		\
 	EM(NETFS_READPAGE,			"RP")		\
 	EM(NETFS_READ_FOR_WRITE,		"RW")		\
 	EM(NETFS_WRITEBACK,			"WB")		\
+	EM(NETFS_WRITETHROUGH,			"WT")		\
 	EM(NETFS_LAUNDER_WRITE,			"LW")		\
 	EM(NETFS_UNBUFFERED_WRITE,		"UW")		\
 	EM(NETFS_DIO_READ,			"DR")		\
@@ -136,7 +138,9 @@ 
 	EM(netfs_folio_trace_redirty,		"redirty")	\
 	EM(netfs_folio_trace_redirtied,		"redirtied")	\
 	EM(netfs_folio_trace_store,		"store")	\
-	E_(netfs_folio_trace_store_plus,	"store+")
+	EM(netfs_folio_trace_store_plus,	"store+")	\
+	EM(netfs_folio_trace_wthru,		"wthru")	\
+	E_(netfs_folio_trace_wthru_plus,	"wthru+")
 
 #ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
 #define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY

[v4,36/39] netfs: Implement a write-through caching option

Commit Message

Comments

Patch