[v5,40/40] 9p: Use netfslib read/write_iter

Message ID	20231221132400.1601991-41-dhowells@redhat.com (mailing list archive)
State	New, archived
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DDBBE64ABF for <linux-cifs@vger.kernel.org>; Thu, 21 Dec 2023 13:26:48 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Jeff Layton <jlayton@kernel.org>, Steve French <smfrench@gmail.com> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Marc Dionne <marc.dionne@auristor.com>, Paulo Alcantara <pc@manguebit.com>, Shyam Prasad N <sprasad@microsoft.com>, Tom Talpey <tom@talpey.com>, Dominique Martinet <asmadeus@codewreck.org>, Eric Van Hensbergen <ericvh@kernel.org>, Ilya Dryomov <idryomov@gmail.com>, Christian Brauner <christian@brauner.io>, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Latchesar Ionkov <lucho@ionkov.net>, Christian Schoenebeck <linux_oss@crudebyte.com> Subject: [PATCH v5 40/40] 9p: Use netfslib read/write_iter Date: Thu, 21 Dec 2023 13:23:35 +0000 Message-ID: <20231221132400.1601991-41-dhowells@redhat.com> In-Reply-To: <20231221132400.1601991-1-dhowells@redhat.com> References: <20231221132400.1601991-1-dhowells@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	netfs, afs, 9p: Delegate high-level I/O to netfslib \| expand [v5,00/40] netfs, afs, 9p: Delegate high-level I/O to netfslib [v5,01/40] afs: Remove whitespace before most ')' from the trace header [v5,02/40] afs: Automatically generate trace tag enums [v5,03/40] netfs, fscache: Move fs/fscache/* into fs/netfs/ [v5,04/40] netfs, fscache: Combine fscache with netfs [v5,05/40] netfs, fscache: Remove ->begin_cache_operation [v5,06/40] netfs, fscache: Move /proc/fs/fscache to /proc/fs/netfs and put in a symlink [v5,07/40] netfs: Move pinning-for-writeback from fscache to netfs [v5,08/40] netfs: Add a procfile to list in-progress requests [v5,09/40] netfs: Allow the netfs to make the io (sub)request alloc larger [v5,10/40] netfs: Add a ->free_subrequest() op [v5,11/40] afs: Don't use folio->private to record partial modification [v5,12/40] netfs: Provide invalidate_folio and release_folio calls [v5,13/40] netfs: Implement unbuffered/DIO vs buffered I/O locking [v5,14/40] netfs: Add iov_iters to (sub)requests to describe various buffers [v5,15/40] netfs: Add support for DIO buffering [v5,16/40] netfs: Provide tools to create a buffer in an xarray [v5,17/40] netfs: Add func to calculate pagecount/size-limited span of an iterator [v5,18/40] netfs: Limit subrequest by size or number of segments [v5,19/40] netfs: Extend the netfs_io_*request structs to handle writes [v5,20/40] netfs: Add a hook to allow tell the netfs to update its i_size [v5,21/40] netfs: Make netfs_put_request() handle a NULL pointer [v5,22/40] netfs: Make the refcounting of netfs_begin_read() easier to use [v5,23/40] netfs: Prep to use folio->private for write grouping and streaming write [v5,24/40] netfs: Dispatch write requests to process a writeback slice [v5,25/40] netfs: Provide func to copy data to pagecache for buffered write [v5,26/40] netfs: Make netfs_read_folio() handle streaming-write pages [v5,27/40] netfs: Allocate multipage folios in the writepath [v5,28/40] netfs: Implement unbuffered/DIO read support [v5,29/40] netfs: Implement unbuffered/DIO write support [v5,30/40] netfs: Implement buffered write API [v5,31/40] netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite() [v5,32/40] netfs: Provide netfs_file_read_iter() [v5,33/40] netfs, cachefiles: Pass upper bound length to allow expansion [v5,34/40] netfs: Provide a writepages implementation [v5,35/40] netfs: Provide a launder_folio implementation [v5,36/40] netfs: Implement a write-through caching option [v5,37/40] netfs: Optimise away reads above the point at which there can be no data [v5,38/40] netfs: Export the netfs_sreq tracepoint [v5,39/40] afs: Use the netfs write helpers [v5,40/40] 9p: Use netfslib read/write_iter

David Howells Dec. 21, 2023, 1:23 p.m. UTC

Use netfslib's read and write iteration helpers, allowing netfslib to take
over the management of the page cache for 9p files and to manage local disk
caching.  In particular, this eliminates write_begin, write_end, writepage
and all mentions of struct page and struct folio from 9p.

Note that netfslib now offers the possibility of write-through caching if
that is desirable for 9p: just set the NETFS_ICTX_WRITETHROUGH flag in
v9inode->netfs.flags in v9fs_set_netfs_context().

Note also this is untested as I can't get ganesha.nfsd to correctly parse
the config to turn on 9p support.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
---

Notes:
    Changes
    =======
    ver #5)
     - Added some missing remote_i_size setting.
     - Added missing writepages (else mmap write never written back).

 fs/9p/vfs_addr.c       | 293 ++++++++++-------------------------------
 fs/9p/vfs_file.c       |  89 ++-----------
 fs/9p/vfs_inode.c      |   5 +-
 fs/9p/vfs_inode_dotl.c |   7 +-
 4 files changed, 85 insertions(+), 309 deletions(-)

David Howells Jan. 2, 2024, 3:39 p.m. UTC | #1

Hi Eric, Latchesar, Dominique,

Would you have any chance to look at the 9p patch before the merge window
opens?  If not, what should I do with the patch?  Should I keep it, or should
I drop it for now and give it to you to take through the 9p tree if Linus
accepts the rest of the patchset?

Thanks,
David

Dominique Martinet Jan. 3, 2024, 7:22 a.m. UTC | #2

David Howells wrote on Thu, Dec 21, 2023 at 01:23:35PM +0000:
> Use netfslib's read and write iteration helpers, allowing netfslib to take
> over the management of the page cache for 9p files and to manage local disk
> caching.  In particular, this eliminates write_begin, write_end, writepage
> and all mentions of struct page and struct folio from 9p.
> 
> Note that netfslib now offers the possibility of write-through caching if
> that is desirable for 9p: just set the NETFS_ICTX_WRITETHROUGH flag in
> v9inode->netfs.flags in v9fs_set_netfs_context().
> 
> Note also this is untested as I can't get ganesha.nfsd to correctly parse
> the config to turn on 9p support.

(that's appparently no longer true and might need updating)


> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> cc: Eric Van Hensbergen <ericvh@kernel.org>
> cc: Latchesar Ionkov <lucho@ionkov.net>
> cc: Dominique Martinet <asmadeus@codewreck.org>

At quite high level, I've played with this a bit and see no obvious
regression with the extra patch

I've also manually confirmed one of the big improvements I'd been asking
for (that writes in cached modes, which used to be chunked to 4k, and
are now properly aggregated, so e.g 'dd bs=1M count=1' will properly
issue a minimal number of TWRITE calls capped by msize) -- this is
great!

I've noticed we don't cache xattrs are all, so with the default mount
options on a kernel built with 9P_FS_SECURITY we'll get a gazillion
lookups for security.capabilities... But that's another problem, and
this is still an improvement so no reason to hold back.

I've got a couple of questions below, but:

Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>


(I'd still be extremly thanksful if Christian and/or Eric would have
time to check as well, but I won't push back to merging it this merge
window next week if they don't have time... I'll also keep trying to run
some more tests as time allows)

> cc: Christian Schoenebeck <linux_oss@crudebyte.com>
> cc: v9fs@lists.linux.dev
> cc: linux-cachefs@redhat.com
> cc: linux-fsdevel@vger.kernel.org
> ---
> 
> Notes:
>     Changes
>     =======
>     ver #5)
>      - Added some missing remote_i_size setting.
>      - Added missing writepages (else mmap write never written back).
> 
>  fs/9p/vfs_addr.c       | 293 ++++++++++-------------------------------
>  fs/9p/vfs_file.c       |  89 ++-----------
>  fs/9p/vfs_inode.c      |   5 +-
>  fs/9p/vfs_inode_dotl.c |   7 +-
>  4 files changed, 85 insertions(+), 309 deletions(-)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index 055b672a247d..20f072c18ce9 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -19,12 +19,48 @@
>  #include <linux/netfs.h>
>  #include <net/9p/9p.h>
>  #include <net/9p/client.h>
> +#include <trace/events/netfs.h>
>  
>  #include "v9fs.h"
>  #include "v9fs_vfs.h"
>  #include "cache.h"
>  #include "fid.h"
>  
> +static void v9fs_upload_to_server(struct netfs_io_subrequest *subreq)
> +{
> +	struct inode *inode = subreq->rreq->inode;
> +	struct v9fs_inode __maybe_unused *v9inode = V9FS_I(inode);

Any reason to have this variable assignment at all?
(I assume it'll get optimized away, but it looks like that's not a maybe
here so was a bit surprised -- I guess it's just been copy-pasted from
the old code getting the fscache cookie?)

> +	struct p9_fid *fid = subreq->rreq->netfs_priv;
> +	int err;
> +
> +	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> +	p9_client_write(fid, subreq->start, &subreq->io_iter, &err);

p9_client_write return value should always be subreq->len, but I believe
we should use it unless err is set.
(It's also possible for partial writes to happen, e.g. p9_client_write
looped a few times and then failed, at which point the size returned
would be the amount that actually got through -- we probably should do
something with that?)

> +	netfs_write_subrequest_terminated(subreq, err < 0 ? err : subreq->len,
> +					  false);
> +}
> +
> +static void v9fs_upload_to_server_worker(struct work_struct *work)
> +{
> +	struct netfs_io_subrequest *subreq =
> +		container_of(work, struct netfs_io_subrequest, work);
> +
> +	v9fs_upload_to_server(subreq);
> +}
> +
> +/*
> + * Set up write requests for a writeback slice.  We need to add a write request
> + * for each write we want to make.
> + */
> +static void v9fs_create_write_requests(struct netfs_io_request *wreq, loff_t start, size_t len)
> +{
> +	struct netfs_io_subrequest *subreq;
> +
> +	subreq = netfs_create_write_request(wreq, NETFS_UPLOAD_TO_SERVER,
> +					    start, len, v9fs_upload_to_server_worker);
> +	if (subreq)
> +		netfs_queue_write_request(subreq);
> +}
> +
>  /**
>   * v9fs_issue_read - Issue a read from 9P
>   * @subreq: The read to make
> @@ -33,14 +69,10 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
>  {
>  	struct netfs_io_request *rreq = subreq->rreq;
>  	struct p9_fid *fid = rreq->netfs_priv;
> -	struct iov_iter to;
> -	loff_t pos = subreq->start + subreq->transferred;
> -	size_t len = subreq->len   - subreq->transferred;
>  	int total, err;
>  
> -	iov_iter_xarray(&to, ITER_DEST, &rreq->mapping->i_pages, pos, len);
> -
> -	total = p9_client_read(fid, pos, &to, &err);
> +	total = p9_client_read(fid, subreq->start + subreq->transferred,
> +			       &subreq->io_iter, &err);

Just to clarify: subreq->io_iter didn't exist (or some conditions to use
it weren't cleared) before?

>  
>  	/* if we just extended the file size, any portion not in
>  	 * cache won't be on server and is zeroes */
> @@ -50,23 +82,37 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
>  }
>  
>  /**
> - * v9fs_init_request - Initialise a read request
> + * v9fs_init_request - Initialise a request
>   * @rreq: The read request
>   * @file: The file being read from
>   */
>  static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
>  {
> -	struct p9_fid *fid = file->private_data;
> -
> -	BUG_ON(!fid);
> +	struct p9_fid *fid;
> +	bool writing = (rreq->origin == NETFS_READ_FOR_WRITE ||
> +			rreq->origin == NETFS_WRITEBACK ||
> +			rreq->origin == NETFS_WRITETHROUGH ||
> +			rreq->origin == NETFS_LAUNDER_WRITE ||
> +			rreq->origin == NETFS_UNBUFFERED_WRITE ||
> +			rreq->origin == NETFS_DIO_WRITE);
> +
> +	if (file) {
> +		fid = file->private_data;
> +		BUG_ON(!fid);

This probably should be WARN + return EINVAL like find by inode?
It's certainly a huge problem, but we should avoid BUG if possible...

> +		p9_fid_get(fid);
> +	} else {
> +		fid = v9fs_fid_find_inode(rreq->inode, writing, INVALID_UID, true);
> +		if (!fid) {
> +			WARN_ONCE(1, "folio expected an open fid inode->i_private=%p\n",
> +				  rreq->inode->i_private);

nit: not sure what's cleaner?
Since there's a message that makes for a bit awkward if...

if (WARN_ONCE(!fid, "folio expected an open fid inode->i_private=%p\n",
	      rreq->inode->i_private))
	return -EINVAL;

(as a side note, I'm not sure what to make of this i_private pointer
here, but if that'll help you figure something out sure..)

> +			return -EINVAL;
> +		}
> +	}
>  
>  	/* we might need to read from a fid that was opened write-only
>  	 * for read-modify-write of page cache, use the writeback fid
>  	 * for that */
> -	WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE &&
> -			!(fid->mode & P9_ORDWR));
> -
> -	p9_fid_get(fid);
> +	WARN_ON(writing && !(fid->mode & P9_ORDWR));

This is as follow on your netfs-lib branch:
-       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE &&
-                       !(fid->mode & P9_ORDWR));
-
-       p9_fid_get(fid);
+       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE && !(fid->mode & P9_ORDWR));

So the WARN_ON has been reverted back with only indentation changed;
I guess there were patterns that were writing despite the fid not having
been open as RDWR?
Do you still have details about these?

If a file has been open without the write bit it might not go through,
and it's incredibly difficult to get such users back to userspace in
async cases (e.g. mmap flushes), so would like to understand that.

>  	rreq->netfs_priv = fid;
>  	return 0;
>  }
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 11cd8d23f6f2..bae330c2f0cf 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -353,25 +353,15 @@ static ssize_t
>  v9fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct p9_fid *fid = iocb->ki_filp->private_data;
> -	int ret, err = 0;
>  
>  	p9_debug(P9_DEBUG_VFS, "fid %d count %zu offset %lld\n",
>  		 fid->fid, iov_iter_count(to), iocb->ki_pos);
>  
> -	if (!(fid->mode & P9L_DIRECT)) {
> -		p9_debug(P9_DEBUG_VFS, "(cached)\n");
> -		return generic_file_read_iter(iocb, to);
> -	}
> -
> -	if (iocb->ki_filp->f_flags & O_NONBLOCK)
> -		ret = p9_client_read_once(fid, iocb->ki_pos, to, &err);
> -	else
> -		ret = p9_client_read(fid, iocb->ki_pos, to, &err);
> -	if (!ret)
> -		return err;
> +	if (fid->mode & P9L_DIRECT)
> +		return netfs_unbuffered_read_iter(iocb, to);
>  
> -	iocb->ki_pos += ret;
> -	return ret;
> +	p9_debug(P9_DEBUG_VFS, "(cached)\n");

(Not a new problem so no need to address here, but having just
"(cached)" on a split line is a bit weird.. We first compute cached or
not as a bool and make it %s + cached ? " (cached)" : "" or
something... I'll send a patch after this gets in to avoid conflicts)

> +	return netfs_file_read_iter(iocb, to);
>  }
>  
>  /*
> @@ -407,46 +397,14 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct p9_fid *fid = file->private_data;
> -	ssize_t retval;
> -	loff_t origin;
> -	int err = 0;
>  
>  	p9_debug(P9_DEBUG_VFS, "fid %d\n", fid->fid);
>  
> -	if (!(fid->mode & (P9L_DIRECT | P9L_NOWRITECACHE))) {
> -		p9_debug(P9_DEBUG_CACHE, "(cached)\n");
> -		return generic_file_write_iter(iocb, from);
> -	}
> +	if (fid->mode & (P9L_DIRECT | P9L_NOWRITECACHE))
> +		return netfs_unbuffered_write_iter(iocb, from);
>  
> -	retval = generic_write_checks(iocb, from);
> -	if (retval <= 0)
> -		return retval;
> -
> -	origin = iocb->ki_pos;
> -	retval = p9_client_write(file->private_data, iocb->ki_pos, from, &err);
> -	if (retval > 0) {
> -		struct inode *inode = file_inode(file);
> -		loff_t i_size;
> -		unsigned long pg_start, pg_end;
> -
> -		pg_start = origin >> PAGE_SHIFT;
> -		pg_end = (origin + retval - 1) >> PAGE_SHIFT;
> -		if (inode->i_mapping && inode->i_mapping->nrpages)
> -			invalidate_inode_pages2_range(inode->i_mapping,
> -						      pg_start, pg_end);
> -		iocb->ki_pos += retval;
> -		i_size = i_size_read(inode);
> -		if (iocb->ki_pos > i_size) {
> -			inode_add_bytes(inode, iocb->ki_pos - i_size);
> -			/*
> -			 * Need to serialize against i_size_write() in
> -			 * v9fs_stat2inode()
> -			 */
> -			v9fs_i_size_write(inode, iocb->ki_pos);
> -		}
> -		return retval;
> -	}
> -	return err;
> +	p9_debug(P9_DEBUG_CACHE, "(cached)\n");
> +	return netfs_file_write_iter(iocb, from);
>  }
>  
>  static int v9fs_file_fsync(struct file *filp, loff_t start, loff_t end,
> @@ -519,36 +477,7 @@ v9fs_file_mmap(struct file *filp, struct vm_area_struct *vma)
>  static vm_fault_t
>  v9fs_vm_page_mkwrite(struct vm_fault *vmf)
>  {
> -	struct folio *folio = page_folio(vmf->page);
> -	struct file *filp = vmf->vma->vm_file;
> -	struct inode *inode = file_inode(filp);
> -
> -
> -	p9_debug(P9_DEBUG_VFS, "folio %p fid %lx\n",
> -		 folio, (unsigned long)filp->private_data);
> -
> -	/* Wait for the page to be written to the cache before we allow it to
> -	 * be modified.  We then assume the entire page will need writing back.
> -	 */
> -#ifdef CONFIG_9P_FSCACHE
> -	if (folio_test_fscache(folio) &&
> -	    folio_wait_fscache_killable(folio) < 0)
> -		return VM_FAULT_NOPAGE;
> -#endif
> -
> -	/* Update file times before taking page lock */
> -	file_update_time(filp);
> -
> -	if (folio_lock_killable(folio) < 0)
> -		return VM_FAULT_RETRY;
> -	if (folio_mapping(folio) != inode->i_mapping)
> -		goto out_unlock;
> -	folio_wait_stable(folio);
> -
> -	return VM_FAULT_LOCKED;
> -out_unlock:
> -	folio_unlock(folio);
> -	return VM_FAULT_NOPAGE;
> +	return netfs_page_mkwrite(vmf, NULL);

(I guess there's no helper that could be used directly in .page_mkwrite
op?)

>  }
>  
>  static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
> index 74122540e00f..55345753ae8d 100644
> --- a/fs/9p/vfs_inode.c
> +++ b/fs/9p/vfs_inode.c
> @@ -374,10 +374,8 @@ void v9fs_evict_inode(struct inode *inode)
>  
>  	truncate_inode_pages_final(&inode->i_data);
>  
> -#ifdef CONFIG_9P_FSCACHE
>  	version = cpu_to_le32(v9inode->qid.version);
>  	netfs_clear_inode_writeback(inode, &version);
> -#endif
>  
>  	clear_inode(inode);
>  	filemap_fdatawrite(&inode->i_data);
> @@ -1112,7 +1110,7 @@ static int v9fs_vfs_setattr(struct mnt_idmap *idmap,
>  	if ((iattr->ia_valid & ATTR_SIZE) &&
>  		 iattr->ia_size != i_size_read(inode)) {
>  		truncate_setsize(inode, iattr->ia_size);
> -		truncate_pagecache(inode, iattr->ia_size);
> +		netfs_resize_file(netfs_inode(inode), iattr->ia_size, true);
>  
>  #ifdef CONFIG_9P_FSCACHE
>  		if (v9ses->cache & CACHE_FSCACHE) {
> @@ -1180,6 +1178,7 @@ v9fs_stat2inode(struct p9_wstat *stat, struct inode *inode,
>  	mode |= inode->i_mode & ~S_IALLUGO;
>  	inode->i_mode = mode;
>  
> +	v9inode->netfs.remote_i_size = stat->length;
>  	if (!(flags & V9FS_STAT2INODE_KEEP_ISIZE))
>  		v9fs_i_size_write(inode, stat->length);
>  	/* not real number of blocks, but 512 byte ones ... */
> diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
> index c7319af2f471..e25fbc988f09 100644
> --- a/fs/9p/vfs_inode_dotl.c
> +++ b/fs/9p/vfs_inode_dotl.c
> @@ -598,7 +598,7 @@ int v9fs_vfs_setattr_dotl(struct mnt_idmap *idmap,
>  	if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size !=
>  		 i_size_read(inode)) {
>  		truncate_setsize(inode, iattr->ia_size);
> -		truncate_pagecache(inode, iattr->ia_size);
> +		netfs_resize_file(netfs_inode(inode), iattr->ia_size, true);
>  
>  #ifdef CONFIG_9P_FSCACHE
>  		if (v9ses->cache & CACHE_FSCACHE)
> @@ -655,6 +655,7 @@ v9fs_stat2inode_dotl(struct p9_stat_dotl *stat, struct inode *inode,
>  		mode |= inode->i_mode & ~S_IALLUGO;
>  		inode->i_mode = mode;
>  
> +		v9inode->netfs.remote_i_size = stat->st_size;
>  		if (!(flags & V9FS_STAT2INODE_KEEP_ISIZE))
>  			v9fs_i_size_write(inode, stat->st_size);
>  		inode->i_blocks = stat->st_blocks;
> @@ -683,8 +684,10 @@ v9fs_stat2inode_dotl(struct p9_stat_dotl *stat, struct inode *inode,
>  			inode->i_mode = mode;
>  		}
>  		if (!(flags & V9FS_STAT2INODE_KEEP_ISIZE) &&
> -		    stat->st_result_mask & P9_STATS_SIZE)
> +		    stat->st_result_mask & P9_STATS_SIZE) {
> +			v9inode->netfs.remote_i_size = stat->st_size;
>  			v9fs_i_size_write(inode, stat->st_size);
> +		}
>  		if (stat->st_result_mask & P9_STATS_BLOCKS)
>  			inode->i_blocks = stat->st_blocks;
>  	}
>

David Howells Jan. 3, 2024, 12:08 p.m. UTC | #3

Dominique Martinet <asmadeus@codewreck.org> wrote:

> I've also manually confirmed one of the big improvements I'd been asking
> for (that writes in cached modes, which used to be chunked to 4k, and
> are now properly aggregated, so e.g 'dd bs=1M count=1' will properly
> issue a minimal number of TWRITE calls capped by msize) -- this is
> great!

After the merge window, we can look at enabling multipage folios for 9p.

> I've noticed we don't cache xattrs are all,

I haven't given this any particular thought.  We could attach them to the
cachefile object as xattrs, but it means you have to do two xattr lookups in
the event of a cache miss.

At this point, I'm going to ask Christian to stack the extra patch on his
branch rather than folding it down and retagging.

> I've got a couple of questions below, but:

I'll address those separately.

> Tested-by: Dominique Martinet <asmadeus@codewreck.org>
> Acked-by: Dominique Martinet <asmadeus@codewreck.org>

Thanks!

David

David Howells Jan. 3, 2024, 12:39 p.m. UTC | #4

Dominique Martinet <asmadeus@codewreck.org> wrote:

> > +static void v9fs_upload_to_server(struct netfs_io_subrequest *subreq)
> > +{
> > +	struct inode *inode = subreq->rreq->inode;
> > +	struct v9fs_inode __maybe_unused *v9inode = V9FS_I(inode);
> 
> Any reason to have this variable assignment at all?

I'll just remove it.  The __maybe_unused suppressed the warning, otherwise I'd
have removed it already.

> p9_client_write return value should always be subreq->len, but I believe
> we should use it unless err is set.
> (It's also possible for partial writes to happen, e.g. p9_client_write
> looped a few times and then failed, at which point the size returned
> would be the amount that actually got through -- we probably should do
> something with that?)

How about something like:

-	int err;
+	int err, len;

 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
-	netfs_write_subrequest_terminated(subreq, err < 0 ? err : subreq->len,
-					  false);
+	len = p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
+	netfs_write_subrequest_terminated(subreq, len ?: err, false);

> > +	total = p9_client_read(fid, subreq->start + subreq->transferred,
> > +			       &subreq->io_iter, &err);
> 
> Just to clarify: subreq->io_iter didn't exist (or some conditions to use
> it weren't cleared) before?

Correct.  It's added in the netfs-lib patches.  I've provided a way to
separate the user-side iterator from the I/O-side iterator to allow the use of
a bounce buffer for the purpose of content crypto, compression or just having
to deal with RMW cycles to a larger block size on the server.

> > +	if (file) {
> > +		fid = file->private_data;
> > +		BUG_ON(!fid);
> 
> This probably should be WARN + return EINVAL like find by inode?
> It's certainly a huge problem, but we should avoid BUG if possible...

Sure.  The BUG_ON() was already there, but I can turn it into a WARN+error.

> nit: not sure what's cleaner?
> Since there's a message that makes for a bit awkward if...
> 
> if (WARN_ONCE(!fid, "folio expected an open fid inode->i_private=%p\n",
> 	      rreq->inode->i_private))
> 	return -EINVAL;
> 
> (as a side note, I'm not sure what to make of this i_private pointer
> here, but if that'll help you figure something out sure..)

Um.  9p is using i_private.  But perhaps i_ino would be a better choice:

	if (file) {
		fid = file->private_data;
		if (!fid)
			goto no_fid;
		p9_fid_get(fid);
	} else {
		fid = v9fs_fid_find_inode(rreq->inode, writing, INVALID_UID, true);
		if (!fid)
			goto no_fid;
	}

	...

no_fid:
	WARN_ONCE(1, "folio expected an open fid inode->i_ino=%lx\n",
		  rreq->inode->i_ino);
	return -EINVAL;

> This is as follow on your netfs-lib branch:
> -       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE &&
> -                       !(fid->mode & P9_ORDWR));
> -
> -       p9_fid_get(fid);
> +       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE && !(fid->mode & P9_ORDWR));
> 
> So the WARN_ON has been reverted back with only indentation changed;
> I guess there were patterns that were writing despite the fid not having
> been open as RDWR?
> Do you still have details about these?

The condition in the WARN_ON() here got changed.  It was:

	WARN_ON(writing && ...

at one point, but that caused a bunch of incorrect warning to appear because
only NETFS_READ_FOR_WRITE requires read-access as well as write-access.  All
the others:

	bool writing = (rreq->origin == NETFS_READ_FOR_WRITE ||
			rreq->origin == NETFS_WRITEBACK ||
			rreq->origin == NETFS_WRITETHROUGH ||
			rreq->origin == NETFS_LAUNDER_WRITE ||
			rreq->origin == NETFS_UNBUFFERED_WRITE ||
			rreq->origin == NETFS_DIO_WRITE);

only require write-access.

There will be an additional one if we roll out content crypto to 9p as we may
need to do RMW cycles occasionally - but that's off to one side just for the
moment.

> If a file has been open without the write bit it might not go through,
> and it's incredibly difficult to get such users back to userspace in
> async cases (e.g. mmap flushes), so would like to understand that.

The VFS/VM should prevent writing to files that aren't open O_WRONLY or
O_RDWR, so I don't think we should be called in otherwise.

Note that I'm intending to change the way fscache is driven when we fetch
cacheable data from the server so that I can free up the PG_fscache bit and
return it to the MM folks.  Instead of marking the page PG_fscache, I mark it
PG_dirty and set page->private with a special value to indicate it should only
be written to the cache - then the writepages sees that and just writes these
pages to the cache.  I have a patch to do this and it seems to work, but I
need to make ceph and cifs use netfslib before I can apply it.

> > +	p9_debug(P9_DEBUG_VFS, "(cached)\n");
> 
> (Not a new problem so no need to address here, but having just
> "(cached)" on a split line is a bit weird.. We first compute cached or
> not as a bool and make it %s + cached ? " (cached)" : "" or
> something... I'll send a patch after this gets in to avoid conflicts)

Okay.

> > +	return netfs_page_mkwrite(vmf, NULL);
> 
> (I guess there's no helper that could be used directly in .page_mkwrite
> op?)

I could provide a helper that just supplies NULL as the second argument.  I
think only 9p will use it, but that's fine.

David

Dominique Martinet Jan. 3, 2024, 1 p.m. UTC | #5

David Howells wrote on Wed, Jan 03, 2024 at 12:39:34PM +0000:
> > p9_client_write return value should always be subreq->len, but I believe
> > we should use it unless err is set.
> > (It's also possible for partial writes to happen, e.g. p9_client_write
> > looped a few times and then failed, at which point the size returned
> > would be the amount that actually got through -- we probably should do
> > something with that?)
> 
> How about something like:
> 
> -	int err;
> +	int err, len;
>  
>  	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> -	p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
> -	netfs_write_subrequest_terminated(subreq, err < 0 ? err : subreq->len,
> -					  false);
> +	len = p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
> +	netfs_write_subrequest_terminated(subreq, len ?: err, false);

I think that'll be fine; plain write() syscall works like this when an
error happens after some data has been flushed, and I assume there'll be
some retry if this happpened on something like a flush dirty and it got
a partial write reported?

> > > +	if (file) {
> > > +		fid = file->private_data;
> > > +		BUG_ON(!fid);
> > 
> > This probably should be WARN + return EINVAL like find by inode?
> > It's certainly a huge problem, but we should avoid BUG if possible...
> 
> Sure.  The BUG_ON() was already there, but I can turn it into a WARN+error.

Thanks.

> > nit: not sure what's cleaner?
> > Since there's a message that makes for a bit awkward if...
> > 
> > if (WARN_ONCE(!fid, "folio expected an open fid inode->i_private=%p\n",
> > 	      rreq->inode->i_private))
> > 	return -EINVAL;
> > 
> > (as a side note, I'm not sure what to make of this i_private pointer
> > here, but if that'll help you figure something out sure..)
> 
> Um.  9p is using i_private.  But perhaps i_ino would be a better choice:
> 
> 	if (file) {
> 		fid = file->private_data;
> 		if (!fid)
> 			goto no_fid;
> 		p9_fid_get(fid);
> 	} else {
> 		fid = v9fs_fid_find_inode(rreq->inode, writing, INVALID_UID, true);
> 		if (!fid)
> 			goto no_fid;
> 	}
> 
> 	...
> 
> no_fid:
> 	WARN_ONCE(1, "folio expected an open fid inode->i_ino=%lx\n",
> 		  rreq->inode->i_ino);
> 	return -EINVAL;

Might be useful to track down if this came frm a file without private
data or lookup failing, but given this was a bug I guess we can deal
with that when that happens -- ack.

> > This is as follow on your netfs-lib branch:
> > -       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE &&
> > -                       !(fid->mode & P9_ORDWR));
> > -
> > -       p9_fid_get(fid);
> > +       WARN_ON(rreq->origin == NETFS_READ_FOR_WRITE && !(fid->mode & P9_ORDWR));
> > 
> > So the WARN_ON has been reverted back with only indentation changed;
> > I guess there were patterns that were writing despite the fid not having
> > been open as RDWR?
> > Do you still have details about these?
> 
> The condition in the WARN_ON() here got changed.  It was:
> 
> 	WARN_ON(writing && ...
> 
> at one point, but that caused a bunch of incorrect warning to appear because
> only NETFS_READ_FOR_WRITE requires read-access as well as write-access.  All
> the others:
> 
> 	bool writing = (rreq->origin == NETFS_READ_FOR_WRITE ||
> 			rreq->origin == NETFS_WRITEBACK ||
> 			rreq->origin == NETFS_WRITETHROUGH ||
> 			rreq->origin == NETFS_LAUNDER_WRITE ||
> 			rreq->origin == NETFS_UNBUFFERED_WRITE ||
> 			rreq->origin == NETFS_DIO_WRITE);
> 
> only require write-access.

Thanks for clarifying

> > If a file has been open without the write bit it might not go through,
> > and it's incredibly difficult to get such users back to userspace in
> > async cases (e.g. mmap flushes), so would like to understand that.
> 
> The VFS/VM should prevent writing to files that aren't open O_WRONLY or
> O_RDWR, so I don't think we should be called in otherwise.

Historically this check was more about finding a fid that wasn't opened
properly than the VFS doing something weird (e.g. by calling mprotect
after mmap and us missing that -- would need to check if that works
actually...)

> > > +	return netfs_page_mkwrite(vmf, NULL);
> > 
> > (I guess there's no helper that could be used directly in .page_mkwrite
> > op?)
> 
> I could provide a helper that just supplies NULL as the second argument.  I
> think only 9p will use it, but that's fine.

If we're the only user I guess we shouldn't bother with it at this
point, we can come back to it if this ever becomes common.

Eric Van Hensbergen Jan. 3, 2024, 7:52 p.m. UTC | #6

On Wed, Jan 03, 2024 at 04:22:29PM +0900, Dominique Martinet wrote:
> David Howells wrote on Thu, Dec 21, 2023 at 01:23:35PM +0000:
> 
> I've noticed we don't cache xattrs are all, so with the default mount
> options on a kernel built with 9P_FS_SECURITY we'll get a gazillion
> lookups for security.capabilities... But that's another problem, and
> this is still an improvement so no reason to hold back.
>

This is a big problem and already on my backlog list since some things
default to this even if the remote file system doesn't support
xattrs.  The quick fix is to disable on a mount when we detect the
host side isn't supporting them (of course this could be weird for
exports that cross file system boundries) -- at the very least we
could keep this info on an inode basis and not request as long as the
inode info is cached.  Caching the actual properties is also a step,
but given this is a security feature, I imagine we don't want to trust
our cache and will always have to ask server unless we can come up with
something clever to indicate xattr changes (haven't looked into that
much yet).

> 
> (I'd still be extremly thanksful if Christian and/or Eric would have
> time to check as well, but I won't push back to merging it this merge
> window next week if they don't have time... I'll also keep trying to run
> some more tests as time allows)
>

I'll try to run through my regression tests as well, but sure we
can fix things up after the merge window if we miss things.

    -eric

Andrea Righi May 9, 2024, 5:15 p.m. UTC | #7

On Thu, Dec 21, 2023 at 01:23:35PM +0000, David Howells wrote:
> Use netfslib's read and write iteration helpers, allowing netfslib to take
> over the management of the page cache for 9p files and to manage local disk
> caching.  In particular, this eliminates write_begin, write_end, writepage
> and all mentions of struct page and struct folio from 9p.
> 
> Note that netfslib now offers the possibility of write-through caching if
> that is desirable for 9p: just set the NETFS_ICTX_WRITETHROUGH flag in
> v9inode->netfs.flags in v9fs_set_netfs_context().
> 
> Note also this is untested as I can't get ganesha.nfsd to correctly parse
> the config to turn on 9p support.

It looks like this patch has introduced a regression with autopkgtest,
see: https://bugs.launchpad.net/bugs/2056461

I haven't looked at the details yet, I just did some bisecting and
apparently reverting this one seems to fix the problem.

Let me know if you want me to test something in particular or if you
already have a potential fix. Otherwise I'll take a look.

Thanks,
-Andrea

David Howells May 9, 2024, 9:33 p.m. UTC | #8

Andrea Righi <andrea.righi@canonical.com> wrote:

> On Thu, Dec 21, 2023 at 01:23:35PM +0000, David Howells wrote:
> > Use netfslib's read and write iteration helpers, allowing netfslib to take
> > over the management of the page cache for 9p files and to manage local disk
> > caching.  In particular, this eliminates write_begin, write_end, writepage
> > and all mentions of struct page and struct folio from 9p.
> > 
> > Note that netfslib now offers the possibility of write-through caching if
> > that is desirable for 9p: just set the NETFS_ICTX_WRITETHROUGH flag in
> > v9inode->netfs.flags in v9fs_set_netfs_context().
> > 
> > Note also this is untested as I can't get ganesha.nfsd to correctly parse
> > the config to turn on 9p support.
> 
> It looks like this patch has introduced a regression with autopkgtest,
> see: https://bugs.launchpad.net/bugs/2056461
> 
> I haven't looked at the details yet, I just did some bisecting and
> apparently reverting this one seems to fix the problem.
> 
> Let me know if you want me to test something in particular or if you
> already have a potential fix. Otherwise I'll take a look.

Do you have a reproducer?

I'll be at LSF next week, so if I can't fix it tomorrow, I won't be able to
poke at it until after that.

David

Andrea Righi May 10, 2024, 5:53 a.m. UTC | #9

On Thu, May 09, 2024 at 10:33:37PM +0100, David Howells wrote:
> Andrea Righi <andrea.righi@canonical.com> wrote:
> 
> > On Thu, Dec 21, 2023 at 01:23:35PM +0000, David Howells wrote:
> > > Use netfslib's read and write iteration helpers, allowing netfslib to take
> > > over the management of the page cache for 9p files and to manage local disk
> > > caching.  In particular, this eliminates write_begin, write_end, writepage
> > > and all mentions of struct page and struct folio from 9p.
> > > 
> > > Note that netfslib now offers the possibility of write-through caching if
> > > that is desirable for 9p: just set the NETFS_ICTX_WRITETHROUGH flag in
> > > v9inode->netfs.flags in v9fs_set_netfs_context().
> > > 
> > > Note also this is untested as I can't get ganesha.nfsd to correctly parse
> > > the config to turn on 9p support.
> > 
> > It looks like this patch has introduced a regression with autopkgtest,
> > see: https://bugs.launchpad.net/bugs/2056461
> > 
> > I haven't looked at the details yet, I just did some bisecting and
> > apparently reverting this one seems to fix the problem.
> > 
> > Let me know if you want me to test something in particular or if you
> > already have a potential fix. Otherwise I'll take a look.
> 
> Do you have a reproducer?
> 
> I'll be at LSF next week, so if I can't fix it tomorrow, I won't be able to
> poke at it until after that.
> 
> David

The only reproducer that I have at the moment is the autopkgtest command
mentioned in the bug, that is a bit convoluted, I'll try to see if I can
better isolate the problem and find a simpler reproducer, but I'll also
be travelling next week to a Canonical event.

At the moment I'll temporarily revert the commit (that seems to prevent
the issue from happening) and I'll keep you posted if I find something.

Thanks,
-Andrea

David Howells May 10, 2024, 7:57 a.m. UTC | #10

Andrea Righi <andrea.righi@canonical.com> wrote:

> The only reproducer that I have at the moment is the autopkgtest command
> mentioned in the bug, that is a bit convoluted, I'll try to see if I can
> better isolate the problem and find a simpler reproducer, but I'll also
> be travelling next week to a Canonical event.

Note that the netfslib has some tracepoints that might help debug it.

David

David Howells May 23, 2024, 7:44 a.m. UTC | #11

Hi Andrea,

Note that there are changes to the netfslib write-side upstream and you might
also want to apply the attached.

In https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461 you say:

| It seems that kernel 6.8 introduced a regression in the 9pfs related to
| caching and netfslib, that can cause some user-space apps to read content
| from files that is not up-to-date (when they are used in a producer/consumer
| fashion).

Can you clarify how these files are being used?

David
---
commit 39302c160390441ed5b4f4f7ad480c44eddf0962
Author: David Howells <dhowells@redhat.com>
Date:   Wed May 22 17:30:22 2024 +0100

    netfs, 9p: Fix race between umount and async request completion
    
    There's a problem in 9p's interaction with netfslib whereby a crash occurs
    because the 9p_fid structs get forcibly destroyed during client teardown
    (without paying attention to their refcounts) before netfslib has finished
    with them.  However, it's not a simple case of deferring the clunking that
    p9_fid_put() does as that requires the client.
    
    The problem is that netfslib has to unlock pages and clear the IN_PROGRESS
    flag before destroying the objects involved - including the pid - and, in
    any case, nothing checks to see if writeback completed barring looking at
    the page flags.
    
    Fix this by keeping a count of outstanding I/O requests (of any type) and
    waiting for it to quiesce during inode eviction.
    
    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: Eric Van Hensbergen <ericvh@kernel.org>
    cc: Latchesar Ionkov <lucho@ionkov.net>
    cc: Dominique Martinet <asmadeus@codewreck.org>
    cc: Christian Schoenebeck <linux_oss@crudebyte.com>
    cc: Jeff Layton <jlayton@kernel.org>
    cc: Steve French <sfrench@samba.org>
    cc: v9fs@lists.linux.dev
    cc: linux-afs@lists.infradead.org
    cc: linux-cifs@vger.kernel.org
    cc: netfs@lists.linux.dev
    cc: linux-fsdevel@vger.kernel.org

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 8c9a896d691e..57cfa9f65046 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -354,6 +354,7 @@ void v9fs_evict_inode(struct inode *inode)
 		version = cpu_to_le32(v9inode->qid.version);
 		netfs_clear_inode_writeback(inode, &version);
 
+		netfs_wait_for_outstanding_io(inode);
 		clear_inode(inode);
 		filemap_fdatawrite(&inode->i_data);
 
@@ -361,8 +362,10 @@ void v9fs_evict_inode(struct inode *inode)
 		if (v9fs_inode_cookie(v9inode))
 			fscache_relinquish_cookie(v9fs_inode_cookie(v9inode), false);
 #endif
-	} else
+	} else {
+		netfs_wait_for_outstanding_io(inode);
 		clear_inode(inode);
+	}
 }
 
 struct inode *
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 94fc049aff58..c831e711a4ac 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -652,6 +652,7 @@ void afs_evict_inode(struct inode *inode)
 
 	afs_set_cache_aux(vnode, &aux);
 	netfs_clear_inode_writeback(inode, &aux);
+	netfs_wait_for_outstanding_io(inode);
 	clear_inode(inode);
 
 	while (!list_empty(&vnode->wb_keys)) {
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index c90d482b1650..f4a642727479 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -72,6 +72,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping,
 		}
 	}
 
+	atomic_inc(&ctx->io_count);
 	trace_netfs_rreq_ref(rreq->debug_id, 1, netfs_rreq_trace_new);
 	netfs_proc_add_rreq(rreq);
 	netfs_stat(&netfs_n_rh_rreq);
@@ -124,6 +125,7 @@ static void netfs_free_request(struct work_struct *work)
 {
 	struct netfs_io_request *rreq =
 		container_of(work, struct netfs_io_request, work);
+	struct netfs_inode *ictx = netfs_inode(rreq->inode);
 	unsigned int i;
 
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
@@ -142,6 +144,9 @@ static void netfs_free_request(struct work_struct *work)
 		}
 		kvfree(rreq->direct_bv);
 	}
+
+	if (atomic_dec_and_test(&ictx->io_count))
+		wake_up_var(&ictx->io_count);
 	call_rcu(&rreq->rcu, netfs_free_request_rcu);
 }
 
diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index ec5b639f421a..21c9e173ea9a 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -435,6 +435,7 @@ cifs_evict_inode(struct inode *inode)
 	if (inode->i_state & I_PINNING_NETFS_WB)
 		cifs_fscache_unuse_inode_cookie(inode, true);
 	cifs_fscache_release_inode_cookie(inode);
+	netfs_wait_for_outstanding_io(inode);
 	clear_inode(inode);
 }
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index d2d291a9cdad..3ca3906bb8da 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -68,6 +68,7 @@ struct netfs_inode {
 	loff_t			remote_i_size;	/* Size of the remote file */
 	loff_t			zero_point;	/* Size after which we assume there's no data
 						 * on the server */
+	atomic_t		io_count;	/* Number of outstanding reqs */
 	unsigned long		flags;
 #define NETFS_ICTX_ODIRECT	0		/* The file has DIO in progress */
 #define NETFS_ICTX_UNBUFFERED	1		/* I/O should not use the pagecache */
@@ -474,6 +475,7 @@ static inline void netfs_inode_init(struct netfs_inode *ctx,
 	ctx->remote_i_size = i_size_read(&ctx->inode);
 	ctx->zero_point = LLONG_MAX;
 	ctx->flags = 0;
+	atomic_set(&ctx->io_count, 0);
 #if IS_ENABLED(CONFIG_FSCACHE)
 	ctx->cache = NULL;
 #endif
@@ -517,4 +519,20 @@ static inline struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx)
 #endif
 }
 
+/**
+ * netfs_wait_for_outstanding_io - Wait for outstanding I/O to complete
+ * @ctx: The netfs inode to wait on
+ *
+ * Wait for outstanding I/O requests of any type to complete.  This is intended
+ * to be called from inode eviction routines.  This makes sure that any
+ * resources held by those requests are cleaned up before we let the inode get
+ * cleaned up.
+ */
+static inline void netfs_wait_for_outstanding_io(struct inode *inode)
+{
+	struct netfs_inode *ictx = netfs_inode(inode);
+
+	wait_var_event(&ictx->io_count, atomic_read(&ictx->io_count) == 0);
+}
+
 #endif /* _LINUX_NETFS_H */

Emanuele Rocca May 30, 2024, 7:16 p.m. UTC | #12

Hi David,

On 2024-05-23 08:44, David Howells wrote:
> In https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461 you say:
> 
> | It seems that kernel 6.8 introduced a regression in the 9pfs related to
> | caching and netfslib, that can cause some user-space apps to read content
> | from files that is not up-to-date (when they are used in a producer/consumer
> | fashion).
> 
> Can you clarify how these files are being used?

I don't know the details of the 9pfs operations involved, but still I
wanted to mention that to reliably reproduce the issue on a Debian
system one can run:

 autopkgtest-build-qemu unstable /tmp/sid.img
 autopkgtest -ddd -B dpdk -- autopkgtest-virt-qemu --debug --show-boot /tmp/sid.img

If the kernel installed in the guest VM is affected by this problem,
after a while the test hangs with something like:

 autopkgtest-virt-qemu: DBG: executing copydown /tmp/alog/tests-tree/ /tmp/autopkgtest.uG6tsJ/build.6QA/src/
 [...]
 autopkgtest-virt-qemu: DBG:  +>?

Full logs at https://people.debian.org/~ema/1072004-6.10-rc1.log

Part of the code mounting the 9pfs in case it helps is at:
https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/virt/autopkgtest-virt-qemu#L290

I could reproduce the issue with both 6.9.2 and 6.10-rc1.

Emanuele Rocca May 31, 2024, 3:06 p.m. UTC | #13

Hi again,

On 2024-05-23 08:44, David Howells wrote:
> commit 39302c160390441ed5b4f4f7ad480c44eddf0962
> Author: David Howells <dhowells@redhat.com>
> Date:   Wed May 22 17:30:22 2024 +0100
> 
>     netfs, 9p: Fix race between umount and async request completion

I have tried this patch on top of 6.10-rc1 and unfortunately the problem
persists.

Meanwhile TJ (in CC) has been doing a lot of further investigation and
opened https://bugzilla.kernel.org/show_bug.cgi?id=218916.

Christian Kastner June 17, 2024, 5:10 p.m. UTC | #14

Hi,

On 2024-05-31 17:06, Emanuele Rocca wrote:
> Meanwhile TJ (in CC) has been doing a lot of further investigation and
> opened https://bugzilla.kernel.org/show_bug.cgi?id=218916.

just to loop back to the MLs: in the referenced bug, TJ posted an
analysis and and added a patch that fixed the issue for multiple testers.

Best,
Christian

Dominique Martinet June 17, 2024, 9:50 p.m. UTC | #15

Christian Kastner wrote on Mon, Jun 17, 2024 at 07:10:56PM +0200:
> On 2024-05-31 17:06, Emanuele Rocca wrote:
> > Meanwhile TJ (in CC) has been doing a lot of further investigation and
> > opened https://bugzilla.kernel.org/show_bug.cgi?id=218916.
> 
> just to loop back to the MLs: in the referenced bug, TJ posted an
> analysis and and added a patch that fixed the issue for multiple testers.

Thanks for the mail, one of these days I'll try to understand how to
make bugzilla automatically put me in cc of all the 9p bugs..

Analysis and tentative fix are of great help! Looks like we now
understand what's wrong -- if I understand the description correctly we
know the correct size (files aren't modified in the background, just
other threads within the VM, right?); and the problem is that the netfs
IO reverts the size back to an incorrect value when it completes?

If so then the fix looks odd to me, the problem ought to be fixed at the
netfs/9p interface level, I don't see why an unbuffered read should
update the size metadata when it's done...

David, what do you think?

[v5,40/40] 9p: Use netfslib read/write_iter

Commit Message

Comments

Patch