diff mbox series

[1/6] kernfs: create vm_operations_struct without page_mkwrite()

Message ID 20240605192934.742369-2-martin.oliveira@eideticom.com (mailing list archive)
State Superseded
Headers show
Series Enable P2PDMA in Userspace RDMA | expand

Commit Message

Martin Oliveira June 5, 2024, 7:29 p.m. UTC
The standard kernfs vm_ops installs a page_mkwrite() operator which
modifies the file update time on write.

This not always required (or makes sense), such as in the P2PDMA, which
uses the sysfs file as an allocator from userspace.

Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.

Fix this by adding a new boolean on kernfs_ops to differentiate between
the different behaviours.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
 fs/kernfs/file.c       | 15 ++++++++++++++-
 include/linux/kernfs.h |  7 +++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas June 5, 2024, 9:43 p.m. UTC | #1
On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> The standard kernfs vm_ops installs a page_mkwrite() operator which
> modifies the file update time on write.
> 
> This not always required (or makes sense), such as in the P2PDMA, which

s/This/This is/ ?

> uses the sysfs file as an allocator from userspace.
> 
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
> 
> Fix this by adding a new boolean on kernfs_ops to differentiate between
> the different behaviours.

> +	 * Use the file as an allocator from userspace. This disables
> +	 * page_mkwrite() to prevent the file time from being updated on write
> +	 * which enables using GUP with FOLL_LONGTERM with memory that's been
> +	 * mmaped.

"mmaped" does seem more commonly used in Linux than "mmapped", but the
base word "mapped" definitely requires "pp", so "mmaped" looks funny
to me.
Greg KH June 6, 2024, 8:54 p.m. UTC | #2
On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> The standard kernfs vm_ops installs a page_mkwrite() operator which
> modifies the file update time on write.
> 
> This not always required (or makes sense), such as in the P2PDMA, which
> uses the sysfs file as an allocator from userspace.

That's not a good idea, please don't do that.  sysfs binary files are
"pass through", why would you want to use this as an allocator?

> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
> 
> Fix this by adding a new boolean on kernfs_ops to differentiate between
> the different behaviours.

This isn't going to work well.

What exactly are you wanting to do in sysfs that you feel this is
required?

thanks,

greg k-h
Logan Gunthorpe June 6, 2024, 9:32 p.m. UTC | #3
Hi Greg,

On 2024-06-06 14:54, Greg Kroah-Hartman wrote:
> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
>> The standard kernfs vm_ops installs a page_mkwrite() operator which
>> modifies the file update time on write.
>>
>> This not always required (or makes sense), such as in the P2PDMA, which
>> uses the sysfs file as an allocator from userspace.
> 
> That's not a good idea, please don't do that.  sysfs binary files are
> "pass through", why would you want to use this as an allocator?

The P2PDMA code already creates a binary attribute which is used to
allocate P2PDMA memory into userspace[1]. It was done this way a couple
of years ago at the suggestion of Christoph[2]. Using a sysfs attribute
made the code substantially simpler and got rid of a bunch of pseudofs
mess that was required when mmaping a char device. The attribute already
exists and is used by userspace so it's not something we can change at
this point.

The attribute has worked well for what was needed until we wanted to use
P2PDMA memory with FOLL_LONGTERM and GUP. That path specifically denies
FOLL_LONGTERM pins when the underlying VMA has a .page_mkwrite operator,
which sysfs/kernfs forces on us. P2PDMA doesn't benefit from this
operator in any way so the simplest thing is to remove it for this use case.

>> Furthermore, having the page_mkwrite() operator causes
>> writable_file_mapping_allowed() to fail due to
>> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
>> enabling P2PDMA over RDMA.
>>
>> Fix this by adding a new boolean on kernfs_ops to differentiate between
>> the different behaviours.
> 
> This isn't going to work well.

What about it are you worried won't work well? We're open to other
suggestions.

Thanks,

Logan

[1] https://elixir.bootlin.com/linux/latest/source/drivers/pci/p2pdma.c#L164
[2] https://lore.kernel.org/all/20220705075108.GB17451@lst.de/
Christoph Hellwig June 7, 2024, 5:03 a.m. UTC | #4
On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> > The standard kernfs vm_ops installs a page_mkwrite() operator which
> > modifies the file update time on write.
> > 
> > This not always required (or makes sense), such as in the P2PDMA, which
> > uses the sysfs file as an allocator from userspace.
> 
> That's not a good idea, please don't do that.  sysfs binary files are
> "pass through", why would you want to use this as an allocator?

I think the real question is why sysfs binary files implement
page_mkwrite by default.  page_mkwrite is needed for file systems that
need to allocate space from a free space pool, which seems odd for
sysfs.
Logan Gunthorpe June 7, 2024, 4:16 p.m. UTC | #5
On 2024-06-06 23:03, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
>> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
>>> The standard kernfs vm_ops installs a page_mkwrite() operator which
>>> modifies the file update time on write.
>>>
>>> This not always required (or makes sense), such as in the P2PDMA, which
>>> uses the sysfs file as an allocator from userspace.
>>
>> That's not a good idea, please don't do that.  sysfs binary files are
>> "pass through", why would you want to use this as an allocator?
> 
> I think the real question is why sysfs binary files implement
> page_mkwrite by default.  page_mkwrite is needed for file systems that
> need to allocate space from a free space pool, which seems odd for
> sysfs.

The default page_mkwrite in kernfs just calls file_update_time() but, as
I understand it, the fault code should call file_update_time() if
page_mkwrite isn't set. So perhaps the easiest thing is to simply not
add a page_mkwrite unless the vm_ops adds one.

It's not the easiest thing to trace, but as best as I can tell there are
no kernfs binary attributes that use page_mkwrite. So alternatively,
perhaps we could just disallow page_mkwrite in kernfs entirely?

Logan
Greg KH June 7, 2024, 7:18 p.m. UTC | #6
On Fri, Jun 07, 2024 at 10:16:58AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2024-06-06 23:03, Christoph Hellwig wrote:
> > On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
> >> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> >>> The standard kernfs vm_ops installs a page_mkwrite() operator which
> >>> modifies the file update time on write.
> >>>
> >>> This not always required (or makes sense), such as in the P2PDMA, which
> >>> uses the sysfs file as an allocator from userspace.
> >>
> >> That's not a good idea, please don't do that.  sysfs binary files are
> >> "pass through", why would you want to use this as an allocator?
> > 
> > I think the real question is why sysfs binary files implement
> > page_mkwrite by default.  page_mkwrite is needed for file systems that
> > need to allocate space from a free space pool, which seems odd for
> > sysfs.
> 
> The default page_mkwrite in kernfs just calls file_update_time() but, as
> I understand it, the fault code should call file_update_time() if
> page_mkwrite isn't set. So perhaps the easiest thing is to simply not
> add a page_mkwrite unless the vm_ops adds one.
> 
> It's not the easiest thing to trace, but as best as I can tell there are
> no kernfs binary attributes that use page_mkwrite. So alternatively,
> perhaps we could just disallow page_mkwrite in kernfs entirely?

Sure, let's do that.
diff mbox series

Patch

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b..d5e9fbded3dd 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -436,6 +436,12 @@  static const struct vm_operations_struct kernfs_vm_ops = {
 	.access		= kernfs_vma_access,
 };
 
+static const struct vm_operations_struct kernfs_vm_ops_mmap_allocates = {
+	.open		= kernfs_vma_open,
+	.fault		= kernfs_vma_fault,
+	.access		= kernfs_vma_access,
+};
+
 static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct kernfs_open_file *of = kernfs_of(file);
@@ -482,13 +488,20 @@  static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->close)
 		goto out_put;
 
+	if (ops->mmap_allocates)
+		vma->vm_ops = &kernfs_vm_ops_mmap_allocates;
+	else
+		vma->vm_ops = &kernfs_vm_ops;
+
+	if (ops->mmap_allocates && vma->vm_ops->page_mkwrite)
+		goto out_put;
+
 	rc = 0;
 	if (!of->mmapped) {
 		of->mmapped = true;
 		of_on(of)->nr_mmapped++;
 		of->vm_ops = vma->vm_ops;
 	}
-	vma->vm_ops = &kernfs_vm_ops;
 out_put:
 	kernfs_put_active(of->kn);
 out_unlock:
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 87c79d076d6d..d6ae7d4b0011 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -311,6 +311,13 @@  struct kernfs_ops {
 	 * ->prealloc.  Provide ->read and ->write with ->prealloc.
 	 */
 	bool prealloc;
+	/*
+	 * Use the file as an allocator from userspace. This disables
+	 * page_mkwrite() to prevent the file time from being updated on write
+	 * which enables using GUP with FOLL_LONGTERM with memory that's been
+	 * mmaped.
+	 */
+	bool mmap_allocates;
 	ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes,
 			 loff_t off);