[0/5] samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage

Message ID	20250121153646.37895-1-me@davidreaver.com (mailing list archive)
Headers	show Received: from fout-a6-smtp.messagingengine.com (fout-a6-smtp.messagingengine.com [103.168.172.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BF971F426C for <linux-fsdevel@vger.kernel.org>; Tue, 21 Jan 2025 15:37:28 +0000 (UTC) Feedback-ID: i67e946c9:Fastmail From: David Reaver <me@davidreaver.com> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Tejun Heo <tj@kernel.org> Cc: David Reaver <me@davidreaver.com>, Steven Rostedt <rostedt@goodmis.org>, Christian Brauner <brauner@kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, Jonathan Corbet <corbet@lwn.net>, James Bottomley <James.Bottomley@HansenPartnership.com>, Krister Johansen <kjlx@templeofstupid.com>, linux-fsdevel@vger.kernel.org Subject: [PATCH 0/5] samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage Date: Tue, 21 Jan 2025 07:36:34 -0800 Message-ID: <20250121153646.37895-1-me@davidreaver.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage \| expand [0/5] samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage [1/5] samples/kernfs: Adds boilerplate/README for sample_kernfs [2/5] samples/kernfs: Make filesystem mountable [3/5] samples/kernfs: Add counter file to each directory [4/5] samples/kernfs: Allow creating and removing directories [5/5] samples/kernfs: Add inc file to allow changing counter increment

David Reaver Jan. 21, 2025, 3:36 p.m. UTC

This patch series creates a toy pseudo-filesystem built on top of kernfs in
samples/kernfs/.

kernfs underpins the sysfs and cgroup filesystems. Many kernel developers have
considered kernfs for other pseudo-filesystems [1][2] and a draft patch was
proposed to investigate moving tracefs to kernfs [3]. One reason kernfs isn't
used more is it is almost entirely undocumented; I certainly had to read almost
all of the kernfs code to implement this toy filesystem. This sample aims to
improve kernfs documentation by way of an example.

The README.rst file in the first patch describes how sample_kernfs works from a
user's perspective. Summary: the filesystem automatically populates directories
with counter files that increment every time they are read. Users can adjust the
increment via inc files. Counter files can be reset by writing a new value to
them.

Subsequent patches build the rest of the filesystem. The commits are structured
to guide readers in learning kernfs components and adapting them to build their
own filesystems. If reviewers would prefer this all to be in one commit, I'm
happy to do that too. Initially, I included a more complex example where you
could read the sum of all child directory counters in a parent directory, but I
didn't want to complicate the sample too much and distract from kernfs. I’m
happy to remove the inc file if reviewers feel it's unnecessary. It is funny how
even a toy can suffer from feature creep :)

This is my first substantial kernel patch, so I welcome feedback on any trivial
errors. I tested this filesystem with all of the CONFIG_DEBUG_* and similar
options I could find and I ensured none of them report any issues. They were
particularly useful when debugging a deadlock that required replacing
kernfs_remove() with kernfs_remove_self(), and discovering a memory leak fixed
with kernfs_put().

In the future, I hope to contribute further by writing documentation for kernfs
and exploring the possibility of porting debugfs and/or tracefs to kernfs (like
completing the draft in [3]). I'm curious if the reviewers feel any of those
ideas are worth doing right now.

Link: https://lwn.net/Articles/960088/ [1]
Link: https://lwn.net/Articles/981155/ [2]
Link: https://lore.kernel.org/all/20240131-tracefs-kernfs-v1-0-f20e2e9a8d61@kernel.org/ [3]

David Reaver (5):
samples/kernfs: Adds boilerplate/README for sample_kernfs
samples/kernfs: Make filesystem mountable
samples/kernfs: Add counter file to each directory
samples/kernfs: Allow creating and removing directories
samples/kernfs: Add inc file to allow changing counter increment

MAINTAINERS | 1 +
samples/Kconfig | 6 +
samples/Makefile | 1 +
samples/kernfs/Makefile | 3 +
samples/kernfs/README.rst | 55 ++++++
samples/kernfs/sample_kernfs.c | 321 +++++++++++++++++++++++++++++++++
6 files changed, 387 insertions(+)
create mode 100644 samples/kernfs/Makefile
create mode 100644 samples/kernfs/README.rst
create mode 100644 samples/kernfs/sample_kernfs.c

base-commit: fda5e3f284002ea55dac1c98c1498d6dd684046e

Christoph Hellwig Jan. 28, 2025, 6:08 a.m. UTC | #1

On Tue, Jan 21, 2025 at 07:36:34AM -0800, David Reaver wrote:
> This patch series creates a toy pseudo-filesystem built on top of kernfs in
> samples/kernfs/.

Is that a good idea?  kernfs and the interactions with the users of it
is a pretty convoluted mess.  I'd much prefer people writing their
pseudo file systems to the VFS APIs over spreading kernfs usage further.

Steven Rostedt Jan. 28, 2025, 3:27 p.m. UTC | #2

On Mon, 27 Jan 2025 22:08:29 -0800
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 21, 2025 at 07:36:34AM -0800, David Reaver wrote:
> > This patch series creates a toy pseudo-filesystem built on top of kernfs in
> > samples/kernfs/.  
> 
> Is that a good idea?  kernfs and the interactions with the users of it
> is a pretty convoluted mess.  I'd much prefer people writing their
> pseudo file systems to the VFS APIs over spreading kernfs usage further.

I have to disagree with this. As someone that uses a pseudo file system to
interact with my subsystem, I really don't want to have to know the
intrinsics of the virtual file system layer just so I can interact via the
file system. Not knowing how to do that properly was what got me in trouble
with Linus is the first place.

The VFS layer is best for developing file systems that are for storage.
Like XFS, ext4, bcachefs, etc. And yes, if you are developing a new layout
of storage, then you should know the VFS APIs.

But pseudo file systems are a completely different beast. The files are not
for storage, but for control of the kernel. They map to control objects.
For tracefs, there's a "current_tracer". If you write "function" to it, it
starts the function tracer. It has to maintain state, but only for the life
of the boot, and not across boots. All of debugfs is the same way, and
unfortunately, the kernel API for debugfs is wrong. It uses dentries as the
handle to the files, which it should not be doing. dentry is a complex
internal cache element within VFS, and I assumed that because debugfs used
it, it was OK to use it as well, and that's where my arguments with Linus
stemmed from.

For people like myself that only need a way to have a control interface via
the file system, kernfs appears to cover that. Maybe kernfs isn't
implemented the way you like? If that's the case, we should fix that. But
from my point of view, it would be really great if I can create a file
system control interface without having to know anything about how VFS is
implemented.

BTW, I was going to work on converting debugfs over to kernfs if I ever got
the chance (or mentor someone else to do it). Whether it's kernfs or
something else, it would be really great to have a kernel abstraction layer
that creates a pseudo file system without having to create a pseudo file
system. debugfs was that, and became very popular, but it was done incorrectly.

-- Steve

Linus Torvalds Jan. 28, 2025, 10:05 p.m. UTC | #3

On Tue, 28 Jan 2025 at 07:27, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 27 Jan 2025 22:08:29 -0800
> Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Is that a good idea?  kernfs and the interactions with the users of it
> > is a pretty convoluted mess.  I'd much prefer people writing their
> > pseudo file systems to the VFS APIs over spreading kernfs usage further.
>
> I have to disagree with this. As someone that uses a pseudo file system to
> interact with my subsystem, I really don't want to have to know the
> intrinsics of the virtual file system layer just so I can interact via the
> file system. Not knowing how to do that properly was what got me in trouble
> with Linus is the first place.

Well, honestly, you were doing some odd things.

For a *simple* filesystem that actually acts as a filesystem, all you
need is in libfs with things like &simple_dir_operations etc.

And we have a *lot* of perfectly regular users of things like that.
Not like the ftrace mess that had very *non*-filesystem semantics with
separate lifetime confusion etc, and that tried to maintain a separate
notion of permissions etc.

To make matters worse, tracefs than had a completely different model
for events, and these interacted oddly in non-filesystem ways.

In other words, all the tracefs problems were self-inflicted, and a
lot of them were because you wanted to go behind the vfs layers back
because you had millions of nodes but didn't want to have millions of
inodes etc.

That's not normal.

I mean, you can pretty much literally look at ramfs:

    fs/ramfs/inode.c

and it is a real example filesystem that does a lot of things, but
almost all of it is just using the direct vfs helpers (simple_lookup /
simple_link/ simple_rmdir etc etc). It plays *zero* games with
dentries.

Or look at fs/pstore.

Or any number of other examples.

And no, nobody should *EVER* look at the horror that is tracefs and eventfs.

              Linus

Steven Rostedt Jan. 28, 2025, 10:42 p.m. UTC | #4

On Tue, 28 Jan 2025 14:05:05 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Well, honestly, you were doing some odd things.

Some of those odd things were because of the use of the dentry as a handle,
which also required making an inode for every file. When the number of
event files blew up to 10s of thousands, that caused a lot of memory to be
used.

> 
> For a *simple* filesystem that actually acts as a filesystem, all you
> need is in libfs with things like &simple_dir_operations etc.
> 
> And we have a *lot* of perfectly regular users of things like that.
> Not like the ftrace mess that had very *non*-filesystem semantics with
> separate lifetime confusion etc, and that tried to maintain a separate
> notion of permissions etc.

I would also say that the proc file system is rather messy. But that's very
old and has a long history which probably built up its complexity.

> 
> To make matters worse, tracefs than had a completely different model
> for events, and these interacted oddly in non-filesystem ways.

Ideally, I rather it not have done it that way. To save memory, since every
event in eventfs has the same files, it was better to just make a single
array that represents those files for every event. That saved over 20
megabytes per tracing instance.

> 
> In other words, all the tracefs problems were self-inflicted, and a
> lot of them were because you wanted to go behind the vfs layers back
> because you had millions of nodes but didn't want to have millions of
> inodes etc.
> 
> That's not normal.
> 
> I mean, you can pretty much literally look at ramfs:
> 
>     fs/ramfs/inode.c
> 
> and it is a real example filesystem that does a lot of things, but
> almost all of it is just using the direct vfs helpers (simple_lookup /
> simple_link/ simple_rmdir etc etc). It plays *zero* games with
> dentries.

It's also a storage file system. It's just that it stores to memory which
looks like it simply uses the page cache where it never needs to write it
to disk. It's not a good example for a control interface.

> 
> Or look at fs/pstore.

Another storage device.

> 
> Or any number of other examples.
> 
> And no, nobody should *EVER* look at the horror that is tracefs and eventfs.

I believe kernfs is to cover control interfaces like sysfs and debugfs,
that actually changes kernel behavior when their files are written to. It's
also likely why procfs is such a mess because that too is a control
interface.

Yes, eventfs is "special", but tracefs could easily be converted to kernfs.
I believe Christian even wrote a POC that did that.

-- Steve

Tejun Heo Jan. 28, 2025, 10:51 p.m. UTC | #5

On Tue, Jan 28, 2025 at 05:42:57PM -0500, Steven Rostedt wrote:
...
> I believe kernfs is to cover control interfaces like sysfs and debugfs,
> that actually changes kernel behavior when their files are written to. It's
> also likely why procfs is such a mess because that too is a control
> interface.

Just for context, kernfs is factored out from sysfs. One of the factors
which drove the design was memory overhead. On large systems (IIRC
especially with iSCSI), there can be a huge number of sysfs nodes and
allocating a dentry and inode pair for each file made some machines run out
of memory during boot, so sysfs implemented memory-backed filesystem store
which then made its interface to its users to depart from the VFS layer.
This requirement holds for cgroup too - there are systems with a *lot* of
cgroups and the associated interface files and we don't want to pin a dentry
and inode for all of them.

Thanks.

Steven Rostedt Jan. 28, 2025, 11:29 p.m. UTC | #6

On Tue, 28 Jan 2025 12:51:47 -1000
Tejun Heo <tj@kernel.org> wrote:

> Just for context, kernfs is factored out from sysfs. One of the factors
> which drove the design was memory overhead. On large systems (IIRC
> especially with iSCSI), there can be a huge number of sysfs nodes and
> allocating a dentry and inode pair for each file made some machines run out
> of memory during boot, so sysfs implemented memory-backed filesystem store
> which then made its interface to its users to depart from the VFS layer.
> This requirement holds for cgroup too - there are systems with a *lot* of
> cgroups and the associated interface files and we don't want to pin a dentry
> and inode for all of them.
> 

Right. And going back to ramfs, it too has a dentry and inode for every
file that is created. Thus, if you have a lot of files, you'll have a lot
of memory dedicated to their dentry and inodes that will never be freed.
The ramfs_create() and ramfs_mkdir() both call ramfs_mknod() which does a
d_instantiate() and a dget() on the dentry so they are persistent until
they are deleted or a reboot happens.

What I did for eventfs, and what I believe kernfs does, is to create a
small descriptor to represent the control data and reference them like what
you would have on disk. That is, the control elements (like an trace event
descriptor) is really what is on "disk". When someone does an "ls" to the
pseudo file system, there needs to be a way for the VFS layer to query the
control structures like how a normal file system would query that data
stored on disk, and then let the VFS layer create the dentry and inodes
when referenced, and more importantly, free them when they are no longer
referenced and there's memory pressure.

I believe kernfs does the same thing. And my point is, it would be nice to
have an abstract layer that represent control descriptors that may be
around for the entirety of the boot (like trace events are) without needing
to pin a dentry and inode for each one of theses files. Currently, that
abstract layer is kernfs.

-- Steve

Tejun Heo Jan. 28, 2025, 11:38 p.m. UTC | #7

On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote:
> What I did for eventfs, and what I believe kernfs does, is to create a
> small descriptor to represent the control data and reference them like what
> you would have on disk. That is, the control elements (like an trace event
> descriptor) is really what is on "disk". When someone does an "ls" to the
> pseudo file system, there needs to be a way for the VFS layer to query the
> control structures like how a normal file system would query that data
> stored on disk, and then let the VFS layer create the dentry and inodes
> when referenced, and more importantly, free them when they are no longer
> referenced and there's memory pressure.

Yeap, that's exactly what kernfs does.

Thanks.

Steven Rostedt Jan. 29, 2025, 12:02 a.m. UTC | #8

On Tue, 28 Jan 2025 13:38:42 -1000
Tejun Heo <tj@kernel.org> wrote:

> On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote:
> > What I did for eventfs, and what I believe kernfs does, is to create a
> > small descriptor to represent the control data and reference them like what
> > you would have on disk. That is, the control elements (like an trace event
> > descriptor) is really what is on "disk". When someone does an "ls" to the
> > pseudo file system, there needs to be a way for the VFS layer to query the
> > control structures like how a normal file system would query that data
> > stored on disk, and then let the VFS layer create the dentry and inodes
> > when referenced, and more importantly, free them when they are no longer
> > referenced and there's memory pressure.  
> 
> Yeap, that's exactly what kernfs does.

And eventfs goes one step further. Because there's a full directory layout
that's identical for every event, it has a single descriptor for directory
and not for file. As there can be over 10 files per directory/event I
didn't want to waste even that memory. This is why I couldn't use kernfs
for eventfs, as I was able to still save a couple of megabytes by not
having the files have any descriptor representing them (besides a single
array for all events).

-- Steve

Greg KH Feb. 3, 2025, 3:05 p.m. UTC | #9

On Tue, Jan 28, 2025 at 07:02:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Jan 2025 13:38:42 -1000
> Tejun Heo <tj@kernel.org> wrote:
> 
> > On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote:
> > > What I did for eventfs, and what I believe kernfs does, is to create a
> > > small descriptor to represent the control data and reference them like what
> > > you would have on disk. That is, the control elements (like an trace event
> > > descriptor) is really what is on "disk". When someone does an "ls" to the
> > > pseudo file system, there needs to be a way for the VFS layer to query the
> > > control structures like how a normal file system would query that data
> > > stored on disk, and then let the VFS layer create the dentry and inodes
> > > when referenced, and more importantly, free them when they are no longer
> > > referenced and there's memory pressure.  
> > 
> > Yeap, that's exactly what kernfs does.
> 
> And eventfs goes one step further. Because there's a full directory layout
> that's identical for every event, it has a single descriptor for directory
> and not for file. As there can be over 10 files per directory/event I
> didn't want to waste even that memory. This is why I couldn't use kernfs
> for eventfs, as I was able to still save a couple of megabytes by not
> having the files have any descriptor representing them (besides a single
> array for all events).

Ok, that's fine, but the original point of "are you sure you want to use
kernfs for anything other than what we have today" remains.  It's only a
limited set of use cases that kernfs is good for, libfs is still the
best place to start out for a virtual filesystem.  The fact that the
majority of our "fake" filesystems are using libfs and not kernfs is
semi-proof of that?

Or is it proof that kernfs is just too undocumented that no one wants to
move to it?  I don't know, but adding samples like this really isn't the
answer to that, the answer would be moving an existing libfs
implementation to use kernfs and then that patch series would be the
example to follow for others.

thanks,

greg k-h

David Reaver Feb. 3, 2025, 4:12 p.m. UTC | #10

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Tue, Jan 28, 2025 at 07:02:24PM -0500, Steven Rostedt wrote:
>>
>> And eventfs goes one step further. Because there's a full directory layout
>> that's identical for every event, it has a single descriptor for directory
>> and not for file. As there can be over 10 files per directory/event I
>> didn't want to waste even that memory. This is why I couldn't use kernfs
>> for eventfs, as I was able to still save a couple of megabytes by not
>> having the files have any descriptor representing them (besides a single
>> array for all events).
>
> Ok, that's fine, but the original point of "are you sure you want to use
> kernfs for anything other than what we have today" remains.  It's only a
> limited set of use cases that kernfs is good for, libfs is still the
> best place to start out for a virtual filesystem.  The fact that the
> majority of our "fake" filesystems are using libfs and not kernfs is
> semi-proof of that?
>
> Or is it proof that kernfs is just too undocumented that no one wants to
> move to it?  I don't know, but adding samples like this really isn't the
> answer to that, the answer would be moving an existing libfs
> implementation to use kernfs and then that patch series would be the
> example to follow for others.
>
> thanks,
>
> greg k-h

Thanks for reviewing the patch, Greg!

I put this sample together with the idea that some documentation is
better than none. I researched how kernfs could be useful in tracefs and
debugfs, but I haven't looked deeply into other virtual filesystems, so
I may have overestimated how well kernfs fits other use cases. From this
discussion, I see that a real libfs-to-kernfs port would provide a
better understanding of kernfs' viability elsewhere and also serve as
documentation.

Thanks for the discussion, folks! I learned a lot from this thread.

Thanks,
David Reaver

[0/5] samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage

Message

Comments