Message ID | 20250121153646.37895-1-me@davidreaver.com (mailing list archive) |
---|---|
Headers | show |
Series | samples/kernfs: Add a pseudo-filesystem to demonstrate kernfs usage | expand |
On Tue, Jan 21, 2025 at 07:36:34AM -0800, David Reaver wrote: > This patch series creates a toy pseudo-filesystem built on top of kernfs in > samples/kernfs/. Is that a good idea? kernfs and the interactions with the users of it is a pretty convoluted mess. I'd much prefer people writing their pseudo file systems to the VFS APIs over spreading kernfs usage further.
On Mon, 27 Jan 2025 22:08:29 -0800 Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Jan 21, 2025 at 07:36:34AM -0800, David Reaver wrote: > > This patch series creates a toy pseudo-filesystem built on top of kernfs in > > samples/kernfs/. > > Is that a good idea? kernfs and the interactions with the users of it > is a pretty convoluted mess. I'd much prefer people writing their > pseudo file systems to the VFS APIs over spreading kernfs usage further. I have to disagree with this. As someone that uses a pseudo file system to interact with my subsystem, I really don't want to have to know the intrinsics of the virtual file system layer just so I can interact via the file system. Not knowing how to do that properly was what got me in trouble with Linus is the first place. The VFS layer is best for developing file systems that are for storage. Like XFS, ext4, bcachefs, etc. And yes, if you are developing a new layout of storage, then you should know the VFS APIs. But pseudo file systems are a completely different beast. The files are not for storage, but for control of the kernel. They map to control objects. For tracefs, there's a "current_tracer". If you write "function" to it, it starts the function tracer. It has to maintain state, but only for the life of the boot, and not across boots. All of debugfs is the same way, and unfortunately, the kernel API for debugfs is wrong. It uses dentries as the handle to the files, which it should not be doing. dentry is a complex internal cache element within VFS, and I assumed that because debugfs used it, it was OK to use it as well, and that's where my arguments with Linus stemmed from. For people like myself that only need a way to have a control interface via the file system, kernfs appears to cover that. Maybe kernfs isn't implemented the way you like? If that's the case, we should fix that. But from my point of view, it would be really great if I can create a file system control interface without having to know anything about how VFS is implemented. BTW, I was going to work on converting debugfs over to kernfs if I ever got the chance (or mentor someone else to do it). Whether it's kernfs or something else, it would be really great to have a kernel abstraction layer that creates a pseudo file system without having to create a pseudo file system. debugfs was that, and became very popular, but it was done incorrectly. -- Steve
On Tue, 28 Jan 2025 at 07:27, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 27 Jan 2025 22:08:29 -0800 > Christoph Hellwig <hch@infradead.org> wrote: > > > > Is that a good idea? kernfs and the interactions with the users of it > > is a pretty convoluted mess. I'd much prefer people writing their > > pseudo file systems to the VFS APIs over spreading kernfs usage further. > > I have to disagree with this. As someone that uses a pseudo file system to > interact with my subsystem, I really don't want to have to know the > intrinsics of the virtual file system layer just so I can interact via the > file system. Not knowing how to do that properly was what got me in trouble > with Linus is the first place. Well, honestly, you were doing some odd things. For a *simple* filesystem that actually acts as a filesystem, all you need is in libfs with things like &simple_dir_operations etc. And we have a *lot* of perfectly regular users of things like that. Not like the ftrace mess that had very *non*-filesystem semantics with separate lifetime confusion etc, and that tried to maintain a separate notion of permissions etc. To make matters worse, tracefs than had a completely different model for events, and these interacted oddly in non-filesystem ways. In other words, all the tracefs problems were self-inflicted, and a lot of them were because you wanted to go behind the vfs layers back because you had millions of nodes but didn't want to have millions of inodes etc. That's not normal. I mean, you can pretty much literally look at ramfs: fs/ramfs/inode.c and it is a real example filesystem that does a lot of things, but almost all of it is just using the direct vfs helpers (simple_lookup / simple_link/ simple_rmdir etc etc). It plays *zero* games with dentries. Or look at fs/pstore. Or any number of other examples. And no, nobody should *EVER* look at the horror that is tracefs and eventfs. Linus
On Tue, 28 Jan 2025 14:05:05 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Well, honestly, you were doing some odd things. Some of those odd things were because of the use of the dentry as a handle, which also required making an inode for every file. When the number of event files blew up to 10s of thousands, that caused a lot of memory to be used. > > For a *simple* filesystem that actually acts as a filesystem, all you > need is in libfs with things like &simple_dir_operations etc. > > And we have a *lot* of perfectly regular users of things like that. > Not like the ftrace mess that had very *non*-filesystem semantics with > separate lifetime confusion etc, and that tried to maintain a separate > notion of permissions etc. I would also say that the proc file system is rather messy. But that's very old and has a long history which probably built up its complexity. > > To make matters worse, tracefs than had a completely different model > for events, and these interacted oddly in non-filesystem ways. Ideally, I rather it not have done it that way. To save memory, since every event in eventfs has the same files, it was better to just make a single array that represents those files for every event. That saved over 20 megabytes per tracing instance. > > In other words, all the tracefs problems were self-inflicted, and a > lot of them were because you wanted to go behind the vfs layers back > because you had millions of nodes but didn't want to have millions of > inodes etc. > > That's not normal. > > I mean, you can pretty much literally look at ramfs: > > fs/ramfs/inode.c > > and it is a real example filesystem that does a lot of things, but > almost all of it is just using the direct vfs helpers (simple_lookup / > simple_link/ simple_rmdir etc etc). It plays *zero* games with > dentries. It's also a storage file system. It's just that it stores to memory which looks like it simply uses the page cache where it never needs to write it to disk. It's not a good example for a control interface. > > Or look at fs/pstore. Another storage device. > > Or any number of other examples. > > And no, nobody should *EVER* look at the horror that is tracefs and eventfs. I believe kernfs is to cover control interfaces like sysfs and debugfs, that actually changes kernel behavior when their files are written to. It's also likely why procfs is such a mess because that too is a control interface. Yes, eventfs is "special", but tracefs could easily be converted to kernfs. I believe Christian even wrote a POC that did that. -- Steve
On Tue, Jan 28, 2025 at 05:42:57PM -0500, Steven Rostedt wrote: ... > I believe kernfs is to cover control interfaces like sysfs and debugfs, > that actually changes kernel behavior when their files are written to. It's > also likely why procfs is such a mess because that too is a control > interface. Just for context, kernfs is factored out from sysfs. One of the factors which drove the design was memory overhead. On large systems (IIRC especially with iSCSI), there can be a huge number of sysfs nodes and allocating a dentry and inode pair for each file made some machines run out of memory during boot, so sysfs implemented memory-backed filesystem store which then made its interface to its users to depart from the VFS layer. This requirement holds for cgroup too - there are systems with a *lot* of cgroups and the associated interface files and we don't want to pin a dentry and inode for all of them. Thanks.
On Tue, 28 Jan 2025 12:51:47 -1000 Tejun Heo <tj@kernel.org> wrote: > Just for context, kernfs is factored out from sysfs. One of the factors > which drove the design was memory overhead. On large systems (IIRC > especially with iSCSI), there can be a huge number of sysfs nodes and > allocating a dentry and inode pair for each file made some machines run out > of memory during boot, so sysfs implemented memory-backed filesystem store > which then made its interface to its users to depart from the VFS layer. > This requirement holds for cgroup too - there are systems with a *lot* of > cgroups and the associated interface files and we don't want to pin a dentry > and inode for all of them. > Right. And going back to ramfs, it too has a dentry and inode for every file that is created. Thus, if you have a lot of files, you'll have a lot of memory dedicated to their dentry and inodes that will never be freed. The ramfs_create() and ramfs_mkdir() both call ramfs_mknod() which does a d_instantiate() and a dget() on the dentry so they are persistent until they are deleted or a reboot happens. What I did for eventfs, and what I believe kernfs does, is to create a small descriptor to represent the control data and reference them like what you would have on disk. That is, the control elements (like an trace event descriptor) is really what is on "disk". When someone does an "ls" to the pseudo file system, there needs to be a way for the VFS layer to query the control structures like how a normal file system would query that data stored on disk, and then let the VFS layer create the dentry and inodes when referenced, and more importantly, free them when they are no longer referenced and there's memory pressure. I believe kernfs does the same thing. And my point is, it would be nice to have an abstract layer that represent control descriptors that may be around for the entirety of the boot (like trace events are) without needing to pin a dentry and inode for each one of theses files. Currently, that abstract layer is kernfs. -- Steve
On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote: > What I did for eventfs, and what I believe kernfs does, is to create a > small descriptor to represent the control data and reference them like what > you would have on disk. That is, the control elements (like an trace event > descriptor) is really what is on "disk". When someone does an "ls" to the > pseudo file system, there needs to be a way for the VFS layer to query the > control structures like how a normal file system would query that data > stored on disk, and then let the VFS layer create the dentry and inodes > when referenced, and more importantly, free them when they are no longer > referenced and there's memory pressure. Yeap, that's exactly what kernfs does. Thanks.
On Tue, 28 Jan 2025 13:38:42 -1000 Tejun Heo <tj@kernel.org> wrote: > On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote: > > What I did for eventfs, and what I believe kernfs does, is to create a > > small descriptor to represent the control data and reference them like what > > you would have on disk. That is, the control elements (like an trace event > > descriptor) is really what is on "disk". When someone does an "ls" to the > > pseudo file system, there needs to be a way for the VFS layer to query the > > control structures like how a normal file system would query that data > > stored on disk, and then let the VFS layer create the dentry and inodes > > when referenced, and more importantly, free them when they are no longer > > referenced and there's memory pressure. > > Yeap, that's exactly what kernfs does. And eventfs goes one step further. Because there's a full directory layout that's identical for every event, it has a single descriptor for directory and not for file. As there can be over 10 files per directory/event I didn't want to waste even that memory. This is why I couldn't use kernfs for eventfs, as I was able to still save a couple of megabytes by not having the files have any descriptor representing them (besides a single array for all events). -- Steve
On Tue, Jan 28, 2025 at 07:02:24PM -0500, Steven Rostedt wrote: > On Tue, 28 Jan 2025 13:38:42 -1000 > Tejun Heo <tj@kernel.org> wrote: > > > On Tue, Jan 28, 2025 at 06:29:57PM -0500, Steven Rostedt wrote: > > > What I did for eventfs, and what I believe kernfs does, is to create a > > > small descriptor to represent the control data and reference them like what > > > you would have on disk. That is, the control elements (like an trace event > > > descriptor) is really what is on "disk". When someone does an "ls" to the > > > pseudo file system, there needs to be a way for the VFS layer to query the > > > control structures like how a normal file system would query that data > > > stored on disk, and then let the VFS layer create the dentry and inodes > > > when referenced, and more importantly, free them when they are no longer > > > referenced and there's memory pressure. > > > > Yeap, that's exactly what kernfs does. > > And eventfs goes one step further. Because there's a full directory layout > that's identical for every event, it has a single descriptor for directory > and not for file. As there can be over 10 files per directory/event I > didn't want to waste even that memory. This is why I couldn't use kernfs > for eventfs, as I was able to still save a couple of megabytes by not > having the files have any descriptor representing them (besides a single > array for all events). Ok, that's fine, but the original point of "are you sure you want to use kernfs for anything other than what we have today" remains. It's only a limited set of use cases that kernfs is good for, libfs is still the best place to start out for a virtual filesystem. The fact that the majority of our "fake" filesystems are using libfs and not kernfs is semi-proof of that? Or is it proof that kernfs is just too undocumented that no one wants to move to it? I don't know, but adding samples like this really isn't the answer to that, the answer would be moving an existing libfs implementation to use kernfs and then that patch series would be the example to follow for others. thanks, greg k-h
Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Tue, Jan 28, 2025 at 07:02:24PM -0500, Steven Rostedt wrote: >> >> And eventfs goes one step further. Because there's a full directory layout >> that's identical for every event, it has a single descriptor for directory >> and not for file. As there can be over 10 files per directory/event I >> didn't want to waste even that memory. This is why I couldn't use kernfs >> for eventfs, as I was able to still save a couple of megabytes by not >> having the files have any descriptor representing them (besides a single >> array for all events). > > Ok, that's fine, but the original point of "are you sure you want to use > kernfs for anything other than what we have today" remains. It's only a > limited set of use cases that kernfs is good for, libfs is still the > best place to start out for a virtual filesystem. The fact that the > majority of our "fake" filesystems are using libfs and not kernfs is > semi-proof of that? > > Or is it proof that kernfs is just too undocumented that no one wants to > move to it? I don't know, but adding samples like this really isn't the > answer to that, the answer would be moving an existing libfs > implementation to use kernfs and then that patch series would be the > example to follow for others. > > thanks, > > greg k-h Thanks for reviewing the patch, Greg! I put this sample together with the idea that some documentation is better than none. I researched how kernfs could be useful in tracefs and debugfs, but I haven't looked deeply into other virtual filesystems, so I may have overestimated how well kernfs fits other use cases. From this discussion, I see that a real libfs-to-kernfs port would provide a better understanding of kernfs' viability elsewhere and also serve as documentation. Thanks for the discussion, folks! I learned a lot from this thread. Thanks, David Reaver