mbox series

[DRAFT,0/4] : Port tracefs to kernfs

Message ID 20240131-tracefs-kernfs-v1-0-f20e2e9a8d61@kernel.org (mailing list archive)
Headers show
Series : Port tracefs to kernfs | expand

Message

Christian Brauner Jan. 31, 2024, 1:36 p.m. UTC
Back in 2022 we already had a session at LSFMM where we talked about
eventfs and we said that it should be based on kernfs and any missing
functionality be implemented in kernfs. Instead we've gotten a
hand-rolled version of similar functionality and 100+ mails exchanges
over the last weeks to fix bugs in there binding people's time.

All we've heard so far were either claims that it would be too difficult
to port tracefs to kernfs or that it somehow wouldn't work but we've
never heard why and it's never been demonstrated why.

So I went and started a draft for porting all of tracefs to kernfs in
the hopes that someone picks this up and finishes the work. I've gotten
the core of it done and it's pretty easy to do logical copy-pasta to
port this to eventfs as well.

I want to see tracefs and eventfs ported to kernfs and get rid of the
hand-rolled implementation. I don't see the value in any additional
talks about why eventfs is special until we've seen an implementation of
tracefs on kernfs.

I'm pretty certain that we have capable people that can and want to
finish the port (I frankly don't have time for this unless I drop all
reviews.). I've started just jotting down the basics yesterday evening
and came to the conclusion that:

* It'll get rid of pointless dentry pinning in various places that is
  currently done in the first place. Instead only a kernfs root and a
  kernfs node need to be stashed. Dentries and inodes are added
  on-demand.

* It'll make _all of_ tracefs capable of on-demand dentry and inode
  creation.

* Quoting [1]:

  > The biggest savings in eventfs is the fact that it has no meta data for
  > files. All the directories in eventfs has a fixed number of files when they
  > are created. The creating of a directory passes in an array that has a list
  > of names and callbacks to call when the file needs to be accessed. Note,
  > this array is static for all events. That is, there's one array for all
  > event files, and one array for all event systems, they are not allocated per
  > directory.

  This is all possible with kernfs.

* All ownership information (mode, uid, gid) is stashed and kept
  kernfs_node->iattrs. So the parent kernfs_node's ownership can be used
  to set the child's ownership information. This will allow to get rid
  of any custom permission checking and ->getattr() and ->setattr()
  calls.

* Private tracefs data that was stashed in inode->i_private is stashed
  in kernfs_node->priv. That's always accessible in kernfs->open() calls
  via kernfs_open_file->kn->priv but it could also be transferred to
  kernfs_open_file->priv. In any case, it makes it a lot easier to
  handle private data than tracefs does it now.

* It'll make maintenance of tracefs easier in the long run because new
  functionality and improvements get added to kernfs including better
  integration with namespaces (I've had patchsets for kernfs a while ago
  to unlock additional namespaces.)

* There's no need for separate i_ops for "instances" and regular tracefs
  directories. Simply compare the stashed kernfs_node of the "instances"
  directory against the current kernfs_node passed to ->mkdir() or
  ->rmdir() whether the directory creation or deletion is allowed.

* Frankly, another big reason to do it is simply maintenance. All of the
  maintenance burden neeeds to be shifted to the generic kernfs
  implementation which is maintained by people familar with filesystem
  details. I'm willing to support it too.

  No shade, but currently I don't see how eventfs can be maintained
  without the involvement of others. Maintainability alone should be a
  sufficient reason to move all of this to kernfs and add any missing
  functionality.

* If we have a session about this at LSFMM and I want to see a POC of
  tracefs and eventfs built on top of kernfs. I'm tired of talking about
  a private implementation of functionality that already exists.
  Otherwise, this is just wasting everyone's time and eventfs as it is
  will not become common infrastructure.

* Yes, debugfs could or should be ported as well but it's almost
  irrelevant for debugfs. It's a debugging filesystem. If you enable it
  on a production workload then you have bigger problems to worry about
  than wasted memory. So I don't consider that urgent. But tracefs is
  causing us headaches right now and I'm weary of cementing a
  hand-rolled implementation.

So really, please let's move this to kernfs, fix any things that aren't
supported in kernfs (I haven't seen any) and get rid of all the custom
functionality. Part of the work is moving tracefs to the new mount api
(which should've been done anyway).

The fs/tracefs/ part already compiles. The rest I haven't finished
converting. All the file_operations need to be moved to kernfs_ops which
shouldn't be too difficult.

To: Steven Rostedt <rostedt@goodmis.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
To: Amir Goldstein <amir73il@gmail.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: lsf-pc@lists.linux-foundation.org,
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>

Link: https://lore.kernel.org/r/20240129105726.2c2f77f0@gandalf.local.home [1]
Link: https://lore.kernel.org/r/20240129105726.2c2f77f0@gandalf.local.home
---
Christian Brauner (4):
      [DRAFT]: tracefs: port to kernfs
      [DRAFT]: trace: stash kernfs_node instead of dentries
      [DRAFT]: hwlat: port struct file_operations thread_mode_fops to struct kernfs_ops
      [DRAFT]: trace: illustrate how to convert basic open functions

 fs/kernfs/mount.c                 |  10 +
 fs/tracefs/inode.c                | 649 +++++++++++++-------------------------
 include/linux/kernfs.h            |   3 +
 include/linux/tracefs.h           |  18 +-
 kernel/trace/trace.c              |  22 +-
 kernel/trace/trace.h              |   4 +-
 kernel/trace/trace_events_synth.c |   4 +-
 kernel/trace/trace_events_user.c  |   2 +-
 kernel/trace/trace_hwlat.c        |  45 +--
 9 files changed, 270 insertions(+), 487 deletions(-)
---
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
change-id: 20240131-tracefs-kernfs-3f2def6eab11

Comments

Steven Rostedt Jan. 31, 2024, 2:41 p.m. UTC | #1
On Wed, 31 Jan 2024 14:36:37 +0100
Christian Brauner <brauner@kernel.org> wrote:

> Back in 2022 we already had a session at LSFMM where we talked about
> eventfs and we said that it should be based on kernfs and any missing
> functionality be implemented in kernfs. Instead we've gotten a
> hand-rolled version of similar functionality and 100+ mails exchanges
> over the last weeks to fix bugs in there binding people's time.

Note, tracefs was written way before that. This was only about eventfs.

> 
> All we've heard so far were either claims that it would be too difficult
> to port tracefs to kernfs or that it somehow wouldn't work but we've
> never heard why and it's never been demonstrated why.

Well, because mainly lack of documentation.

> 
> So I went and started a draft for porting all of tracefs to kernfs in
> the hopes that someone picks this up and finishes the work. I've gotten
> the core of it done and it's pretty easy to do logical copy-pasta to
> port this to eventfs as well.

tracefs yes, but I'm not so sure about eventfs.

> 
> I want to see tracefs and eventfs ported to kernfs and get rid of the
> hand-rolled implementation. I don't see the value in any additional
> talks about why eventfs is special until we've seen an implementation of
> tracefs on kernfs.
> 
> I'm pretty certain that we have capable people that can and want to
> finish the port (I frankly don't have time for this unless I drop all
> reviews.). I've started just jotting down the basics yesterday evening
> and came to the conclusion that:

I don't have the time either. But if someone else wants to, I'm fine with
that. I was supposed to be done with my eventfs work by December. It's
almost already February. It put me behind so much I worked throughout the
entire time between Christmas and New Years, including the weekends :-p, I
haven't stopped and I'm really starting to feel burnt.

All I ask is to keep the requirement in eventfs not to allocate anything
for files on creation (see below).

> 
> * It'll get rid of pointless dentry pinning in various places that is
>   currently done in the first place. Instead only a kernfs root and a
>   kernfs node need to be stashed. Dentries and inodes are added
>   on-demand.
> 
> * It'll make _all of_ tracefs capable of on-demand dentry and inode
>   creation.
> 
> * Quoting [1]:
> 
>   > The biggest savings in eventfs is the fact that it has no meta data for
>   > files. All the directories in eventfs has a fixed number of files when they
>   > are created. The creating of a directory passes in an array that has a list
>   > of names and callbacks to call when the file needs to be accessed. Note,
>   > this array is static for all events. That is, there's one array for all
>   > event files, and one array for all event systems, they are not allocated per
>   > directory.  
> 
>   This is all possible with kernfs.

Is it? Let me explain how eventfs files are created, and then you can tell
me how to do this with kernfs. Maybe it is possible.

There's only three types of directories in eventfs.

1) the "events" directory. This is the top node and is created in
   /sys/kernel/tracing/events as well as in /sys/kernel/tracing/instances/<instance>/events

2) The system directory. This is the events/<system>

3) The event directory. This is the events/<system>/<event>

Each of these directories have the exact same files for their type. For the
"events" directory at the top instance, or each created instance, it has
the same files (although they all have different state). All <system>
directories have the same files, and all the <event> directories have the
same files.

When one of these directories is created, it is passed a static array that
contains the name and a callback function for each of these files. That
means, for the three types of directories, there is only three arrays that
represent all files in eventfs.

Let's look at the events directory:

	ei = eventfs_create_dir(name, e_events, event_entries, nr_entries, file);

Where it passes the "name" of the event directory, the parent eventfs_inode
(e_events), a static array of the files within this directory
(event_entries) along with he size of that array (nr_entries), and finally
it passes the default data for all of the files (file) that is passed to
the callbacks and the callbacks can override the data to add to the
inode->i_private.

Let's look at that event_entries:

	static struct eventfs_entry event_entries[] = {
		{
			.name		= "enable",
			.callback	= event_callback,
		},
		{
			.name		= "filter",
			.callback	= event_callback,
		},
		{
			.name		= "trigger",
			.callback	= event_callback,
		},
		{
			.name		= "format",
			.callback	= event_callback,
		},
#ifdef CONFIG_PERF_EVENTS
		{
			.name		= "id",
			.callback	= event_callback,
		},
#endif
#ifdef CONFIG_HIST_TRIGGERS
		{
			.name		= "hist",
			.callback	= event_callback,
		},
#endif
#ifdef CONFIG_HIST_TRIGGERS_DEBUG
		{
			.name		= "hist_debug",
			.callback	= event_callback,
		},
#endif
#ifdef CONFIG_TRACE_EVENT_INJECT
		{
			.name		= "inject",
			.callback	= event_callback,
		},
#endif
	};

Notice the "static" in front. That means, *all* event directories use the
same array. There is zero allocation for any file within each of these
directories. The only exception for allocation is if the attrs change. Then
we need to allocate an array in the directory to handle that. But that too
is only allocated when referenced.

And there's a lot of those directories:

  # ls -d events/*/* |wc -l
2212

The dentry and inode are only created on lookup in the lookup code:

	for (int i = 0; i < ei->nr_entries; i++) {
		void *data;
		umode_t mode;
		const struct file_operations *fops;
		const struct eventfs_entry *entry = &ei->entries[i];

		if (strcmp(name, entry->name) != 0)
			continue;

		data = ei->data;
		if (entry->callback(name, &mode, &data, &fops) <= 0)
			goto enoent;

		result = lookup_file_dentry(dentry, ei, i, mode, data, fops);
		goto out;
	}

The ei represents the events directory (or a system directory or even the
"events" directory). It loops that static array looking for a matching
name (if needed, I have thought about requiring it to be sorted, to do a
binary search instead). If it finds one, it then calls the callback
function, as some event directories do not have all the files, the callback
can inform the lookup that "no, this directory doesn't get this file". For
example, the internal ftrace events (like what is used for function
tracing) doesn't have an "enable" or "filter" file as you can't enable
those events, the callback will inform the lookup about that.

	if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
		if (call->class->reg && strcmp(name, "enable") == 0) {
			*mode = TRACE_MODE_WRITE;
			*fops = &ftrace_enable_fops;
			return 1;
		}

		if (strcmp(name, "filter") == 0) {
			*mode = TRACE_MODE_WRITE;
			*fops = &ftrace_event_filter_fops;
			return 1;
		}
	}
	[..]
	return 0;

 # ls events/sched/sched_switch/
enable  filter  format  hist  hist_debug  id  inject  trigge

 # ls events/ftrace/function/
format  hist  hist_debug  id  inject

My point being, that eventfs doesn't allocate any resources for the files.
Well, you could say it does allocate the ei->entries pointer that points to
the static array.

tracefs has tracefs_create_file() but evenfs does not. What is the
equivalent to that in kernfs?

  # ls -d events/*/*/* |wc -l
15819

I do not want to allocate 15 thousand kernfs_node's for this.

Perhaps kernfs has a way to do the same, or maybe it's trivial to make it
do it? I don't know.

> 
> * All ownership information (mode, uid, gid) is stashed and kept
>   kernfs_node->iattrs. So the parent kernfs_node's ownership can be used
>   to set the child's ownership information. This will allow to get rid
>   of any custom permission checking and ->getattr() and ->setattr()
>   calls.
> 
> * Private tracefs data that was stashed in inode->i_private is stashed
>   in kernfs_node->priv. That's always accessible in kernfs->open() calls
>   via kernfs_open_file->kn->priv but it could also be transferred to
>   kernfs_open_file->priv. In any case, it makes it a lot easier to
>   handle private data than tracefs does it now.
> 
> * It'll make maintenance of tracefs easier in the long run because new
>   functionality and improvements get added to kernfs including better
>   integration with namespaces (I've had patchsets for kernfs a while ago
>   to unlock additional namespaces.)
> 
> * There's no need for separate i_ops for "instances" and regular tracefs
>   directories. Simply compare the stashed kernfs_node of the "instances"
>   directory against the current kernfs_node passed to ->mkdir() or
>   ->rmdir() whether the directory creation or deletion is allowed.  
> 
> * Frankly, another big reason to do it is simply maintenance. All of the
>   maintenance burden neeeds to be shifted to the generic kernfs
>   implementation which is maintained by people familar with filesystem
>   details. I'm willing to support it too.
> 
>   No shade, but currently I don't see how eventfs can be maintained
>   without the involvement of others. Maintainability alone should be a
>   sufficient reason to move all of this to kernfs and add any missing
>   functionality.
> 
> * If we have a session about this at LSFMM and I want to see a POC of
>   tracefs and eventfs built on top of kernfs. I'm tired of talking about
>   a private implementation of functionality that already exists.
>   Otherwise, this is just wasting everyone's time and eventfs as it is
>   will not become common infrastructure.

That was never the point. I believe the point was how do we make it easier
to not have this situation happen again. I don't want eventfs to be the
standard way of doing things. I'm looking at this as more of a post-mortem
session than "let's do it this way" one.

-- Steve

> 
> * Yes, debugfs could or should be ported as well but it's almost
>   irrelevant for debugfs. It's a debugging filesystem. If you enable it
>   on a production workload then you have bigger problems to worry about
>   than wasted memory. So I don't consider that urgent. But tracefs is
>   causing us headaches right now and I'm weary of cementing a
>   hand-rolled implementation.
> 
> So really, please let's move this to kernfs, fix any things that aren't
> supported in kernfs (I haven't seen any) and get rid of all the custom
> functionality. Part of the work is moving tracefs to the new mount api
> (which should've been done anyway).
> 
> The fs/tracefs/ part already compiles. The rest I haven't finished
> converting. All the file_operations need to be moved to kernfs_ops which
> shouldn't be too difficult.
James Bottomley Jan. 31, 2024, 2:59 p.m. UTC | #2
On Wed, 2024-01-31 at 14:36 +0100, Christian Brauner wrote:
> * If we have a session about this at LSFMM and I want to see a POC of
>   tracefs and eventfs built on top of kernfs. I'm tired of talking
> about
>   a private implementation of functionality that already exists.
>   Otherwise, this is just wasting everyone's time and eventfs as it
> is
>   will not become common infrastructure.

Note: I asked for this to be updated to be a "how do we prevent this
happening again" type session.  I believe you took this as "the VFS is
insufficiently documented" but that wasn't my intent.  There's clearly
something missing that should give people looking to do filesystems
like this a clear direction on how to start, where to look and what vfs
properties are required, which properties break some tools if not
implemented (which may or may not be important to the use case) and
which are nice to have.  When I did shiftfs, my biggest problem was
actually getting configfs to work for it due to being unable to operate
without a superblock, so learning all the VFS intricacies came in
second to that.  I did think at the time I should do a talk more
focussed on what I learned about the basics of the VFS for psuedo
filesystems, but that got lost in the need to push shiftfs itself. 
After that most of my subsequent talks were about extending configfs
because that was the area I had the most problems ...

James
Steven Rostedt Jan. 31, 2024, 3:28 p.m. UTC | #3
On Wed, 31 Jan 2024 09:41:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > So I went and started a draft for porting all of tracefs to kernfs in
> > the hopes that someone picks this up and finishes the work. I've gotten
> > the core of it done and it's pretty easy to do logical copy-pasta to
> > port this to eventfs as well.  
> 
> tracefs yes, but I'm not so sure about eventfs.

BTW, I do want to thank you for doing this. I would *love* to have
tracefs switched over to kernfs. Unfortunately, I have no time to do
it. I also don't want to lose the memory savings that is done in
eventfs.

Even if kernfs couldn't do what is needed in eventfs, I'd still be
happy if just the tracefs portion was converted over to kernfs, as long
as it treated the eventfs the same as tracefs does today.

I never really wanted to be the tracefs/eventfs maintainer. I just
needed the interface for tracing. The more generic code it can use, the
better. This is why I'm ecstatic for the simplification changes that
Linus is making.

-- Steve