[1/2] fs/ceph/debugfs: make all files world-readable

Message ID	20230922062558.1739642-1-max.kellermann@ionos.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> From: Max Kellermann <max.kellermann@ionos.com> To: Xiubo Li <xiubli@redhat.com>, Ilya Dryomov <idryomov@gmail.com>, Jeff Layton <jlayton@kernel.org> Cc: Max Kellermann <max.kellermann@ionos.com>, ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable Date: Fri, 22 Sep 2023 08:25:57 +0200 Message-Id: <20230922062558.1739642-1-max.kellermann@ionos.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/2] fs/ceph/debugfs: make all files world-readable \| expand [1/2] fs/ceph/debugfs: make all files world-readable [2/2] fs/ceph/debugfs: expose raw metric counters

Max Kellermann Sept. 22, 2023, 6:25 a.m. UTC

I'd like to be able to run metrics collector processes without special
privileges

In the kernel, there is a mix of debugfs files being world-readable
and not world-readable is; with a naive "git grep", I found 723
world-readable debugfs_create_file() calls and 582 calls which were
only accessible to privileged processe.

From the code, I cannot derive a consistent policy for that, but the
ceph statistics seem harmless (and useful) enough.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/ceph/debugfs.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Xiubo Li Sept. 25, 2023, 5:18 a.m. UTC | #1

On 9/22/23 14:25, Max Kellermann wrote:
> I'd like to be able to run metrics collector processes without special
> privileges
>
> In the kernel, there is a mix of debugfs files being world-readable
> and not world-readable is; with a naive "git grep", I found 723
> world-readable debugfs_create_file() calls and 582 calls which were
> only accessible to privileged processe.
>
>  From the code, I cannot derive a consistent policy for that, but the
> ceph statistics seem harmless (and useful) enough.

I am not sure whether will this make sense. Because the 'debug' under 
'/sys/kernel/' is also only accessible by privileged process.

Ilya, Jeff

Any idea ?

Thanks

- Xiubo

> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>   fs/ceph/debugfs.c | 18 +++++++++---------
>   1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 3904333fa6c3..2abee7e18144 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -429,31 +429,31 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
>   				       name);
>   
>   	fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
> -					0400,
> +					0444,
>   					fsc->client->debugfs_dir,
>   					fsc,
>   					&mdsmap_fops);
>   
>   	fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
> -					0400,
> +					0444,
>   					fsc->client->debugfs_dir,
>   					fsc,
>   					&mds_sessions_fops);
>   
>   	fsc->debugfs_mdsc = debugfs_create_file("mdsc",
> -						0400,
> +						0444,
>   						fsc->client->debugfs_dir,
>   						fsc,
>   						&mdsc_fops);
>   
>   	fsc->debugfs_caps = debugfs_create_file("caps",
> -						0400,
> +						0444,
>   						fsc->client->debugfs_dir,
>   						fsc,
>   						&caps_fops);
>   
>   	fsc->debugfs_status = debugfs_create_file("status",
> -						  0400,
> +						  0444,
>   						  fsc->client->debugfs_dir,
>   						  fsc,
>   						  &status_fops);
> @@ -461,13 +461,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
>   	fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
>   						      fsc->client->debugfs_dir);
>   
> -	debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
> +	debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
>   			    &metrics_file_fops);
> -	debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
> +	debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
>   			    &metrics_latency_fops);
> -	debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
> +	debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
>   			    &metrics_size_fops);
> -	debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
> +	debugfs_create_file("caps", 0444, fsc->debugfs_metrics_dir, fsc,
>   			    &metrics_caps_fops);
>   }
>

Jeff Layton Sept. 25, 2023, 10:24 a.m. UTC | #2

On Mon, 2023-09-25 at 13:18 +0800, Xiubo Li wrote:
> On 9/22/23 14:25, Max Kellermann wrote:
> > I'd like to be able to run metrics collector processes without special
> > privileges
> > 
> > In the kernel, there is a mix of debugfs files being world-readable
> > and not world-readable is; with a naive "git grep", I found 723
> > world-readable debugfs_create_file() calls and 582 calls which were
> > only accessible to privileged processe.
> > 
> >  From the code, I cannot derive a consistent policy for that, but the
> > ceph statistics seem harmless (and useful) enough.
> 
> I am not sure whether will this make sense. Because the 'debug' under 
> '/sys/kernel/' is also only accessible by privileged process.
> 
> Ilya, Jeff
> 
> Any idea ?
> 

Yeah, I don't think this makes much sense. At least on my machine:

# stat -c '%A' /sys/kernel/debug
drwx------

Without at least x permissions, an unprivileged user can't pathwalk
through there. Max, how are you testing this?


> 
> > Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> > ---
> >   fs/ceph/debugfs.c | 18 +++++++++---------
> >   1 file changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> > index 3904333fa6c3..2abee7e18144 100644
> > --- a/fs/ceph/debugfs.c
> > +++ b/fs/ceph/debugfs.c
> > @@ -429,31 +429,31 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> >   				       name);
> >   
> >   	fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
> > -					0400,
> > +					0444,
> >   					fsc->client->debugfs_dir,
> >   					fsc,
> >   					&mdsmap_fops);
> >   
> >   	fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
> > -					0400,
> > +					0444,
> >   					fsc->client->debugfs_dir,
> >   					fsc,
> >   					&mds_sessions_fops);
> >   
> >   	fsc->debugfs_mdsc = debugfs_create_file("mdsc",
> > -						0400,
> > +						0444,
> >   						fsc->client->debugfs_dir,
> >   						fsc,
> >   						&mdsc_fops);
> >   
> >   	fsc->debugfs_caps = debugfs_create_file("caps",
> > -						0400,
> > +						0444,
> >   						fsc->client->debugfs_dir,
> >   						fsc,
> >   						&caps_fops);
> >   
> >   	fsc->debugfs_status = debugfs_create_file("status",
> > -						  0400,
> > +						  0444,
> >   						  fsc->client->debugfs_dir,
> >   						  fsc,
> >   						  &status_fops);
> > @@ -461,13 +461,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> >   	fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
> >   						      fsc->client->debugfs_dir);
> >   
> > -	debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
> > +	debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
> >   			    &metrics_file_fops);
> > -	debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
> > +	debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
> >   			    &metrics_latency_fops);
> > -	debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
> > +	debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
> >   			    &metrics_size_fops);
> > -	debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
> > +	debugfs_create_file("caps", 0444, fsc->debugfs_metrics_dir, fsc,
> >   			    &metrics_caps_fops);
> >   }
> >   
>

Ilya Dryomov Sept. 25, 2023, 10:41 a.m. UTC | #3

On Fri, Sep 22, 2023 at 8:26 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> I'd like to be able to run metrics collector processes without special
> privileges

Hi Max,

A word of caution about building metrics collectors based on debugfs
output: there are no stability guarantees.  While the format won't be
changed just for the sake of change of course, expect zero effort to
preserve backwards compatibility.

The latency metrics in particular are sent to the MDS in binary form
and are intended to be consumed through commands like "ceph fs top".
debugfs stuff is there just for an occasional sneak peek (apart from
actual debugging).

Thanks,

                Ilya

Jeff Layton Sept. 25, 2023, 11:29 a.m. UTC | #4

On Mon, 2023-09-25 at 12:41 +0200, Ilya Dryomov wrote:
> On Fri, Sep 22, 2023 at 8:26 AM Max Kellermann <max.kellermann@ionos.com> wrote:
> > 
> > I'd like to be able to run metrics collector processes without special
> > privileges
> 
> Hi Max,
> 
> A word of caution about building metrics collectors based on debugfs
> output: there are no stability guarantees.  While the format won't be
> changed just for the sake of change of course, expect zero effort to
> preserve backwards compatibility.
> 
> The latency metrics in particular are sent to the MDS in binary form
> and are intended to be consumed through commands like "ceph fs top".
> debugfs stuff is there just for an occasional sneak peek (apart from
> actual debugging).
> 

FWIW, I wish we had gone with netlink for this functionality instead of
a seqfile. Lorenzo has been working with netlink for some similar
functionality with nfsd[1], and it's much nicer for this sort of thing.

[1]: https://lore.kernel.org/linux-nfs/ZQTM6l7NrsVHFoR5@lore-desk/T/#t

Max Kellermann Sept. 26, 2023, 6:09 a.m. UTC | #5

On Mon, Sep 25, 2023 at 7:18 AM Xiubo Li <xiubli@redhat.com> wrote:
> I am not sure whether will this make sense. Because the 'debug' under
> '/sys/kernel/' is also only accessible by privileged process.

Not exactly correct. It is by default accessible to processes who have
CAP_DAC_OVERRIDE and additionally it is accessible to (unprivileged)
processes running as uid=0 (those two traits usually overlap).
But we don't want to run kernel-exporter as uid=0 and neither do we
want to give it CAP_DAC_OVERRIDE; both would be too much, it would
affect much more than just (read) access to debugfs.
Instead, we mount debugfs with "gid=X,mode=0710". That way, we can
give (unprivileged) processes which are member of a certain group
access to debugfs, and we put our kernel-exporter process in that
group.

We can use these mount options to change debugfs defaults, but if a
debugfs implementor (such as cephfs) decides to override these global
debugfs settings by passing stricter file permissions, we can't easily
override that. And that is what my patch is about: restore the ability
to override debugfs permissions with a mount option, as debugfs was
designed.

Max

Max Kellermann Sept. 26, 2023, 6:16 a.m. UTC | #6

On Mon, Sep 25, 2023 at 12:42 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> A word of caution about building metrics collectors based on debugfs
> output: there are no stability guarantees.  While the format won't be
> changed just for the sake of change of course, expect zero effort to
> preserve backwards compatibility.

Agree, but there's nothing else. We have been using my patch for quite
some time, and it has been very useful.

Maybe we can discuss promoting these statistics to sysfs/proc? (the
raw numbers, not the existing aggregates which are useless for any
practical purpose)

> The latency metrics in particular are sent to the MDS in binary form
> and are intended to be consumed through commands like "ceph fs top".
> debugfs stuff is there just for an occasional sneak peek (apart from
> actual debugging).

I don't know the whole Ceph ecosystem so well, but "ceph" is a command
that is supposed to run on a Ceph server, and not on a machine that
mounts a cephfs, right? If that's right, then this command is useless
for me.

Max

Ilya Dryomov Sept. 26, 2023, 8:45 a.m. UTC | #7

On Tue, Sep 26, 2023 at 8:16 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:42 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > A word of caution about building metrics collectors based on debugfs
> > output: there are no stability guarantees.  While the format won't be
> > changed just for the sake of change of course, expect zero effort to
> > preserve backwards compatibility.
>
> Agree, but there's nothing else. We have been using my patch for quite
> some time, and it has been very useful.
>
> Maybe we can discuss promoting these statistics to sysfs/proc? (the
> raw numbers, not the existing aggregates which are useless for any
> practical purpose)
>
> > The latency metrics in particular are sent to the MDS in binary form
> > and are intended to be consumed through commands like "ceph fs top".
> > debugfs stuff is there just for an occasional sneak peek (apart from
> > actual debugging).
>
> I don't know the whole Ceph ecosystem so well, but "ceph" is a command
> that is supposed to run on a Ceph server, and not on a machine that
> mounts a cephfs, right? If that's right, then this command is useless
> for me.

No, "ceph" command (as well as "rbd", "rados", etc) can be run from
anywhere -- it's just a matter of installing a package which is likely
already installed unless you are mounting CephFS manually without using
/sbin/mount.ceph mount helper.

Thanks,

                Ilya

Max Kellermann Sept. 26, 2023, 9:09 a.m. UTC | #8

On Tue, Sep 26, 2023 at 10:46 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> No, "ceph" command (as well as "rbd", "rados", etc) can be run from
> anywhere -- it's just a matter of installing a package which is likely
> already installed unless you are mounting CephFS manually without using
> /sbin/mount.ceph mount helper.

I have never heard of that helper, so no, we're not using it - should we?

This "ceph" tool requires installing 90 MB of additional Debian
packages, which I just tried on a test cluster, and "ceph fs top"
fails with "Error initializing cluster client: ObjectNotFound('RADOS
object not found (error calling conf_read_file)')". Okay, so I have to
configure something.... but .... I don't get why I would want to do
that, when I can get the same information from the kernel without
installing or configuring anything. This sounds like overcomplexifying
the thing for no reason.

Max

Ilya Dryomov Sept. 27, 2023, 10:53 a.m. UTC | #9

On Tue, Sep 26, 2023 at 11:09 AM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Tue, Sep 26, 2023 at 10:46 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > No, "ceph" command (as well as "rbd", "rados", etc) can be run from
> > anywhere -- it's just a matter of installing a package which is likely
> > already installed unless you are mounting CephFS manually without using
> > /sbin/mount.ceph mount helper.
>
> I have never heard of that helper, so no, we're not using it - should we?

If you have figured out the right mount options, you might as well not.
The helper does things like determine whether v1 or v2 addresses should
be used, fetch the key and pass it via the kernel keyring (whereas you
are probably passing it verbatim on the command line), etc.  It's the
same syscall in the end, so the helper is certainly not required.

>
> This "ceph" tool requires installing 90 MB of additional Debian
> packages, which I just tried on a test cluster, and "ceph fs top"
> fails with "Error initializing cluster client: ObjectNotFound('RADOS
> object not found (error calling conf_read_file)')". Okay, so I have to
> configure something.... but .... I don't get why I would want to do
> that, when I can get the same information from the kernel without
> installing or configuring anything. This sounds like overcomplexifying
> the thing for no reason.

I have relayed my understanding of this feature (or rather how it was
presented to me).  I see where you are coming from, so adding more
CephFS folks to chime in.

Thanks,

                Ilya

Max Kellermann Sept. 27, 2023, 11:22 a.m. UTC | #10

On Wed, Sep 27, 2023 at 12:53 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> > This "ceph" tool requires installing 90 MB of additional Debian
> > packages, which I just tried on a test cluster, and "ceph fs top"
> > fails with "Error initializing cluster client: ObjectNotFound('RADOS
> > object not found (error calling conf_read_file)')". Okay, so I have to
> > configure something.... but .... I don't get why I would want to do
> > that, when I can get the same information from the kernel without
> > installing or configuring anything. This sounds like overcomplexifying
> > the thing for no reason.
>
> I have relayed my understanding of this feature (or rather how it was
> presented to me).  I see where you are coming from, so adding more
> CephFS folks to chime in.

Let me show these folks how badly "ceph fs stats" performs:

 # time ceph fs perf stats
 {"version": 2, "global_counters": ["cap_hit", "read_latency",
"write_latency"[...]
 real    0m0.502s
 user    0m0.393s
 sys    0m0.053s

Now my debugfs-based solution:

 # time cat /sys/kernel/debug/ceph/*/metrics/latency
 item          total       avg_lat(us)     min_lat(us)     max_lat(us)
    stdev(us)
 [...]
 real    0m0.002s
 user    0m0.002s
 sys    0m0.001s

debugfs is more than 200 times faster. It is so fast, it can hardly be
measured by "time" - and most of these 2ms is the overhead for
executing /bin/cat, not for actually reading the debugfs file.
Our kernel-exporter is a daemon process, it only needs a single
pread() system call in each iteration, it has even less overhead.
Integrating the "ceph" tool instead would require forking the process
each time, starting a new Python VM, and so on...
For obtaining real-time latency statistics, the "ceph" script is the
wrong tool for the job.

Max

[1/2] fs/ceph/debugfs: make all files world-readable

Commit Message

Comments

Patch