mbox series

[0/4] pidfs: implement file handle support

Message ID 20241101135452.19359-1-erin.shepherd@e43.eu (mailing list archive)
Headers show
Series pidfs: implement file handle support | expand

Message

Erin Shepherd Nov. 1, 2024, 1:54 p.m. UTC
Since the introduction of pidfs, we have had 64-bit process identifiers 
that will not be reused for the entire uptime of the system. This greatly 
facilitates process tracking in userspace.

There are two limitations at present:

 * These identifiers are currently only exposed to processes on 64-bit 
   systems. On 32-bit systems, inode space is also limited to 32 bits and 
   therefore is subject to the same reuse issues.
 * There is no way to go from one of these unique identifiers to a pid or 
   pidfd.

Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
userspace  can retrieve a unique process identifier even on 32-bit systems 
via name_to_handle_at.

Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
userspace can convert back from a file handle to the corresponding pidfd. 
To support us going from a file handle to a pidfd, we have to store a pid 
inside the file handle. To ensure file handles are invariant and can move 
between pid namespaces, we stash a pid from the initial namespace inside 
the file handle.

I'm not quite sure if stashing an initial-namespace pid inside the file 
handle is the right approach here; if not, I think that patch 1 & 2 are 
useful on their own.

Erin Shepherd (4):
  pseudofs: add support for export_ops
  pidfs: implement file handle export support
  pid: introduce find_get_pid_ns
  pidfs: implement fh_to_dentry

 fs/libfs.c                |  1 +
 fs/pidfs.c                | 57 +++++++++++++++++++++++++++++++++++++++
 include/linux/pid.h       |  1 +
 include/linux/pseudo_fs.h |  1 +
 kernel/pid.c              | 10 +++++--
 5 files changed, 68 insertions(+), 2 deletions(-)

Comments

Christian Brauner Nov. 12, 2024, 1:10 p.m. UTC | #1
On Fri, Nov 01, 2024 at 01:54:48PM +0000, Erin Shepherd wrote:
> Since the introduction of pidfs, we have had 64-bit process identifiers 
> that will not be reused for the entire uptime of the system. This greatly 
> facilitates process tracking in userspace.
> 
> There are two limitations at present:
> 
>  * These identifiers are currently only exposed to processes on 64-bit 
>    systems. On 32-bit systems, inode space is also limited to 32 bits and 
>    therefore is subject to the same reuse issues.
>  * There is no way to go from one of these unique identifiers to a pid or 
>    pidfd.
> 
> Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
> userspace  can retrieve a unique process identifier even on 32-bit systems 
> via name_to_handle_at.
> 
> Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
> userspace can convert back from a file handle to the corresponding pidfd. 
> To support us going from a file handle to a pidfd, we have to store a pid 
> inside the file handle. To ensure file handles are invariant and can move 
> between pid namespaces, we stash a pid from the initial namespace inside 
> the file handle.
> 
> I'm not quite sure if stashing an initial-namespace pid inside the file 
> handle is the right approach here; if not, I think that patch 1 & 2 are 
> useful on their own.

Sorry for the delayed reply (I'm recovering from a lengthy illness.).

I like the idea in general. I think this is really useful. A few of my
thoughts but I need input from Amir and Jeff:

* In the last patch of the series you already implement decoding of
  pidfd file handles by adding a .fh_to_dentry export_operations method.

  There are a few things to consider because of how open_by_handle_at()
  works.

  - open_by_handle_at() needs to be restricted so it only creates pidfds
    from pidfs file handles that resolve to a struct pid that is
    reachable in the caller's pid namespace. In other words, it should
    mirror pidfd_open().

    Put another way, open_by_handle_at() must not be usable to open
    arbitrary pids to prevent a container from constructing a pidfd file
    handle for a process that lives outside it's pid namespace
    hierarchy.

    With this restriction in place open_by_handle_at() can be available
    to let unprivileged processes open pidfd file handles.

    Related to that, I don't think we need to make open_by_handle_at()
    open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
    because any process in the initial pid namespace can open any other
    process via pidfd_open() anyway because pid namespaces are
    hierarchical.

    IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
    provided pidfs file handle must be reachable from the caller's pid
    namespace.

  - open_by_handle_at() uses may_decode_fh() to determine whether it's
    possible to decode a file handle as an unprivileged user. The
    current checks don't make sense for pidfs. Conceptually, I think
    there don't need to place any restrictions based on global
    CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
    mount on pidfs file handles.

    The only restriction that matters is that the requested pidfs file
    handle is reachable from the caller's pid namespace.

  - A pidfd always has exactly a single inode and a single dentry.
    There's no aliases.

  - Generally, in my naive opinion, I think that decoding pidfs file
    handles should be a lot simpler than decoding regular path based
    file handles. Because there should be no need to verify any
    ancestors, or reconnect paths. Pidfs also doesn't have directory
    inodes, only regular inodes. In other words, any dentry is
    acceptable.

    Essentially, the only thing we need is for exportfs_decode_fh_raw()
    to verify that the provided pidfs file handle is resolvable in the
    caller's pid namespace. If so we're done. The challenge is how to
    nicely plumb this into the code without it sticking out like a sore
    thumb.

  - Pidfs should not be exportable via NFS. It doesn't make sense.
Jeff Layton Nov. 12, 2024, 1:57 p.m. UTC | #2
On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> On Fri, Nov 01, 2024 at 01:54:48PM +0000, Erin Shepherd wrote:
> > Since the introduction of pidfs, we have had 64-bit process identifiers 
> > that will not be reused for the entire uptime of the system. This greatly 
> > facilitates process tracking in userspace.
> > 
> > There are two limitations at present:
> > 
> >  * These identifiers are currently only exposed to processes on 64-bit 
> >    systems. On 32-bit systems, inode space is also limited to 32 bits and 
> >    therefore is subject to the same reuse issues.

We should really just move to storing 64-bit inode numbers internally
on 32-bit machines. That would at least make statx() give you all 64
bits on 32-bit host.

> >  * There is no way to go from one of these unique identifiers to a pid or 
> >    pidfd.
> > 
> > Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
> > userspace  can retrieve a unique process identifier even on 32-bit systems 
> > via name_to_handle_at.
> > 
> > Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
> > userspace can convert back from a file handle to the corresponding pidfd. 
> > To support us going from a file handle to a pidfd, we have to store a pid 
> > inside the file handle. To ensure file handles are invariant and can move 
> > between pid namespaces, we stash a pid from the initial namespace inside 
> > the file handle.
> > 
> > I'm not quite sure if stashing an initial-namespace pid inside the file 
> > handle is the right approach here; if not, I think that patch 1 & 2 are 
> > useful on their own.

Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
mount namespaces do? In that case, stashing the pid from init_ns is
probably the next best thing.

> 
> Sorry for the delayed reply (I'm recovering from a lengthy illness.).
> 
> I like the idea in general. I think this is really useful. A few of my
> thoughts but I need input from Amir and Jeff:
> 
> * In the last patch of the series you already implement decoding of
>   pidfd file handles by adding a .fh_to_dentry export_operations method.
> 
>   There are a few things to consider because of how open_by_handle_at()
>   works.
> 
>   - open_by_handle_at() needs to be restricted so it only creates pidfds
>     from pidfs file handles that resolve to a struct pid that is
>     reachable in the caller's pid namespace. In other words, it should
>     mirror pidfd_open().
> 
>     Put another way, open_by_handle_at() must not be usable to open
>     arbitrary pids to prevent a container from constructing a pidfd file
>     handle for a process that lives outside it's pid namespace
>     hierarchy.
> 
>     With this restriction in place open_by_handle_at() can be available
>     to let unprivileged processes open pidfd file handles.
> 
>     Related to that, I don't think we need to make open_by_handle_at()
>     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
>     because any process in the initial pid namespace can open any other
>     process via pidfd_open() anyway because pid namespaces are
>     hierarchical.
> 
>     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
>     provided pidfs file handle must be reachable from the caller's pid
>     namespace.
> 
>   - open_by_handle_at() uses may_decode_fh() to determine whether it's
>     possible to decode a file handle as an unprivileged user. The
>     current checks don't make sense for pidfs. Conceptually, I think
>     there don't need to place any restrictions based on global
>     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
>     mount on pidfs file handles.
> 
>     The only restriction that matters is that the requested pidfs file
>     handle is reachable from the caller's pid namespace.
>
>   - A pidfd always has exactly a single inode and a single dentry.
>     There's no aliases.
> 
>   - Generally, in my naive opinion, I think that decoding pidfs file
>     handles should be a lot simpler than decoding regular path based
>     file handles. Because there should be no need to verify any
>     ancestors, or reconnect paths. Pidfs also doesn't have directory
>     inodes, only regular inodes. In other words, any dentry is
>     acceptable.
> 
>     Essentially, the only thing we need is for exportfs_decode_fh_raw()
>     to verify that the provided pidfs file handle is resolvable in the
>     caller's pid namespace. If so we're done. The challenge is how to
>     nicely plumb this into the code without it sticking out like a sore
>     thumb.
> 
>   - Pidfs should not be exportable via NFS. It doesn't make sense.

I haven't looked over the patchset yet, but those restrictions all
sound pretty reasonable to me. Special casing the may_decode_fh
permission checks may be the tricky bit. I'm not sure what that should
look like, tbqh.
Erin Shepherd Nov. 12, 2024, 10:43 p.m. UTC | #3
On 12/11/2024 14:57, Jeff Layton wrote:
> On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> We should really just move to storing 64-bit inode numbers internally
> on 32-bit machines. That would at least make statx() give you all 64
> bits on 32-bit host.

I think that would be ideal from the perspective of exposing it to
userspace.
It does leave the question of going back from inode to pidfd unsolved
though.I like the name_to_handle_at/open_by_handle_at approach because
it neatly solves both sides of the problem with APIs we already have and
understand

> Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> mount namespaces do? In that case, stashing the pid from init_ns is
> probably the next best thing.

Not that I could identify, no; so stashing the PID seemed like the most
pragmatic
approach.

I'm not 100% sure it should be a documented property of the file handle
format; I
somewhat think that everything after the PID inode should be opaque to
userspace
and subject to change in the future (to the point I considered xoring it
with a
magic constant to make it less obvious to userspace/make it more obvious
that its
not to be relied upon; but that to my knowledge is not something that
the kernel
has done elsewhere).

- Erin
Erin Shepherd Nov. 12, 2024, 11:03 p.m. UTC | #4
On 12/11/2024 14:10, Christian Brauner wrote:
> Sorry for the delayed reply (I'm recovering from a lengthy illness.).
No worries on my part, and I hope you're feeling better!
> I like the idea in general. I think this is really useful. A few of my
> thoughts but I need input from Amir and Jeff:
>
> * In the last patch of the series you already implement decoding of
>   pidfd file handles by adding a .fh_to_dentry export_operations method.
>
>   There are a few things to consider because of how open_by_handle_at()
>   works.
>
>   - open_by_handle_at() needs to be restricted so it only creates pidfds
>     from pidfs file handles that resolve to a struct pid that is
>     reachable in the caller's pid namespace. In other words, it should
>     mirror pidfd_open().
>
>     Put another way, open_by_handle_at() must not be usable to open
>     arbitrary pids to prevent a container from constructing a pidfd file
>     handle for a process that lives outside it's pid namespace
>     hierarchy.
>
>     With this restriction in place open_by_handle_at() can be available
>     to let unprivileged processes open pidfd file handles.
>
>     Related to that, I don't think we need to make open_by_handle_at()
>     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
>     because any process in the initial pid namespace can open any other
>     process via pidfd_open() anyway because pid namespaces are
>     hierarchical.
>
>     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
>     provided pidfs file handle must be reachable from the caller's pid
>     namespace.

The pid_vnr(pid) == 0 check catches this case -- we return an error to the
caller if there isn't a pid mapping in the caller's namespace

Perhaps I should have called this out explicitly.

>   - open_by_handle_at() uses may_decode_fh() to determine whether it's
>     possible to decode a file handle as an unprivileged user. The
>     current checks don't make sense for pidfs. Conceptually, I think
>     there don't need to place any restrictions based on global
>     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
>     mount on pidfs file handles.
>
>     The only restriction that matters is that the requested pidfs file
>     handle is reachable from the caller's pid namespace.

I wonder if this could be handled through an addition to export_operations'
flags member?

>   - A pidfd always has exactly a single inode and a single dentry.
>     There's no aliases.
>
>   - Generally, in my naive opinion, I think that decoding pidfs file
>     handles should be a lot simpler than decoding regular path based
>     file handles. Because there should be no need to verify any
>     ancestors, or reconnect paths. Pidfs also doesn't have directory
>     inodes, only regular inodes. In other words, any dentry is
>     acceptable.
>
>     Essentially, the only thing we need is for exportfs_decode_fh_raw()
>     to verify that the provided pidfs file handle is resolvable in the
>     caller's pid namespace. If so we're done. The challenge is how to
>     nicely plumb this into the code without it sticking out like a sore
>     thumb.

Theoretically you should be able to use PIDFD_SELF as well (assuming that
makes its way into mainline this release :-)) but I am a bit concerned about
potentially polluting the open_by_handle_at logic with pidfd specificities.

>   - Pidfs should not be exportable via NFS. It doesn't make sense.

Hmm, I guess I might have made that possible, though I'm certainly not
familiar enough with the internals of nfsd to be able to test if I've done
so.

I guess probably this case calls for another export_ops flag? Not like we're
short on them

Thanks,
    - Erin
Darrick J. Wong Nov. 13, 2024, 12:37 a.m. UTC | #5
On Tue, Nov 12, 2024 at 11:43:13PM +0100, Erin Shepherd wrote:
> On 12/11/2024 14:57, Jeff Layton wrote:
> > On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> > We should really just move to storing 64-bit inode numbers internally
> > on 32-bit machines. That would at least make statx() give you all 64
> > bits on 32-bit host.
> 
> I think that would be ideal from the perspective of exposing it to
> userspace.
> It does leave the question of going back from inode to pidfd unsolved
> though.I like the name_to_handle_at/open_by_handle_at approach because
> it neatly solves both sides of the problem with APIs we already have and
> understand
> 
> > Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> > mount namespaces do? In that case, stashing the pid from init_ns is
> > probably the next best thing.
> 
> Not that I could identify, no; so stashing the PID seemed like the most
> pragmatic
> approach.
> 
> I'm not 100% sure it should be a documented property of the file
> handle format; I somewhat think that everything after the PID inode
> should be opaque to userspace and subject to change in the future (to
> the point I considered xoring it with a magic constant to make it less
> obvious to userspace/make it more obvious that its not to be relied
> upon; but that to my knowledge is not something that the kernel has
> done elsewhere).

It's a handle, the internal details of its layout of it is supposed to
be opaque to userspace.  I wonder how well userspace deals with weirdly
sized handles though...

--D

> - Erin
> 
>
Darrick J. Wong Nov. 13, 2024, 12:40 a.m. UTC | #6
On Wed, Nov 13, 2024 at 12:03:08AM +0100, Erin Shepherd wrote:
> 
> On 12/11/2024 14:10, Christian Brauner wrote:
> > Sorry for the delayed reply (I'm recovering from a lengthy illness.).
> No worries on my part, and I hope you're feeling better!
> > I like the idea in general. I think this is really useful. A few of my
> > thoughts but I need input from Amir and Jeff:
> >
> > * In the last patch of the series you already implement decoding of
> >   pidfd file handles by adding a .fh_to_dentry export_operations method.
> >
> >   There are a few things to consider because of how open_by_handle_at()
> >   works.
> >
> >   - open_by_handle_at() needs to be restricted so it only creates pidfds
> >     from pidfs file handles that resolve to a struct pid that is
> >     reachable in the caller's pid namespace. In other words, it should
> >     mirror pidfd_open().
> >
> >     Put another way, open_by_handle_at() must not be usable to open
> >     arbitrary pids to prevent a container from constructing a pidfd file
> >     handle for a process that lives outside it's pid namespace
> >     hierarchy.
> >
> >     With this restriction in place open_by_handle_at() can be available
> >     to let unprivileged processes open pidfd file handles.
> >
> >     Related to that, I don't think we need to make open_by_handle_at()
> >     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
> >     because any process in the initial pid namespace can open any other
> >     process via pidfd_open() anyway because pid namespaces are
> >     hierarchical.
> >
> >     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
> >     provided pidfs file handle must be reachable from the caller's pid
> >     namespace.
> 
> The pid_vnr(pid) == 0 check catches this case -- we return an error to the
> caller if there isn't a pid mapping in the caller's namespace
> 
> Perhaps I should have called this out explicitly.
> 
> >   - open_by_handle_at() uses may_decode_fh() to determine whether it's
> >     possible to decode a file handle as an unprivileged user. The
> >     current checks don't make sense for pidfs. Conceptually, I think
> >     there don't need to place any restrictions based on global
> >     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
> >     mount on pidfs file handles.
> >
> >     The only restriction that matters is that the requested pidfs file
> >     handle is reachable from the caller's pid namespace.
> 
> I wonder if this could be handled through an addition to export_operations'
> flags member?
> 
> >   - A pidfd always has exactly a single inode and a single dentry.
> >     There's no aliases.
> >
> >   - Generally, in my naive opinion, I think that decoding pidfs file
> >     handles should be a lot simpler than decoding regular path based
> >     file handles. Because there should be no need to verify any
> >     ancestors, or reconnect paths. Pidfs also doesn't have directory
> >     inodes, only regular inodes. In other words, any dentry is
> >     acceptable.
> >
> >     Essentially, the only thing we need is for exportfs_decode_fh_raw()
> >     to verify that the provided pidfs file handle is resolvable in the
> >     caller's pid namespace. If so we're done. The challenge is how to
> >     nicely plumb this into the code without it sticking out like a sore
> >     thumb.
> 
> Theoretically you should be able to use PIDFD_SELF as well (assuming that
> makes its way into mainline this release :-)) but I am a bit concerned about
> potentially polluting the open_by_handle_at logic with pidfd specificities.
> 
> >   - Pidfs should not be exportable via NFS. It doesn't make sense.
> 
> Hmm, I guess I might have made that possible, though I'm certainly not
> familiar enough with the internals of nfsd to be able to test if I've done
> so.

AFAIK check_export() in fs/nfsd/export.c spells this it out:

	/* There are two requirements on a filesystem to be exportable.
	 * 1:  We must be able to identify the filesystem from a number.
	 *       either a device number (so FS_REQUIRES_DEV needed)
	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
	 * 2:  We must be able to find an inode from a filehandle.
	 *       This means that s_export_op must be set.
	 * 3: We must not currently be on an idmapped mount.
	 */

Granted I've been wrong on account of stale docs before. :$

Though it would be kinda funny if you *could* mess with another
machine's processes over NFS.

--D

> I guess probably this case calls for another export_ops flag? Not like we're
> short on them
> 
> Thanks,
>     - Erin
> 
>
Erin Shepherd Nov. 13, 2024, 10:17 a.m. UTC | #7
On 13/11/2024 01:40, Darrick J. Wong wrote:
>> Hmm, I guess I might have made that possible, though I'm certainly not
>> familiar enough with the internals of nfsd to be able to test if I've done
>> so.
> AFAIK check_export() in fs/nfsd/export.c spells this it out:
>
> 	/* There are two requirements on a filesystem to be exportable.
> 	 * 1:  We must be able to identify the filesystem from a number.
> 	 *       either a device number (so FS_REQUIRES_DEV needed)
> 	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> 	 * 2:  We must be able to find an inode from a filehandle.
> 	 *       This means that s_export_op must be set.
> 	 * 3: We must not currently be on an idmapped mount.
> 	 */
>
> Granted I've been wrong on account of stale docs before. :$
>
> Though it would be kinda funny if you *could* mess with another
> machine's processes over NFS.
>
> --D

To be clear I'm not familiar enough with the workings of nfsd to tell if
pidfs fails those requirements and therefore wouldn't become exportable as
a result of this patch, though I gather from you're message that we're in the
clear?

Regardless I think my question is: do we think either those requirements could
change in the future, or the properties of pidfs could change in the future,
in ways that could accidentally make the filesystem exportable?

I guess though that the same concern would apply to cgroupfs and it hasn't posed
an issue so far.

- Erin
Christian Brauner Nov. 13, 2024, 11:35 a.m. UTC | #8
On Tue, Nov 12, 2024 at 11:43:13PM +0100, Erin Shepherd wrote:
> On 12/11/2024 14:57, Jeff Layton wrote:
> > On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> > We should really just move to storing 64-bit inode numbers internally
> > on 32-bit machines. That would at least make statx() give you all 64
> > bits on 32-bit host.
> 
> I think that would be ideal from the perspective of exposing it to
> userspace.
> It does leave the question of going back from inode to pidfd unsolved
> though.I like the name_to_handle_at/open_by_handle_at approach because

Indeed it doesn't solve it because it's possible that a given struct pid
never had a pidfd created for it and thus no inode actually does exist.
So when you're decoding a pidfs file handle you need to go to a struct
pid based on some property. The pid is fine for that and it is
equivalen to how pidfd_open() works.

> it neatly solves both sides of the problem with APIs we already have and
> understand
> 
> > Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> > mount namespaces do? In that case, stashing the pid from init_ns is
> > probably the next best thing.
> 
> Not that I could identify, no; so stashing the PID seemed like the most
> pragmatic
> approach.
> 
> I'm not 100% sure it should be a documented property of the file handle
> format; I
> somewhat think that everything after the PID inode should be opaque to
> userspace
> and subject to change in the future (to the point I considered xoring it
> with a
> magic constant to make it less obvious to userspace/make it more obvious
> that its
> not to be relied upon; but that to my knowledge is not something that
> the kernel
> has done elsewhere).
> 
> - Erin
>
Jeff Layton Nov. 13, 2024, 1:29 p.m. UTC | #9
On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> On 13/11/2024 01:40, Darrick J. Wong wrote:
> > > Hmm, I guess I might have made that possible, though I'm certainly not
> > > familiar enough with the internals of nfsd to be able to test if I've done
> > > so.
> > AFAIK check_export() in fs/nfsd/export.c spells this it out:
> > 
> > 	/* There are two requirements on a filesystem to be exportable.
> > 	 * 1:  We must be able to identify the filesystem from a number.
> > 	 *       either a device number (so FS_REQUIRES_DEV needed)
> > 	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> > 	 * 2:  We must be able to find an inode from a filehandle.
> > 	 *       This means that s_export_op must be set.
> > 	 * 3: We must not currently be on an idmapped mount.
> > 	 */
> > 
> > Granted I've been wrong on account of stale docs before. :$
> > 
> > Though it would be kinda funny if you *could* mess with another
> > machine's processes over NFS.
> > 
> > --D
> 
> To be clear I'm not familiar enough with the workings of nfsd to tell if
> pidfs fails those requirements and therefore wouldn't become exportable as
> a result of this patch, though I gather from you're message that we're in the
> clear?
> 
> Regardless I think my question is: do we think either those requirements could
> change in the future, or the properties of pidfs could change in the future,
> in ways that could accidentally make the filesystem exportable?
> 
> I guess though that the same concern would apply to cgroupfs and it hasn't posed
> an issue so far.

We have other filesystems that do this sort of thing (like cgroupfs),
and we don't allow them to be exportable. We'll need to make sure that
that's the case before we merge this, of course, as I forget the
details of how that works.
Chuck Lever III Nov. 13, 2024, 2:41 p.m. UTC | #10
> On Nov 13, 2024, at 8:29 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
>> On 13/11/2024 01:40, Darrick J. Wong wrote:
>>>> Hmm, I guess I might have made that possible, though I'm certainly not
>>>> familiar enough with the internals of nfsd to be able to test if I've done
>>>> so.
>>> AFAIK check_export() in fs/nfsd/export.c spells this it out:
>>> 
>>> /* There are two requirements on a filesystem to be exportable.
>>> * 1:  We must be able to identify the filesystem from a number.
>>> *       either a device number (so FS_REQUIRES_DEV needed)
>>> *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
>>> * 2:  We must be able to find an inode from a filehandle.
>>> *       This means that s_export_op must be set.
>>> * 3: We must not currently be on an idmapped mount.
>>> */
>>> 
>>> Granted I've been wrong on account of stale docs before. :$
>>> 
>>> Though it would be kinda funny if you *could* mess with another
>>> machine's processes over NFS.
>>> 
>>> --D
>> 
>> To be clear I'm not familiar enough with the workings of nfsd to tell if
>> pidfs fails those requirements and therefore wouldn't become exportable as
>> a result of this patch, though I gather from you're message that we're in the
>> clear?
>> 
>> Regardless I think my question is: do we think either those requirements could
>> change in the future, or the properties of pidfs could change in the future,
>> in ways that could accidentally make the filesystem exportable?
>> 
>> I guess though that the same concern would apply to cgroupfs and it hasn't posed
>> an issue so far.
> 
> We have other filesystems that do this sort of thing (like cgroupfs),
> and we don't allow them to be exportable. We'll need to make sure that
> that's the case before we merge this, of course, as I forget the
> details of how that works.

It's far easier to add exportability later than it is
to remove it if we think it was a mistake. I would err
on the side of caution if there isn't an immediate
need/use-case for exposure via NFS.

--
Chuck Lever
Amir Goldstein Nov. 14, 2024, 6:55 a.m. UTC | #11
On Wed, Nov 13, 2024 at 2:29 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> > On 13/11/2024 01:40, Darrick J. Wong wrote:
> > > > Hmm, I guess I might have made that possible, though I'm certainly not
> > > > familiar enough with the internals of nfsd to be able to test if I've done
> > > > so.
> > > AFAIK check_export() in fs/nfsd/export.c spells this it out:
> > >
> > >     /* There are two requirements on a filesystem to be exportable.
> > >      * 1:  We must be able to identify the filesystem from a number.
> > >      *       either a device number (so FS_REQUIRES_DEV needed)
> > >      *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> > >      * 2:  We must be able to find an inode from a filehandle.
> > >      *       This means that s_export_op must be set.
> > >      * 3: We must not currently be on an idmapped mount.
> > >      */
> > >
> > > Granted I've been wrong on account of stale docs before. :$
> > >
> > > Though it would be kinda funny if you *could* mess with another
> > > machine's processes over NFS.
> > >
> > > --D
> >
> > To be clear I'm not familiar enough with the workings of nfsd to tell if
> > pidfs fails those requirements and therefore wouldn't become exportable as
> > a result of this patch, though I gather from you're message that we're in the
> > clear?
> >
> > Regardless I think my question is: do we think either those requirements could
> > change in the future, or the properties of pidfs could change in the future,
> > in ways that could accidentally make the filesystem exportable?
> >
> > I guess though that the same concern would apply to cgroupfs and it hasn't posed
> > an issue so far.
>
> We have other filesystems that do this sort of thing (like cgroupfs),
> and we don't allow them to be exportable. We'll need to make sure that
> that's the case before we merge this, of course, as I forget the
> details of how that works.

TBH, I cannot find how export of cgroups with NFSEXP_FSID
is prevented.

We should probably block nfs export of SB_NOUSER and anyway,
this should be tied to the flag for relaxing CAP_DAC_READ_SEARCH,
because this is a strong indication that it's not a traditional nfs file handle.

Thanks,
Amir.
Christian Brauner Nov. 14, 2024, 10:39 a.m. UTC | #12
On Wed, Nov 13, 2024 at 02:41:29PM +0000, Chuck Lever III wrote:
> 
> 
> > On Nov 13, 2024, at 8:29 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> >> On 13/11/2024 01:40, Darrick J. Wong wrote:
> >>>> Hmm, I guess I might have made that possible, though I'm certainly not
> >>>> familiar enough with the internals of nfsd to be able to test if I've done
> >>>> so.
> >>> AFAIK check_export() in fs/nfsd/export.c spells this it out:
> >>> 
> >>> /* There are two requirements on a filesystem to be exportable.
> >>> * 1:  We must be able to identify the filesystem from a number.
> >>> *       either a device number (so FS_REQUIRES_DEV needed)
> >>> *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> >>> * 2:  We must be able to find an inode from a filehandle.
> >>> *       This means that s_export_op must be set.
> >>> * 3: We must not currently be on an idmapped mount.
> >>> */
> >>> 
> >>> Granted I've been wrong on account of stale docs before. :$
> >>> 
> >>> Though it would be kinda funny if you *could* mess with another
> >>> machine's processes over NFS.
> >>> 
> >>> --D
> >> 
> >> To be clear I'm not familiar enough with the workings of nfsd to tell if
> >> pidfs fails those requirements and therefore wouldn't become exportable as
> >> a result of this patch, though I gather from you're message that we're in the
> >> clear?
> >> 
> >> Regardless I think my question is: do we think either those requirements could
> >> change in the future, or the properties of pidfs could change in the future,
> >> in ways that could accidentally make the filesystem exportable?
> >> 
> >> I guess though that the same concern would apply to cgroupfs and it hasn't posed
> >> an issue so far.
> > 
> > We have other filesystems that do this sort of thing (like cgroupfs),
> > and we don't allow them to be exportable. We'll need to make sure that
> > that's the case before we merge this, of course, as I forget the
> > details of how that works.
> 
> It's far easier to add exportability later than it is
> to remove it if we think it was a mistake. I would err
> on the side of caution if there isn't an immediate
> need/use-case for exposure via NFS.

Tbh, the idea itself of exporting pidfs via nfs is a) crazy and b)
pretty interesting. If we could really export pidfs over NFS cleanly
somehow then we would have a filesystem-like representation of a remote
machine's processes. There could be a lot of interesting things in this.
But I would think that this requires some proper massaging of how pidfs
works. But in principle it might be possible.

Again, I'm not saying it's a great idea and we definitely shouldn't do
it now. But it's an interesting thought experiment at least.