[RFC,v2,0/3] pidfs: file handle preliminaries

Message ID	20241129-work-pidfs-v2-0-61043d66fbce@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04E231E515; Fri, 29 Nov 2024 13:02:40 +0000 (UTC) From: Christian Brauner <brauner@kernel.org> Subject: [PATCH RFC v2 0/3] pidfs: file handle preliminaries Date: Fri, 29 Nov 2024 14:02:22 +0100 Message-Id: <20241129-work-pidfs-v2-0-61043d66fbce@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Erin Shepherd <erin.shepherd@e43.eu>, Amir Goldstein <amir73il@gmail.com>, Jeff Layton <jlayton@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>, Chuck Lever <chuck.lever@oracle.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org, Christian Brauner <brauner@kernel.org>
Series	pidfs: file handle preliminaries \| expand [RFC,v2,0/3] pidfs: file handle preliminaries [RFC,v2,1/3] pidfs: rework inode number allocation [RFC,v2,2/3] pidfs: remove 32bit inode number handling [RFC,v2,3/3] pidfs: support FS_IOC_GETVERSION

Message ID

20241129-work-pidfs-v2-0-61043d66fbce@kernel.org (mailing list archive)

Headers

From: Christian Brauner <brauner@kernel.org>
Subject: [PATCH RFC v2 0/3] pidfs: file handle preliminaries
Date: Fri, 29 Nov 2024 14:02:22 +0100
Message-Id: <20241129-work-pidfs-v2-0-61043d66fbce@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: Erin Shepherd <erin.shepherd@e43.eu>,
 Amir Goldstein <amir73il@gmail.com>, Jeff Layton <jlayton@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
 Chuck Lever <chuck.lever@oracle.com>, linux-fsdevel@vger.kernel.org,
 linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
 Christian Brauner <brauner@kernel.org>

Series

pidfs: file handle preliminaries | expand

Message

Christian Brauner Nov. 29, 2024, 1:02 p.m. UTC

Hey,

This reworks the inode number allocation for pidfs in order to support
file handles properly.

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the
design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

Thanks!
Christian

---
Changes in v2:
- Remove __maybe_unused pidfd_ino_get_pid() function that was only there
  for initial illustration purposes.
- Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org

---
Christian Brauner (3):
      pidfs: rework inode number allocation
      pidfs: remove 32bit inode number handling
      pidfs: support FS_IOC_GETVERSION

 fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
 include/linux/pidfs.h |   2 +
 kernel/pid.c          |  14 +++---
 3 files changed, 86 insertions(+), 48 deletions(-)
---
base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
change-id: 20241128-work-pidfs-2bd42c7ea772

Comments

Christian Brauner Nov. 29, 2024, 1:37 p.m. UTC | #1

Hey,

Now that we have the preliminaries to lookup struct pid based on its
inode number alone we can implement file handle support.

This is based on custom export operation methods which allows pidfs to
implement permission checking and opening of pidfs file handles cleanly
without hacking around in the core file handle code too much.

This is lightly tested.

Thanks!
Christian

---
Christian Brauner (5):
      fhandle: simplify error handling
      exportfs: add open method
      fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()
      exportfs: add permission method
      pidfs: implement file handle support

Erin Shepherd (1):
      pseudofs: add support for export_ops

 fs/fhandle.c              | 97 ++++++++++++++++++++++-------------------------
 fs/libfs.c                |  1 +
 fs/pidfs.c                | 96 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/exportfs.h  | 20 ++++++++++
 include/linux/pseudo_fs.h |  1 +
 5 files changed, 164 insertions(+), 51 deletions(-)
---
base-commit: 94c9a56ad3521a28177610c63298d66de634cb9d
change-id: 20241129-work-pidfs-file_handle-07bdfb860a38

Jeff Layton Nov. 29, 2024, 2:27 p.m. UTC | #2

On Fri, 2024-11-29 at 14:02 +0100, Christian Brauner wrote:
> Hey,
> 
> This reworks the inode number allocation for pidfs in order to support
> file handles properly.
> 
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
> 
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the
> design.
> 
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
> 
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
> 
> Thanks!
> Christian
> 
> ---
> Changes in v2:
> - Remove __maybe_unused pidfd_ino_get_pid() function that was only there
>   for initial illustration purposes.
> - Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org
> 
> ---
> Christian Brauner (3):
>       pidfs: rework inode number allocation
>       pidfs: remove 32bit inode number handling
>       pidfs: support FS_IOC_GETVERSION
> 
>  fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
>  include/linux/pidfs.h |   2 +
>  kernel/pid.c          |  14 +++---
>  3 files changed, 86 insertions(+), 48 deletions(-)
> ---
> base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
> change-id: 20241128-work-pidfs-2bd42c7ea772
> 
> 

This seems like a good stopgap fix until we can sort out how to get to
64-bit inode numbers internally everywhere.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Amir Goldstein Nov. 29, 2024, 2:34 p.m. UTC | #3

On Fri, Nov 29, 2024 at 3:27 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2024-11-29 at 14:02 +0100, Christian Brauner wrote:
> > Hey,
> >
> > This reworks the inode number allocation for pidfs in order to support
> > file handles properly.
> >
> > Recently we received a patchset that aims to enable file handle encoding
> > and decoding via name_to_handle_at(2) and open_by_handle_at(2).
> >
> > A crucical step in the patch series is how to go from inode number to
> > struct pid without leaking information into unprivileged contexts. The
> > issue is that in order to find a struct pid the pid number in the
> > initial pid namespace must be encoded into the file handle via
> > name_to_handle_at(2). This can be used by containers using a separate
> > pid namespace to learn what the pid number of a given process in the
> > initial pid namespace is. While this is a weak information leak it could
> > be used in various exploits and in general is an ugly wart in the
> > design.
> >
> > To solve this problem a new way is needed to lookup a struct pid based
> > on the inode number allocated for that struct pid. The other part is to
> > remove the custom inode number allocation on 32bit systems that is also
> > an ugly wart that should go away.
> >
> > So, a new scheme is used that I was discusssing with Tejun some time
> > back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> > are used for the generation number. This gives a 64 bit inode number
> > that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> > recycled slowly and can be used to lookup struct pids.
> >
> > Thanks!
> > Christian
> >
> > ---
> > Changes in v2:
> > - Remove __maybe_unused pidfd_ino_get_pid() function that was only there
> >   for initial illustration purposes.
> > - Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org
> >
> > ---
> > Christian Brauner (3):
> >       pidfs: rework inode number allocation
> >       pidfs: remove 32bit inode number handling
> >       pidfs: support FS_IOC_GETVERSION
> >
> >  fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
> >  include/linux/pidfs.h |   2 +
> >  kernel/pid.c          |  14 +++---
> >  3 files changed, 86 insertions(+), 48 deletions(-)
> > ---
> > base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
> > change-id: 20241128-work-pidfs-2bd42c7ea772
> >
> >
>
> This seems like a good stopgap fix until we can sort out how to get to
> 64-bit inode numbers internally everywhere.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Yep. look good

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Amir Goldstein Nov. 30, 2024, 12:22 p.m. UTC | #4

On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Hey,
>
> Now that we have the preliminaries to lookup struct pid based on its
> inode number alone we can implement file handle support.
>
> This is based on custom export operation methods which allows pidfs to
> implement permission checking and opening of pidfs file handles cleanly
> without hacking around in the core file handle code too much.
>
> This is lightly tested.

With my comments addressed as you pushed to vfs-6.14.pidfs branch
in your tree, you may add to the patches posted:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

HOWEVER,
IMO there is still one thing that has to be addressed before merge -
We must make sure that nfsd cannot export pidfs.

In principal, SB_NOUSER filesystems should not be accessible to
userspace paths, so exportfs should not be able to configure nfsd
export of pidfs, but maybe this limitation can be worked around by
using magic link paths?

I think it may be worth explicitly disallowing nfsd export of SB_NOUSER
filesystems and we could also consider blocking SB_KERNMOUNT,
but may there are users exporting ramfs?

Jeff has mentioned that he thinks we are blocking export of cgroupfs
by nfsd, but I really don't see where that is being enforced.
The requirement for FS_REQUIRES_DEV in check_export() is weak
because user can overrule it with manual fsid argument to exportfs.

So maybe we disallow nfsd export of kernfs and backport to stable kernels
to be on the safe side?

On top of that, we may also want to reject nfsd export of any fs
with custom ->open() or ->permission() export ops, on the grounds
that nfsd does not call these ops?

Regarding the two other kernel users of exportfs, namely,
overlayfs and fanotify -

For overlayfs, I think that in ovl_can_decode_fh() we can safely
opt-out of SB_NOUSER and SB_KERNMOUNT filesystems,
to not allow nfs exporting of overlayfs over those lower fs.

For fanotify, there is already a check in fanotify_events_supported()
to disallow sb/mount marks on SB_NOUSER and a comment that
questions the value of allowing them for SB_KERNMOUNT.
So for pidfs there is no risk wrt fanotify and it does not look like pidfs
is going to generate any fanotify events anyway.

Thanks,
Amir.