diff mbox series

[2/2] openat2: add OA2_INHERIT_CRED flag

Message ID 20240424105248.189032-3-stsp2@yandex.ru (mailing list archive)
State New, archived
Headers show
Series implement OA2_INHERIT_CRED flag for openat2() | expand

Commit Message

stsp April 24, 2024, 10:52 a.m. UTC
This flag performs the open operation with the fs credentials
(fsuid, fsgid, group_info) that were in effect when dir_fd was opened.
This allows the process to pre-open some directories and then
change eUID (and all other UIDs/GIDs) to a less-privileged user,
retaining the ability to open/create files within these directories.

Design goal:
The idea is to provide a very light-weight sandboxing, where the
process, without the use of any heavy-weight techniques like chroot
within namespaces, can restrict the access to the set of pre-opened
directories.
This patch is just a first step to such sandboxing. If things go
well, in the future the same extension can be added to more syscalls.
These should include at least unlinkat(), renameat2() and the
not-yet-upstreamed setxattrat().

Security considerations:
- Only the bare minimal set of credentials is overridden:
  fsuid, fsgid and group_info. The rest, for example capabilities,
  are not overridden to avoid unneeded security risks.
- To avoid sandboxing escape, this patch makes sure the restricted
  lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT.
- To avoid leaking creds across exec, this patch requires O_CLOEXEC
  flag on a directory.
- Magic /proc symlinks are discarded, as suggested by
  Andy Lutomirski <luto@kernel.org>

Use cases:
Virtual machines that deal with untrusted code, can use that
instead of a more heavy-weighted approaches.
Currently the approach is being tested on a dosemu2 VM.

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>

CC: Stefan Metzmacher <metze@samba.org>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andy Lutomirski <luto@kernel.org>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Jeff Layton <jlayton@kernel.org>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Alexander Aring <alex.aring@gmail.com>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Christian Göttsche <cgzones@googlemail.com>
---
 fs/internal.h                |  2 +-
 fs/namei.c                   | 56 ++++++++++++++++++++++++++++++++++--
 fs/open.c                    | 10 ++++++-
 include/linux/fcntl.h        |  2 ++
 include/uapi/linux/openat2.h |  3 ++
 5 files changed, 69 insertions(+), 4 deletions(-)

Comments

Al Viro April 25, 2024, 2:31 a.m. UTC | #1
On Wed, Apr 24, 2024 at 01:52:48PM +0300, Stas Sergeev wrote:
> @@ -3793,8 +3828,23 @@ static struct file *path_openat(struct nameidata *nd,
>  			error = do_o_path(nd, flags, file);
>  	} else {
>  		const char *s = path_init(nd, flags);
> -		file = alloc_empty_file(open_flags, current_cred());
> -		error = PTR_ERR_OR_ZERO(file);
> +		const struct cred *old_cred = NULL;
> +
> +		error = 0;
> +		if (open_flags & OA2_INHERIT_CRED) {
> +			/* Only work with O_CLOEXEC dirs. */
> +			if (!get_close_on_exec(nd->dfd))
> +				error = -EPERM;
> +
> +			if (!error)
> +				old_cred = openat2_override_creds(nd);
> +		}
> +		if (!error) {
> +			file = alloc_empty_file(open_flags, current_cred());

Consider the following, currently absolutely harmless situation:
	* process is owned by luser:students.
	* descriptor 69 refers to root-opened root directory (O_RDONLY)
What's the expected result of
	fcntl(69, F_SEFTD, O_CLOEXEC);
	opening "etc/shadow" with dirfd equal to 69 and your flag given
	subsequent read() from the resulting descriptor?

At which point will the kernel say "go fuck yourself, I'm not letting you
read that file", provided that attacker passes that new flag of yours?

As a bonus question, how about opening it for _write_, seeing that this
is an obvious instant roothole?

Again, currently the setup that has a root-opened directory in descriptor
table of a non-root process is safe.

Incidentally, suppose you have the same process run with stdin opened
(r/o) by root.  F_SETFD it to O_CLOEXEC, then use your open with
dirfd being 0, pathname - "" and flags - O_RDWR.

AFAICS, without an explicit opt-in by the original opener it's
a non-starter, and TBH I doubt that even with such opt-in (FMODE_CRED,
whatever) it would be a good idea - it gives too much.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
stsp April 25, 2024, 7:24 a.m. UTC | #2
25.04.2024 05:31, Al Viro пишет:
> Consider the following, currently absolutely harmless situation:
> 	* process is owned by luser:students.
> 	* descriptor 69 refers to root-opened root directory (O_RDONLY)
> What's the expected result of
> 	fcntl(69, F_SEFTD, O_CLOEXEC);
> 	opening "etc/shadow" with dirfd equal to 69 and your flag given
> 	subsequent read() from the resulting descriptor?
>
> At which point will the kernel say "go fuck yourself, I'm not letting you
> read that file", provided that attacker passes that new flag of yours?
>
> As a bonus question, how about opening it for _write_, seeing that this
> is an obvious instant roothole?
>
> Again, currently the setup that has a root-opened directory in descriptor
> table of a non-root process is safe.
>
> Incidentally, suppose you have the same process run with stdin opened
> (r/o) by root.  F_SETFD it to O_CLOEXEC, then use your open with
> dirfd being 0, pathname - "" and flags - O_RDWR.

Ok, F_SETFD, how simple. :(

> AFAICS, without an explicit opt-in by the original opener it's
> a non-starter, and TBH I doubt that even with such opt-in (FMODE_CRED,
> whatever) it would be a good idea - it gives too much.
Yes, which is why I am quite sceptical
to this FMODE_CRED idea.

Please note that my O_CLOEXEC check
actually meant to check that exactly this
process have opened the dir. It just didn't
happen that way, as you pointed.
Can I replace the O_CLOEXEC check with
some explicit check that makes sure the
fd was opened by exactly that process?
stsp April 25, 2024, 9:23 a.m. UTC | #3
25.04.2024 05:31, Al Viro пишет:
> Incidentally, suppose you have the same process run with stdin opened
> (r/o) by root.  F_SETFD it to O_CLOEXEC, then use your open with
> dirfd being 0, pathname - "" and flags - O_RDWR.
I actually checked this with the test-case.
It seems to return ENOENT:


Breakpoint 1, openat2 (dirfd=0, pathname=0x7fffffffdbee "",
     how=0x7fffffffd5e0, size=24) at tst.c:13
13        return syscall(SYS_openat2, dirfd, pathname, how, size);
(gdb) fin
Run till exit from #0  openat2 (dirfd=0, pathname=0x7fffffffdbee "",
     how=0x7fffffffd5e0, size=24) at tst.c:13
0x000000000040167b in main (argc=3, argv=0x7fffffffd7b8) at tst.c:140
140        fd = openat2(0, efile, &how1, sizeof(how1));
Value returned is $1 = -1
(gdb) list
135        err = fcntl(0, F_SETFD, O_CLOEXEC);
136        if (err) {
137            perror("fcntl(F_SETFD)");
138            return EXIT_FAILURE;
139        }
140        fd = openat2(0, efile, &how1, sizeof(how1));
141        if (fd == -1) {
142            perror("openat2(1)");
143    //        return EXIT_FAILURE;
144        } else {
(gdb) p errno
$2 = 2


So it seems the creds can't be stolen
from a non-dir fd, but I wonder why
ENOENT is returned instead of ENOTDIR.
Such ENOENT is not dicumented in a
man page of openat2(), so I guess there
is some problem here even w/o my patch. :)
kernel test robot April 25, 2024, 1:50 p.m. UTC | #4
Hello,

kernel test robot noticed "BUG:KASAN:wild-memory-access_in_terminate_walk" on:

commit: 97bb54b42b1d6150e9ae11a7bf7833ed9f8c471d ("[PATCH 2/2] openat2: add OA2_INHERIT_CRED flag")
url: https://github.com/intel-lab-lkp/linux/commits/Stas-Sergeev/fs-reorganize-path_openat/20240424-185527
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 9d1ddab261f3e2af7c384dc02238784ce0cf9f98
patch link: https://lore.kernel.org/all/20240424105248.189032-3-stsp2@yandex.ru/
patch subject: [PATCH 2/2] openat2: add OA2_INHERIT_CRED flag

in testcase: boot

compiler: clang-17
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+---------------------------------------------------------------------------------------+------------+------------+
|                                                                                       | 831d3c6cc6 | 97bb54b42b |
+---------------------------------------------------------------------------------------+------------+------------+
| BUG:KASAN:wild-memory-access_in_terminate_walk                                        | 0          | 12         |
| canonical_address#:#[##]                                                              | 0          | 12         |
| RIP:terminate_walk                                                                    | 0          | 12         |
| Kernel_panic-not_syncing:Fatal_exception                                              | 0          | 12         |
+---------------------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202404252107.3c18eed2-lkp@intel.com


[ 2.555857][ T16] BUG: KASAN: wild-memory-access in terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[    2.556181][   T16] Write of size 4 at addr aaaaaaaaaaaaaaaa by task kdevtmpfs/16
[    2.556181][   T16]
[    2.556181][   T16] CPU: 0 PID: 16 Comm: kdevtmpfs Tainted: G                T  6.9.0-rc5-00038-g97bb54b42b1d #1 c90cc2d91176f38ca16e85ead0a72934082854cd
[    2.556181][   T16] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[    2.556181][   T16] Call Trace:
[    2.556181][   T16]  <TASK>
[ 2.556181][ T16] dump_stack_lvl (lib/dump_stack.c:116) 
[ 2.556181][ T16] print_report (mm/kasan/report.c:?) 
[ 2.556181][ T16] ? kasan_report (mm/kasan/report.c:214 mm/kasan/report.c:590) 
[ 2.556181][ T16] ? terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.556181][ T16] kasan_report (mm/kasan/report.c:603) 
[ 2.556181][ T16] ? terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.556181][ T16] kasan_check_range (mm/kasan/generic.c:?) 
[ 2.556181][ T16] terminate_walk (include/linux/instrumented.h:? include/linux/atomic/atomic-instrumented.h:400 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.556181][ T16] path_lookupat (fs/namei.c:2515) 
[ 2.556181][ T16] filename_lookup (fs/namei.c:2526) 
[ 2.556181][ T16] kern_path (fs/namei.c:2634) 
[ 2.556181][ T16] init_mount (fs/init.c:22) 
[ 2.556181][ T16] devtmpfs_setup (drivers/base/devtmpfs.c:419) 
[ 2.556181][ T16] devtmpfsd (drivers/base/devtmpfs.c:436) 
[ 2.556181][ T16] kthread (kernel/kthread.c:390) 
[ 2.556181][ T16] ? vclkdev_alloc (drivers/base/devtmpfs.c:435) 
[ 2.556181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 2.556181][ T16] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 2.556181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 2.556181][ T16] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[    2.556181][   T16]  </TASK>
[    2.556181][   T16] ==================================================================
[    2.556184][   T16] Disabling lock debugging due to kernel taint
[    2.556901][   T16] general protection fault, probably for non-canonical address 0xaaaaaaaaaaaaaaaa: 0000 [#1] KASAN PTI
[    2.558131][   T16] CPU: 0 PID: 16 Comm: kdevtmpfs Tainted: G    B           T  6.9.0-rc5-00038-g97bb54b42b1d #1 c90cc2d91176f38ca16e85ead0a72934082854cd
[    2.559653][   T16] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 2.560181][ T16] RIP: 0010:terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.560181][ T16] Code: 03 43 80 3c 2e 00 74 08 4c 89 ff e8 01 61 f4 ff 49 8b 1f 48 85 db 74 41 48 89 df be 04 00 00 00 e8 dc 61 f4 ff b8 ff ff ff ff <0f> c1 03 83 f8 01 75 25 43 80 3c 2e 00 74 08 4c 89 ff e8 d0 60 f4
All code
========
   0:	03 43 80             	add    -0x80(%rbx),%eax
   3:	3c 2e                	cmp    $0x2e,%al
   5:	00 74 08 4c          	add    %dh,0x4c(%rax,%rcx,1)
   9:	89 ff                	mov    %edi,%edi
   b:	e8 01 61 f4 ff       	call   0xfffffffffff46111
  10:	49 8b 1f             	mov    (%r15),%rbx
  13:	48 85 db             	test   %rbx,%rbx
  16:	74 41                	je     0x59
  18:	48 89 df             	mov    %rbx,%rdi
  1b:	be 04 00 00 00       	mov    $0x4,%esi
  20:	e8 dc 61 f4 ff       	call   0xfffffffffff46201
  25:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  2a:*	0f c1 03             	xadd   %eax,(%rbx)		<-- trapping instruction
  2d:	83 f8 01             	cmp    $0x1,%eax
  30:	75 25                	jne    0x57
  32:	43 80 3c 2e 00       	cmpb   $0x0,(%r14,%r13,1)
  37:	74 08                	je     0x41
  39:	4c 89 ff             	mov    %r15,%rdi
  3c:	e8                   	.byte 0xe8
  3d:	d0 60 f4             	shlb   -0xc(%rax)

Code starting with the faulting instruction
===========================================
   0:	0f c1 03             	xadd   %eax,(%rbx)
   3:	83 f8 01             	cmp    $0x1,%eax
   6:	75 25                	jne    0x2d
   8:	43 80 3c 2e 00       	cmpb   $0x0,(%r14,%r13,1)
   d:	74 08                	je     0x17
   f:	4c 89 ff             	mov    %r15,%rdi
  12:	e8                   	.byte 0xe8
  13:	d0 60 f4             	shlb   -0xc(%rax)
[    2.560181][   T16] RSP: 0000:ffffc9000010fc40 EFLAGS: 00010246
[    2.560181][   T16] RAX: 00000000ffffffff RBX: aaaaaaaaaaaaaaaa RCX: ffffffff811e4a0f
[    2.560181][   T16] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff8792adc0
[    2.560181][   T16] RBP: 0000000000000011 R08: ffffffff8792adc7 R09: 1ffffffff0f255b8
[    2.560181][   T16] R10: dffffc0000000000 R11: fffffbfff0f255b9 R12: 1ffff92000021fc4
[    2.560181][   T16] R13: dffffc0000000000 R14: 1ffff92000021fc1 R15: ffffc9000010fe08
[    2.560181][   T16] FS:  0000000000000000(0000) GS:ffffffff878dc000(0000) knlGS:0000000000000000
[    2.560181][   T16] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.560181][   T16] CR2: ffff88843ffff000 CR3: 000000000789c000 CR4: 00000000000406f0
[    2.560181][   T16] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.560181][   T16] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.560181][   T16] Call Trace:
[    2.560181][   T16]  <TASK>
[ 2.560181][ T16] ? __die_body (arch/x86/kernel/dumpstack.c:421) 
[ 2.560181][ T16] ? die_addr (arch/x86/kernel/dumpstack.c:?) 
[ 2.560181][ T16] ? exc_general_protection (arch/x86/kernel/traps.c:?) 
[ 2.560181][ T16] ? end_report (arch/x86/include/asm/current.h:49 mm/kasan/report.c:240) 
[ 2.560181][ T16] ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:617) 
[ 2.560181][ T16] ? add_taint (arch/x86/include/asm/bitops.h:60 include/asm-generic/bitops/instrumented-atomic.h:29 kernel/panic.c:555) 
[ 2.560181][ T16] ? terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.560181][ T16] path_lookupat (fs/namei.c:2515) 
[ 2.560181][ T16] filename_lookup (fs/namei.c:2526) 
[ 2.560181][ T16] kern_path (fs/namei.c:2634) 
[ 2.560181][ T16] init_mount (fs/init.c:22) 
[ 2.560181][ T16] devtmpfs_setup (drivers/base/devtmpfs.c:419) 
[ 2.560181][ T16] devtmpfsd (drivers/base/devtmpfs.c:436) 
[ 2.560181][ T16] kthread (kernel/kthread.c:390) 
[ 2.560181][ T16] ? vclkdev_alloc (drivers/base/devtmpfs.c:435) 
[ 2.560181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 2.560181][ T16] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 2.560181][ T16] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 2.560181][ T16] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[    2.560181][   T16]  </TASK>
[    2.560181][   T16] Modules linked in:
[    2.560183][   T16] ---[ end trace 0000000000000000 ]---
[ 2.560820][ T16] RIP: 0010:terminate_walk (arch/x86/include/asm/atomic.h:103 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:264 include/linux/refcount.h:307 include/linux/refcount.h:325 fs/namei.c:702) 
[ 2.561462][ T16] Code: 03 43 80 3c 2e 00 74 08 4c 89 ff e8 01 61 f4 ff 49 8b 1f 48 85 db 74 41 48 89 df be 04 00 00 00 e8 dc 61 f4 ff b8 ff ff ff ff <0f> c1 03 83 f8 01 75 25 43 80 3c 2e 00 74 08 4c 89 ff e8 d0 60 f4
All code
========
   0:	03 43 80             	add    -0x80(%rbx),%eax
   3:	3c 2e                	cmp    $0x2e,%al
   5:	00 74 08 4c          	add    %dh,0x4c(%rax,%rcx,1)
   9:	89 ff                	mov    %edi,%edi
   b:	e8 01 61 f4 ff       	call   0xfffffffffff46111
  10:	49 8b 1f             	mov    (%r15),%rbx
  13:	48 85 db             	test   %rbx,%rbx
  16:	74 41                	je     0x59
  18:	48 89 df             	mov    %rbx,%rdi
  1b:	be 04 00 00 00       	mov    $0x4,%esi
  20:	e8 dc 61 f4 ff       	call   0xfffffffffff46201
  25:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  2a:*	0f c1 03             	xadd   %eax,(%rbx)		<-- trapping instruction
  2d:	83 f8 01             	cmp    $0x1,%eax
  30:	75 25                	jne    0x57
  32:	43 80 3c 2e 00       	cmpb   $0x0,(%r14,%r13,1)
  37:	74 08                	je     0x41
  39:	4c 89 ff             	mov    %r15,%rdi
  3c:	e8                   	.byte 0xe8
  3d:	d0 60 f4             	shlb   -0xc(%rax)

Code starting with the faulting instruction
===========================================
   0:	0f c1 03             	xadd   %eax,(%rbx)
   3:	83 f8 01             	cmp    $0x1,%eax
   6:	75 25                	jne    0x2d
   8:	43 80 3c 2e 00       	cmpb   $0x0,(%r14,%r13,1)
   d:	74 08                	je     0x17
   f:	4c 89 ff             	mov    %r15,%rdi
  12:	e8                   	.byte 0xe8
  13:	d0 60 f4             	shlb   -0xc(%rax)


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240425/202404252107.3c18eed2-lkp@intel.com
Christian Brauner April 25, 2024, 2:02 p.m. UTC | #5
>  struct open_flags {
> -	int open_flag;
> +	u64 open_flag;

Btw, this change taken together with

> +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED)

is also ripe to causes subtle bugs and security issues. This new
VALID_OPENAT2_FLAGS define bypasses

	BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
			 "struct open_flags doesn't yet handle flags > 32 bits");

in build_open_flags(). And right now lookup_open(), open_last_lookups(),
and do_open() just do:

	int open_flag = op->open_flag;

Because op->open_flag was 32bit that was fine. But now ->open_flag is
64bit which means we truncate the upper 32bit including OA2_INHERIT_CRED
or any other new flag in the upper 32bits in those functions.

So as soon as there's an additional check in e.g., do_open() for
OA2_INHERIT_CRED or in any of the other helpers that's security relevant
we're fscked because that flag is never seen and no bot will help us
here. And it's super easy to miss during review...
stsp April 26, 2024, 1:36 p.m. UTC | #6
25.04.2024 17:02, Christian Brauner пишет:
>>   struct open_flags {
>> -	int open_flag;
>> +	u64 open_flag;
> Btw, this change taken together with
All fixed in v5.
I dropped u64 use.
Other comments are addressed as well.
Please let me know if I missed some.

Thank you.
diff mbox series

Patch

diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..692b53b19aad 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -169,7 +169,7 @@  static inline void sb_end_ro_state_change(struct super_block *sb)
  * open.c
  */
 struct open_flags {
-	int open_flag;
+	u64 open_flag;
 	umode_t mode;
 	int acc_mode;
 	int intent;
diff --git a/fs/namei.c b/fs/namei.c
index 413eef134234..aeb9f504538e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -586,6 +586,9 @@  struct nameidata {
 	int		dfd;
 	vfsuid_t	dir_vfsuid;
 	umode_t		dir_mode;
+	kuid_t		dir_open_fsuid;
+	kgid_t		dir_open_fsgid;
+	struct group_info *dir_open_groups;
 } __randomize_layout;
 
 #define ND_ROOT_PRESET 1
@@ -695,6 +698,8 @@  static void terminate_walk(struct nameidata *nd)
 	nd->depth = 0;
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
+	if (nd->dir_open_groups)
+		put_group_info(nd->dir_open_groups);
 }
 
 /* path_put is needed afterwards regardless of success or failure */
@@ -2414,6 +2419,9 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 			get_fs_pwd(current->fs, &nd->path);
 			nd->inode = nd->path.dentry->d_inode;
 		}
+		nd->dir_open_fsuid = current_cred()->fsuid;
+		nd->dir_open_fsgid = current_cred()->fsgid;
+		nd->dir_open_groups = get_current_groups();
 	} else {
 		/* Caller must check execute permissions on the starting path component */
 		struct fd f = fdget_raw(nd->dfd);
@@ -2437,6 +2445,10 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 			path_get(&nd->path);
 			nd->inode = nd->path.dentry->d_inode;
 		}
+		nd->dir_open_fsuid = f.file->f_cred->fsuid;
+		nd->dir_open_fsgid = f.file->f_cred->fsgid;
+		nd->dir_open_groups = get_group_info(
+				f.file->f_cred->group_info);
 		fdput(f);
 	}
 
@@ -3776,6 +3788,29 @@  static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 	return error;
 }
 
+static const struct cred *openat2_override_creds(struct nameidata *nd)
+{
+	const struct cred *old_cred;
+	struct cred *override_cred;
+
+	override_cred = prepare_creds();
+	if (!override_cred)
+		return NULL;
+
+	override_cred->fsuid = nd->dir_open_fsuid;
+	override_cred->fsgid = nd->dir_open_fsgid;
+	override_cred->group_info = nd->dir_open_groups;
+
+	override_cred->non_rcu = 1;
+
+	old_cred = override_creds(override_cred);
+
+	/* override_cred() gets its own ref */
+	put_cred(override_cred);
+
+	return old_cred;
+}
+
 static struct file *path_openat(struct nameidata *nd,
 			const struct open_flags *op, unsigned flags)
 {
@@ -3793,8 +3828,23 @@  static struct file *path_openat(struct nameidata *nd,
 			error = do_o_path(nd, flags, file);
 	} else {
 		const char *s = path_init(nd, flags);
-		file = alloc_empty_file(open_flags, current_cred());
-		error = PTR_ERR_OR_ZERO(file);
+		const struct cred *old_cred = NULL;
+
+		error = 0;
+		if (open_flags & OA2_INHERIT_CRED) {
+			/* Only work with O_CLOEXEC dirs. */
+			if (!get_close_on_exec(nd->dfd))
+				error = -EPERM;
+
+			if (!error)
+				old_cred = openat2_override_creds(nd);
+		}
+		if (!error) {
+			file = alloc_empty_file(open_flags, current_cred());
+			error = PTR_ERR_OR_ZERO(file);
+		} else {
+			file = ERR_PTR(error);
+		}
 		if (!error) {
 			while (!(error = link_path_walk(s, nd)) &&
 			       (s = open_last_lookups(nd, file, op)) != NULL)
@@ -3802,6 +3852,8 @@  static struct file *path_openat(struct nameidata *nd,
 		}
 		if (!error)
 			error = do_open(nd, file, op);
+		if (old_cred)
+			revert_creds(old_cred);
 		terminate_walk(nd);
 		if (IS_ERR(file))
 			return file;
diff --git a/fs/open.c b/fs/open.c
index ee8460c83c77..c871ff8fc6e3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1225,7 +1225,7 @@  inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 	 * values before calling build_open_flags(), but openat2(2) checks all
 	 * of its arguments.
 	 */
-	if (flags & ~VALID_OPEN_FLAGS)
+	if (flags & ~VALID_OPENAT2_FLAGS)
 		return -EINVAL;
 	if (how->resolve & ~VALID_RESOLVE_FLAGS)
 		return -EINVAL;
@@ -1326,6 +1326,14 @@  inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		lookup_flags |= LOOKUP_CACHED;
 	}
 
+	if (flags & OA2_INHERIT_CRED) {
+		/* Inherit creds only with scoped look-up modes. */
+		if (!(lookup_flags & LOOKUP_IS_SCOPED))
+			return -EPERM;
+		/* Reject /proc "magic" links if inheriting creds. */
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	}
+
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..b71f8b162102 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -12,6 +12,8 @@ 
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
 
+#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED)
+
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index a5feb7604948..cdd676a10b62 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -40,4 +40,7 @@  struct open_how {
 					return -EAGAIN if that's not
 					possible. */
 
+/* openat2-specific flags go to upper 4 bytes. */
+#define OA2_INHERIT_CRED		(1ULL << 32)
+
 #endif /* _UAPI_LINUX_OPENAT2_H */