mbox series

[GIT,PULL] Add trusted_for(2) (was O_MAYEXEC)

Message ID 20220321161557.495388-1-mic@digikod.net (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Add trusted_for(2) (was O_MAYEXEC) | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux.git tags/trusted-for-v18

Message

Mickaël Salaün March 21, 2022, 4:15 p.m. UTC
Hi Linus,

This patch series adds a new syscall named trusted_for.  It enables user
space to ask the kernel: is this file descriptor's content trusted to be
used for this purpose?  The set of usage currently only contains
execution, but other may follow (e.g. configuration, sensitive data).
If the kernel identifies the file descriptor as trustworthy for this
usage, user space should then take this information into account.  The
"execution" usage means that the content of the file descriptor is
trusted according to the system policy to be executed by user space,
which means that it interprets the content or (try to) maps it as
executable memory.

A simple system-wide security policy can be set by the system
administrator through a sysctl configuration consistent with the mount
points or the file access rights.  The documentation explains the
prerequisites.

It is important to note that this can only enable to extend access
control managed by the kernel.  Hence it enables current access control
mechanism to be extended and become a superset of what they can
currently control.  Indeed, the security policy could also be delegated
to an LSM, either a MAC system or an integrity system.  For instance,
this is required to close a major IMA measurement/appraisal interpreter
integrity gap by bringing the ability to check the use of scripts.
Other uses are expected as well.

For further details, please see the latest cover letter:
https://lore.kernel.org/r/20220104155024.48023-1-mic@digikod.net

Commit dae71698b6c5 ("printk: Move back proc_dointvec_minmax_sysadmin()
to sysctl.c") was recently added due to the sysctl refactoring.

Commit e674341a90b9 ("selftests/interpreter: fix separate directory
build") will fix some test build cases as explained here:
https://lore.kernel.org/r/20220119101531.2850400-1-usama.anjum@collabora.com
Merging this commit without the new KHDR_INCLUDES is not an issue.
The upcoming kselftest pull request is ready:
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/log/?h=next

This patch series has been open for review for more than three years and
got a lot of feedbacks (and bikeshedding) which were all considered.
Since I heard no objection, please consider to pull this code for
v5.18-rc1 .  These five patches have been successfully tested in the
latest linux-next releases for several weeks.

Regards,
 Mickaël

--
The following changes since commit dcb85f85fa6f142aae1fe86f399d4503d49f2b60:

  gcc-plugins/stackleak: Use noinstr in favor of notrace (2022-02-03 17:02:21 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux.git tags/trusted-for-v18

for you to fetch changes up to e674341a90b95c3458d684ae25e6891afc3e03ad:

  selftests/interpreter: fix separate directory build (2022-03-04 10:56:25 +0100)

----------------------------------------------------------------
Add the trusted_for system call (v18)

The final goal of this patch series is to enable the kernel to be a
global policy manager by entrusting processes with access control at
their level.  To reach this goal, two complementary parts are required:
* user space needs to be able to know if it can trust some file
  descriptor content for a specific usage;
* and the kernel needs to make available some part of the policy
  configured by the system administrator.

In a nutshell, this is a required building block to control script
execution.

For further details see the latest cover letter:
https://lore.kernel.org/r/20220104155024.48023-1-mic@digikod.net

----------------------------------------------------------------
Mickaël Salaün (4):
      printk: Move back proc_dointvec_minmax_sysadmin() to sysctl.c
      fs: Add trusted_for(2) syscall implementation and related sysctl
      arch: Wire up trusted_for(2)
      selftest/interpreter: Add tests for trusted_for(2) policies

Muhammad Usama Anjum (1):
      selftests/interpreter: fix separate directory build

 Documentation/admin-guide/sysctl/fs.rst            |  50 +++
 arch/alpha/kernel/syscalls/syscall.tbl             |   1 +
 arch/arm/tools/syscall.tbl                         |   1 +
 arch/arm64/include/asm/unistd.h                    |   2 +-
 arch/arm64/include/asm/unistd32.h                  |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl              |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl              |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl        |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl          |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl          |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl          |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl            |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl           |   1 +
 arch/s390/kernel/syscalls/syscall.tbl              |   1 +
 arch/sh/kernel/syscalls/syscall.tbl                |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl            |   1 +
 fs/open.c                                          | 133 ++++++++
 fs/proc/proc_sysctl.c                              |   2 +-
 include/linux/syscalls.h                           |   1 +
 include/linux/sysctl.h                             |   3 +
 include/uapi/asm-generic/unistd.h                  |   5 +-
 include/uapi/linux/trusted-for.h                   |  18 +
 kernel/printk/sysctl.c                             |   9 -
 kernel/sysctl.c                                    |   9 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/interpreter/.gitignore     |   2 +
 tools/testing/selftests/interpreter/Makefile       |  21 ++
 tools/testing/selftests/interpreter/config         |   1 +
 .../selftests/interpreter/trust_policy_test.c      | 362 +++++++++++++++++++++
 32 files changed, 625 insertions(+), 12 deletions(-)
 create mode 100644 include/uapi/linux/trusted-for.h
 create mode 100644 tools/testing/selftests/interpreter/.gitignore
 create mode 100644 tools/testing/selftests/interpreter/Makefile
 create mode 100644 tools/testing/selftests/interpreter/config
 create mode 100644 tools/testing/selftests/interpreter/trust_policy_test.c

Comments

Luis Chamberlain March 21, 2022, 5:38 p.m. UTC | #1
On Mon, Mar 21, 2022 at 05:15:57PM +0100, Mickaël Salaün wrote:
> Since I heard no objection, please consider to pull this code for
> v5.18-rc1 .  These five patches have been successfully tested in the
> latest linux-next releases for several weeks.

>  kernel/sysctl.c                                    |   9 +

Please don't add more sysctls there. We're slowly trying to move
all these to their respective places so this does not become a larger
kitchen sink mess.

  Luis
Mickaël Salaün March 21, 2022, 6:05 p.m. UTC | #2
On 21/03/2022 18:38, Luis Chamberlain wrote:
> On Mon, Mar 21, 2022 at 05:15:57PM +0100, Mickaël Salaün wrote:
>> Since I heard no objection, please consider to pull this code for
>> v5.18-rc1 .  These five patches have been successfully tested in the
>> latest linux-next releases for several weeks.
> 
>>   kernel/sysctl.c                                    |   9 +
> 
> Please don't add more sysctls there. We're slowly trying to move
> all these to their respective places so this does not become a larger
> kitchen sink mess.

It is not a new sysctl but proc_dointvec_minmax_sysadmin(). This helper 
is shared between printk and fs subsystems.
Luis Chamberlain March 21, 2022, 11:32 p.m. UTC | #3
On Mon, Mar 21, 2022 at 07:05:42PM +0100, Mickaël Salaün wrote:
> 
> On 21/03/2022 18:38, Luis Chamberlain wrote:
> > On Mon, Mar 21, 2022 at 05:15:57PM +0100, Mickaël Salaün wrote:
> > > Since I heard no objection, please consider to pull this code for
> > > v5.18-rc1 .  These five patches have been successfully tested in the
> > > latest linux-next releases for several weeks.
> > 
> > >   kernel/sysctl.c                                    |   9 +
> > 
> > Please don't add more sysctls there. We're slowly trying to move
> > all these to their respective places so this does not become a larger
> > kitchen sink mess.
> 
> It is not a new sysctl but proc_dointvec_minmax_sysadmin(). This helper is
> shared between printk and fs subsystems.

That should be good then, thanks!

  Luis
Mickaël Salaün March 30, 2022, 4:06 p.m. UTC | #4
Hi,

What is the status of this pull request? Do you need something more?

Regards,
  Mickaël


On 21/03/2022 17:15, Mickaël Salaün wrote:
> Hi Linus,
> 
> This patch series adds a new syscall named trusted_for.  It enables user
> space to ask the kernel: is this file descriptor's content trusted to be
> used for this purpose?  The set of usage currently only contains
> execution, but other may follow (e.g. configuration, sensitive data).
> If the kernel identifies the file descriptor as trustworthy for this
> usage, user space should then take this information into account.  The
> "execution" usage means that the content of the file descriptor is
> trusted according to the system policy to be executed by user space,
> which means that it interprets the content or (try to) maps it as
> executable memory.
> 
> A simple system-wide security policy can be set by the system
> administrator through a sysctl configuration consistent with the mount
> points or the file access rights.  The documentation explains the
> prerequisites.
> 
> It is important to note that this can only enable to extend access
> control managed by the kernel.  Hence it enables current access control
> mechanism to be extended and become a superset of what they can
> currently control.  Indeed, the security policy could also be delegated
> to an LSM, either a MAC system or an integrity system.  For instance,
> this is required to close a major IMA measurement/appraisal interpreter
> integrity gap by bringing the ability to check the use of scripts.
> Other uses are expected as well.
> 
> For further details, please see the latest cover letter:
> https://lore.kernel.org/r/20220104155024.48023-1-mic@digikod.net
> 
> Commit dae71698b6c5 ("printk: Move back proc_dointvec_minmax_sysadmin()
> to sysctl.c") was recently added due to the sysctl refactoring.
> 
> Commit e674341a90b9 ("selftests/interpreter: fix separate directory
> build") will fix some test build cases as explained here:
> https://lore.kernel.org/r/20220119101531.2850400-1-usama.anjum@collabora.com
> Merging this commit without the new KHDR_INCLUDES is not an issue.
> The upcoming kselftest pull request is ready:
> https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/log/?h=next
> 
> This patch series has been open for review for more than three years and
> got a lot of feedbacks (and bikeshedding) which were all considered.
> Since I heard no objection, please consider to pull this code for
> v5.18-rc1 .  These five patches have been successfully tested in the
> latest linux-next releases for several weeks.
> 
> Regards,
>   Mickaël
> 
> --
> The following changes since commit dcb85f85fa6f142aae1fe86f399d4503d49f2b60:
> 
>    gcc-plugins/stackleak: Use noinstr in favor of notrace (2022-02-03 17:02:21 -0800)
> 
> are available in the Git repository at:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux.git tags/trusted-for-v18
> 
> for you to fetch changes up to e674341a90b95c3458d684ae25e6891afc3e03ad:
> 
>    selftests/interpreter: fix separate directory build (2022-03-04 10:56:25 +0100)
> 
> ----------------------------------------------------------------
> Add the trusted_for system call (v18)
> 
> The final goal of this patch series is to enable the kernel to be a
> global policy manager by entrusting processes with access control at
> their level.  To reach this goal, two complementary parts are required:
> * user space needs to be able to know if it can trust some file
>    descriptor content for a specific usage;
> * and the kernel needs to make available some part of the policy
>    configured by the system administrator.
> 
> In a nutshell, this is a required building block to control script
> execution.
> 
> For further details see the latest cover letter:
> https://lore.kernel.org/r/20220104155024.48023-1-mic@digikod.net
> 
> ----------------------------------------------------------------
> Mickaël Salaün (4):
>        printk: Move back proc_dointvec_minmax_sysadmin() to sysctl.c
>        fs: Add trusted_for(2) syscall implementation and related sysctl
>        arch: Wire up trusted_for(2)
>        selftest/interpreter: Add tests for trusted_for(2) policies
> 
> Muhammad Usama Anjum (1):
>        selftests/interpreter: fix separate directory build
> 
>   Documentation/admin-guide/sysctl/fs.rst            |  50 +++
>   arch/alpha/kernel/syscalls/syscall.tbl             |   1 +
>   arch/arm/tools/syscall.tbl                         |   1 +
>   arch/arm64/include/asm/unistd.h                    |   2 +-
>   arch/arm64/include/asm/unistd32.h                  |   2 +
>   arch/ia64/kernel/syscalls/syscall.tbl              |   1 +
>   arch/m68k/kernel/syscalls/syscall.tbl              |   1 +
>   arch/microblaze/kernel/syscalls/syscall.tbl        |   1 +
>   arch/mips/kernel/syscalls/syscall_n32.tbl          |   1 +
>   arch/mips/kernel/syscalls/syscall_n64.tbl          |   1 +
>   arch/mips/kernel/syscalls/syscall_o32.tbl          |   1 +
>   arch/parisc/kernel/syscalls/syscall.tbl            |   1 +
>   arch/powerpc/kernel/syscalls/syscall.tbl           |   1 +
>   arch/s390/kernel/syscalls/syscall.tbl              |   1 +
>   arch/sh/kernel/syscalls/syscall.tbl                |   1 +
>   arch/sparc/kernel/syscalls/syscall.tbl             |   1 +
>   arch/x86/entry/syscalls/syscall_32.tbl             |   1 +
>   arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
>   arch/xtensa/kernel/syscalls/syscall.tbl            |   1 +
>   fs/open.c                                          | 133 ++++++++
>   fs/proc/proc_sysctl.c                              |   2 +-
>   include/linux/syscalls.h                           |   1 +
>   include/linux/sysctl.h                             |   3 +
>   include/uapi/asm-generic/unistd.h                  |   5 +-
>   include/uapi/linux/trusted-for.h                   |  18 +
>   kernel/printk/sysctl.c                             |   9 -
>   kernel/sysctl.c                                    |   9 +
>   tools/testing/selftests/Makefile                   |   1 +
>   tools/testing/selftests/interpreter/.gitignore     |   2 +
>   tools/testing/selftests/interpreter/Makefile       |  21 ++
>   tools/testing/selftests/interpreter/config         |   1 +
>   .../selftests/interpreter/trust_policy_test.c      | 362 +++++++++++++++++++++
>   32 files changed, 625 insertions(+), 12 deletions(-)
>   create mode 100644 include/uapi/linux/trusted-for.h
>   create mode 100644 tools/testing/selftests/interpreter/.gitignore
>   create mode 100644 tools/testing/selftests/interpreter/Makefile
>   create mode 100644 tools/testing/selftests/interpreter/config
>   create mode 100644 tools/testing/selftests/interpreter/trust_policy_test.c
Kees Cook April 4, 2022, 6:40 p.m. UTC | #5
On Mon, Mar 21, 2022 at 05:15:57PM +0100, Mickaël Salaün wrote:
> [...]
> For further details, please see the latest cover letter:
> https://lore.kernel.org/r/20220104155024.48023-1-mic@digikod.net
> 
> Commit dae71698b6c5 ("printk: Move back proc_dointvec_minmax_sysadmin()
> to sysctl.c") was recently added due to the sysctl refactoring.
> 
> Commit e674341a90b9 ("selftests/interpreter: fix separate directory
> build") will fix some test build cases as explained here:
> https://lore.kernel.org/r/20220119101531.2850400-1-usama.anjum@collabora.com
> Merging this commit without the new KHDR_INCLUDES is not an issue.
> The upcoming kselftest pull request is ready:
> https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/log/?h=next
> 
> This patch series has been open for review for more than three years and
> got a lot of feedbacks (and bikeshedding) which were all considered.
> Since I heard no objection, please consider to pull this code for
> v5.18-rc1 .  These five patches have been successfully tested in the
> latest linux-next releases for several weeks.

Hi Linus,

It looks like this didn't get pulled for -rc1 even though it was sent
during the merge window and has been in -next for a while. It would be
really nice to get this landed since userspace can't make any forward
progress without the kernel support.

Was there some issue blocking this from being merged? All the feedback I
can find on prior versions was addressed.

-Kees
Linus Torvalds April 4, 2022, 6:47 p.m. UTC | #6
On Mon, Apr 4, 2022 at 11:40 AM Kees Cook <keescook@chromium.org> wrote:
>
> It looks like this didn't get pulled for -rc1 even though it was sent
> during the merge window and has been in -next for a while. It would be
> really nice to get this landed since userspace can't make any forward
> progress without the kernel support.

Honestly, I need a *lot* better reasoning for random new non-standard
system calls than this had.

And this kind of "completely random interface with no semantics except
for random 'future flags'" I will not pull even *with* good reasoning.

I already told Mickaël in private that I wouldn't pull this.

Honestly, we have a *horrible* history with non-standard system calls,
and that's been true even for well-designed stuff that actually
matters, that people asked for.

Something  like this, which adds one very special system call and
where the whole thing is designed for "let's add something random
later because we don't even know what we want" is right out.

What the system call seems to actually *want* is basically a new flag
to access() (and faccessat()). One that is very close to what X_OK
already is.

But that wasn't how it was sold.

So no. No way will this ever get merged, and whoever came up with that
disgusting "trusted_for()" (for WHAT? WHO TRUSTS? WHY?) should look
themselves in the mirror.

If you add a new X_OK variant to access(), maybe that could fly.

                Linus
Mickaël Salaün April 4, 2022, 8:30 p.m. UTC | #7
On 04/04/2022 20:47, Linus Torvalds wrote:
> On Mon, Apr 4, 2022 at 11:40 AM Kees Cook <keescook@chromium.org> wrote:
>>
>> It looks like this didn't get pulled for -rc1 even though it was sent
>> during the merge window and has been in -next for a while. It would be
>> really nice to get this landed since userspace can't make any forward
>> progress without the kernel support.
> 
> Honestly, I need a *lot* better reasoning for random new non-standard
> system calls than this had.
> 
> And this kind of "completely random interface with no semantics except
> for random 'future flags'" I will not pull even *with* good reasoning.

I think the semantic is well defined:
"This new syscall enables user space to ask the kernel: is this file
descriptor's content trusted to be used for this purpose?"
See the trusted_for_policy sysctl documentation: 
https://lore.kernel.org/all/20220104155024.48023-3-mic@digikod.net/

There is currently only one defined and implemented purpose: execution 
(or script interpretation). There is room for other flags because it is 
a good practice to do so, and other purposes were proposed.


> 
> I already told Mickaël in private that I wouldn't pull this.
> 
> Honestly, we have a *horrible* history with non-standard system calls,
> and that's been true even for well-designed stuff that actually
> matters, that people asked for.
> 
> Something  like this, which adds one very special system call and
> where the whole thing is designed for "let's add something random
> later because we don't even know what we want" is right out.
> 
> What the system call seems to actually *want* is basically a new flag
> to access() (and faccessat()). One that is very close to what X_OK
> already is.

I agree.


> 
> But that wasn't how it was sold.
> 
> So no. No way will this ever get merged, and whoever came up with that
> disgusting "trusted_for()" (for WHAT? WHO TRUSTS? WHY?) should look
> themselves in the mirror.

Well, naming is difficult, but I'm open to suggestion. :)

As explained in the description, the WHAT is the file descriptor 
content, the WHO TRUSTS is the system security policy (e.g. the mount 
point options) and the WHY is defined by the usage flag 
(TRUSTED_FOR_EXECUTION).
This translates to: is this file descriptor's content trusted to be used 
for this specified purpose/usage?


> 
> If you add a new X_OK variant to access(), maybe that could fly.

As answered in private, that was the approach I took for one of the 
early versions but a dedicated syscall was requested by Al Viro: 
https://lore.kernel.org/r/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net
The main reason behind this request was that it doesn't have the exact 
same semantic as faccessat(2). The changes for this syscall are 
documented here: 
https://lore.kernel.org/all/20220104155024.48023-3-mic@digikod.net/
The whole history is linked in the cover letter: 
https://lore.kernel.org/all/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net/

This initial proposal was using a new faccessat2(2) flag: 
AT_INTERPRETED, see 
https://lore.kernel.org/all/20200908075956.1069018-2-mic@digikod.net/
What do you think about that? I'm happy to get back to this version if 
everyone is OK with it.
Linus Torvalds April 4, 2022, 9:28 p.m. UTC | #8
On Mon, Apr 4, 2022 at 1:29 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> This initial proposal was using a new faccessat2(2) flag:
> AT_INTERPRETED, see
> https://lore.kernel.org/all/20200908075956.1069018-2-mic@digikod.net/
> What do you think about that? I'm happy to get back to this version if
> everyone is OK with it.

I'm certainly happi_er_ with that, but I find that particular patch
odd for other reasons.

In no particular order:

 - what's with the insane non-C syntax? Double parentheses have no
actual meaning in C:

     if ((flags & AT_INTERPRETED)) {
         if ((mode & MAY_EXEC)) {

   so I don't understand why you'd use that strance thing.

 - why is this an AT_INTERPRETED flag? I don't understand the name, I
don't understand the semantics.

   Why would that flag have the same value as AT_SYMLINK_FOLLOW?

   Why isn't this just a new _mode_ bit, which is what I feel is
sensible? We only use three bits (with no bits set meaning
"existence"), so we have *tons* of bits left in that namespace, and it
would make much more sense to me to have

        #define EXECVE_OK 8

    which is the same as the "group exec" bit, so it actually makes
some kind of sense too.

 - related to that "I don't understand the semantics", the
"documentation" for that thing doesn't make sense either:

    +         The
    +    main usage is for script
    +    interpreters to enforce a policy
    +    consistent with the kernel's one
    +    (through sysctl configuration or LSM
    +    policy).  */

Now, what I *think* you mean is

 (1) user-space executable loaders want to be able to test the *same*
policy as the kernel does for execve()

 (2) access(path, EXECVE_OK) will do the same permission checks as
"execve()" would do for that path

 (3) if you already have the fd open, use "faccess(fd, NULL,
F_OK_TO_EXECUTE, AT_EMPTY_PATH)"

 (4) maybe we want to add a flag for the "euid vs real uid", and that
would be in the "flags" field, since that changes the actual *lookup*
semantics

Note that that (4) is something that some normal user space has wanted
in the past too (GNU libcs has a "eaccess()" thing for "effective uid
access").

And yes, we still need to talk details:

 - no backwards compatibility issues, because we've happily always
checked 'mode' for being valid, so old kernels will always return
-EINVAL.

 - POSIX namespace issues for EXECVE_OK means that the name probably
needs some thinking (maybe we just need to call it __ACCESS_OK_EXECVE
internally or something - the kernel actually doesn't even export the
existing [FRWX]_OK values, because we "know" they map tho the usual
"owner RWX" bits, with F being "no bit set")

 - I really want the exact semantics very clearly defined. I think
it's ok to say "exact same security check as for 'execve()'", but even
then we need to have that discussion about

    (a) "what about suid bits that user space cannot react to"
    (b) that whole "effective vs real" discussion

So to recap: I'm very much ok with some access() extension, but I
think even that very much needs clarification, and the existing patch
is just odd in many many ways.

                  Linus
Linus Torvalds April 4, 2022, 9:40 p.m. UTC | #9
On Mon, Apr 4, 2022 at 2:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  (4) maybe we want to add a flag for the "euid vs real uid", and that
> would be in the "flags" field, since that changes the actual *lookup*
> semantics

Duh. We already did that long ago, and it's there as AT_EACCESS.

I should have just looked at the code more closely.

But that "you didn't even check what we already do, Linus" thing just
makes it even more obvious that all of this makes perfect sense in the
confines of access() already, and a new "check _these_ protections"
should just be a new mode bit.

                 Linus
Kees Cook April 4, 2022, 10:25 p.m. UTC | #10
On Mon, Apr 04, 2022 at 02:28:19PM -0700, Linus Torvalds wrote:
> Now, what I *think* you mean is
> 
>  (1) user-space executable loaders want to be able to test the *same*
> policy as the kernel does for execve()

Right. The script interpreter wants to ask "if this file were actually
an ELF going through execve(), would the kernel allow it?"

>  (2) access(path, EXECVE_OK) will do the same permission checks as
> "execve()" would do for that path

Maybe. I defer to Mickaël here, but my instinct is to avoid creating an
API that can be accidentally misused. I'd like this to be fd-only based,
since that removes path name races. (e.g. trusted_for() required an fd.)

>  (3) if you already have the fd open, use "faccess(fd, NULL,
> F_OK_TO_EXECUTE, AT_EMPTY_PATH)"

Yes, specifically faccessat2(). (And continuing the race thought above,
yes, there could still be races if the content of the file could be
changed, but that case is less problematic under real-world conditions.)

>  (4) maybe we want to add a flag for the "euid vs real uid", and that
> would be in the "flags" field, since that changes the actual *lookup*
> semantics
> 
> Note that that (4) is something that some normal user space has wanted
> in the past too (GNU libcs has a "eaccess()" thing for "effective uid
> access").

I think this already exists as AT_EACCESS? It was added with
faccessat2() itself, if I'm reading the history correctly.

And I just need to say that the thought of setuid script interpreters
still makes me sad. :)

>  - I really want the exact semantics very clearly defined. I think
> it's ok to say "exact same security check as for 'execve()'", but even
> then we need to have that discussion about
> 
>     (a) "what about suid bits that user space cannot react to"

What do you mean here? Do you mean setid bits on the file itself?

>     (b) that whole "effective vs real" discussion

I think this is handled with AT_EACCESS?
Linus Torvalds April 4, 2022, 11:26 p.m. UTC | #11
On Mon, Apr 4, 2022 at 3:25 PM Kees Cook <keescook@chromium.org> wrote:
>
> Maybe. I defer to Mickaël here, but my instinct is to avoid creating an
> API that can be accidentally misused. I'd like this to be fd-only based,
> since that removes path name races. (e.g. trusted_for() required an fd.)

Some people want pathnames. Think things like just the PATH thing just
to find the right executable in the first place.

For things like that, races don't matter, because you're just trying
to find the right path to begin with.

> I think this already exists as AT_EACCESS? It was added with
> faccessat2() itself, if I'm reading the history correctly.

Yeah, I noticed myself, I just hadn't looked (and I don't do enough
user-space programming to be aware of if that way).

> >     (a) "what about suid bits that user space cannot react to"
>
> What do you mean here? Do you mean setid bits on the file itself?

Right.

Maybe we don't care.

Maybe we do.

Is the user-space loader going to honor them? Is it going to ignore
them? I don't know. And it actually interacts with things like
'nosuid', which the kernel does know about, and user space has a hard
time figuring out.

So if the point is "give me an interface so that I can do the same
thing a kernel execve() loader would do", then those sgid/suid bits
actually may be exactly the kind of thing that user space wants the
kernel to react to - should it ignore them, or should it do something
special when it sees that they are set?

I'm not saying that they *should* be something we care about. All I'm
saying is that I want that *discussion* to happen.

               Linus
Theodore Ts'o April 5, 2022, 2:54 p.m. UTC | #12
On Mon, Apr 04, 2022 at 10:30:13PM +0200, Mickaël Salaün wrote:
> > If you add a new X_OK variant to access(), maybe that could fly.
> 
> As answered in private, that was the approach I took for one of the early
> versions but a dedicated syscall was requested by Al Viro:
> https://lore.kernel.org/r/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net
> The main reason behind this request was that it doesn't have the exact same
> semantic as faccessat(2). The changes for this syscall are documented here:
> https://lore.kernel.org/all/20220104155024.48023-3-mic@digikod.net/
> The whole history is linked in the cover letter:
> https://lore.kernel.org/all/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net/

As a suggestion, something that can be helpful for something which has
been as heavily bike-sheded as this concept might be to write a
"legislative history", or perhaps, a "bike shed history".

And not just with links to mailing list discussions, but a short
summary of why, for example, we moved from the open flag O_MAYEXEC to
the faccessat(2) approach.  I looked, but I couldn't find the
reasoning while diving into the mail archives.  And there was some
kind of request for some new functionality for IMA and other LSM's
that I couldn't follow that is why the new AT_INTERETED flag, but at
this point my time quantuum for mailing list archeology most
definitely expired.  :-)

It might be that when all of this is laid out, we can either revisit
prior design decisions as "that bike-shed request to support this
corner case was unreasonable", or "oh, OK, this is why we need as
fully general a solution as this".

Also, some of examples of potential future use cases such as "magic
links" that were linked in the cover letter, it's not clear to me
actually make sense in the context of a "trusted for" system call
(although might make more sense in the context of an open flag).  So
revisiting some of those other cases to see whether they actually
*could* be implemented as new "TRUSTED_FOR" flags might be
instructive.

Personally, I'm a bit skeptical about the prospct of additional use
cases, since trusted_for(2) is essentially a mother_should_I(2)
request where userspace is asking the kernel whether they should go
ahead and do some particular policy thing.  And it's not clear to me
how many of these policy questions exist where (a) the kernel is in
the past position to answer that question, and (b) there isn't some
additional information that the kernel doesn't have that might be
needed to answer that question.

For example, "Mother should I use that private key file" might require
information about whether the SRE is currently on pager duty or not,
at least for some policies, and the kernel isn't going to have that
information.

Other examples of TRUSTED_FOR flags that really make sense and would
be useful might help alleviate my skepticsm.  And the "bike shed
history" would help with my question about why some folks didn't like
the original O_MAYEXEC flag?

Cheers,

					- Ted
Mickaël Salaün April 5, 2022, 3:38 p.m. UTC | #13
On 04/04/2022 23:28, Linus Torvalds wrote:
> On Mon, Apr 4, 2022 at 1:29 PM Mickaël Salaün <mic@digikod.net> wrote:
>>
>> This initial proposal was using a new faccessat2(2) flag:
>> AT_INTERPRETED, see
>> https://lore.kernel.org/all/20200908075956.1069018-2-mic@digikod.net/
>> What do you think about that? I'm happy to get back to this version if
>> everyone is OK with it.
> 
> I'm certainly happi_er_ with that, but I find that particular patch
> odd for other reasons.
> 
> In no particular order:
> 
>   - what's with the insane non-C syntax? Double parentheses have no
> actual meaning in C:
> 
>       if ((flags & AT_INTERPRETED)) {
>           if ((mode & MAY_EXEC)) {
> 
>     so I don't understand why you'd use that strance thing.

I guess it comes from a previous version that ANDed two booleans.


> 
>   - why is this an AT_INTERPRETED flag? I don't understand the name, I
> don't understand the semantics.

I wasn't sure it was a good idea to add another mode bit, so I ended up 
using a flag to not break compatibility of other mode checks but extend 
the semantic to interpreted scripts. But I agree that a new mode makes 
sense.


> 
>     Why would that flag have the same value as AT_SYMLINK_FOLLOW?

It was a bug.


> 
>     Why isn't this just a new _mode_ bit, which is what I feel is
> sensible? We only use three bits (with no bits set meaning
> "existence"), so we have *tons* of bits left in that namespace, and it
> would make much more sense to me to have
> 
>          #define EXECVE_OK 8
> 
>      which is the same as the "group exec" bit, so it actually makes
> some kind of sense too.

Looks fine to me. The "EXECVE_" prefix is a bit weird but it will not be 
defined in the kernel like X_OK and others anyway, and as you said, it 
matches S_IXGRP.


> 
>   - related to that "I don't understand the semantics", the
> "documentation" for that thing doesn't make sense either:
> 
>      +         The
>      +    main usage is for script
>      +    interpreters to enforce a policy
>      +    consistent with the kernel's one
>      +    (through sysctl configuration or LSM
>      +    policy).  */

I'll synchronize the documentation with a next series.


> 
> Now, what I *think* you mean is
See a following email in reply to Kees.
[...]

> 
> And yes, we still need to talk details:
> 
>   - no backwards compatibility issues, because we've happily always
> checked 'mode' for being valid, so old kernels will always return
> -EINVAL.
> 
>   - POSIX namespace issues for EXECVE_OK means that the name probably
> needs some thinking (maybe we just need to call it __ACCESS_OK_EXECVE
> internally or something - the kernel actually doesn't even export the
> existing [FRWX]_OK values, because we "know" they map tho the usual
> "owner RWX" bits, with F being "no bit set")

Right, I cannot find a better name for now.

See a following email in reply to Kees.
[...]

> 
> So to recap: I'm very much ok with some access() extension, but I
> think even that very much needs clarification, and the existing patch
> is just odd in many many ways.

This v8 was kind of an early version, I'll update everything. Thanks!
Mickaël Salaün April 5, 2022, 3:55 p.m. UTC | #14
On 05/04/2022 00:25, Kees Cook wrote:
> On Mon, Apr 04, 2022 at 02:28:19PM -0700, Linus Torvalds wrote:
>> Now, what I *think* you mean is
>>
>>   (1) user-space executable loaders want to be able to test the *same*
>> policy as the kernel does for execve()
> 
> Right. The script interpreter wants to ask "if this file were actually
> an ELF going through execve(), would the kernel allow it?"

The current behavior was a bit more flexible thanks to the sysctl. It 
was either the mount exec option check, the file perm check or both. The 
rationale is to let sysadmins adapt their system to existing 
applications/interpreters without breaking. Only basing the check on 
mount exec and file perm could be an issue in the short term, but I 
guess it would deter inconsistencies in existing systems… I'm not sure 
it is a wise move because if no interpreter want to use this check it 
would then be useless. See commit message in 
https://lore.kernel.org/all/20220104155024.48023-3-mic@digikod.net/ and 
the trusted_for_policy sysctl documentation:

"Even without enforced security policy, user space interpreters can use
this syscall to try as much as possible to enforce the system policy at
their level, knowing that it will not break anything on running systems
which do not care about this feature.  However, on systems which want
this feature enforced, there will be knowledgeable people (i.e. system
administrator who configured fs.trusted_for_policy deliberately) to
manage it. [...]"


> 
>>   (2) access(path, EXECVE_OK) will do the same permission checks as
>> "execve()" would do for that path
> 
> Maybe. I defer to Mickaël here, but my instinct is to avoid creating an
> API that can be accidentally misused. I'd like this to be fd-only based,
> since that removes path name races. (e.g. trusted_for() required an fd.)

The fd-based approach is definitely better from a security point of view 
but there is indeed a use case for pathnames. I guess we could highlight 
this point in the documentation.

> 
>>   (3) if you already have the fd open, use "faccess(fd, NULL,
>> F_OK_TO_EXECUTE, AT_EMPTY_PATH)"
> 
> Yes, specifically faccessat2(). (And continuing the race thought above,
> yes, there could still be races if the content of the file could be
> changed, but that case is less problematic under real-world conditions.)

I'm not worried about changes in the file once it is opened. This could 
be an issue but not in the kernel (e.g. flaky update system).

[...]
Mickaël Salaün April 5, 2022, 4:09 p.m. UTC | #15
On 05/04/2022 01:26, Linus Torvalds wrote:
> On Mon, Apr 4, 2022 at 3:25 PM Kees Cook <keescook@chromium.org> wrote:

[...]

> 
>> I think this already exists as AT_EACCESS? It was added with
>> faccessat2() itself, if I'm reading the history correctly.
> 
> Yeah, I noticed myself, I just hadn't looked (and I don't do enough
> user-space programming to be aware of if that way).

I think AT_EACCESS should be usable with the new EXECVE_OK too.


> 
>>>      (a) "what about suid bits that user space cannot react to"
>>
>> What do you mean here? Do you mean setid bits on the file itself?
> 
> Right.
> 
> Maybe we don't care.

I think we don't. I think the only corner case that could be different 
is for files that are executable, SUID and non-readable. In this case it 
wouldn't matter because userspace could not read the file, which is 
required for interpretation/execution. Anyway, S[GU]ID bits in scripts 
are just ignored by execve and we want to follow the same semantic.


> 
> Maybe we do.
> 
> Is the user-space loader going to honor them? Is it going to ignore
> them? I don't know. And it actually interacts with things like
> 'nosuid', which the kernel does know about, and user space has a hard
> time figuring out.
> 
> So if the point is "give me an interface so that I can do the same
> thing a kernel execve() loader would do", then those sgid/suid bits
> actually may be exactly the kind of thing that user space wants the
> kernel to react to - should it ignore them, or should it do something
> special when it sees that they are set?
> 
> I'm not saying that they *should* be something we care about. All I'm
> saying is that I want that *discussion* to happen.
> 
>                 Linus
Mickaël Salaün April 5, 2022, 4:14 p.m. UTC | #16
On 05/04/2022 16:54, Theodore Ts'o wrote:
> On Mon, Apr 04, 2022 at 10:30:13PM +0200, Mickaël Salaün wrote:
>>> If you add a new X_OK variant to access(), maybe that could fly.
>>
>> As answered in private, that was the approach I took for one of the early
>> versions but a dedicated syscall was requested by Al Viro:
>> https://lore.kernel.org/r/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net
>> The main reason behind this request was that it doesn't have the exact same
>> semantic as faccessat(2). The changes for this syscall are documented here:
>> https://lore.kernel.org/all/20220104155024.48023-3-mic@digikod.net/
>> The whole history is linked in the cover letter:
>> https://lore.kernel.org/all/2ed377c4-3500-3ddc-7181-a5bc114ddf94@digikod.net/
> 
> As a suggestion, something that can be helpful for something which has
> been as heavily bike-sheded as this concept might be to write a
> "legislative history", or perhaps, a "bike shed history".
> 
> And not just with links to mailing list discussions, but a short
> summary of why, for example, we moved from the open flag O_MAYEXEC to
> the faccessat(2) approach.  I looked, but I couldn't find the
> reasoning while diving into the mail archives.  And there was some
> kind of request for some new functionality for IMA and other LSM's
> that I couldn't follow that is why the new AT_INTERETED flag, but at
> this point my time quantuum for mailing list archeology most
> definitely expired.  :-)
> 
> It might be that when all of this is laid out, we can either revisit
> prior design decisions as "that bike-shed request to support this
> corner case was unreasonable", or "oh, OK, this is why we need as
> fully general a solution as this".
> 
> Also, some of examples of potential future use cases such as "magic
> links" that were linked in the cover letter, it's not clear to me
> actually make sense in the context of a "trusted for" system call
> (although might make more sense in the context of an open flag).  So
> revisiting some of those other cases to see whether they actually
> *could* be implemented as new "TRUSTED_FOR" flags might be
> instructive.
> 
> Personally, I'm a bit skeptical about the prospct of additional use
> cases, since trusted_for(2) is essentially a mother_should_I(2)

That would be an interesting syscall name. ;)


> request where userspace is asking the kernel whether they should go
> ahead and do some particular policy thing.  And it's not clear to me
> how many of these policy questions exist where (a) the kernel is in
> the past position to answer that question, and (b) there isn't some
> additional information that the kernel doesn't have that might be
> needed to answer that question.

Script execution is definitely the main use case and the semantic is 
already known by the kernel.


> 
> For example, "Mother should I use that private key file" might require
> information about whether the SRE is currently on pager duty or not,
> at least for some policies, and the kernel isn't going to have that
> information.
> 
> Other examples of TRUSTED_FOR flags that really make sense and would
> be useful might help alleviate my skepticsm.  And the "bike shed
> history" would help with my question about why some folks didn't like
> the original O_MAYEXEC flag?

Thanks, I'll do some that.

> 
> Cheers,
> 
> 					- Ted
Linus Torvalds April 5, 2022, 4:17 p.m. UTC | #17
On Tue, Apr 5, 2022 at 9:08 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> I think we don't. I think the only corner case that could be different
> is for files that are executable, SUID and non-readable. In this case it
> wouldn't matter because userspace could not read the file, which is
> required for interpretation/execution. Anyway, S[GU]ID bits in scripts
> are just ignored by execve and we want to follow the same semantic.

So I just want to bring up the possibility that somebody wants to just
implement execve() in user space for some reason - not just "script
interpreter".

It's *doable*.

Don't ask me if it's sane or useful, but people have done insane
things before. Things like "emulate other operating systems in user
space" etc

Such a user can trivially see the suid/sgid bit on the file (just do
fstat() on it), but wouldn't necessarily see if that file is then in a
mount that is mounted nosuid.

Now, I think the right thing to do is to just say "we don't support
it", but I do think it should perhaps be mentioned somewhere
explicitly.

Particularly since I can well imagine that a security policy might
have some "no, I don't allow suid exec" and return an actual error for
it, and then the access() call would fail for that case.

(Ok, so the security policies would look at the actual bprm data on a
real exec, not the inode executable, so that's kind of made up and
theoretical, but I just want this issue to be mentioned somewhere so
that people are aware that the "it's the same basic file checking that
execve does, but a _real_ execve might then have _other_ issues going
on, including suid bits etc")

               Linus
Theodore Ts'o April 5, 2022, 10:21 p.m. UTC | #18
On Mon, Apr 04, 2022 at 04:26:44PM -0700, Linus Torvalds wrote:
> > >     (a) "what about suid bits that user space cannot react to"
> >
> > What do you mean here? Do you mean setid bits on the file itself?
> 
> Right.
> 
> Maybe we don't care.
> 
> Maybe we do.
> 
> Is the user-space loader going to honor them? Is it going to ignore
> them? I don't know. And it actually interacts with things like
> 'nosuid', which the kernel does know about, and user space has a hard
> time figuring out.

So there *used* to be suidperl which was a setuid version of perl with
some extra security checks.  (See [1] for more details.)  The suidperl
binary would be used by #!/usr/bin/perl so it could honor setuid bits
on perl scripts, but it was deprecated in Perl 5.8 and removed in Perl
5.12 in 2010[2].

[1] https://mattmccutchen.net/suidperl.html
[2] https://metacpan.org/release/SHAY/perl-5.20.2/view/pod/perl5120delta.pod#Deprecations

So it's possible that the user-space loader might try to honor them,
and if there was such an example "in the field", it might be nice if
there was a way for the kernel to advise userspace about the nosuid.
But I'm not aware of any other shell script interpreter that tried do
what perl did with suidperl.

> So if the point is "give me an interface so that I can do the same
> thing a kernel execve() loader would do", then those sgid/suid bits
> actually may be exactly the kind of thing that user space wants the
> kernel to react to - should it ignore them, or should it do something
> special when it sees that they are set?
> 
> I'm not saying that they *should* be something we care about. All I'm
> saying is that I want that *discussion* to happen.

I'm not convinced we should.  I suppose *if* the shell script was
suid, *and* the file system was mounted nosuid, then the check could
return false, and that would be mostly harmless even if the script
interpreter didn't support setuid.  But it's extra complexity, and in
theory it could break a setuid script, where the setuid bit was
previously a no-op, and it now might cause a problem for that user.

	     	    	       	     	   - Ted
Kees Cook Feb. 8, 2023, 7:32 p.m. UTC | #19
*thread necromancy*

On Tue, Apr 05, 2022 at 06:09:03PM +0200, Mickaël Salaün wrote:
> 
> On 05/04/2022 01:26, Linus Torvalds wrote:
> > On Mon, Apr 4, 2022 at 3:25 PM Kees Cook <keescook@chromium.org> wrote:
> 
> [...]
> 
> > 
> > > I think this already exists as AT_EACCESS? It was added with
> > > faccessat2() itself, if I'm reading the history correctly.
> > 
> > Yeah, I noticed myself, I just hadn't looked (and I don't do enough
> > user-space programming to be aware of if that way).
> 
> I think AT_EACCESS should be usable with the new EXECVE_OK too.
> 
> 
> > 
> > > >      (a) "what about suid bits that user space cannot react to"
> > > 
> > > What do you mean here? Do you mean setid bits on the file itself?
> > 
> > Right.
> > 
> > Maybe we don't care.
> 
> I think we don't. I think the only corner case that could be different is
> for files that are executable, SUID and non-readable. In this case it
> wouldn't matter because userspace could not read the file, which is required
> for interpretation/execution. Anyway, S[GU]ID bits in scripts are just
> ignored by execve and we want to follow the same semantic.

Hi Mickaël,

Is there a new version of this being worked on? It would be really nice
to have the O_MAYEXEC/faccessat2() visibility for script execution control
in userspace. It seems like it would be mainly a respin of an earlier
version of this series before trusted_for() was proposed.

-Kees
Mickaël Salaün Feb. 9, 2023, 3:43 p.m. UTC | #20
On 08/02/2023 20:32, Kees Cook wrote:
> *thread necromancy*
> 
> On Tue, Apr 05, 2022 at 06:09:03PM +0200, Mickaël Salaün wrote:
>>
>> On 05/04/2022 01:26, Linus Torvalds wrote:
>>> On Mon, Apr 4, 2022 at 3:25 PM Kees Cook <keescook@chromium.org> wrote:
>>
>> [...]
>>
>>>
>>>> I think this already exists as AT_EACCESS? It was added with
>>>> faccessat2() itself, if I'm reading the history correctly.
>>>
>>> Yeah, I noticed myself, I just hadn't looked (and I don't do enough
>>> user-space programming to be aware of if that way).
>>
>> I think AT_EACCESS should be usable with the new EXECVE_OK too.
>>
>>
>>>
>>>>>       (a) "what about suid bits that user space cannot react to"
>>>>
>>>> What do you mean here? Do you mean setid bits on the file itself?
>>>
>>> Right.
>>>
>>> Maybe we don't care.
>>
>> I think we don't. I think the only corner case that could be different is
>> for files that are executable, SUID and non-readable. In this case it
>> wouldn't matter because userspace could not read the file, which is required
>> for interpretation/execution. Anyway, S[GU]ID bits in scripts are just
>> ignored by execve and we want to follow the same semantic.
> 
> Hi Mickaël,
> 
> Is there a new version of this being worked on? It would be really nice
> to have the O_MAYEXEC/faccessat2() visibility for script execution control
> in userspace. It seems like it would be mainly a respin of an earlier
> version of this series before trusted_for() was proposed.

Yes, I plan to send a new version in a few weeks.

> 
> -Kees
>