diff mbox series

[man-pages,v2,2/2] openat2.2: document new openat2(2) syscall

Message ID 20200202151907.23587-3-cyphar@cyphar.com (mailing list archive)
State New, archived
Headers show
Series document openat2(2) | expand

Commit Message

Aleksa Sarai Feb. 2, 2020, 3:19 p.m. UTC
Rather than trying to merge the new syscall documentation into open.2
(which would probably result in the man-page being incomprehensible),
instead the new syscall gets its own dedicated page with links between
open(2) and openat2(2) to avoid duplicating information such as the list
of O_* flags or common errors.

In addition to describing all of the key flags, information about the
extensibility design is provided so that users can better understand why
they need to pass sizeof(struct open_how) and how their programs will
work across kernels. After some discussions with David Laight, I also
included explicit instructions to zero the structure to avoid issues
when recompiling with new headers.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2    |  17 ++
 man2/openat2.2 | 471 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 488 insertions(+)
 create mode 100644 man2/openat2.2

Comments

Michael Kerrisk (man-pages) March 30, 2020, 9:08 a.m. UTC | #1
Hello Aleksa,

On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> Rather than trying to merge the new syscall documentation into open.2
> (which would probably result in the man-page being incomprehensible),
> instead the new syscall gets its own dedicated page with links between
> open(2) and openat2(2) to avoid duplicating information such as the list
> of O_* flags or common errors.
> 
> In addition to describing all of the key flags, information about the
> extensibility design is provided so that users can better understand why
> they need to pass sizeof(struct open_how) and how their programs will
> work across kernels. After some discussions with David Laight, I also
> included explicit instructions to zero the structure to avoid issues
> when recompiling with new headers.>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

I'm just editing this page, and have a question on one piece.

> +Unlike
> +.BR openat (2),
> +it is an error to provide
> +.BR openat2 ()
> +with a
> +.I mode
> +which contains bits other than
> +.IR 0777 ,

This piece appears not to be true, both from my reading of the
source code, and from testing (i.e., I wrote a a small program that
successfully called openat2() and created a file that had the
set-UID, set-GID, and sticky bits set).

Is this a bug in the implementation or a bug in the manual page text?

Thanks,

Michael
Aleksa Sarai March 30, 2020, 9:20 a.m. UTC | #2
On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
> > 
> > In addition to describing all of the key flags, information about the
> > extensibility design is provided so that users can better understand why
> > they need to pass sizeof(struct open_how) and how their programs will
> > work across kernels. After some discussions with David Laight, I also
> > included explicit instructions to zero the structure to avoid issues
> > when recompiling with new headers.>
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> I'm just editing this page, and have a question on one piece.
> 
> > +Unlike
> > +.BR openat (2),
> > +it is an error to provide
> > +.BR openat2 ()
> > +with a
> > +.I mode
> > +which contains bits other than
> > +.IR 0777 ,
> 
> This piece appears not to be true, both from my reading of the
> source code, and from testing (i.e., I wrote a a small program that
> successfully called openat2() and created a file that had the
> set-UID, set-GID, and sticky bits set).
> 
> Is this a bug in the implementation or a bug in the manual page text?

My bad -- it's a bug in the manual. The actual check (which does work,
there are selftests for this) is:

	if (how->mode & ~S_IALLUGO)
		return -EINVAL;

But when writing the man page I forgot that S_IALLUGO also includes
those bits. Do you want me to send an updated version or would you
prefer to clean it up?
Michael Kerrisk (man-pages) March 30, 2020, 9:36 a.m. UTC | #3
On 3/30/20 11:20 AM, Aleksa Sarai wrote:
> On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> Hello Aleksa,
>>
>> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
>>> Rather than trying to merge the new syscall documentation into open.2
>>> (which would probably result in the man-page being incomprehensible),
>>> instead the new syscall gets its own dedicated page with links between
>>> open(2) and openat2(2) to avoid duplicating information such as the list
>>> of O_* flags or common errors.
>>>
>>> In addition to describing all of the key flags, information about the
>>> extensibility design is provided so that users can better understand why
>>> they need to pass sizeof(struct open_how) and how their programs will
>>> work across kernels. After some discussions with David Laight, I also
>>> included explicit instructions to zero the structure to avoid issues
>>> when recompiling with new headers.>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>
>> I'm just editing this page, and have a question on one piece.
>>
>>> +Unlike
>>> +.BR openat (2),
>>> +it is an error to provide
>>> +.BR openat2 ()
>>> +with a
>>> +.I mode
>>> +which contains bits other than
>>> +.IR 0777 ,
>>
>> This piece appears not to be true, both from my reading of the
>> source code, and from testing (i.e., I wrote a a small program that
>> successfully called openat2() and created a file that had the
>> set-UID, set-GID, and sticky bits set).
>>
>> Is this a bug in the implementation or a bug in the manual page text?
> 
> My bad -- it's a bug in the manual. The actual check (which does work,
> there are selftests for this) is:
> 
> 	if (how->mode & ~S_IALLUGO)
> 		return -EINVAL;
> 
> But when writing the man page I forgot that S_IALLUGO also includes
> those bits. Do you want me to send an updated version or would you
> prefer to clean it up?

I'll clean it up.

So, it should say, "bits other than 07777", right?

Thanks,

Michael
Aleksa Sarai March 30, 2020, 9:48 a.m. UTC | #4
On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On 3/30/20 11:20 AM, Aleksa Sarai wrote:
> > On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> Hello Aleksa,
> >>
> >> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> >>> Rather than trying to merge the new syscall documentation into open.2
> >>> (which would probably result in the man-page being incomprehensible),
> >>> instead the new syscall gets its own dedicated page with links between
> >>> open(2) and openat2(2) to avoid duplicating information such as the list
> >>> of O_* flags or common errors.
> >>>
> >>> In addition to describing all of the key flags, information about the
> >>> extensibility design is provided so that users can better understand why
> >>> they need to pass sizeof(struct open_how) and how their programs will
> >>> work across kernels. After some discussions with David Laight, I also
> >>> included explicit instructions to zero the structure to avoid issues
> >>> when recompiling with new headers.>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>
> >> I'm just editing this page, and have a question on one piece.
> >>
> >>> +Unlike
> >>> +.BR openat (2),
> >>> +it is an error to provide
> >>> +.BR openat2 ()
> >>> +with a
> >>> +.I mode
> >>> +which contains bits other than
> >>> +.IR 0777 ,
> >>
> >> This piece appears not to be true, both from my reading of the
> >> source code, and from testing (i.e., I wrote a a small program that
> >> successfully called openat2() and created a file that had the
> >> set-UID, set-GID, and sticky bits set).
> >>
> >> Is this a bug in the implementation or a bug in the manual page text?
> > 
> > My bad -- it's a bug in the manual. The actual check (which does work,
> > there are selftests for this) is:
> > 
> > 	if (how->mode & ~S_IALLUGO)
> > 		return -EINVAL;
> > 
> > But when writing the man page I forgot that S_IALLUGO also includes
> > those bits. Do you want me to send an updated version or would you
> > prefer to clean it up?
> 
> I'll clean it up.
> 
> So, it should say, "bits other than 07777", right?

Yes, that would be correct.
Michael Kerrisk (man-pages) March 30, 2020, 8:43 p.m. UTC | #5
Hello Aleksa,

On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> Rather than trying to merge the new syscall documentation into open.2
> (which would probably result in the man-page being incomprehensible),
> instead the new syscall gets its own dedicated page with links between
> open(2) and openat2(2) to avoid duplicating information such as the list
> of O_* flags or common errors.
> 
> In addition to describing all of the key flags, information about the
> extensibility design is provided so that users can better understand why
> they need to pass sizeof(struct open_how) and how their programs will
> work across kernels. After some discussions with David Laight, I also
> included explicit instructions to zero the structure to avoid issues
> when recompiling with new headers.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks. I've applied this patch, but also done quite a lot of
editing of the page. The current draft is below (and also pushed 
to Git). Could I ask you to review the page, to see if I injected
any error during my edits.

In addition, I've added a number of FIXMEs in comments
in the page source. Can you please check these, and let me
know your thoughts.

Cheers,

Michael

.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.TH OPENAT2 2 2019-12-20 "Linux" "Linux Programmer's Manual"
.SH NAME
openat2 \- open and possibly create a file (extended)
.SH SYNOPSIS
.nf
.B #include <sys/types.h>
.B #include <sys/stat.h>
.B #include <fcntl.h>
.B #include <openat2.h>
.PP
.BI "int openat2(int " dirfd ", const char *" pathname ,
.BI "            struct open_how *" how ", size_t " size ");
.fi
.PP
.IR Note :
There is no glibc wrapper for this system call; see NOTES.
.SH DESCRIPTION
The
.BR openat2 ()
system call is an extension of
.BR openat (2)
and provides a superset of its functionality.
.PP
The
.BR openat2 ()
system call opens the file specified by
.IR pathname .
If the specified file does not exist, it may optionally (if
.B O_CREAT
is specified in
.IR how.flags )
be created.
.PP
As with
.BR openat (2),
if
.I pathname
is a relative pathname, then it is interpreted relative to the
directory referred to by the file descriptor
.I dirfd
(or the current working directory of the calling process, if
.I dirfd
is the special value
.BR AT_FDCWD ).
If
.I pathname
is an absolute pathname, then
.I dirfd
is ignored (unless
.I how.resolve
contains
.BR RESOLVE_IN_ROOT,
in which case
.I pathname
is resolved relative to
.IR dirfd ).
.PP
Rather than taking a single
.I flags
argument, an extensible structure (\fIhow\fP) is passed to allow for
future extensions.
The
.I size
argument must be specified as
.IR "sizeof(struct open_how)" .
.\"
.SS The open_how structure
The
.I how
argument specifies how
.I pathname
should be opened, and acts as a superset of the
.IR flags
and
.IR mode
arguments to
.BR openat (2).
This argument is a pointer to a structure of the following form:
.PP
.in +4n
.EX
struct open_how {
    u64 flags;    /* O_* flags */
    u64 mode;     /* Mode for O_{CREAT,TMPFILE} */
    u64 resolve;  /* RESOLVE_* flags */
    /* ... */
};
.EE
.in
.PP
Any future extensions to
.BR openat2 ()
will be implemented as new fields appended to the above structure,
with a zero value in a new field resulting in the kernel behaving
as though that extension field was not present.
Therefore, the caller
.I must
zero-fill this structure on
initialization.
(See the "Extensibility" section of the
.B NOTES
for more detail on why this is necessary.)
.PP
The fields of the
.I open_how
structure are as follows:
.TP
.I flags
This field specifies
the file creation and file status flags to use when opening the file.
All of the
.B O_*
flags defined for
.BR openat (2)
are valid
.BR openat2 ()
flag values.
.IP
Whereas
.BR openat (2)
ignores unknown bits in its
.I flags
argument,
.BR openat2 ()
returns an error if unknown or conflicting flags are specified in
.IR how.flags .
.TP
.I mode
This field specifies the
mode for the new file, with identical semantics to the
.I mode
argument of
.BR openat (2).
.IP
Whereas
.BR openat (2)
ignores bits other than those in the range
.I 07777
in its
.I mode
argument,
.BR openat2 ()
returns an error if
.I how.mode
contains bits other than
.IR 07777 .
Similarly, an error is returned if
.BR openat2 ()
is called with a non-zero
.IR how.mode
and
.IR how.flags
does not contain
.BR O_CREAT
or
.BR O_TMPFILE .
.TP
.I resolve
This is a bit-mask of flags that modify the way in which
.B all
components of
.I pathname
will be resolved.
(See
.BR path_resolution (7)
for background information.)
.IP
The primary use case for these flags is to allow trusted programs to restrict
how untrusted paths (or paths inside untrusted directories) are resolved.
The full list of
.I resolve
flags is as follows:
.RS
.TP
.B RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including all bind
mounts).
.IP
Applications that employ
this flag are encouraged to make its use configurable (unless it is
used for a specific security purpose), as bind mounts are very widely used by
end-users.
Setting this flag indiscriminately for all uses of
.BR openat2 ()
may result in spurious errors on previously-functional systems.
.\" FIXME I find the "previously-functional systems" in the previous
.\" sentence a little odd (since openat2() ia new sysycall), so I would
.\" like to clarify a little...
.\" Are you referring to the scenario where someone might take an
.\" existing application that uses openat() and replaces the uses
.\" of openat() with openat2()? In which case, is it correct to
.\" understand that you mean that one should not just indiscriminately
.\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
.\" If I'm not on the right track, could you point me in the right
.\" direction please.
.TP
.B RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution.
This option implies
.BR RESOLVE_NO_MAGICLINKS .
.IP
If the trailing component (i.e., basename) of
.I pathname
is a symbolic link, and
.I how.flags
contains both
.BR O_PATH
and
.BR O_NOFOLLOW ,
then an
.B O_PATH
file descriptor referencing the symbolic link will be returned.
.IP
Applications that employ
this flag are encouraged to make its use configurable (unless it is
used for a specific security purpose), as symbolic links are very widely used
by end-users.
Setting this flag indiscriminately for all uses of
.BR openat2 ()
may result in spurious errors on previously-functional systems.
.TP
.B RESOLVE_NO_MAGICLINKS
Disallow all magic-link resolution during path resolution.
.IP
If the trailing component (i.e., basename) of
.I pathname
is a magic link, and
.I how.flags
contains both
.BR O_PATH
and
.BR O_NOFOLLOW ,
then an
.B O_PATH
file descriptor referencing the magic link will be returned.
.IP
Magic links are symbolic link-like objects that are most notably found in
.BR proc (5)
(examples include
.IR /proc/[pid]/exe
and
.IR /proc/[pid]/fd/* ).
Due to the potential danger of unknowingly opening these magic links,
it may be
preferable for users to disable their resolution entirely.
.\" FIXME: what specific details in symlink(7) are being referred
.\" by the following sentence? It's not clear.
(See
.BR symlink (7)
for more details.)
.TP
.B RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the resolution
is not a descendant of the directory indicated by
.IR dirfd .
This causes absolute symbolic links (and absolute values of
.IR pathname )
to be rejected.
.IP
Currently, this flag also disables magic-link resolution.
However, this may change in the future.
Therefore, to ensure that magic links are not resolved,
the caller should explicitly specify
.BR RESOLVE_NO_MAGICLINKS .
.TP
.B RESOLVE_IN_ROOT
Treat the directory referred to by
.I dirfd
as the root directory while resolving
.IR pathname .
.\" FIXME I found the following hard to understand (in particular, the
.\" meaning of "scoped" is unclear) , and reworded as below. Is it okay?
.\"     Absolute symbolic links and ".." path components will be scoped to
.\"     .IR dirfd .
Absolute symbolic links are interpreted relative to
.IR dirfd .
If a prefix component of
.I pathname
equates to
.IR dirfd ,
then an immediately following
.IR ..
component likewise equates to
.IR dirfd
(just as
.I /..
is traditionally equivalent to
.IR / ).
If
.I pathname
is an absolute path, it is also interpreted relative to
.IR dirfd .
.IP
The effect of this flag is as though the calling process had used
.BR chroot (2)
to (temporarily) modify its root directory (to the directory
referred to by
.IR dirfd ).
However, unlike
.BR chroot (2)
(which changes the filesystem root permanently for a process),
.B RESOLVE_IN_ROOT
allows a program to efficiently restrict path resolution on a per-open basis.
.\" FIXME The next piece is unclear (to me). What kind of ".." escape
.\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?
The
.B RESOLVE_IN_ROOT
flag also has several hardening features
(such as detecting escape attempts during
.I ".."
resolution) which
.BR chroot (2)
does not.
.IP
Currently, this flag also disables magic-link resolution.
However, this may change in the future.
Therefore, to ensure that magic links are not resolved,
the caller should explicitly specify
.BR RESOLVE_NO_MAGICLINKS .
.RE
.IP
If any bits other than those listed above are set in
.IR how.resolve ,
an error is returned.
.SH RETURN VALUE
On success, a new file descriptor is returned.
On error, \-1 is returned, and
.I errno
is set appropriately.
.SH ERRORS
The set of errors returned by
.BR openat2 ()
includes all of the errors returned by
.BR openat (2),
as well as the following additional errors:
.TP
.B E2BIG
An extension that this kernel does not support was specified in
.IR how .
(See the "Extensibility" section of
.B NOTES
for more detail on how extensions are handled.)
.TP
.B EAGAIN
.I how.resolve
contains either
.BR RESOLVE_IN_ROOT
or
.BR RESOLVE_BENEATH ,
and the kernel could not ensure that a ".." component didn't escape (due to a
race condition or potential attack).
The caller may choose to retry the
.BR openat2 ()
call.
.TP
.B EINVAL
An unknown flag or invalid value was specified in
.IR how .
.TP
.B EINVAL
.I mode
is non-zero, but
.I how.flags
does not contain
.BR O_CREAT
or
.BR O_TMPFILE .
.TP
.B EINVAL
.I size
was smaller than any known version of
.IR "struct open_how" .
.TP
.B ELOOP
.I how.resolve
contains
.BR RESOLVE_NO_SYMLINKS ,
and one of the path components was a symbolic link (or magic link).
.TP
.B ELOOP
.I how.resolve
contains
.BR RESOLVE_NO_MAGICLINKS ,
and one of the path components was a magic link.
.TP
.B EXDEV
.I how.resolve
contains either
.BR RESOLVE_IN_ROOT
or
.BR RESOLVE_BENEATH ,
and an escape from the root during path resolution was detected.
.TP
.B EXDEV
.I how.resolve
contains
.BR RESOLVE_NO_XDEV ,
and a path component crosses a mount point.
.SH VERSIONS
.BR openat2 ()
first appeared in Linux 5.6.
.SH CONFORMING TO
This system call is Linux-specific.
.PP
The semantics of
.B RESOLVE_BENEATH
were modeled after FreeBSD's
.BR O_BENEATH .
.SH NOTES
Glibc does not provide a wrapper for this system call; call it using
.BR syscall (2).
.\"
.SS Extensibility
In order to allow for future extensibility,
.BR openat2 ()
requires the user-space application to specify the size of the
.I open_how
structure that it is passing.
By providing this information, it is possible for
.BR openat2 ()
to provide both forwards- and backwards-compatibility, with
.I size
acting as an implicit version number.
(Because new extension fields will always
be appended, the structure size will always increase.)
This extensibility design is very similar to other system calls such as
.BR perf_setattr (2),
.BR perf_event_open (2),
and
.BR clone3 (2).
.PP
If we let
.I usize
be the size of the structure as specified by the user-space application, and
.I ksize
be the size of the structure which the kernel supports, then there are
three cases to consider:
.IP \(bu 2
If
.IR ksize
equals
.IR usize ,
then there is no version mismatch and
.I how
can be used verbatim.
.IP \(bu
If
.IR ksize
is larger than
.IR usize ,
then there are some extension fields that the kernel supports
which the user-space application
is unaware of.
Because a zero value in any added extension field signifies a no-op,
the kernel
treats all of the extension fields not provided by the user-space application
as having zero values.
This provides backwards-compatibility.
.IP \(bu
If
.IR ksize
is smaller than
.IR usize ,
then there are some extension fields which the user-space application
is aware of but which the kernel does not support.
Because any extension field must have its zero values signify a no-op,
the kernel can
safely ignore the unsupported extension fields if they are all-zero.
If any unsupported extension fields are non-zero, then \-1 is returned and
.I errno
is set to
.BR E2BIG .
This provides forwards-compatibility.
.PP
Because the definition of
.I struct open_how
may change in the future (with new fields being added when system headers are
updated), user-space applications should zero-fill
.I struct open_how
to ensure that recompiling the program with new headers will not result in
spurious errors at runtime.
The simplest way is to use a designated
initializer:
.PP
.in +4n
.EX
struct open_how how = { .flags = O_RDWR,
                        .resolve = RESOLVE_IN_ROOT };
.EE
.in
.PP
or explicitly using
.BR memset (3)
or similar:
.PP
.in +4n
.EX
struct open_how how;
memset(&how, 0, sizeof(how));
how.flags = O_RDWR;
how.resolve = RESOLVE_IN_ROOT;
.EE
.in
.PP
A user-space application that wishes to determine which extensions
the running kernel supports can do so by conducting a binary search on
.IR size
with a structure which has every byte nonzero (to find the largest value
which doesn't produce an error of
.BR E2BIG ).
.SH SEE ALSO
.BR openat (2),
.BR path_resolution (7),
.BR symlink (7)
Aleksa Sarai March 31, 2020, 2:39 p.m. UTC | #6
On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> Hello Aleksa,
> 
> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
> > 
> > In addition to describing all of the key flags, information about the
> > extensibility design is provided so that users can better understand why
> > they need to pass sizeof(struct open_how) and how their programs will
> > work across kernels. After some discussions with David Laight, I also
> > included explicit instructions to zero the structure to avoid issues
> > when recompiling with new headers.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks. I've applied this patch, but also done quite a lot of
> editing of the page. The current draft is below (and also pushed 
> to Git). Could I ask you to review the page, to see if I injected
> any error during my edits.

Looks good to me.

> In addition, I've added a number of FIXMEs in comments
> in the page source. Can you please check these, and let me
> know your thoughts.

Will do, see below.

> .\" FIXME I find the "previously-functional systems" in the previous
> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> .\" like to clarify a little...
> .\" Are you referring to the scenario where someone might take an
> .\" existing application that uses openat() and replaces the uses
> .\" of openat() with openat2()? In which case, is it correct to
> .\" understand that you mean that one should not just indiscriminately
> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> .\" If I'm not on the right track, could you point me in the right
> .\" direction please.

This is mostly meant as a warning to hopefully avoid applications
because the developer didn't realise that system paths may contain
symlinks or bind-mounts. For an application which has switched to
openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
it's possible that on some distributions (or future versions of a
distribution) that their application will stop working because a system
path suddenly contains a symlink or is a bind-mount.

This was a concern which was brought up on LWN some time ago. If you can
think of a phrasing that makes this more clear, I'd appreciate it.

> .\" FIXME: what specific details in symlink(7) are being referred
> .\" by the following sentence? It's not clear.

The section on magic-links, but you're right that the sentence ordering
is a bit odd. It should probably go after the first sentence.

> .\" FIXME I found the following hard to understand (in particular, the
> .\" meaning of "scoped" is unclear) , and reworded as below. Is it okay?
> .\"     Absolute symbolic links and ".." path components will be scoped to
> .\"     .IR dirfd .

Scoped does broadly mean "interpreted relative to", though the
difference is mainly that when I said scoped it's meant to be more of an
assertive claim ("the kernel promises to always treat this path inside
dirfd"). But "interpreted relative to" is a clearer way of phrasing the
semantics, so I'm okay with this change.

> .\" FIXME The next piece is unclear (to me). What kind of ".." escape
> .\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?

If the root is moved, you can escape from a chroot(2). But this sentence
might not really belong in a man-page since it's describing (important)
aspects of the implementation and not the semantics.
Michael Kerrisk (man-pages) April 1, 2020, 6:38 a.m. UTC | #7
Hello Aleksa,

On 3/31/20 4:39 PM, Aleksa Sarai wrote:
> On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> Hello Aleksa,
>>
>> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
>>> Rather than trying to merge the new syscall documentation into open.2
>>> (which would probably result in the man-page being incomprehensible),
>>> instead the new syscall gets its own dedicated page with links between
>>> open(2) and openat2(2) to avoid duplicating information such as the list
>>> of O_* flags or common errors.
>>>
>>> In addition to describing all of the key flags, information about the
>>> extensibility design is provided so that users can better understand why
>>> they need to pass sizeof(struct open_how) and how their programs will
>>> work across kernels. After some discussions with David Laight, I also
>>> included explicit instructions to zero the structure to avoid issues
>>> when recompiling with new headers.
>>>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>
>> Thanks. I've applied this patch, but also done quite a lot of
>> editing of the page. The current draft is below (and also pushed 
>> to Git). Could I ask you to review the page, to see if I injected
>> any error during my edits.
> 
> Looks good to me.
> 
>> In addition, I've added a number of FIXMEs in comments
>> in the page source. Can you please check these, and let me
>> know your thoughts.
> 
> Will do, see below.
> 
>> .\" FIXME I find the "previously-functional systems" in the previous
>> .\" sentence a little odd (since openat2() ia new sysycall), so I would
>> .\" like to clarify a little...
>> .\" Are you referring to the scenario where someone might take an
>> .\" existing application that uses openat() and replaces the uses
>> .\" of openat() with openat2()? In which case, is it correct to
>> .\" understand that you mean that one should not just indiscriminately
>> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
>> .\" If I'm not on the right track, could you point me in the right
>> .\" direction please.
> 
> This is mostly meant as a warning to hopefully avoid applications
> because the developer didn't realise that system paths may contain
> symlinks or bind-mounts. For an application which has switched to
> openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
> it's possible that on some distributions (or future versions of a
> distribution) that their application will stop working because a system
> path suddenly contains a symlink or is a bind-mount.
> 
> This was a concern which was brought up on LWN some time ago. If you can
> think of a phrasing that makes this more clear, I'd appreciate it.

Thanks. I've made the text:

                     Applications  that  employ  the RESOLVE_NO_XDEV flag
                     are encouraged to make its use configurable  (unless
                     it is used for a specific security purpose), as bind
                     mounts are widely used by end-users.   Setting  this
                     flag indiscriminately—i.e., for purposes not specif‐
                     ically related to security—for all uses of openat2()
                     may  result  in  spurious errors on previously-func‐
                     tional systems.  This may occur if, for  example,  a
                     system  pathname  that  is used by an application is
                     modified (e.g., in a new  distribution  release)  so
                     that  a  pathname  component  (now)  contains a bind
                     mount.

Okay?

>> .\" FIXME: what specific details in symlink(7) are being referred
>> .\" by the following sentence? It's not clear.
> 
> The section on magic-links, but you're right that the sentence ordering
> is a bit odd. It should probably go after the first sentence.

I must admit that I'm still confused. There's only the briefest of 
mentions of magic links in symlink(7). Perhaps that needs to be fixed?

And, while I think of it, the text just preceding that FIXME says:

    Due to the potential danger of unknowingly opening 
    these magic links, it may be preferable for users to 
    disable their resolution entirely.

This sentence reads a little strangely. Could you please give me some
concrete examples, and I will try rewording that sentence a bit.

>> .\" FIXME I found the following hard to understand (in particular, the
>> .\" meaning of "scoped" is unclear) , and reworded as below. Is it okay?
>> .\"     Absolute symbolic links and ".." path components will be scoped to
>> .\"     .IR dirfd .
> 
> Scoped does broadly mean "interpreted relative to", though the
> difference is mainly that when I said scoped it's meant to be more of an
> assertive claim ("the kernel promises to always treat this path inside
> dirfd"). But "interpreted relative to" is a clearer way of phrasing the
> semantics, so I'm okay with this change.

Okay.

>> .\" FIXME The next piece is unclear (to me). What kind of ".." escape
>> .\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?
> 
> If the root is moved, you can escape from a chroot(2). But this sentence
> might not really belong in a man-page since it's describing (important)
> aspects of the implementation and not the semantics.

So, should I just remove the sentence?

Thanks,

Michael
Michael Kerrisk (man-pages) April 8, 2020, 9:29 p.m. UTC | #8
Hi Aleksa,

Ping!

Cheers,

Michael

On Wed, 1 Apr 2020 at 08:38, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello Aleksa,
>
> On 3/31/20 4:39 PM, Aleksa Sarai wrote:
> > On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> Hello Aleksa,
> >>
> >> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> >>> Rather than trying to merge the new syscall documentation into open.2
> >>> (which would probably result in the man-page being incomprehensible),
> >>> instead the new syscall gets its own dedicated page with links between
> >>> open(2) and openat2(2) to avoid duplicating information such as the list
> >>> of O_* flags or common errors.
> >>>
> >>> In addition to describing all of the key flags, information about the
> >>> extensibility design is provided so that users can better understand why
> >>> they need to pass sizeof(struct open_how) and how their programs will
> >>> work across kernels. After some discussions with David Laight, I also
> >>> included explicit instructions to zero the structure to avoid issues
> >>> when recompiling with new headers.
> >>>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>
> >> Thanks. I've applied this patch, but also done quite a lot of
> >> editing of the page. The current draft is below (and also pushed
> >> to Git). Could I ask you to review the page, to see if I injected
> >> any error during my edits.
> >
> > Looks good to me.
> >
> >> In addition, I've added a number of FIXMEs in comments
> >> in the page source. Can you please check these, and let me
> >> know your thoughts.
> >
> > Will do, see below.
> >
> >> .\" FIXME I find the "previously-functional systems" in the previous
> >> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> >> .\" like to clarify a little...
> >> .\" Are you referring to the scenario where someone might take an
> >> .\" existing application that uses openat() and replaces the uses
> >> .\" of openat() with openat2()? In which case, is it correct to
> >> .\" understand that you mean that one should not just indiscriminately
> >> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> >> .\" If I'm not on the right track, could you point me in the right
> >> .\" direction please.
> >
> > This is mostly meant as a warning to hopefully avoid applications
> > because the developer didn't realise that system paths may contain
> > symlinks or bind-mounts. For an application which has switched to
> > openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
> > it's possible that on some distributions (or future versions of a
> > distribution) that their application will stop working because a system
> > path suddenly contains a symlink or is a bind-mount.
> >
> > This was a concern which was brought up on LWN some time ago. If you can
> > think of a phrasing that makes this more clear, I'd appreciate it.
>
> Thanks. I've made the text:
>
>                      Applications  that  employ  the RESOLVE_NO_XDEV flag
>                      are encouraged to make its use configurable  (unless
>                      it is used for a specific security purpose), as bind
>                      mounts are widely used by end-users.   Setting  this
>                      flag indiscriminately—i.e., for purposes not specif‐
>                      ically related to security—for all uses of openat2()
>                      may  result  in  spurious errors on previously-func‐
>                      tional systems.  This may occur if, for  example,  a
>                      system  pathname  that  is used by an application is
>                      modified (e.g., in a new  distribution  release)  so
>                      that  a  pathname  component  (now)  contains a bind
>                      mount.
>
> Okay?
>
> >> .\" FIXME: what specific details in symlink(7) are being referred
> >> .\" by the following sentence? It's not clear.
> >
> > The section on magic-links, but you're right that the sentence ordering
> > is a bit odd. It should probably go after the first sentence.
>
> I must admit that I'm still confused. There's only the briefest of
> mentions of magic links in symlink(7). Perhaps that needs to be fixed?
>
> And, while I think of it, the text just preceding that FIXME says:
>
>     Due to the potential danger of unknowingly opening
>     these magic links, it may be preferable for users to
>     disable their resolution entirely.
>
> This sentence reads a little strangely. Could you please give me some
> concrete examples, and I will try rewording that sentence a bit.
>
> >> .\" FIXME I found the following hard to understand (in particular, the
> >> .\" meaning of "scoped" is unclear) , and reworded as below. Is it okay?
> >> .\"     Absolute symbolic links and ".." path components will be scoped to
> >> .\"     .IR dirfd .
> >
> > Scoped does broadly mean "interpreted relative to", though the
> > difference is mainly that when I said scoped it's meant to be more of an
> > assertive claim ("the kernel promises to always treat this path inside
> > dirfd"). But "interpreted relative to" is a clearer way of phrasing the
> > semantics, so I'm okay with this change.
>
> Okay.
>
> >> .\" FIXME The next piece is unclear (to me). What kind of ".." escape
> >> .\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?
> >
> > If the root is moved, you can escape from a chroot(2). But this sentence
> > might not really belong in a man-page since it's describing (important)
> > aspects of the implementation and not the semantics.
>
> So, should I just remove the sentence?
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
Aleksa Sarai April 12, 2020, 4:49 p.m. UTC | #9
Sorry, I could've sworn I responded when you posted this -- comments
below. And sorry for not getting back to you before the 5.06 release.

On 2020-04-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On 3/31/20 4:39 PM, Aleksa Sarai wrote:
> > On 2020-03-30, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> On 2/2/20 4:19 PM, Aleksa Sarai wrote:
> >>> Rather than trying to merge the new syscall documentation into open.2
> >>> (which would probably result in the man-page being incomprehensible),
> >>> instead the new syscall gets its own dedicated page with links between
> >>> open(2) and openat2(2) to avoid duplicating information such as the list
> >>> of O_* flags or common errors.
> >>>
> >>> In addition to describing all of the key flags, information about the
> >>> extensibility design is provided so that users can better understand why
> >>> they need to pass sizeof(struct open_how) and how their programs will
> >>> work across kernels. After some discussions with David Laight, I also
> >>> included explicit instructions to zero the structure to avoid issues
> >>> when recompiling with new headers.
> >>>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>
> >> Thanks. I've applied this patch, but also done quite a lot of
> >> editing of the page. The current draft is below (and also pushed 
> >> to Git). Could I ask you to review the page, to see if I injected
> >> any error during my edits.
> > 
> > Looks good to me.
> > 
> >> In addition, I've added a number of FIXMEs in comments
> >> in the page source. Can you please check these, and let me
> >> know your thoughts.
> > 
> > Will do, see below.
> > 
> >> .\" FIXME I find the "previously-functional systems" in the previous
> >> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> >> .\" like to clarify a little...
> >> .\" Are you referring to the scenario where someone might take an
> >> .\" existing application that uses openat() and replaces the uses
> >> .\" of openat() with openat2()? In which case, is it correct to
> >> .\" understand that you mean that one should not just indiscriminately
> >> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> >> .\" If I'm not on the right track, could you point me in the right
> >> .\" direction please.
> > 
> > This is mostly meant as a warning to hopefully avoid applications
> > because the developer didn't realise that system paths may contain
> > symlinks or bind-mounts. For an application which has switched to
> > openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
> > it's possible that on some distributions (or future versions of a
> > distribution) that their application will stop working because a system
> > path suddenly contains a symlink or is a bind-mount.
> > 
> > This was a concern which was brought up on LWN some time ago. If you can
> > think of a phrasing that makes this more clear, I'd appreciate it.
> 
> Thanks. I've made the text:
> 
>                      Applications  that  employ  the RESOLVE_NO_XDEV flag
>                      are encouraged to make its use configurable  (unless
>                      it is used for a specific security purpose), as bind
>                      mounts are widely used by end-users.   Setting  this
>                      flag indiscriminately—i.e., for purposes not specif‐
>                      ically related to security—for all uses of openat2()
>                      may  result  in  spurious errors on previously-func‐
>                      tional systems.  This may occur if, for  example,  a
>                      system  pathname  that  is used by an application is
>                      modified (e.g., in a new  distribution  release)  so
>                      that  a  pathname  component  (now)  contains a bind
>                      mount.
> 
> Okay?

Yup, and the same text should be used for the same warning I gave for
RESOLVE_NO_SYMLINKS (for the same reason, because system paths may
switch to symlinks -- the prime example being what Arch Linux did
several years ago).

> >> .\" FIXME: what specific details in symlink(7) are being referred
> >> .\" by the following sentence? It's not clear.
> > 
> > The section on magic-links, but you're right that the sentence ordering
> > is a bit odd. It should probably go after the first sentence.
> 
> I must admit that I'm still confused. There's only the briefest of 
> mentions of magic links in symlink(7). Perhaps that needs to be fixed?

It wouldn't hurt to add a longer description of magic-links in
symlink(7). I'll send you a small patch to beef up the description (I
had planned to include a longer rewrite with the O_EMPTYPATH patches but
those require quite a bit more work to land).

> And, while I think of it, the text just preceding that FIXME says:
> 
>     Due to the potential danger of unknowingly opening 
>     these magic links, it may be preferable for users to 
>     disable their resolution entirely.
> 
> This sentence reads a little strangely. Could you please give me some
> concrete examples, and I will try rewording that sentence a bit.

The primary example is that certain files (such as tty devices) are
best not opened by an unsuspecting program (if you do not have a
controlling TTY, and you open such a file that console becomes your
controlling TTY unless you use O_NOCTTY).

But more generally, magic-links allow programs to be "beamed" all over
the system (bypassing ordinary mount namespace restrictions). Since they
are fairly rarely used intentionally by most programs, this is more of a
tip to programmers that maybe they should play it safe and disallow
magic-links unless they are expecting to have to use them.

> >> .\" FIXME I found the following hard to understand (in particular, the
> >> .\" meaning of "scoped" is unclear) , and reworded as below. Is it okay?
> >> .\"     Absolute symbolic links and ".." path components will be scoped to
> >> .\"     .IR dirfd .
> > 
> > Scoped does broadly mean "interpreted relative to", though the
> > difference is mainly that when I said scoped it's meant to be more of an
> > assertive claim ("the kernel promises to always treat this path inside
> > dirfd"). But "interpreted relative to" is a clearer way of phrasing the
> > semantics, so I'm okay with this change.
> 
> Okay.
> 
> >> .\" FIXME The next piece is unclear (to me). What kind of ".." escape
> >> .\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?
> > 
> > If the root is moved, you can escape from a chroot(2). But this sentence
> > might not really belong in a man-page since it's describing (important)
> > aspects of the implementation and not the semantics.
> 
> So, should I just remove the sentence?

Yup, sounds reasonable.
Michael Kerrisk (man-pages) April 13, 2020, 7:22 a.m. UTC | #10
Hello Aleksa,

On 4/12/20 6:49 PM, Aleksa Sarai wrote:
> Sorry, I could've sworn I responded when you posted this -- comments
> below. And sorry for not getting back to you before the 5.06 release.

No worries and ahanks for your feedback below.

[...]

>>>> .\" FIXME I find the "previously-functional systems" in the previous
>>>> .\" sentence a little odd (since openat2() ia new sysycall), so I would
>>>> .\" like to clarify a little...
>>>> .\" Are you referring to the scenario where someone might take an
>>>> .\" existing application that uses openat() and replaces the uses
>>>> .\" of openat() with openat2()? In which case, is it correct to
>>>> .\" understand that you mean that one should not just indiscriminately
>>>> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
>>>> .\" If I'm not on the right track, could you point me in the right
>>>> .\" direction please.
>>>
>>> This is mostly meant as a warning to hopefully avoid applications
>>> because the developer didn't realise that system paths may contain
>>> symlinks or bind-mounts. For an application which has switched to
>>> openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
>>> it's possible that on some distributions (or future versions of a
>>> distribution) that their application will stop working because a system
>>> path suddenly contains a symlink or is a bind-mount.
>>>
>>> This was a concern which was brought up on LWN some time ago. If you can
>>> think of a phrasing that makes this more clear, I'd appreciate it.
>>
>> Thanks. I've made the text:
>>
>>                      Applications  that  employ  the RESOLVE_NO_XDEV flag
>>                      are encouraged to make its use configurable  (unless
>>                      it is used for a specific security purpose), as bind
>>                      mounts are widely used by end-users.   Setting  this
>>                      flag indiscriminately—i.e., for purposes not specif‐
>>                      ically related to security—for all uses of openat2()
>>                      may  result  in  spurious errors on previously-func‐
>>                      tional systems.  This may occur if, for  example,  a
>>                      system  pathname  that  is used by an application is
>>                      modified (e.g., in a new  distribution  release)  so
>>                      that  a  pathname  component  (now)  contains a bind
>>                      mount.
>>
>> Okay?
> 
> Yup,

Thanks.

> and the same text should be used for the same warning I gave for
> RESOLVE_NO_SYMLINKS (for the same reason, because system paths may
> switch to symlinks -- the prime example being what Arch Linux did
> several years ago).

Okay -- I added similar text to RESOLVE_NO_SYMLINKS.

>>>> .\" FIXME: what specific details in symlink(7) are being referred
>>>> .\" by the following sentence? It's not clear.
>>>
>>> The section on magic-links, but you're right that the sentence ordering
>>> is a bit odd. It should probably go after the first sentence.
>>
>> I must admit that I'm still confused. There's only the briefest of 
>> mentions of magic links in symlink(7). Perhaps that needs to be fixed?
> 
> It wouldn't hurt to add a longer description of magic-links in
> symlink(7). I'll send you a small patch to beef up the description (I
> had planned to include a longer rewrite with the O_EMPTYPATH patches but
> those require quite a bit more work to land).

That would be great. Thank you!

>> And, while I think of it, the text just preceding that FIXME says:
>>
>>     Due to the potential danger of unknowingly opening 
>>     these magic links, it may be preferable for users to 
>>     disable their resolution entirely.
>>
>> This sentence reads a little strangely. Could you please give me some
>> concrete examples, and I will try rewording that sentence a bit.
> 
> The primary example is that certain files (such as tty devices) are
> best not opened by an unsuspecting program (if you do not have a
> controlling TTY, and you open such a file that console becomes your
> controlling TTY unless you use O_NOCTTY).
> 
> But more generally, magic-links allow programs to be "beamed" all over
> the system (bypassing ordinary mount namespace restrictions). Since they
> are fairly rarely used intentionally by most programs, this is more of a
> tip to programmers that maybe they should play it safe and disallow
> magic-links unless they are expecting to have to use them.


I've reworked the text on RESOLVE_NO_MAGICLINKS substantially:

       RESOLVE_NO_MAGICLINKS
              Disallow all magic-link resolution during path reso‐
              lution.

              Magic links are symbolic link-like objects that  are
              most  notably  found  in  proc(5);  examples include
              /proc/[pid]/exe  and  /proc/[pid]/fd/*.   (See  sym‐
              link(7) for more details.)

              Unknowingly  opening  magic  links  can be risky for
              some applications.  Examples of such  risks  include
              the following:

              · If the process opening a pathname is a controlling
                process that currently has no controlling terminal
                (see  credentials(7)),  then  opening a magic link
                inside /proc/[pid]/fd that happens to refer  to  a
                terminal would cause the process to acquire a con‐
                trolling terminal.

              · In  a  containerized  environment,  a  magic  link
                inside  /proc  may  refer to an object outside the
                container, and thus may provide a means to  escape
                from the container.

[The above example derives from https://lwn.net/Articles/796868/]

              Because  of such risks, an application may prefer to
              disable   magic   link    resolution    using    the
              RESOLVE_NO_MAGICLINKS flag.

              If  the trailing component (i.e., basename) of path‐
              name is a magic link, and  how.flags  contains  both
              O_PATH  and O_NOFOLLOW, then an O_PATH file descrip‐
              tor referencing the magic link will be returned.

How does the above look?

Also, regarding the last paragraph, I  have a question.  The
text doesn't seem quite to relate to the rest of the discussion.
Should it be saying something like:

If the trailing component (i.e., basename) of pathname is a magic link,
**how.resolve contains RESOLVE_NO_MAGICLINKS,**
and how.flags contains both O_PATH and O_NOFOLLOW, then an O_PATH
file descriptor referencing the magic link will be returned.

?

[...]

>>>> .\" FIXME The next piece is unclear (to me). What kind of ".." escape
>>>> .\" attempts does chroot() not detect that RESOLVE_IN_ROOT does?
>>>
>>> If the root is moved, you can escape from a chroot(2). But this sentence
>>> might not really belong in a man-page since it's describing (important)
>>> aspects of the implementation and not the semantics.
>>
>> So, should I just remove the sentence?
> 
> Yup, sounds reasonable.

Done.

Thanks,

Michael
Aleksa Sarai April 14, 2020, 10:35 a.m. UTC | #11
On 2020-04-13, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >>>> .\" FIXME I find the "previously-functional systems" in the previous
> >>>> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> >>>> .\" like to clarify a little...
> >>>> .\" Are you referring to the scenario where someone might take an
> >>>> .\" existing application that uses openat() and replaces the uses
> >>>> .\" of openat() with openat2()? In which case, is it correct to
> >>>> .\" understand that you mean that one should not just indiscriminately
> >>>> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> >>>> .\" If I'm not on the right track, could you point me in the right
> >>>> .\" direction please.
> >>>
> >>> This is mostly meant as a warning to hopefully avoid applications
> >>> because the developer didn't realise that system paths may contain
> >>> symlinks or bind-mounts. For an application which has switched to
> >>> openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
> >>> it's possible that on some distributions (or future versions of a
> >>> distribution) that their application will stop working because a system
> >>> path suddenly contains a symlink or is a bind-mount.
> >>>
> >>> This was a concern which was brought up on LWN some time ago. If you can
> >>> think of a phrasing that makes this more clear, I'd appreciate it.
> >>
> >> Thanks. I've made the text:
> >>
> >>                      Applications  that  employ  the RESOLVE_NO_XDEV flag
> >>                      are encouraged to make its use configurable  (unless
> >>                      it is used for a specific security purpose), as bind
> >>                      mounts are widely used by end-users.   Setting  this
> >>                      flag indiscriminately—i.e., for purposes not specif‐
> >>                      ically related to security—for all uses of openat2()
> >>                      may  result  in  spurious errors on previously-func‐
> >>                      tional systems.  This may occur if, for  example,  a
> >>                      system  pathname  that  is used by an application is
> >>                      modified (e.g., in a new  distribution  release)  so
> >>                      that  a  pathname  component  (now)  contains a bind
> >>                      mount.
> >>
> >> Okay?
> > 
> > Yup,
> 
> Thanks.
> 
> > and the same text should be used for the same warning I gave for
> > RESOLVE_NO_SYMLINKS (for the same reason, because system paths may
> > switch to symlinks -- the prime example being what Arch Linux did
> > several years ago).
> 
> Okay -- I added similar text to RESOLVE_NO_SYMLINKS.

Much appreciated.

> >>>> .\" FIXME: what specific details in symlink(7) are being referred
> >>>> .\" by the following sentence? It's not clear.
> >>>
> >>> The section on magic-links, but you're right that the sentence ordering
> >>> is a bit odd. It should probably go after the first sentence.
> >>
> >> I must admit that I'm still confused. There's only the briefest of 
> >> mentions of magic links in symlink(7). Perhaps that needs to be fixed?
> > 
> > It wouldn't hurt to add a longer description of magic-links in
> > symlink(7). I'll send you a small patch to beef up the description (I
> > had planned to include a longer rewrite with the O_EMPTYPATH patches but
> > those require quite a bit more work to land).
> 
> That would be great. Thank you!

I'll cook something up later this week.

> >> And, while I think of it, the text just preceding that FIXME says:
> >>
> >>     Due to the potential danger of unknowingly opening 
> >>     these magic links, it may be preferable for users to 
> >>     disable their resolution entirely.
> >>
> >> This sentence reads a little strangely. Could you please give me some
> >> concrete examples, and I will try rewording that sentence a bit.
> > 
> > The primary example is that certain files (such as tty devices) are
> > best not opened by an unsuspecting program (if you do not have a
> > controlling TTY, and you open such a file that console becomes your
> > controlling TTY unless you use O_NOCTTY).
> > 
> > But more generally, magic-links allow programs to be "beamed" all over
> > the system (bypassing ordinary mount namespace restrictions). Since they
> > are fairly rarely used intentionally by most programs, this is more of a
> > tip to programmers that maybe they should play it safe and disallow
> > magic-links unless they are expecting to have to use them.
> 
> 
> I've reworked the text on RESOLVE_NO_MAGICLINKS substantially:
> 
>        RESOLVE_NO_MAGICLINKS
>               Disallow all magic-link resolution during path reso‐
>               lution.
> 
>               Magic links are symbolic link-like objects that  are
>               most  notably  found  in  proc(5);  examples include
>               /proc/[pid]/exe  and  /proc/[pid]/fd/*.   (See  sym‐
>               link(7) for more details.)
> 
>               Unknowingly  opening  magic  links  can be risky for
>               some applications.  Examples of such  risks  include
>               the following:
> 
>               · If the process opening a pathname is a controlling
>                 process that currently has no controlling terminal
>                 (see  credentials(7)),  then  opening a magic link
>                 inside /proc/[pid]/fd that happens to refer  to  a
>                 terminal would cause the process to acquire a con‐
>                 trolling terminal.
> 
>               · In  a  containerized  environment,  a  magic  link
>                 inside  /proc  may  refer to an object outside the
>                 container, and thus may provide a means to  escape
>                 from the container.
> 
> [The above example derives from https://lwn.net/Articles/796868/]
> 
>               Because  of such risks, an application may prefer to
>               disable   magic   link    resolution    using    the
>               RESOLVE_NO_MAGICLINKS flag.
> 
>               If  the trailing component (i.e., basename) of path‐
>               name is a magic link, and  how.flags  contains  both
>               O_PATH  and O_NOFOLLOW, then an O_PATH file descrip‐
>               tor referencing the magic link will be returned.
> 
> How does the above look?

The changes look correct, though you could end up going through procfs
even if you weren't resolving a path inside proc directly (since you can
bind-mount symlinks or have a symlink to procfs). But I'm not sure if
it's necessary to outline all the ways a program could be tricked into
doing something unintended.

> Also, regarding the last paragraph, I  have a question.  The
> text doesn't seem quite to relate to the rest of the discussion.
> Should it be saying something like:
> 
> If the trailing component (i.e., basename) of pathname is a magic link,
> **how.resolve contains RESOLVE_NO_MAGICLINKS,**
> and how.flags contains both O_PATH and O_NOFOLLOW, then an O_PATH
> file descriptor referencing the magic link will be returned.
> 
> ?

Yes, that is what I meant to write -- and I believe that the
RESOLVE_NO_SYMLINKS section is missing similar text in the second
paragraph (except it should refer to RESOLVE_NO_SYMLINKS, obviously).

Thanks!
Michael Kerrisk (man-pages) April 15, 2020, 8:24 p.m. UTC | #12
Hello Aleksa,

On Tue, 14 Apr 2020 at 12:35, Aleksa Sarai <asarai@suse.de> wrote:

[...]

> > >> I must admit that I'm still confused. There's only the briefest of
> > >> mentions of magic links in symlink(7). Perhaps that needs to be fixed?
> > >
> > > It wouldn't hurt to add a longer description of magic-links in
> > > symlink(7). I'll send you a small patch to beef up the description (I
> > > had planned to include a longer rewrite with the O_EMPTYPATH patches but
> > > those require quite a bit more work to land).
> >
> > That would be great. Thank you!
>
> I'll cook something up later this week.

Thank you!

[...]

> > I've reworked the text on RESOLVE_NO_MAGICLINKS substantially:
> >
> >        RESOLVE_NO_MAGICLINKS
> >               Disallow all magic-link resolution during path reso‐
> >               lution.
> >
> >               Magic links are symbolic link-like objects that  are
> >               most  notably  found  in  proc(5);  examples include
> >               /proc/[pid]/exe  and  /proc/[pid]/fd/*.   (See  sym‐
> >               link(7) for more details.)
> >
> >               Unknowingly  opening  magic  links  can be risky for
> >               some applications.  Examples of such  risks  include
> >               the following:
> >
> >               · If the process opening a pathname is a controlling
> >                 process that currently has no controlling terminal
> >                 (see  credentials(7)),  then  opening a magic link
> >                 inside /proc/[pid]/fd that happens to refer  to  a
> >                 terminal would cause the process to acquire a con‐
> >                 trolling terminal.
> >
> >               · In  a  containerized  environment,  a  magic  link
> >                 inside  /proc  may  refer to an object outside the
> >                 container, and thus may provide a means to  escape
> >                 from the container.
> >
> > [The above example derives from https://lwn.net/Articles/796868/]
> >
> >               Because  of such risks, an application may prefer to
> >               disable   magic   link    resolution    using    the
> >               RESOLVE_NO_MAGICLINKS flag.
> >
> >               If  the trailing component (i.e., basename) of path‐
> >               name is a magic link, and  how.flags  contains  both
> >               O_PATH  and O_NOFOLLOW, then an O_PATH file descrip‐
> >               tor referencing the magic link will be returned.
> >
> > How does the above look?
>
> The changes look correct, though you could end up going through procfs
> even if you weren't resolving a path inside proc directly (since you can
> bind-mount symlinks or have a symlink to procfs). But I'm not sure if
> it's necessary to outline all the ways a program could be tricked into
> doing something unintended.

Yes, indeed. These paragraphs are merely intended to give the reader
some ideas about what the issues are.

> > Also, regarding the last paragraph, I  have a question.  The
> > text doesn't seem quite to relate to the rest of the discussion.
> > Should it be saying something like:
> >
> > If the trailing component (i.e., basename) of pathname is a magic link,
> > **how.resolve contains RESOLVE_NO_MAGICLINKS,**
> > and how.flags contains both O_PATH and O_NOFOLLOW, then an O_PATH
> > file descriptor referencing the magic link will be returned.
> >
> > ?
>
> Yes, that is what I meant to write --

Good. Fixed.

> and I believe that the
> RESOLVE_NO_SYMLINKS section is missing similar text in the second
> paragraph (except it should refer to RESOLVE_NO_SYMLINKS, obviously).

Also fixed.

Thanks,

Michael
Michael Kerrisk (man-pages) May 4, 2020, 10:17 a.m. UTC | #13
Hi Aleksa,

Ping on this piece please:

> > > It wouldn't hurt to add a longer description of magic-links in
> > > symlink(7). I'll send you a small patch to beef up the description (I
> > > had planned to include a longer rewrite with the O_EMPTYPATH patches but
> > > those require quite a bit more work to land).
> >
> > That would be great. Thank you!
>
> I'll cook something up later this week.

Thanks,

Michael
diff mbox series

Patch

diff --git a/man2/open.2 b/man2/open.2
index b0f485b41589..2a721c991a20 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@  open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Documented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -933,6 +937,15 @@  If
 is absolute, then
 .I dirfd
 is ignored.
+.SS openat2(2)
+The
+.BR openat2 (2)
+system call is an extension of
+.BR openat (),
+with a superset of features. To avoid making this man page too long, the
+description of
+.BR openat2 (2)
+and its features is documented in a separate man page.
 .SH RETURN VALUE
 .BR open (),
 .BR openat (),
@@ -1220,6 +1233,9 @@  SVr4, 4.3BSD, POSIX.1-2001, POSIX.1-2008.
 .BR openat ():
 POSIX.1-2008.
 .PP
+.BR openat2 (2)
+is Linux-specific.
+.PP
 The
 .BR O_DIRECT ,
 .BR O_NOATIME ,
@@ -1778,6 +1794,7 @@  is ignored).
 .BR mknod (2),
 .BR mmap (2),
 .BR mount (2),
+.BR openat2 (2),
 .BR open_by_handle_at (2),
 .BR read (2),
 .BR socket (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..6ec87a040f37
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,471 @@ 
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-12-20 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.B #include <openat2.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call opens the file specified by
+.IR pathname .
+If the specified file does not exist, it may optionally (if
+.B O_CREAT
+is specified in
+.IR how.flags )
+be created by
+.BR openat2() .
+.PP
+As with
+.BR openat (2),
+if
+.I pathname
+is relative, then it is interpreted relative to the
+directory referred to by the file descriptor
+.I dirfd
+(or the current working directory of the calling process, if
+.I dirfd
+is the special value
+.BR AT_FDCWD .)
+If
+.I pathname
+is absolute, then
+.I dirfd
+is ignored (unless
+.I how.resolve
+contains
+.BR RESOLVE_IN_ROOT,
+in which case
+.I pathname
+is resolved relative to
+.IR dirfd .)
+.PP
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality.
+Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+future extensions.
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+.B NOTES
+for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    u64 flags;    /* O_* flags. */
+    u64 mode;     /* Mode for O_{CREAT,TMPFILE}. */
+    u64 resolve;  /* RESOLVE_* flags. */
+    /* ... */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+Therefore, users must ensure that they zero-fill this structure on
+initialisation (see the "Extensibility" section of
+the
+.B NOTES
+for more detail on why this is necessary.)
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation.
+All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+
+Unlike
+.BR openat (2),
+it is an error to provide
+.BR openat2 ()
+unknown or conflicting flags in
+.IR flags .
+.RE
+
+.IR mode
+.RS
+File mode for the new file, with identical semantics to the
+.I mode
+argument to
+.BR openat (2).
+
+Unlike
+.BR openat (2),
+it is an error to provide
+.BR openat2 ()
+with a
+.I mode
+which contains bits other than
+.IR 0777 ,
+or to provide
+.BR openat2 ()
+a non-zero
+.IR mode " if " flags
+does not contain
+.BR O_CREAT " or " O_TMPFILE .
+.RE
+
+.I resolve
+.RS
+Change how
+.B all
+components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.)
+The primary use case for these flags is to allow trusted programs to restrict
+how untrusted paths (or paths inside untrusted directories) are resolved.
+The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow traversal of mount points during path resolution (including all bind
+mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind mounts are very widely used by
+end-users.
+Setting this flag indiscrimnately for all uses of
+.IR openat2 ()
+may result in spurious errors on previously-functional systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow resolution of symbolic links during path resolution.
+This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+If the trailing component is a symbolic link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symbolic link will be returned.
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symbolic links are very widely used
+by end-users.
+Setting this flag indiscrimnately for all uses of
+.IR openat2 ()
+may result in spurious errors on previously-functional systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic link resolution during path resolution.
+
+If the trailing component is a magic link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic link will be returned.
+
+Magic-links are symbolic link-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symbolic link (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symbolic links (and absolute values of
+.IR pathname )
+to be rejected.
+
+Currently, this flag also disables magic link resolution.
+However, this may change in the future.
+The caller should explicitly specify
+.B RESOLVE_NO_MAGICLINKS
+to ensure that magic links are not resolved.
+
+.TP
+.B RESOLVE_IN_ROOT
+Treat
+.I dirfd
+as the root directory while resolving
+.I pathname
+(as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.)
+Absolute symbolic links and ".." path components will be scoped to
+.IR dirfd .
+If
+.I pathname
+is an absolute path, it is also treated relative to
+.IR dirfd .
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root permanently for a process),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations.
+It also has several hardening features (such detecting escape attempts during
+.I ".."
+resolution) which
+.BR chroot (2)
+does not.
+
+Currently, this flag also disables magic link resolution.
+However, this may change in the future.
+The caller should explicitly specify
+.B RESOLVE_NO_MAGICLINKS
+to ensure that magic links are not resolved.
+.PP
+It is an error to provide
+.BR openat2 ()
+unknown flags in
+.IR resolve .
+.RE
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I mode
+is non-zero, but
+.I flags
+does not contain
+.BR O_CREAT " or " O_TMPFILE .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the
+.B NOTES
+for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack.)
+Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and an escape from the root during path resolution was detected.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symbolic link (or magic link).
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic link.
+
+.SH VERSIONS
+.BR openat2 ()
+first appeared in Linux 5.6.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify the size of
+.I struct open_how
+structure they are passing.
+By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.)
+This extensibility design is very similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of.
+Because all extensions must have their zero values be a no-op, the kernel
+treats all of the extension fields not set by userspace to have zero values.
+This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support.
+Because all extensions must have their zero values be a no-op, the kernel can
+safely ignore the unsupported extension fields if they are all-zero.
+If any unsupported extension fields are non-zero, then -1 is returned and
+.I errno
+is set to
+.BR E2BIG .
+This provides forwards-compatibility.
+.RE
+.PP
+Therefore most userspace programs will not need to have any special handling
+of extensions.
+.PP
+However, because the definition of
+.I struct open_how
+may change in the future (with new fields being added when system headers are
+updated), userspace programs should zero-fill
+.I struct open_how
+to ensure that re-compiling the program with new headers will not result in
+spurious errors at runtime. The simplest way is to use a designated
+initialiser:
+.PP
+.in +4n
+.EX
+struct open_how how = { .flags = O_RDWR, .resolve = RESOLVE_IN_ROOT };
+.EE
+.in
+.PP
+or explicitly using something like
+.BR memset (3):
+.PP
+.in +4n
+.EX
+struct open_how how;
+memset(&how, 0, sizeof(how));
+how.flags = O_RDWR;
+how.resolve = RESOLVE_IN_ROOT;
+.EE
+.in
+.PP
+If a userspace program wishes to determine what extensions the running kernel
+supports, they may conduct a binary search on
+.IR size
+with a structure which has every byte non-zero (to find the largest value which
+doesn't produce an error of
+.BR E2BIG .)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)