mbox series

[v2,0/2] mm/memfd: add ioctl(MEMFD_CHECK_IF_ORIGINAL)

Message ID 20230908175738.41895-1-mclapinski@google.com (mailing list archive)
Headers show
Series mm/memfd: add ioctl(MEMFD_CHECK_IF_ORIGINAL) | expand

Message

Michał Cłapiński Sept. 8, 2023, 5:57 p.m. UTC
This change introduces a way to check if an fd points to a memfd's
original open fd (the one created by memfd_create).

We encountered an issue with migrating memfds in CRIU (checkpoint
restore in userspace - it migrates running processes between
machines). Imagine a scenario:
1. Create a memfd. By default it's open with O_RDWR and yet one can
exec() to it (unlike with regular files, where one would get ETXTBSY).
2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.

Now those 2 fds are indistinguishable from userspace. You can't exec()
to either of them (since the reopen incremented inode->i_writecount)
and their /proc/self/fdinfo/ are exactly the same. Unfortunately they
are not the same. If you close the second one, the first one becomes
exec()able again. If you close the first one, the other doesn't become
exec()able. Therefore during migration it does matter which is recreated
first and which is reopened but there is no way for CRIU to tell which
was first.

---
Changes since v1 at [1]:
  - Rewrote it from fcntl to ioctl. This was requested by filesystems
    maintainer.

Links:
  [1] https://lore.kernel.org/all/20230831203647.558079-1-mclapinski@google.com/

Michal Clapinski (2):
  mm/memfd: add ioctl(MEMFD_CHECK_IF_ORIGINAL)
  selftests: test ioctl(MEMFD_CHECK_IF_ORIGINAL)

 .../userspace-api/ioctl/ioctl-number.rst      |  1 +
 fs/hugetlbfs/inode.c                          |  9 ++++++
 include/linux/memfd.h                         | 12 +++++++
 mm/memfd.c                                    |  9 ++++++
 mm/shmem.c                                    |  9 ++++++
 tools/testing/selftests/memfd/memfd_test.c    | 32 +++++++++++++++++++
 6 files changed, 72 insertions(+)

Comments

Jonathan Corbet Sept. 8, 2023, 8:34 p.m. UTC | #1
Michal Clapinski <mclapinski@google.com> writes:

> This change introduces a way to check if an fd points to a memfd's
> original open fd (the one created by memfd_create).
>
> We encountered an issue with migrating memfds in CRIU (checkpoint
> restore in userspace - it migrates running processes between
> machines). Imagine a scenario:
> 1. Create a memfd. By default it's open with O_RDWR and yet one can
> exec() to it (unlike with regular files, where one would get ETXTBSY).
> 2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.
>
> Now those 2 fds are indistinguishable from userspace. You can't exec()
> to either of them (since the reopen incremented inode->i_writecount)
> and their /proc/self/fdinfo/ are exactly the same. Unfortunately they
> are not the same. If you close the second one, the first one becomes
> exec()able again. If you close the first one, the other doesn't become
> exec()able. Therefore during migration it does matter which is recreated
> first and which is reopened but there is no way for CRIU to tell which
> was first.

So please bear with me...I'll confess that I don't fully understand the
situation here, so this is probably a dumb question.

It seems like you are adding this "original open" test as a way of
working around a quirk with the behavior of subsequent opens.  I don't
*think* that this is part of the intended, documented behavior of
memfds, it's just something that happens.  You're exposing an artifact
of the current implementation.

Given that the two file descriptors are otherwise indistinguishable,
might a better fix be to make them indistinguishable in this regard as
well?  Is there a good reason why the second fd doesn't become
exec()able in this scenario and, if not, perhaps that behavior could be
changed instead?

Thanks,

jon
Michał Cłapiński Sept. 8, 2023, 9:55 p.m. UTC | #2
On Fri, Sep 8, 2023 at 10:34 PM Jonathan Corbet <corbet@lwn.net> wrote:
>
> Michal Clapinski <mclapinski@google.com> writes:
>
> > This change introduces a way to check if an fd points to a memfd's
> > original open fd (the one created by memfd_create).
> >
> > We encountered an issue with migrating memfds in CRIU (checkpoint
> > restore in userspace - it migrates running processes between
> > machines). Imagine a scenario:
> > 1. Create a memfd. By default it's open with O_RDWR and yet one can
> > exec() to it (unlike with regular files, where one would get ETXTBSY).
> > 2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.
> >
> > Now those 2 fds are indistinguishable from userspace. You can't exec()
> > to either of them (since the reopen incremented inode->i_writecount)
> > and their /proc/self/fdinfo/ are exactly the same. Unfortunately they
> > are not the same. If you close the second one, the first one becomes
> > exec()able again. If you close the first one, the other doesn't become
> > exec()able. Therefore during migration it does matter which is recreated
> > first and which is reopened but there is no way for CRIU to tell which
> > was first.
>
> So please bear with me...I'll confess that I don't fully understand the
> situation here, so this is probably a dumb question.
>
> It seems like you are adding this "original open" test as a way of
> working around a quirk with the behavior of subsequent opens.  I don't
> *think* that this is part of the intended, documented behavior of
> memfds, it's just something that happens.  You're exposing an artifact
> of the current implementation.

I don't know if the exec()ability of the original memfd was intended,
let alone the non-exec()ability of subsequent opens. But otherwise
yes.

> Given that the two file descriptors are otherwise indistinguishable,
> might a better fix be to make them indistinguishable in this regard as
> well?  Is there a good reason why the second fd doesn't become
> exec()able in this scenario and, if not, perhaps that behavior could be
> changed instead?

It probably could be changed, yes. But I'm worried that would be
broadening the bug that is the exec()ability of memfds. AFAIK no other
fd that is opened as writable can be exec()ed. If maintainers would
prefer this, I could do this.
Jonathan Corbet Sept. 8, 2023, 10:07 p.m. UTC | #3
Michał Cłapiński <mclapinski@google.com> writes:

> On Fri, Sep 8, 2023 at 10:34 PM Jonathan Corbet <corbet@lwn.net> wrote:
>> Given that the two file descriptors are otherwise indistinguishable,
>> might a better fix be to make them indistinguishable in this regard as
>> well?  Is there a good reason why the second fd doesn't become
>> exec()able in this scenario and, if not, perhaps that behavior could be
>> changed instead?
>
> It probably could be changed, yes. But I'm worried that would be
> broadening the bug that is the exec()ability of memfds. AFAIK no other
> fd that is opened as writable can be exec()ed. If maintainers would
> prefer this, I could do this.

I'm not convinced that perpetuating the behavior and adding an ioctl()
workaround would be better than that; it seems to me that consistency
would be better.  But I don't have any real say in that matter, of
course; I'm curious what others think.

Thanks,

jon