mbox series

[v4,0/2] System Generation ID driver and VMGENID backend

Message ID 1610453760-13812-1-git-send-email-acatan@amazon.com (mailing list archive)
Headers show
Series System Generation ID driver and VMGENID backend | expand

Message

Catangiu, Adrian Costin Jan. 12, 2021, 12:15 p.m. UTC
This feature is aimed at virtualized or containerized environments
where VM or container snapshotting duplicates memory state, which is a
challenge for applications that want to generate unique data such as
request IDs, UUIDs, and cryptographic nonces.

The patch set introduces a mechanism that provides a userspace
interface for applications and libraries to be made aware of uniqueness
breaking events such as VM or container snapshotting, and allow them to
react and adapt to such events.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

The first patch in the set implements a device driver which exposes a
read-only device /dev/sysgenid to userspace, which contains a
monotonically increasing u32 generation counter. Libraries and
applications are expected to open() the device, and then call read()
which blocks until the SysGenId changes. Following an update, read()
calls no longer block until the application acknowledges the new
SysGenId by write()ing it back to the device. Non-blocking read() calls
return EAGAIN when there is no new SysGenId available. Alternatively,
libraries can mmap() the device to get a single shared page which
contains the latest SysGenId at offset 0.

SysGenId also supports a notification mechanism exposed as two IOCTLs
on the device. SYSGENID_GET_OUTDATED_WATCHERS immediately returns the
number of file descriptors to the device that were open during the last
SysGenId change but have not yet acknowledged the new id.
SYSGENID_WAIT_WATCHERS blocks until there are no open file handles on
the device which haven’t acknowledged the new id. These two interfaces
are intended for serverless and container control planes, which want to
confirm that all application code has detected and reacted to the new
SysGenId before sending an invoke to the newly-restored sandbox.

The second patch in the set adds a VmGenId driver which makes use of
the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
on VM snapshots.

---

v3 -> v4:

  - split functionality in two separate kernel modules: 
    1. drivers/misc/sysgenid.c which provides the generic userspace
       interface and mechanisms
    2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
       kernel entropy and acts as a driving backend for the generic
       sysgenid
  - renamed /dev/vmgenid -> /dev/sysgenid
  - renamed uapi header file vmgenid.h -> sysgenid.h
  - renamed ioctls VMGENID_* -> SYSGENID_*
  - added ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
  - fixed races in documentation examples
  - various style nits
  - rebased on top of linus latest

v2 -> v3:

  - separate the core driver logic and interface, from the ACPI device.
    The ACPI vmgenid device is now one possible backend.
  - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
  - add locking to avoid races between fs ops handlers and hw irq
    driven generation updates
  - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    outdated or a generation change happens while waiting (thus making
    current caller outdated), the ioctl returns -EINTR to signal the
    user to handle event and retry. Fixes blocking on oneself.
  - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    CAP_CHECKPOINT_RESTORE capability, through which software can force
    generation bump.

v1 -> v2:

  - expose to userspace a monotonically increasing u32 Vm Gen Counter
    instead of the hw VmGen UUID
  - since the hw/hypervisor-provided 128-bit UUID is not public
    anymore, add it to the kernel RNG as device randomness
  - insert driver page containing Vm Gen Counter in the user vma in
    the driver's mmap handler instead of using a fault handler
  - turn driver into a misc device driver to auto-create /dev/vmgenid
  - change ioctl arg to avoid leaking kernel structs to userspace
  - update documentation
  - various nits
  - rebase on top of linus latest

Adrian Catangiu (2):
  drivers/misc: sysgenid: add system generation id driver
  drivers/virt: vmgenid: add vm generation id driver

 Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
 Documentation/virt/vmgenid.rst          |  34 ++++
 drivers/misc/Kconfig                    |  16 ++
 drivers/misc/Makefile                   |   1 +
 drivers/misc/sysgenid.c                 | 298 ++++++++++++++++++++++++++++++++
 drivers/virt/Kconfig                    |  14 ++
 drivers/virt/Makefile                   |   1 +
 drivers/virt/vmgenid.c                  | 153 ++++++++++++++++
 include/uapi/linux/sysgenid.h           |  18 ++
 9 files changed, 775 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 drivers/virt/vmgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

Comments

Michael S. Tsirkin Jan. 12, 2021, 12:48 p.m. UTC | #1
On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> This feature is aimed at virtualized or containerized environments
> where VM or container snapshotting duplicates memory state, which is a
> challenge for applications that want to generate unique data such as
> request IDs, UUIDs, and cryptographic nonces.
> 
> The patch set introduces a mechanism that provides a userspace
> interface for applications and libraries to be made aware of uniqueness
> breaking events such as VM or container snapshotting, and allow them to
> react and adapt to such events.
> 
> Solving the uniqueness problem strongly enough for cryptographic
> purposes requires a mechanism which can deterministically reseed
> userspace PRNGs with new entropy at restore time. This mechanism must
> also support the high-throughput and low-latency use-cases that led
> programmers to pick a userspace PRNG in the first place; be usable by
> both application code and libraries; allow transparent retrofitting
> behind existing popular PRNG interfaces without changing application
> code; it must be efficient, especially on snapshot restore; and be
> simple enough for wide adoption.
> 
> The first patch in the set implements a device driver which exposes a
> read-only device /dev/sysgenid to userspace, which contains a
> monotonically increasing u32 generation counter. Libraries and
> applications are expected to open() the device, and then call read()
> which blocks until the SysGenId changes. Following an update, read()
> calls no longer block until the application acknowledges the new
> SysGenId by write()ing it back to the device. Non-blocking read() calls
> return EAGAIN when there is no new SysGenId available. Alternatively,
> libraries can mmap() the device to get a single shared page which
> contains the latest SysGenId at offset 0.

Looking at some specifications, the gen ID might actually be located
at an arbitrary address. How about instead of hard-coding the offset,
we expose it e.g. in sysfs?


> SysGenId also supports a notification mechanism exposed as two IOCTLs
> on the device. SYSGENID_GET_OUTDATED_WATCHERS immediately returns the
> number of file descriptors to the device that were open during the last
> SysGenId change but have not yet acknowledged the new id.
> SYSGENID_WAIT_WATCHERS blocks until there are no open file handles on
> the device which haven’t acknowledged the new id. These two interfaces
> are intended for serverless and container control planes, which want to
> confirm that all application code has detected and reacted to the new
> SysGenId before sending an invoke to the newly-restored sandbox.
> 
> The second patch in the set adds a VmGenId driver which makes use of
> the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
> on VM snapshots.
> 
> ---
> 
> v3 -> v4:
> 
>   - split functionality in two separate kernel modules: 
>     1. drivers/misc/sysgenid.c which provides the generic userspace
>        interface and mechanisms
>     2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
>        kernel entropy and acts as a driving backend for the generic
>        sysgenid
>   - renamed /dev/vmgenid -> /dev/sysgenid
>   - renamed uapi header file vmgenid.h -> sysgenid.h
>   - renamed ioctls VMGENID_* -> SYSGENID_*
>   - added ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
>   - fixed races in documentation examples
>   - various style nits
>   - rebased on top of linus latest
> 
> v2 -> v3:
> 
>   - separate the core driver logic and interface, from the ACPI device.
>     The ACPI vmgenid device is now one possible backend.
>   - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
>   - add locking to avoid races between fs ops handlers and hw irq
>     driven generation updates
>   - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
>     outdated or a generation change happens while waiting (thus making
>     current caller outdated), the ioctl returns -EINTR to signal the
>     user to handle event and retry. Fixes blocking on oneself.
>   - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
>     CAP_CHECKPOINT_RESTORE capability, through which software can force
>     generation bump.
> 
> v1 -> v2:
> 
>   - expose to userspace a monotonically increasing u32 Vm Gen Counter
>     instead of the hw VmGen UUID
>   - since the hw/hypervisor-provided 128-bit UUID is not public
>     anymore, add it to the kernel RNG as device randomness
>   - insert driver page containing Vm Gen Counter in the user vma in
>     the driver's mmap handler instead of using a fault handler
>   - turn driver into a misc device driver to auto-create /dev/vmgenid
>   - change ioctl arg to avoid leaking kernel structs to userspace
>   - update documentation
>   - various nits
>   - rebase on top of linus latest
> 
> Adrian Catangiu (2):
>   drivers/misc: sysgenid: add system generation id driver
>   drivers/virt: vmgenid: add vm generation id driver
> 
>  Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
>  Documentation/virt/vmgenid.rst          |  34 ++++
>  drivers/misc/Kconfig                    |  16 ++
>  drivers/misc/Makefile                   |   1 +
>  drivers/misc/sysgenid.c                 | 298 ++++++++++++++++++++++++++++++++
>  drivers/virt/Kconfig                    |  14 ++
>  drivers/virt/Makefile                   |   1 +
>  drivers/virt/vmgenid.c                  | 153 ++++++++++++++++
>  include/uapi/linux/sysgenid.h           |  18 ++
>  9 files changed, 775 insertions(+)
>  create mode 100644 Documentation/misc-devices/sysgenid.rst
>  create mode 100644 Documentation/virt/vmgenid.rst
>  create mode 100644 drivers/misc/sysgenid.c
>  create mode 100644 drivers/virt/vmgenid.c
>  create mode 100644 include/uapi/linux/sysgenid.h
> 
> -- 
> 2.7.4
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Denis V. Lunev" via Jan. 21, 2021, 10:28 a.m. UTC | #2
On 12/01/2021, 14:49, "Michael S. Tsirkin" <mst@redhat.com> wrote:

    On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
    > The first patch in the set implements a device driver which exposes a
    > read-only device /dev/sysgenid to userspace, which contains a
    > monotonically increasing u32 generation counter. Libraries and
    > applications are expected to open() the device, and then call read()
    > which blocks until the SysGenId changes. Following an update, read()
    > calls no longer block until the application acknowledges the new
    > SysGenId by write()ing it back to the device. Non-blocking read() calls
    > return EAGAIN when there is no new SysGenId available. Alternatively,
    > libraries can mmap() the device to get a single shared page which
    > contains the latest SysGenId at offset 0.

    Looking at some specifications, the gen ID might actually be located
    at an arbitrary address. How about instead of hard-coding the offset,
    we expose it e.g. in sysfs?

The functionality is split between SysGenID which exposes an internal u32
counter to userspace, and an (optional) VmGenID backend which drives
SysGenID generation changes based on hw vmgenid updates.

The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
otherwise exposed to userspace. It is only used internally by the vmgenid
driver to find out about VM generation changes and drive the more generic
SysGenID.

The SysGenID u32 monotonic increasing counter is the one that is mmaped to
userspace, but it is a software counter. I don't see any value in using a dynamic
offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
static so no need to dynamically calculate or find it at runtime.

Thanks,
Adrian.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Michael S. Tsirkin Jan. 27, 2021, 12:47 p.m. UTC | #3
On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
> On 12/01/2021, 14:49, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>     On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
>     > The first patch in the set implements a device driver which exposes a
>     > read-only device /dev/sysgenid to userspace, which contains a
>     > monotonically increasing u32 generation counter. Libraries and
>     > applications are expected to open() the device, and then call read()
>     > which blocks until the SysGenId changes. Following an update, read()
>     > calls no longer block until the application acknowledges the new
>     > SysGenId by write()ing it back to the device. Non-blocking read() calls
>     > return EAGAIN when there is no new SysGenId available. Alternatively,
>     > libraries can mmap() the device to get a single shared page which
>     > contains the latest SysGenId at offset 0.
> 
>     Looking at some specifications, the gen ID might actually be located
>     at an arbitrary address. How about instead of hard-coding the offset,
>     we expose it e.g. in sysfs?
> 
> The functionality is split between SysGenID which exposes an internal u32
> counter to userspace, and an (optional) VmGenID backend which drives
> SysGenID generation changes based on hw vmgenid updates.
> 
> The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
> otherwise exposed to userspace. It is only used internally by the vmgenid
> driver to find out about VM generation changes and drive the more generic
> SysGenID.
> 
> The SysGenID u32 monotonic increasing counter is the one that is mmaped to
> userspace, but it is a software counter. I don't see any value in using a dynamic
> offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
> static so no need to dynamically calculate or find it at runtime.

Well you are burning a whole page on it, using an offset the page
can be shared with other functionality.

> Thanks,
> Adrian.
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Denis V. Lunev" via Jan. 28, 2021, 12:58 p.m. UTC | #4
Hey Michael!

On 27.01.21 13:47, Michael S. Tsirkin wrote:
> 
> On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
>> On 12/01/2021, 14:49, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>      On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
>>      > The first patch in the set implements a device driver which exposes a
>>      > read-only device /dev/sysgenid to userspace, which contains a
>>      > monotonically increasing u32 generation counter. Libraries and
>>      > applications are expected to open() the device, and then call read()
>>      > which blocks until the SysGenId changes. Following an update, read()
>>      > calls no longer block until the application acknowledges the new
>>      > SysGenId by write()ing it back to the device. Non-blocking read() calls
>>      > return EAGAIN when there is no new SysGenId available. Alternatively,
>>      > libraries can mmap() the device to get a single shared page which
>>      > contains the latest SysGenId at offset 0.
>>
>>      Looking at some specifications, the gen ID might actually be located
>>      at an arbitrary address. How about instead of hard-coding the offset,
>>      we expose it e.g. in sysfs?
>>
>> The functionality is split between SysGenID which exposes an internal u32
>> counter to userspace, and an (optional) VmGenID backend which drives
>> SysGenID generation changes based on hw vmgenid updates.
>>
>> The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
>> otherwise exposed to userspace. It is only used internally by the vmgenid
>> driver to find out about VM generation changes and drive the more generic
>> SysGenID.
>>
>> The SysGenID u32 monotonic increasing counter is the one that is mmaped to
>> userspace, but it is a software counter. I don't see any value in using a dynamic
>> offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
>> static so no need to dynamically calculate or find it at runtime.
> 
> Well you are burning a whole page on it, using an offset the page
> can be shared with other functionality.

Currently, the SysGenID lives is one page owned by Linux that we share 
out to multiple user space clients. So yes, we burn a single page of the 
system here.

If we put more data in that same page, what data would you put there? 
Random other bits from other subsystems? At that point, we'd be 
reinventing vdso all over again, no? Probably with the same problems.

Which gets me to the second alternative: Reuse VDSO. The problem there 
is that the VDSO is an extremely architecture specific mechanism. Any 
new architecture we'd want to support would need multiple layers of 
changes in multiple layers of both kernel and libc. I'd like to avoid 
that if we can :).

So that leaves us with either wasting a page per system or not having an 
mmap() interface in the first place.

The reason we have the mmap() interface is that it's be easier to 
consume for libraries, that are not hooked into the main event loop.

So, uh, what are you suggesting? :)


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Michael S. Tsirkin Feb. 2, 2021, 2:34 p.m. UTC | #5
On Thu, Jan 28, 2021 at 01:58:12PM +0100, Alexander Graf wrote:
> Hey Michael!
> 
> On 27.01.21 13:47, Michael S. Tsirkin wrote:
> > 
> > On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
> > > On 12/01/2021, 14:49, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > >      On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> > >      > The first patch in the set implements a device driver which exposes a
> > >      > read-only device /dev/sysgenid to userspace, which contains a
> > >      > monotonically increasing u32 generation counter. Libraries and
> > >      > applications are expected to open() the device, and then call read()
> > >      > which blocks until the SysGenId changes. Following an update, read()
> > >      > calls no longer block until the application acknowledges the new
> > >      > SysGenId by write()ing it back to the device. Non-blocking read() calls
> > >      > return EAGAIN when there is no new SysGenId available. Alternatively,
> > >      > libraries can mmap() the device to get a single shared page which
> > >      > contains the latest SysGenId at offset 0.
> > > 
> > >      Looking at some specifications, the gen ID might actually be located
> > >      at an arbitrary address. How about instead of hard-coding the offset,
> > >      we expose it e.g. in sysfs?
> > > 
> > > The functionality is split between SysGenID which exposes an internal u32
> > > counter to userspace, and an (optional) VmGenID backend which drives
> > > SysGenID generation changes based on hw vmgenid updates.
> > > 
> > > The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
> > > otherwise exposed to userspace. It is only used internally by the vmgenid
> > > driver to find out about VM generation changes and drive the more generic
> > > SysGenID.
> > > 
> > > The SysGenID u32 monotonic increasing counter is the one that is mmaped to
> > > userspace, but it is a software counter. I don't see any value in using a dynamic
> > > offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
> > > static so no need to dynamically calculate or find it at runtime.
> > 
> > Well you are burning a whole page on it, using an offset the page
> > can be shared with other functionality.
> 
> Currently, the SysGenID lives is one page owned by Linux that we share out
> to multiple user space clients. So yes, we burn a single page of the system
> here.
> 
> If we put more data in that same page, what data would you put there? Random
> other bits from other subsystems? At that point, we'd be reinventing vdso
> all over again, no? Probably with the same problems.
> 
> Which gets me to the second alternative: Reuse VDSO. The problem there is
> that the VDSO is an extremely architecture specific mechanism. Any new
> architecture we'd want to support would need multiple layers of changes in
> multiple layers of both kernel and libc. I'd like to avoid that if we can
> :).
> 
> So that leaves us with either wasting a page per system or not having an
> mmap() interface in the first place.
> 
> The reason we have the mmap() interface is that it's be easier to consume
> for libraries, that are not hooked into the main event loop.
> 
> So, uh, what are you suggesting? :)

I'd drop mmap at this point. I haven't seen a way to use it
that isn't racy.

> 
> Alex
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
>