diff mbox series

drivers/virt: vmgenid: add vm generation id driver

Message ID 788878CE-2578-4991-A5A6-669DCABAC2F2@amazon.com (mailing list archive)
State New, archived
Headers show
Series drivers/virt: vmgenid: add vm generation id driver | expand

Commit Message

Catangiu, Adrian Costin Oct. 16, 2020, 2:33 p.m. UTC
- Background

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.

The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.

- Problem

The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.

Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.

- Solution

This patch is a driver which exposes the Virtual Machine Generation ID
via a char-dev FS interface that provides ID update sync and async
notification, retrieval and confirmation mechanisms:

When the device is 'open()'ed a copy of the current vm UUID is
associated with the file handle. 'read()' operations block until the
associated UUID is no longer up to date - until HW vm gen id changes -
at which point the new UUID is provided/returned. Nonblocking 'read()'
uses EWOULDBLOCK to signal that there is no _new_ UUID available.

'poll()' is implemented to allow polling for UUID updates. Such
updates result in 'EPOLLIN' events.

Subsequent read()s following a UUID update no longer block, but return
the updated UUID. The application needs to acknowledge the UUID update
by confirming it through a 'write()'.
Only on writing back to the driver the right/latest UUID, will the
driver mark this "watcher" as up to date and remove EPOLLIN status.

'mmap()' support allows mapping a single read-only shared page which
will always contain the latest UUID value at offset 0.

The driver also adds support for tracking count of open file handles
that haven't acknowledged an UUID update. This is exposed through
two IOCTLs:
 * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
   _outdated_ watchers - number of file handles that were open during
   a VM generation change, and which have not yet confirmed the new
   Vm-Gen-Id.
 * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_
   watchers, or if a 'timeout' argument is provided, until the timeout
   expires.

This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal
https://lkml.org/lkml/2018/3/1/498

- Future improvements

Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.

There is a path for platform devices to match on _CID when _HID is
'PRP0001' - which is not the case for the Qemu vmgenid device.

Guidance and help here would be greatly appreciated.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 Documentation/virt/vmgenid.rst | 211 +++++++++++++++++++++
 drivers/virt/Kconfig           |  13 ++
 drivers/virt/Makefile          |   1 +
 drivers/virt/vmgenid.c         | 419 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vmgenid.h   |  22 +++
 5 files changed, 666 insertions(+)
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/virt/vmgenid.c
 create mode 100644 include/uapi/linux/vmgenid.h

Comments

Catangiu, Adrian Costin Oct. 16, 2020, 3 p.m. UTC | #1
Sorry, I forgot to add a few people interested in this and the KVM ML to CC. Added them.

On 16/10/2020, 17:33, "Catangiu, Adrian Costin" <acatan@amazon.com> wrote:

    - Background
    
    The VM Generation ID is a feature defined by Microsoft (paper:
    http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
    multiple hypervisor vendors.
    
    The feature is required in virtualized environments by apps that work
    with local copies/caches of world-unique data such as random values,
    uuids, monotonically increasing counters, etc.
    Such apps can be negatively affected by VM snapshotting when the VM
    is either cloned or returned to an earlier point in time.
    
    The VM Generation ID is a simple concept meant to alleviate the issue
    by providing a unique ID that changes each time the VM is restored
    from a snapshot. The hw provided UUID value can be used to
    differentiate between VMs or different generations of the same VM.
    
    - Problem
    
    The VM Generation ID is exposed through an ACPI device by multiple
    hypervisor vendors but neither the vendors or upstream Linux have no
    default driver for it leaving users to fend for themselves.
    
    Furthermore, simply finding out about a VM generation change is only
    the starting point of a process to renew internal states of possibly
    multiple applications across the system. This process could benefit
    from a driver that provides an interface through which orchestration
    can be easily done.
    
    - Solution
    
    This patch is a driver which exposes the Virtual Machine Generation ID
    via a char-dev FS interface that provides ID update sync and async
    notification, retrieval and confirmation mechanisms:
    
    When the device is 'open()'ed a copy of the current vm UUID is
    associated with the file handle. 'read()' operations block until the
    associated UUID is no longer up to date - until HW vm gen id changes -
    at which point the new UUID is provided/returned. Nonblocking 'read()'
    uses EWOULDBLOCK to signal that there is no _new_ UUID available.
    
    'poll()' is implemented to allow polling for UUID updates. Such
    updates result in 'EPOLLIN' events.
    
    Subsequent read()s following a UUID update no longer block, but return
    the updated UUID. The application needs to acknowledge the UUID update
    by confirming it through a 'write()'.
    Only on writing back to the driver the right/latest UUID, will the
    driver mark this "watcher" as up to date and remove EPOLLIN status.
    
    'mmap()' support allows mapping a single read-only shared page which
    will always contain the latest UUID value at offset 0.
    
    The driver also adds support for tracking count of open file handles
    that haven't acknowledged an UUID update. This is exposed through
    two IOCTLs:
     * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
       _outdated_ watchers - number of file handles that were open during
       a VM generation change, and which have not yet confirmed the new
       Vm-Gen-Id.
     * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_
       watchers, or if a 'timeout' argument is provided, until the timeout
       expires.
    
    This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal
    https://lkml.org/lkml/2018/3/1/498
    
    - Future improvements
    
    Ideally we would want the driver to register itself based on devices'
    _CID and not _HID, but unfortunately I couldn't find a way to do that.
    The problem is that ACPI device matching is done by
    '__acpi_match_device()' which exclusively looks at
    'acpi_hardware_id *hwid'.
    
    There is a path for platform devices to match on _CID when _HID is
    'PRP0001' - which is not the case for the Qemu vmgenid device.
    
    Guidance and help here would be greatly appreciated.
    
    Signed-off-by: Adrian Catangiu <acatan@amazon.com>
    ---
     Documentation/virt/vmgenid.rst | 211 +++++++++++++++++++++
     drivers/virt/Kconfig           |  13 ++
     drivers/virt/Makefile          |   1 +
     drivers/virt/vmgenid.c         | 419 +++++++++++++++++++++++++++++++++++++++++
     include/uapi/linux/vmgenid.h   |  22 +++
     5 files changed, 666 insertions(+)
     create mode 100644 Documentation/virt/vmgenid.rst
     create mode 100644 drivers/virt/vmgenid.c
     create mode 100644 include/uapi/linux/vmgenid.h
    
    diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
    new file mode 100644
    index 0000000..5224415
    --- /dev/null
    +++ b/Documentation/virt/vmgenid.rst
    @@ -0,0 +1,211 @@
    +.. SPDX-License-Identifier: GPL-2.0
    +
    +============
    +VMGENID
    +============
    +
    +The VM Generation ID is a feature defined by Microsoft (paper:
    +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
    +multiple hypervisor vendors.
    +
    +The feature is required in virtualized environments by apps that work
    +with local copies/caches of world-unique data such as random values,
    +uuids, monotonically increasing counters, etc.
    +Such apps can be negatively affected by VM snapshotting when the VM
    +is either cloned or returned to an earlier point in time.
    +
    +The VM Generation ID is a simple concept meant to alleviate the issue
    +by providing a unique ID that changes each time the VM is restored
    +from a snapshot. The hw provided UUID value can be used to
    +differentiate between VMs or different generations of the same VM.
    +
    +The VM Generation ID is exposed through an ACPI device by multiple
    +hypervisor vendors. The driver for it lives at
    +``drivers/virt/vmgenid.c``
    +
    +The driver exposes the Virtual Machine Generation ID via a char-dev FS
    +interface that provides ID update sync/async notification, retrieval
    +and confirmation mechanisms:
    +
    +``open()``:
    +  When the device is opened, a copy of the current vm UUID is
    +  associated with the file handle. The driver now tracks this file
    +  handle as an independent *watcher*. The driver tracks how many
    +  watchers are aware of the latest Vm-Gen-Id uuid and how many of
    +  them are *outdated*, outdated being those that have lived through
    +  a Vm-Gen-Id change but not yet confirmed the generation change event.
    +
    +``read()``:
    +  Read is meant to provide the *new* Vm-Gen-Id when a generation change
    +  takes place. The read operation blocks until the associated UUID is
    +  no longer up to date - until HW vm gen id changes - at which point
    +  the new UUID is provided/returned. Nonblocking ``read()``
    +  uses ``EAGAIN`` to signal that there is no *new* UUID available.
    +  The hw UUID is considered *new* for each open file handle that hasn't
    +  confirmed the new value, following a generation change. Therefore,
    +  once a generation change takes place, all ``read()`` calls will
    +  immediately return the new uuid and will continue to do so until the
    +  new value is confirmed back to the driver through ``write()``.
    +  Partial reads are not allowed - read buffer needs to be at least
    +  ``sizeof(uuid_t)`` in size.
    +
    +``write()``:
    +  Write is used to confirm the up-to-date Vm-Gen-Id back to the driver.
    +  Following a VM generation change, all existing watchers are marked
    +  as *outdated*. Each file handle will maintain the *outdated* status
    +  until a ``write()`` confirms the up-to-date UUID back to the driver.
    +  Partial writes are not allowed - write buffer should be exactly
    +  ``sizeof(uuid_t)`` in size.
    +
    +``poll()``:
    +  Poll is implemented to allow polling for UUID updates. Such
    +  updates result in ``EPOLLIN`` polling status until the new up-to-date
    +  UUID is confirmed back to the driver through a ``write()``.
    +
    +``ioctl()``:
    +  The driver also adds support for tracking count of open file handles
    +  that haven't acknowledged an UUID update. This is exposed through
    +  two IOCTLs:
    +
    +  - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
    +    *outdated* watchers - number of file handles that were open during
    +    a VM generation change, and which have not yet confirmed the new
    +    Vm-Gen-Id.
    +  - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
    +    watchers, or if a ``timeout`` argument is provided, until the
    +    timeout expires.
    +
    +``mmap()``:
    +  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
    +  in size. The first 16 bytes of the mapped page will contain an
    +  up-to-date copy of the VM generation UUID.
    +  The mapped memory can be used as a low-latency UUID probe mechanism
    +  in critical sections - see examples.
    +
    +``close()``:
    +  Removes the file handle as a Vm-Gen-Id watcher.
    +
    +Example application workflows
    +-----------------------------
    +
    +1) Watchdog thread simplified example::
    +
    +	void watchdog_thread_handler(int *thread_active)
    +	{
    +		uuid_t uuid;
    +		int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
    +
    +		do {
    +			// read new UUID - blocks until VM generation changes
    +			read(fd, &uuid, sizeof(uuid));
    +
    +			// because of VM generation change, we need to rebuild world
    +			reseed_app_env();
    +
    +			// confirm we're done handling UUID update
    +			write(fd, &uuid, sizeof(uuid));
    +		} while (atomic_read(thread_active));
    +
    +		close(fd);
    +	}
    +
    +2) ASYNC simplified example::
    +
    +	void handle_io_on_vmgenfd(int vmgenfd)
    +	{
    +		uuid_t uuid;
    +
    +		// because of VM generation change, we need to rebuild world
    +		reseed_app_env();
    +
    +		// read new UUID - we need it to confirm we've handled update
    +		read(fd, &uuid, sizeof(uuid));
    +
    +		// confirm we're done handling UUID update
    +		write(fd, &uuid, sizeof(uuid));
    +	}
    +
    +	int main() {
    +		int epfd, vmgenfd;
    +		struct epoll_event ev;
    +
    +		epfd = epoll_create(EPOLL_QUEUE_LEN);
    +
    +		vmgenfd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
    +
    +		// register vmgenid for polling
    +		ev.events = EPOLLIN;
    +		ev.data.fd = vmgenfd;
    +		epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
    +
    +		// register other parts of your app for polling
    +		// ...
    +
    +		while (1) {
    +			// wait for something to do...
    +			int nfds = epoll_wait(epfd, events,
    +				MAX_EPOLL_EVENTS_PER_RUN,
    +				EPOLL_RUN_TIMEOUT);
    +			if (nfds < 0) die("Error in epoll_wait!");
    +
    +			// for each ready fd
    +			for(int i = 0; i < nfds; i++) {
    +				int fd = events[i].data.fd;
    +
    +				if (fd == vmgenfd)
    +					handle_io_on_vmgenfd(vmgenfd);
    +				else
    +					handle_some_other_part_of_the_app(fd);
    +			}
    +		}
    +
    +		return 0;
    +	}
    +
    +3) Mapped memory polling simplified example::
    +
    +	/*
    +	 * app/library function that provides cached secrets
    +	 */
    +	char * safe_cached_secret(app_data_t *app)
    +	{
    +		char *secret;
    +		volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app);
    +	again:
    +		secret = __cached_secret(app);
    +
    +		if (unlikely(*uuid_ptr != app->cached_uuid)) {
    +			app->cached_uuid = *uuid_ptr;
    +
    +			// rebuild world then confirm the uuid update (thru write)
    +			rebuild_caches(app);
    +			ack_vmgenid_update(app);
    +
    +			goto again;
    +		}
    +
    +		return secret;
    +	}
    +
    +4) Orchestrator simplified example::
    +
    +	/*
    +	 * orchestrator - manages multiple apps and libraries used by a service
    +	 * and tries to make sure all sensitive components gracefully handle
    +	 * VM generation changes.
    +	 * Following function is called on detection of a VM generation change.
    +	 */
    +	int handle_vmgen_update(int vmgenfd, uuid_t new_uuid)
    +	{
    +		// pause until all components have handled event
    +		pause_service();
    +
    +		// confirm *this* watcher as up-to-date
    +		write(fd, &new_uuid, sizeof(uuid_t));
    +
    +		// wait for all *others*
    +		ioctl(fd, VMGENID_WAIT_WATCHERS, NULL);
    +
    +		// all apps on the system have rebuilt worlds
    +		resume_service();
    +	}
    diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
    index 363af2e..c80f1ce 100644
    --- a/drivers/virt/Kconfig
    +++ b/drivers/virt/Kconfig
    @@ -13,6 +13,19 @@ menuconfig VIRT_DRIVERS
     
     if VIRT_DRIVERS
     
    +config VMGENID
    +	tristate "Virtual Machine Generation ID driver"
    +	depends on ACPI
    +	default M
    +	help
    +	  This is a Virtual Machine Generation ID driver which provides
    +	  a virtual machine unique identifier. The provided UUID can be
    +	  watched through the FS interface exposed by this driver, and
    +	  thus can provide notifications for VM snapshot or cloning events.
    +	  This enables applications and libraries that store or cache
    +	  sensitive information, to know that they need to regenerate it
    +	  after process memory has been exposed to potential copying.
    +
     config FSL_HV_MANAGER
     	tristate "Freescale hypervisor management driver"
     	depends on FSL_SOC
    diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
    index fd33124..a1f8dcc 100644
    --- a/drivers/virt/Makefile
    +++ b/drivers/virt/Makefile
    @@ -4,4 +4,5 @@
     #
     
     obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
    +obj-$(CONFIG_VMGENID)		+= vmgenid.o
     obj-y				+= vboxguest/
    diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
    new file mode 100644
    index 0000000..d314c72
    --- /dev/null
    +++ b/drivers/virt/vmgenid.c
    @@ -0,0 +1,419 @@
    +// SPDX-License-Identifier: GPL-2.0
    +/*
    + * Virtual Machine Generation ID driver
    + *
    + * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc
    + * All rights reserved.
    + *	Authors:
    + *	  Adrian Catangiu <acatan@amazon.com>
    + *	  Or Idgar <oridgar@gmail.com>
    + *	  Gal Hammer <ghammer@redhat.com>
    + *
    + */
    +#include <linux/acpi.h>
    +#include <linux/cdev.h>
    +#include <linux/kernel.h>
    +#include <linux/mm.h>
    +#include <linux/module.h>
    +#include <linux/poll.h>
    +#include <linux/uuid.h>
    +#include <linux/vmgenid.h>
    +
    +#define DEV_NAME "vmgenid"
    +ACPI_MODULE_NAME(DEV_NAME);
    +
    +struct dev_data {
    +	struct cdev       cdev;
    +	dev_t             dev_id;
    +	unsigned long     map_buf;
    +
    +	void              *uuid_iomap;
    +	uuid_t            uuid;
    +	wait_queue_head_t read_wait;
    +
    +	atomic_t          watchers;
    +	atomic_t          outdated_watchers;
    +	wait_queue_head_t outdated_wait;
    +};
    +
    +struct file_data {
    +	struct dev_data  *dev_data;
    +	uuid_t           acked_uuid;
    +};
    +
    +static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid)
    +{
    +	return !memcmp(uuid, &priv->uuid, sizeof(uuid_t));
    +}
    +
    +static void vmgenid_put_outdated_watchers(struct dev_data *priv)
    +{
    +	if (atomic_dec_and_test(&priv->outdated_watchers))
    +		wake_up_interruptible(&priv->outdated_wait);
    +}
    +
    +static int vmgenid_open(struct inode *inode, struct file *file)
    +{
    +	struct dev_data *priv =
    +		container_of(inode->i_cdev, struct dev_data, cdev);
    +	struct file_data *file_data =
    +		kzalloc(sizeof(struct file_data), GFP_KERNEL);
    +
    +	if (!file_data)
    +		return -ENOMEM;
    +
    +	file_data->acked_uuid = priv->uuid;
    +	file_data->dev_data = priv;
    +
    +	file->private_data = file_data;
    +	atomic_inc(&priv->watchers);
    +
    +	return 0;
    +}
    +
    +static int vmgenid_close(struct inode *inode, struct file *file)
    +{
    +	struct file_data *file_data = (struct file_data *) file->private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +
    +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
    +		vmgenid_put_outdated_watchers(priv);
    +	atomic_dec(&priv->watchers);
    +	kfree(file->private_data);
    +
    +	return 0;
    +}
    +
    +static ssize_t
    +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
    +{
    +	struct file_data *file_data =
    +		(struct file_data *) file->private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +	ssize_t ret;
    +
    +	if (nbytes == 0)
    +		return 0;
    +	/* disallow partial UUID reads */
    +	if (nbytes < sizeof(uuid_t))
    +		return -EINVAL;
    +	nbytes = sizeof(uuid_t);
    +
    +	if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
    +		if (file->f_flags & O_NONBLOCK)
    +			return -EAGAIN;
    +		ret = wait_event_interruptible(
    +			priv->read_wait,
    +			!vmgenid_uuid_matches(priv, &file_data->acked_uuid)
    +		);
    +		if (ret)
    +			return ret;
    +	}
    +
    +	ret = copy_to_user(ubuf, &priv->uuid, nbytes);
    +	if (ret)
    +		return -EFAULT;
    +
    +	return nbytes;
    +}
    +
    +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
    +				size_t count, loff_t *ppos)
    +{
    +	struct file_data *file_data = (struct file_data *) file->private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +	uuid_t ack_uuid;
    +
    +	/* disallow partial UUID writes */
    +	if (count != sizeof(uuid_t))
    +		return -EINVAL;
    +	if (copy_from_user(&ack_uuid, ubuf, count))
    +		return -EFAULT;
    +	/* wrong UUID acknowledged */
    +	if (!vmgenid_uuid_matches(priv, &ack_uuid))
    +		return -EINVAL;
    +
    +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
    +		/* update local view of UUID */
    +		file_data->acked_uuid = ack_uuid;
    +		vmgenid_put_outdated_watchers(priv);
    +	}
    +
    +	return (ssize_t)count;
    +}
    +
    +static __poll_t
    +vmgenid_poll(struct file *file, poll_table *wait)
    +{
    +	__poll_t mask = 0;
    +	struct file_data *file_data =
    +		(struct file_data *) file->private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +
    +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
    +		return EPOLLIN | EPOLLRDNORM;
    +
    +	poll_wait(file, &priv->read_wait, wait);
    +
    +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
    +		mask = EPOLLIN | EPOLLRDNORM;
    +
    +	return mask;
    +}
    +
    +static long vmgenid_ioctl(struct file *file,
    +		unsigned int cmd, unsigned long arg)
    +{
    +	struct file_data *file_data =
    +		(struct file_data *) file->private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +	struct timespec __user *timeout = (void *) arg;
    +	struct timespec kts;
    +	ktime_t until;
    +	int ret;
    +
    +	switch (cmd) {
    +	case VMGENID_GET_OUTDATED_WATCHERS:
    +		ret = atomic_read(&priv->outdated_watchers);
    +		break;
    +	case VMGENID_WAIT_WATCHERS:
    +		if (timeout) {
    +			ret = copy_from_user(&kts, timeout, sizeof(kts));
    +			if (ret)
    +				return -EFAULT;
    +			until = timespec_to_ktime(kts);
    +		} else {
    +			until = KTIME_MAX;
    +		}
    +
    +		ret = wait_event_interruptible_hrtimeout(
    +			priv->outdated_wait,
    +			!atomic_read(&priv->outdated_watchers),
    +			until
    +		);
    +		break;
    +	default:
    +		ret = -EINVAL;
    +		break;
    +	}
    +	return ret;
    +}
    +
    +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf)
    +{
    +	struct page *page;
    +	struct file_data *file_data =
    +			(struct file_data *) vmf->vma->vm_private_data;
    +	struct dev_data *priv = file_data->dev_data;
    +
    +	if (priv->map_buf) {
    +		page = virt_to_page(priv->map_buf);
    +		get_page(page);
    +		vmf->page = page;
    +	}
    +
    +	return 0;
    +}
    +
    +static const struct vm_operations_struct vmgenid_vm_ops = {
    +	.fault = vmgenid_vm_fault,
    +};
    +
    +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
    +{
    +	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
    +		return -EINVAL;
    +
    +	if ((vma->vm_flags & VM_WRITE) != 0)
    +		return -EPERM;
    +
    +	vma->vm_ops = &vmgenid_vm_ops;
    +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
    +	vma->vm_private_data = file->private_data;
    +
    +	return 0;
    +}
    +
    +static const struct file_operations fops = {
    +	.owner          = THIS_MODULE,
    +	.mmap           = vmgenid_mmap,
    +	.open           = vmgenid_open,
    +	.release        = vmgenid_close,
    +	.read           = vmgenid_read,
    +	.write          = vmgenid_write,
    +	.poll           = vmgenid_poll,
    +	.compat_ioctl   = vmgenid_ioctl,
    +	.unlocked_ioctl = vmgenid_ioctl,
    +};
    +
    +static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle)
    +{
    +	int i;
    +	phys_addr_t phys_addr;
    +	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
    +	acpi_status status;
    +	union acpi_object *pss;
    +	union acpi_object *element;
    +
    +	status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
    +	if (ACPI_FAILURE(status)) {
    +		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
    +		return -ENODEV;
    +	}
    +	pss = buffer.pointer;
    +	if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
    +		return -EINVAL;
    +
    +	phys_addr = 0;
    +	for (i = 0; i < pss->package.count; i++) {
    +		element = &(pss->package.elements[i]);
    +		if (element->type != ACPI_TYPE_INTEGER)
    +			return -EINVAL;
    +		phys_addr |= element->integer.value << i * 32;
    +	}
    +
    +	priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
    +	if (!priv->uuid_iomap) {
    +		pr_err("Could not map memory at 0x%llx, size %u",
    +			   phys_addr,
    +			   (u32)sizeof(uuid_t));
    +		return -ENOMEM;
    +	}
    +
    +	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
    +	memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
    +
    +	return 0;
    +}
    +
    +static int vmgenid_acpi_add(struct acpi_device *device)
    +{
    +	int ret;
    +	struct dev_data *priv;
    +
    +	if (!device)
    +		return -EINVAL;
    +
    +	priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL);
    +	if (!priv)
    +		return -ENOMEM;
    +
    +	priv->map_buf = get_zeroed_page(GFP_KERNEL);
    +	if (!priv->map_buf) {
    +		ret = -ENOMEM;
    +		goto free;
    +	}
    +
    +	device->driver_data = priv;
    +
    +	init_waitqueue_head(&priv->read_wait);
    +	atomic_set(&priv->watchers, 0);
    +	atomic_set(&priv->outdated_watchers, 0);
    +	init_waitqueue_head(&priv->outdated_wait);
    +
    +	ret = vmgenid_acpi_map(priv, device->handle);
    +	if (ret < 0)
    +		goto err;
    +
    +	ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME);
    +	if (ret < 0) {
    +		pr_err("alloc_chrdev_region() failed for vmgenid\n");
    +		goto err;
    +	}
    +
    +	cdev_init(&priv->cdev, &fops);
    +	cdev_add(&priv->cdev, priv->dev_id, 1);
    +
    +	return 0;
    +
    +err:
    +	if (priv->uuid_iomap)
    +		acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
    +
    +	free_pages(priv->map_buf, 0);
    +
    +free:
    +	kfree(priv);
    +
    +	return ret;
    +}
    +
    +static int vmgenid_acpi_remove(struct acpi_device *device)
    +{
    +	struct dev_data *priv;
    +
    +	if (!device || !acpi_driver_data(device))
    +		return -EINVAL;
    +	priv = acpi_driver_data(device);
    +
    +	cdev_del(&priv->cdev);
    +	unregister_chrdev_region(priv->dev_id, 1);
    +	device->driver_data = NULL;
    +
    +	if (priv->uuid_iomap)
    +		acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
    +	free_pages(priv->map_buf, 0);
    +	kfree(priv);
    +
    +	return 0;
    +}
    +
    +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
    +{
    +	uuid_t old_uuid;
    +	struct dev_data *priv;
    +
    +	pr_debug("VMGENID notified, event %u", event);
    +
    +	if (!device || !acpi_driver_data(device)) {
    +		pr_err("VMGENID notify with NULL private data");
    +		return;
    +	}
    +	priv = acpi_driver_data(device);
    +
    +	/* update VM Generation UUID */
    +	old_uuid = priv->uuid;
    +	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
    +
    +	if (!vmgenid_uuid_matches(priv, &old_uuid)) {
    +		/* HW uuid updated */
    +		memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
    +		atomic_set(&priv->outdated_watchers,
    +			 atomic_read(&priv->watchers));
    +		wake_up_interruptible(&priv->read_wait);
    +	}
    +}
    +
    +static const struct acpi_device_id vmgenid_ids[] = {
    +	{"QEMUVGID", 0},
    +	{"", 0},
    +};
    +
    +static struct acpi_driver acpi_vmgenid_driver = {
    +	.name = "vm_generation_id",
    +	.ids = vmgenid_ids,
    +	.owner = THIS_MODULE,
    +	.ops = {
    +		.add = vmgenid_acpi_add,
    +		.remove = vmgenid_acpi_remove,
    +		.notify = vmgenid_acpi_notify,
    +	}
    +};
    +
    +static int __init vmgenid_init(void)
    +{
    +	return acpi_bus_register_driver(&acpi_vmgenid_driver);
    +}
    +
    +static void __exit vmgenid_exit(void)
    +{
    +	acpi_bus_unregister_driver(&acpi_vmgenid_driver);
    +}
    +
    +module_init(vmgenid_init);
    +module_exit(vmgenid_exit);
    +
    +MODULE_AUTHOR("Adrian Catangiu");
    +MODULE_DESCRIPTION("Virtual Machine Generation ID");
    +MODULE_LICENSE("GPL");
    +MODULE_VERSION("0.1");
    diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
    new file mode 100644
    index 0000000..f7fca7b
    --- /dev/null
    +++ b/include/uapi/linux/vmgenid.h
    @@ -0,0 +1,22 @@
    +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
    +/*
    + * Copyright (c) 2020, Amazon.com Inc.
    + *
    + * This program is free software; you can redistribute it and/or
    + * modify it under the terms of the GNU General Public License
    + * as published by the Free Software Foundation; either version
    + * 2 of the License, or (at your option) any later version.
    + */
    +
    +#ifndef _UAPI_LINUX_VMGENID_H
    +#define _UAPI_LINUX_VMGENID_H
    +
    +#include <linux/ioctl.h>
    +#include <linux/time.h>
    +
    +#define VMGENID_IOCTL 0x2d
    +#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
    +#define VMGENID_WAIT_WATCHERS         _IOW(VMGENID_IOCTL, 2, struct timespec)
    +
    +#endif /* _UAPI_LINUX_VMGENID_H */
    +
    -- 
    2.7.4
    
    
    




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Greg KH Oct. 16, 2020, 3:14 p.m. UTC | #2
On Fri, Oct 16, 2020 at 02:33:15PM +0000, Catangiu, Adrian Costin wrote:
> +config VMGENID
> +	tristate "Virtual Machine Generation ID driver"
> +	depends on ACPI
> +	default M

Unless this is required to boot a machine, this should be removed.

> +	help
> +	  This is a Virtual Machine Generation ID driver which provides
> +	  a virtual machine unique identifier. The provided UUID can be
> +	  watched through the FS interface exposed by this driver, and
> +	  thus can provide notifications for VM snapshot or cloning events.

Where is the uuid exposed in the filesystem?

> +	  This enables applications and libraries that store or cache
> +	  sensitive information, to know that they need to regenerate it
> +	  after process memory has been exposed to potential copying.

Module name?

> +
>  config FSL_HV_MANAGER
>  	tristate "Freescale hypervisor management driver"
>  	depends on FSL_SOC
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index fd33124..a1f8dcc 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -4,4 +4,5 @@
>  #
>  
>  obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
> +obj-$(CONFIG_VMGENID)		+= vmgenid.o
>  obj-y				+= vboxguest/
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> new file mode 100644
> index 0000000..d314c72
> --- /dev/null
> +++ b/drivers/virt/vmgenid.c
> @@ -0,0 +1,419 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Virtual Machine Generation ID driver
> + *
> + * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc

That's not how you write copyright lines, please fix up.

> + * All rights reserved.
> + *	Authors:
> + *	  Adrian Catangiu <acatan@amazon.com>
> + *	  Or Idgar <oridgar@gmail.com>
> + *	  Gal Hammer <ghammer@redhat.com>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/cdev.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/uuid.h>
> +#include <linux/vmgenid.h>
> +
> +#define DEV_NAME "vmgenid"
> +ACPI_MODULE_NAME(DEV_NAME);
> +
> +struct dev_data {
> +	struct cdev       cdev;
> +	dev_t             dev_id;
> +	unsigned long     map_buf;
> +
> +	void              *uuid_iomap;
> +	uuid_t            uuid;
> +	wait_queue_head_t read_wait;
> +
> +	atomic_t          watchers;
> +	atomic_t          outdated_watchers;
> +	wait_queue_head_t outdated_wait;
> +};
> +
> +struct file_data {
> +	struct dev_data  *dev_data;
> +	uuid_t           acked_uuid;
> +};
> +
> +static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid)
> +{
> +	return !memcmp(uuid, &priv->uuid, sizeof(uuid_t));
> +}
> +
> +static void vmgenid_put_outdated_watchers(struct dev_data *priv)
> +{
> +	if (atomic_dec_and_test(&priv->outdated_watchers))
> +		wake_up_interruptible(&priv->outdated_wait);
> +}
> +
> +static int vmgenid_open(struct inode *inode, struct file *file)
> +{
> +	struct dev_data *priv =
> +		container_of(inode->i_cdev, struct dev_data, cdev);
> +	struct file_data *file_data =
> +		kzalloc(sizeof(struct file_data), GFP_KERNEL);
> +
> +	if (!file_data)
> +		return -ENOMEM;
> +
> +	file_data->acked_uuid = priv->uuid;
> +	file_data->dev_data = priv;
> +
> +	file->private_data = file_data;
> +	atomic_inc(&priv->watchers);
> +
> +	return 0;
> +}
> +
> +static int vmgenid_close(struct inode *inode, struct file *file)
> +{
> +	struct file_data *file_data = (struct file_data *) file->private_data;
> +	struct dev_data *priv = file_data->dev_data;
> +
> +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
> +		vmgenid_put_outdated_watchers(priv);
> +	atomic_dec(&priv->watchers);
> +	kfree(file->private_data);
> +
> +	return 0;
> +}
> +
> +static ssize_t
> +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
> +{
> +	struct file_data *file_data =
> +		(struct file_data *) file->private_data;
> +	struct dev_data *priv = file_data->dev_data;
> +	ssize_t ret;
> +
> +	if (nbytes == 0)
> +		return 0;
> +	/* disallow partial UUID reads */
> +	if (nbytes < sizeof(uuid_t))
> +		return -EINVAL;
> +	nbytes = sizeof(uuid_t);
> +
> +	if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
> +		if (file->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +		ret = wait_event_interruptible(
> +			priv->read_wait,
> +			!vmgenid_uuid_matches(priv, &file_data->acked_uuid)
> +		);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = copy_to_user(ubuf, &priv->uuid, nbytes);
> +	if (ret)
> +		return -EFAULT;
> +
> +	return nbytes;
> +}
> +
> +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct file_data *file_data = (struct file_data *) file->private_data;
> +	struct dev_data *priv = file_data->dev_data;
> +	uuid_t ack_uuid;
> +
> +	/* disallow partial UUID writes */
> +	if (count != sizeof(uuid_t))
> +		return -EINVAL;
> +	if (copy_from_user(&ack_uuid, ubuf, count))
> +		return -EFAULT;
> +	/* wrong UUID acknowledged */
> +	if (!vmgenid_uuid_matches(priv, &ack_uuid))
> +		return -EINVAL;
> +
> +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
> +		/* update local view of UUID */
> +		file_data->acked_uuid = ack_uuid;
> +		vmgenid_put_outdated_watchers(priv);
> +	}
> +
> +	return (ssize_t)count;
> +}
> +
> +static __poll_t
> +vmgenid_poll(struct file *file, poll_table *wait)
> +{
> +	__poll_t mask = 0;
> +	struct file_data *file_data =
> +		(struct file_data *) file->private_data;
> +	struct dev_data *priv = file_data->dev_data;
> +
> +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
> +		return EPOLLIN | EPOLLRDNORM;
> +
> +	poll_wait(file, &priv->read_wait, wait);
> +
> +	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
> +		mask = EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static long vmgenid_ioctl(struct file *file,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	struct file_data *file_data =
> +		(struct file_data *) file->private_data;
> +	struct dev_data *priv = file_data->dev_data;
> +	struct timespec __user *timeout = (void *) arg;
> +	struct timespec kts;
> +	ktime_t until;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VMGENID_GET_OUTDATED_WATCHERS:
> +		ret = atomic_read(&priv->outdated_watchers);
> +		break;
> +	case VMGENID_WAIT_WATCHERS:
> +		if (timeout) {
> +			ret = copy_from_user(&kts, timeout, sizeof(kts));
> +			if (ret)
> +				return -EFAULT;
> +			until = timespec_to_ktime(kts);

No validation of structure fields?

And you are exposing a kernel structure to userspace?

I don't see this function in Linus's tree right now, are you sure this
builds?


> +		} else {
> +			until = KTIME_MAX;
> +		}
> +
> +		ret = wait_event_interruptible_hrtimeout(
> +			priv->outdated_wait,
> +			!atomic_read(&priv->outdated_watchers),
> +			until
> +		);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf)
> +{
> +	struct page *page;
> +	struct file_data *file_data =
> +			(struct file_data *) vmf->vma->vm_private_data;

You do know you don't need to cast stuff like this, right?

Also run checkpatch.pl so maintainers are not grumpy and tell you to run
checkpatch.pl.

> +	struct dev_data *priv = file_data->dev_data;
> +
> +	if (priv->map_buf) {
> +		page = virt_to_page(priv->map_buf);
> +		get_page(page);
> +		vmf->page = page;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct vmgenid_vm_ops = {
> +	.fault = vmgenid_vm_fault,
> +};
> +
> +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> +		return -EINVAL;
> +
> +	if ((vma->vm_flags & VM_WRITE) != 0)
> +		return -EPERM;
> +
> +	vma->vm_ops = &vmgenid_vm_ops;
> +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_private_data = file->private_data;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations fops = {
> +	.owner          = THIS_MODULE,
> +	.mmap           = vmgenid_mmap,
> +	.open           = vmgenid_open,
> +	.release        = vmgenid_close,
> +	.read           = vmgenid_read,
> +	.write          = vmgenid_write,
> +	.poll           = vmgenid_poll,
> +	.compat_ioctl   = vmgenid_ioctl,
> +	.unlocked_ioctl = vmgenid_ioctl,
> +};
> +
> +static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle)
> +{
> +	int i;
> +	phys_addr_t phys_addr;
> +	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
> +	acpi_status status;
> +	union acpi_object *pss;
> +	union acpi_object *element;
> +
> +	status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
> +	if (ACPI_FAILURE(status)) {
> +		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
> +		return -ENODEV;
> +	}
> +	pss = buffer.pointer;
> +	if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
> +		return -EINVAL;
> +
> +	phys_addr = 0;
> +	for (i = 0; i < pss->package.count; i++) {
> +		element = &(pss->package.elements[i]);
> +		if (element->type != ACPI_TYPE_INTEGER)
> +			return -EINVAL;
> +		phys_addr |= element->integer.value << i * 32;
> +	}
> +
> +	priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
> +	if (!priv->uuid_iomap) {
> +		pr_err("Could not map memory at 0x%llx, size %u",
> +			   phys_addr,
> +			   (u32)sizeof(uuid_t));
> +		return -ENOMEM;
> +	}
> +
> +	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
> +	memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
> +
> +	return 0;
> +}
> +
> +static int vmgenid_acpi_add(struct acpi_device *device)
> +{
> +	int ret;
> +	struct dev_data *priv;
> +
> +	if (!device)
> +		return -EINVAL;
> +
> +	priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->map_buf = get_zeroed_page(GFP_KERNEL);
> +	if (!priv->map_buf) {
> +		ret = -ENOMEM;
> +		goto free;
> +	}
> +
> +	device->driver_data = priv;
> +
> +	init_waitqueue_head(&priv->read_wait);
> +	atomic_set(&priv->watchers, 0);
> +	atomic_set(&priv->outdated_watchers, 0);
> +	init_waitqueue_head(&priv->outdated_wait);
> +
> +	ret = vmgenid_acpi_map(priv, device->handle);
> +	if (ret < 0)
> +		goto err;
> +
> +	ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME);

Why is this not a misc device driver instead of messing around with all
of this "by hand"?

By doing things this way, you are not creating a device node in /dev/
automatically, right?  How did you test this?

> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +/*
> + * Copyright (c) 2020, Amazon.com Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.

License boilerplate is not needed.

> + */
> +
> +#ifndef _UAPI_LINUX_VMGENID_H
> +#define _UAPI_LINUX_VMGENID_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/time.h>
> +
> +#define VMGENID_IOCTL 0x2d
> +#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
> +#define VMGENID_WAIT_WATCHERS         _IOW(VMGENID_IOCTL, 2, struct timespec)

Why do you need ioctls for this?  Why not just use read/write?

thanks,

greg k-h
Jann Horn Oct. 17, 2020, 1:40 a.m. UTC | #3
[adding some more people who are interested in RNG stuff: Andy, Jason,
Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
concerns some pretty fundamental API stuff related to RNG usage]

On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
<acatan@amazon.com> wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver which exposes the Virtual Machine Generation ID
> via a char-dev FS interface that provides ID update sync and async
> notification, retrieval and confirmation mechanisms:
>
> When the device is 'open()'ed a copy of the current vm UUID is
> associated with the file handle. 'read()' operations block until the
> associated UUID is no longer up to date - until HW vm gen id changes -
> at which point the new UUID is provided/returned. Nonblocking 'read()'
> uses EWOULDBLOCK to signal that there is no _new_ UUID available.
>
> 'poll()' is implemented to allow polling for UUID updates. Such
> updates result in 'EPOLLIN' events.
>
> Subsequent read()s following a UUID update no longer block, but return
> the updated UUID. The application needs to acknowledge the UUID update
> by confirming it through a 'write()'.
> Only on writing back to the driver the right/latest UUID, will the
> driver mark this "watcher" as up to date and remove EPOLLIN status.
>
> 'mmap()' support allows mapping a single read-only shared page which
> will always contain the latest UUID value at offset 0.

It would be nicer if that page just contained an incrementing counter,
instead of a UUID. It's not like the application cares *what* the UUID
changed to, just that it *did* change and all RNGs state now needs to
be reseeded from the kernel, right? And an application can't reliably
read the entire UUID from the memory mapping anyway, because the VM
might be forked in the middle.

So I think your kernel driver should detect UUID changes and then turn
those into a monotonically incrementing counter. (Probably 64 bits
wide?) (That's probably also a little bit faster than comparing an
entire UUID.)

An option might be to put that counter into the vDSO, instead of a
separate VMA; but I don't know how the other folks feel about that.
Andy, do you have opinions on this? That way, normal userspace code
that uses this infrastructure wouldn't have to mess around with a
special device at all. And it'd be usable in seccomp sandboxes and so
on without needing special plumbing. And libraries wouldn't have to
call open() and mess with file descriptor numbers.

> The driver also adds support for tracking count of open file handles
> that haven't acknowledged an UUID update. This is exposed through
> two IOCTLs:
>  * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
>    _outdated_ watchers - number of file handles that were open during
>    a VM generation change, and which have not yet confirmed the new
>    Vm-Gen-Id.
>  * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_
>    watchers, or if a 'timeout' argument is provided, until the timeout
>    expires.

Does this mean that code that uses the memory mapping to detect
changes is still supposed to confirm generation IDs? What about
multithreaded processes, especially ones that use libraries - if a
library opens a single file descriptor that is used from multiple
threads, is the library required to synchronize all its threads before
confirming the change? That seems like it could get messy. And it
again means that this interface can't easily be used from inside
seccomp sandboxes.

[...]
> diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
[...]
> +``close()``:
> +  Removes the file handle as a Vm-Gen-Id watcher.

(Linux doesn't have "file handles". Technically close() just closes a
file descriptor, and if that file descriptor points to the same file
description object (aka struct file) as another file descriptor,
nothing happens.)

> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> +       void watchdog_thread_handler(int *thread_active)
> +       {
> +               uuid_t uuid;
> +               int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);

In case we actually keep this API, you should use O_CLOEXEC here.

> +
> +               do {
> +                       // read new UUID - blocks until VM generation changes
> +                       read(fd, &uuid, sizeof(uuid));
> +
> +                       // because of VM generation change, we need to rebuild world
> +                       reseed_app_env();
> +
> +                       // confirm we're done handling UUID update
> +                       write(fd, &uuid, sizeof(uuid));
> +               } while (atomic_read(thread_active));
> +
> +               close(fd);
> +       }
[...]
> +3) Mapped memory polling simplified example::
> +
> +       /*
> +        * app/library function that provides cached secrets
> +        */
> +       char * safe_cached_secret(app_data_t *app)
> +       {
> +               char *secret;
> +               volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app);
> +       again:
> +               secret = __cached_secret(app);
> +
> +               if (unlikely(*uuid_ptr != app->cached_uuid)) {
> +                       app->cached_uuid = *uuid_ptr;
> +
> +                       // rebuild world then confirm the uuid update (thru write)
> +                       rebuild_caches(app);
> +                       ack_vmgenid_update(app);
> +
> +                       goto again;
> +               }
> +
> +               return secret;
> +       }
> +
> +4) Orchestrator simplified example::
> +
> +       /*
> +        * orchestrator - manages multiple apps and libraries used by a service
> +        * and tries to make sure all sensitive components gracefully handle
> +        * VM generation changes.
> +        * Following function is called on detection of a VM generation change.
> +        */
> +       int handle_vmgen_update(int vmgenfd, uuid_t new_uuid)
> +       {
> +               // pause until all components have handled event
> +               pause_service();
> +
> +               // confirm *this* watcher as up-to-date
> +               write(fd, &new_uuid, sizeof(uuid_t));
> +
> +               // wait for all *others*
> +               ioctl(fd, VMGENID_WAIT_WATCHERS, NULL);
> +
> +               // all apps on the system have rebuilt worlds
> +               resume_service();
> +       }

Can you describe what value such an "Orchestrator" would add? Because
it seems to me like this will just unnecessarily complicate things.

[...]
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index fd33124..a1f8dcc 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -4,4 +4,5 @@
>  #
>
>  obj-$(CONFIG_FSL_HV_MANAGER)   += fsl_hypervisor.o
> +obj-$(CONFIG_VMGENID)          += vmgenid.o
>  obj-y                          += vboxguest/
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
[...]
> +static int vmgenid_close(struct inode *inode, struct file *file)
> +{
> +       struct file_data *file_data = (struct file_data *) file->private_data;
> +       struct dev_data *priv = file_data->dev_data;
> +
> +       if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
> +               vmgenid_put_outdated_watchers(priv);
> +       atomic_dec(&priv->watchers);

What happens if the UUID changes between the previous two calls? Then
the outdated watcher count will go out of sync, right?

(But as I've said, I think that maybe the outdated watcher counting is
a bad idea in general, and we should just get rid of it.)

> +       kfree(file->private_data);
> +
> +       return 0;
> +}
> +
> +static ssize_t
> +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
> +{
> +       struct file_data *file_data =
> +               (struct file_data *) file->private_data;
> +       struct dev_data *priv = file_data->dev_data;
> +       ssize_t ret;
> +
> +       if (nbytes == 0)
> +               return 0;
> +       /* disallow partial UUID reads */
> +       if (nbytes < sizeof(uuid_t))
> +               return -EINVAL;
> +       nbytes = sizeof(uuid_t);
> +
> +       if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
> +               if (file->f_flags & O_NONBLOCK)
> +                       return -EAGAIN;
> +               ret = wait_event_interruptible(
> +                       priv->read_wait,
> +                       !vmgenid_uuid_matches(priv, &file_data->acked_uuid)
> +               );
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       ret = copy_to_user(ubuf, &priv->uuid, nbytes);

If the VM is again forked in the middle of this, will userspace see a
torn UUID (consisting of half old and half new value)?

> +       if (ret)
> +               return -EFAULT;
> +
> +       return nbytes;
> +}
[...]
> +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf)
> +{
> +       struct page *page;
> +       struct file_data *file_data =
> +                       (struct file_data *) vmf->vma->vm_private_data;
> +       struct dev_data *priv = file_data->dev_data;
> +
> +       if (priv->map_buf) {
> +               page = virt_to_page(priv->map_buf);
> +               get_page(page);
> +               vmf->page = page;
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct vm_operations_struct vmgenid_vm_ops = {
> +       .fault = vmgenid_vm_fault,
> +};
> +
> +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> +               return -EINVAL;
> +
> +       if ((vma->vm_flags & VM_WRITE) != 0)
> +               return -EPERM;

This doesn't work, you also need to clear VM_MAYWRITE. See e.g. binder_mmap().

Also, I think mmap handlers for special mappings like this usually
directly install the page inside the mmap handler, using something
like vm_insert_page(). And then they don't need a ->fault handler.

(But if we decide to put this into the vDSO, the whole memory mapping
thing would become unnecessary anyway.)

> +       vma->vm_ops = &vmgenid_vm_ops;
> +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +       vma->vm_private_data = file->private_data;
> +
> +       return 0;
> +}
> +
> +static const struct file_operations fops = {
> +       .owner          = THIS_MODULE,
> +       .mmap           = vmgenid_mmap,
> +       .open           = vmgenid_open,
> +       .release        = vmgenid_close,
> +       .read           = vmgenid_read,
> +       .write          = vmgenid_write,
> +       .poll           = vmgenid_poll,
> +       .compat_ioctl   = vmgenid_ioctl,

You don't need to define a compat_ioctl if the normal ioctl handler is the same.

> +       .unlocked_ioctl = vmgenid_ioctl,
> +};
[...]
> +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> +{
> +       uuid_t old_uuid;
> +       struct dev_data *priv;
> +
> +       pr_debug("VMGENID notified, event %u", event);
> +
> +       if (!device || !acpi_driver_data(device)) {
> +               pr_err("VMGENID notify with NULL private data");
> +               return;
> +       }
> +       priv = acpi_driver_data(device);
> +
> +       /* update VM Generation UUID */
> +       old_uuid = priv->uuid;
> +       memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
> +
> +       if (!vmgenid_uuid_matches(priv, &old_uuid)) {
> +               /* HW uuid updated */
> +               memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
> +               atomic_set(&priv->outdated_watchers,
> +                        atomic_read(&priv->watchers));
> +               wake_up_interruptible(&priv->read_wait);
> +       }
> +}

If we know that the VM just got forked, we should probably also make
sure that we reseed the kernel's internal RNG before we tell userspace
to fetch new RNG seeds from the kernel? Otherwise this is kinda
pointless. Or are we already taking care of that elsewhere? If not, we
should probably mix the UUID into the entropy pool (at least
`write_pool(&input_pool, uuid, sizeof(uuid_t))`, although technically
it would be better to do it in a way that ensures that userspace can't
write the same value into the RNG - maybe we should introduce type
prefixes into write_pool()?) and then trigger a reseed of everything
else (`crng_reseed(&primary_crng, &input_pool)`).
Willy Tarreau Oct. 17, 2020, 3:36 a.m. UTC | #4
On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote:
> [adding some more people who are interested in RNG stuff: Andy, Jason,
> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> concerns some pretty fundamental API stuff related to RNG usage]
> 
> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
> <acatan@amazon.com> wrote:
> > This patch is a driver which exposes the Virtual Machine Generation ID
> > via a char-dev FS interface that provides ID update sync and async
> > notification, retrieval and confirmation mechanisms:
> >
> > When the device is 'open()'ed a copy of the current vm UUID is
> > associated with the file handle. 'read()' operations block until the
> > associated UUID is no longer up to date - until HW vm gen id changes -
> > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> >
> > 'poll()' is implemented to allow polling for UUID updates. Such
> > updates result in 'EPOLLIN' events.
> >
> > Subsequent read()s following a UUID update no longer block, but return
> > the updated UUID. The application needs to acknowledge the UUID update
> > by confirming it through a 'write()'.
> > Only on writing back to the driver the right/latest UUID, will the
> > driver mark this "watcher" as up to date and remove EPOLLIN status.
> >
> > 'mmap()' support allows mapping a single read-only shared page which
> > will always contain the latest UUID value at offset 0.
> 
> It would be nicer if that page just contained an incrementing counter,
> instead of a UUID. It's not like the application cares *what* the UUID
> changed to, just that it *did* change and all RNGs state now needs to
> be reseeded from the kernel, right? And an application can't reliably
> read the entire UUID from the memory mapping anyway, because the VM
> might be forked in the middle.
> 
> So I think your kernel driver should detect UUID changes and then turn
> those into a monotonically incrementing counter. (Probably 64 bits
> wide?) (That's probably also a little bit faster than comparing an
> entire UUID.)

I agree with this. Further, I'm observing there is a very common
confusion between "universally unique" and "random". Randoms are
needed when seeking unpredictability. A random number generator
*must* be able to return the same value multiple times in a row
(though this is rare), otherwise it's not random.

To illustrate this, a die has less than 3 bits of randomness and
is sufficient to play games with friends where a counter would allow
everyone to cheat. Conversely if you want to assign IDs to members
of your family you'd rather use a counter than a die for this or
you risk collisions and/or long series of retries to pick unique
IDs.

RFC4122 explains in great length how to produce guaranteed unique
IDs, and this only involves space, time and counters. There's
indeed a lazy variant that probably everyone uses nowadays,
consisting in picking random numbers, but this is not guaranteed
unique anymore.

If the UUIDs used there are real UUIDs, it could be as simple as
updating them according to their format, i.e. updating the timestamp,
and if the timestamp is already the same, just increase the seq counter.
Doing this doesn't require entropy, doesn't need to block and doesn't
needlessly leak randoms that sometimes make people feel nervous.

Just my two cents,
Willy
Jann Horn Oct. 17, 2020, 4:02 a.m. UTC | #5
On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote:
> > [adding some more people who are interested in RNG stuff: Andy, Jason,
> > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> > concerns some pretty fundamental API stuff related to RNG usage]
> >
> > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
> > <acatan@amazon.com> wrote:
> > > This patch is a driver which exposes the Virtual Machine Generation ID
> > > via a char-dev FS interface that provides ID update sync and async
> > > notification, retrieval and confirmation mechanisms:
> > >
> > > When the device is 'open()'ed a copy of the current vm UUID is
> > > associated with the file handle. 'read()' operations block until the
> > > associated UUID is no longer up to date - until HW vm gen id changes -
> > > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> > >
> > > 'poll()' is implemented to allow polling for UUID updates. Such
> > > updates result in 'EPOLLIN' events.
> > >
> > > Subsequent read()s following a UUID update no longer block, but return
> > > the updated UUID. The application needs to acknowledge the UUID update
> > > by confirming it through a 'write()'.
> > > Only on writing back to the driver the right/latest UUID, will the
> > > driver mark this "watcher" as up to date and remove EPOLLIN status.
> > >
> > > 'mmap()' support allows mapping a single read-only shared page which
> > > will always contain the latest UUID value at offset 0.
> >
> > It would be nicer if that page just contained an incrementing counter,
> > instead of a UUID. It's not like the application cares *what* the UUID
> > changed to, just that it *did* change and all RNGs state now needs to
> > be reseeded from the kernel, right? And an application can't reliably
> > read the entire UUID from the memory mapping anyway, because the VM
> > might be forked in the middle.
> >
> > So I think your kernel driver should detect UUID changes and then turn
> > those into a monotonically incrementing counter. (Probably 64 bits
> > wide?) (That's probably also a little bit faster than comparing an
> > entire UUID.)
>
> I agree with this. Further, I'm observing there is a very common
> confusion between "universally unique" and "random". Randoms are
> needed when seeking unpredictability. A random number generator
> *must* be able to return the same value multiple times in a row
> (though this is rare), otherwise it's not random.
[...]
> If the UUIDs used there are real UUIDs, it could be as simple as
> updating them according to their format, i.e. updating the timestamp,
> and if the timestamp is already the same, just increase the seq counter.
> Doing this doesn't require entropy, doesn't need to block and doesn't
> needlessly leak randoms that sometimes make people feel nervous.

Those UUIDs are supplied by existing hypervisor code; in that regard,
this is almost like a driver for a hardware device. It is written
against a fixed API provided by the underlying machine. Making sure
that the sequence of UUIDs, as seen from inside the machine, never
changes back to a previous one is the responsibility of the hypervisor
and out of scope for this driver.

Microsoft's spec document (which is a .docx file for reasons I don't
understand) actually promises us that it is a cryptographically random
128-bit integer value, which means that if you fork a VM 2^64 times,
the probability that any two of those VMs have the same counter is
2^-64. That should be good enough.

But in userspace, we just need a simple counter. There's no need for
us to worry about anything else, like timestamps or whatever. If we
repeatedly fork a paused VM, the forked VMs will see the same counter
value, but that's totally fine, because the only thing that matters to
userspace is that the counter changes when the VM is forked.

And actually, since the value is a cryptographically random 128-bit
value, I think that we should definitely use it to help reseed the
kernel's RNG, and keep it secret from userspace. That way, even if the
VM image is public, we can ensure that going forward, the kernel RNG
will return securely random data.
Colm MacCarthaigh Oct. 17, 2020, 4:34 a.m. UTC | #6
On 16 Oct 2020, at 21:02, Jann Horn wrote:
> On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote:
> But in userspace, we just need a simple counter. There's no need for
> us to worry about anything else, like timestamps or whatever. If we
> repeatedly fork a paused VM, the forked VMs will see the same counter
> value, but that's totally fine, because the only thing that matters to
> userspace is that the counter changes when the VM is forked.

For user-space, even a single bit would do. We added MADVISE_WIPEONFORK 
so that userspace libraries can detect fork()/clone() robustly, for the 
same reasons. It just wipes a page as the indicator, which is 
effectively a single-bit signal, and it works well. On the user-space 
side of this, I’m keen to find a solution like that that we can use 
fairly easily inside of portable libraries and applications. The “have 
I forked” checks do end up in hot paths, so it’s nice if they can be 
CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my 
favorite.

> And actually, since the value is a cryptographically random 128-bit
> value, I think that we should definitely use it to help reseed the
> kernel's RNG, and keep it secret from userspace. That way, even if the
> VM image is public, we can ensure that going forward, the kernel RNG
> will return securely random data.

If the image is public, you need some extra new raw entropy from 
somewhere. The gen-id could be mixed in, that can’t do any harm as 
long as rigorous cryptographic mixing with the prior state is used, but 
if that’s all you do then the final state is still deterministic and 
non-secret. The kernel would need to use the change as a trigger to 
measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just 
define the machine contract as “this has to be unique random data and 
if it’s not unique, or if it’s pubic, you’re toast”.


-
Colm
Jann Horn Oct. 17, 2020, 5:01 a.m. UTC | #7
On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh <colmmacc@amazon.com> wrote:
> On 16 Oct 2020, at 21:02, Jann Horn wrote:
> > On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote:
> > But in userspace, we just need a simple counter. There's no need for
> > us to worry about anything else, like timestamps or whatever. If we
> > repeatedly fork a paused VM, the forked VMs will see the same counter
> > value, but that's totally fine, because the only thing that matters to
> > userspace is that the counter changes when the VM is forked.
>
> For user-space, even a single bit would do. We added MADVISE_WIPEONFORK
> so that userspace libraries can detect fork()/clone() robustly, for the
> same reasons. It just wipes a page as the indicator, which is
> effectively a single-bit signal, and it works well. On the user-space
> side of this, I’m keen to find a solution like that that we can use
> fairly easily inside of portable libraries and applications. The “have
> I forked” checks do end up in hot paths, so it’s nice if they can be
> CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my
> favorite.

I'm pretty sure a single bit is not enough if you want to have a
single page, shared across the entire system, that stores the VM
forking state; you need a counter for that.

> > And actually, since the value is a cryptographically random 128-bit
> > value, I think that we should definitely use it to help reseed the
> > kernel's RNG, and keep it secret from userspace. That way, even if the
> > VM image is public, we can ensure that going forward, the kernel RNG
> > will return securely random data.
>
> If the image is public, you need some extra new raw entropy from
> somewhere. The gen-id could be mixed in, that can’t do any harm as
> long as rigorous cryptographic mixing with the prior state is used, but
> if that’s all you do then the final state is still deterministic and
> non-secret.

Microsoft's documentation
(http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
Generation ID that we get after a fork "is a 128-bit,
cryptographically random integer value". If multiple people use the
same image, it guarantees that each use of the image gets its own,
fresh ID: The table in section "How to implement virtual machine
generation ID support in a virtualization platform" says that (among
other things) "Virtual machine is imported, copied, or cloned"
generates a new generation ID.

So the RNG state after mixing in the new VM Generation ID would
contain 128 bits of secret entropy not known to anyone else, including
people with access to the VM image.

Now, 128 bits of cryptographically random data aren't _optimal_; I
think something on the order of 256 bits would be nicer from a
theoretical standpoint. But in practice I think we'll be good with the
128 bits we're getting (since the number of users who fork a VM image
is probably not going to be so large that worst-case collision
probabilities matter).

> The kernel would need to use the change as a trigger to
> measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just
> define the machine contract as “this has to be unique random data and
> if it’s not unique, or if it’s pubic, you’re toast”.

As far as I can tell from Microsoft's spec, that is a guarantee we're
already getting.
Colm MacCarthaigh Oct. 17, 2020, 5:29 a.m. UTC | #8
On 16 Oct 2020, at 22:01, Jann Horn wrote:
>
> On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh 
> <colmmacc@amazon.com> wrote:
>> For user-space, even a single bit would do. We added 
>> MADVISE_WIPEONFORK
>> so that userspace libraries can detect fork()/clone() robustly, for 
>> the
>> same reasons. It just wipes a page as the indicator, which is
>> effectively a single-bit signal, and it works well. On the user-space
>> side of this, I’m keen to find a solution like that that we can use
>> fairly easily inside of portable libraries and applications. The 
>> “have
>> I forked” checks do end up in hot paths, so it’s nice if they can 
>> be
>> CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my
>> favorite.
>
> I'm pretty sure a single bit is not enough if you want to have a
> single page, shared across the entire system, that stores the VM
> forking state; you need a counter for that.

You’re right. WIPEONFORK is more like a single-bit per use. If it’s 
something system wide then a counter is better.

> So the RNG state after mixing in the new VM Generation ID would
> contain 128 bits of secret entropy not known to anyone else, including
> people with access to the VM image.
>
> Now, 128 bits of cryptographically random data aren't _optimal_; I
> think something on the order of 256 bits would be nicer from a
> theoretical standpoint. But in practice I think we'll be good with the
> 128 bits we're getting (since the number of users who fork a VM image
> is probably not going to be so large that worst-case collision
> probabilities matter).

This reminds me on key/IV usage limits for AES encryption, where the 
same birthday bounds apply, and even though 256-bits would be better, we 
routinely make 128-bit birthday bounds work for massively scalable 
systems.

>> The kernel would need to use the change as a trigger to
>> measure some entropy (e.g. interrupts and RDRAND, or whatever). Our 
>> just
>> define the machine contract as “this has to be unique random data 
>> and
>> if it’s not unique, or if it’s pubic, you’re toast”.
>
> As far as I can tell from Microsoft's spec, that is a guarantee we're
> already getting.

Neat.

-
Colm
Willy Tarreau Oct. 17, 2020, 5:37 a.m. UTC | #9
On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> Microsoft's documentation
> (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> Generation ID that we get after a fork "is a 128-bit,
> cryptographically random integer value". If multiple people use the
> same image, it guarantees that each use of the image gets its own,
> fresh ID:

No. It cannot be more unique than the source that feeds that cryptographic
transformation. All it guarantees is that the entropy source is protected
from being guessed based on the output. Applying cryptography on a simple
counter provides apparently random numbers that will be unique for a long
period for the same source, but as soon as you duplicate that code between
users and they start from the same counter they'll get the same IDs.

This is why I think that using a counter is better if you really need something
unique. Randoms only reduce predictability which helps avoiding collisions.
And I'm saying this as someone who had on his external gateway the same SSH
host key as 89 other hosts on the net, each of them using randoms to provide
a universally unique one...

Willy
Jann Horn Oct. 17, 2020, 5:52 a.m. UTC | #10
On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> > Microsoft's documentation
> > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> > Generation ID that we get after a fork "is a 128-bit,
> > cryptographically random integer value". If multiple people use the
> > same image, it guarantees that each use of the image gets its own,
> > fresh ID:
>
> No. It cannot be more unique than the source that feeds that cryptographic
> transformation. All it guarantees is that the entropy source is protected
> from being guessed based on the output. Applying cryptography on a simple
> counter provides apparently random numbers that will be unique for a long
> period for the same source, but as soon as you duplicate that code between
> users and they start from the same counter they'll get the same IDs.
>
> This is why I think that using a counter is better if you really need something
> unique. Randoms only reduce predictability which helps avoiding collisions.

Microsoft's spec tells us that they're giving us cryptographically
random numbers. Where they're getting those from is not our problem.
(And if even the hypervisor is not able to collect enough entropy to
securely generate random numbers, worrying about RNG reseeding in the
guest would be kinda pointless, we'd be fairly screwed anyway.)

Also note that we don't actually need to *always* reinitialize RNG
state on forks for functional correctness; it is fine if that fails
with a probability of 2^-128, because functionally everything will be
fine, and an attacker who is that lucky could also just guess an AES
key (which has the same probability of being successful). (And also
2^-128 is such a tiny number that it doesn't matter anyway.)

> And I'm saying this as someone who had on his external gateway the same SSH
> host key as 89 other hosts on the net, each of them using randoms to provide
> a universally unique one...

If your SSH host key was shared with 89 other hosts, it evidently
wasn't generated from cryptographically random numbers. :P Either
because the key generator was not properly hooked up to the system's
entropy pool (if you're talking about the Debian fiasco), or because
the system simply did not have enough entropy available. (Or because
the key generator is broken, but I don't think that ever happened with
OpenSSH?)
Willy Tarreau Oct. 17, 2020, 6:44 a.m. UTC | #11
On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote:
> On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote:
> > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> > > Microsoft's documentation
> > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> > > Generation ID that we get after a fork "is a 128-bit,
> > > cryptographically random integer value". If multiple people use the
> > > same image, it guarantees that each use of the image gets its own,
> > > fresh ID:
> >
> > No. It cannot be more unique than the source that feeds that cryptographic
> > transformation. All it guarantees is that the entropy source is protected
> > from being guessed based on the output. Applying cryptography on a simple
> > counter provides apparently random numbers that will be unique for a long
> > period for the same source, but as soon as you duplicate that code between
> > users and they start from the same counter they'll get the same IDs.
> >
> > This is why I think that using a counter is better if you really need something
> > unique. Randoms only reduce predictability which helps avoiding collisions.
> 
> Microsoft's spec tells us that they're giving us cryptographically
> random numbers. Where they're getting those from is not our problem.
> (And if even the hypervisor is not able to collect enough entropy to
> securely generate random numbers, worrying about RNG reseeding in the
> guest would be kinda pointless, we'd be fairly screwed anyway.)

Sorry if I sound annoying, but it's a matter of terminology and needs.

Cryptograhically random means safe for use with cryptography in that it
is unguessable enough so that you can use it for encryption keys that
nobody will be able to guess. It in no ways guarantees uniqueness, just
like you don't really care if the symmetric crypto key of you VPN has
already been used once somewhere else as long as there's no way to know.
However with the good enough distribution that a CSPRNG provides,
collisions within a *same* generator are bound to a very low, predictable
rate which is by generally considered as acceptable for all use cases.

Something random (cryptographically or not) *cannot* be unique by
definition, otherwise it's not random anymore, since each draw has an
influence on the remaining list of possible draws, which is contrary to
randomness. And conversely something unique cannot be completely random
because if you know it's unique, you can already rule out all other known
values from the candidates, thus it's more predictable than random.

With this in mind, picking randoms from a same RNG is often highly
sufficient to consider they're highly likely unique within a long
period. But it's not a guarantee. And it's even less one between two
RNGs (e.g. if uniqueness is required between multiple hypervisors in
case VMs are migrated or centrally managed, which I don't know).

If what is sought here is a strong guarantee of uniqueness, using a
counter as you first suggested is better. If what is sought is pure
randomness (in the sense that it's unpredictable, which I don't think
is needed here), then randoms are better. If both are required, just
concatenate a counter and a random. And if you need them to be spatially
unique, just include a node identifier.

Now the initial needs in the forwarded message are not entirely clear
to me but I wanted to rule out the apparent mismatch between the expressed
needs for uniqueness and the proposed solutions solely based on randomness.

Cheers,
Willy
Jann Horn Oct. 17, 2020, 6:55 a.m. UTC | #12
On Sat, Oct 17, 2020 at 8:44 AM Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote:
> > On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote:
> > > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> > > > Microsoft's documentation
> > > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> > > > Generation ID that we get after a fork "is a 128-bit,
> > > > cryptographically random integer value". If multiple people use the
> > > > same image, it guarantees that each use of the image gets its own,
> > > > fresh ID:
> > >
> > > No. It cannot be more unique than the source that feeds that cryptographic
> > > transformation. All it guarantees is that the entropy source is protected
> > > from being guessed based on the output. Applying cryptography on a simple
> > > counter provides apparently random numbers that will be unique for a long
> > > period for the same source, but as soon as you duplicate that code between
> > > users and they start from the same counter they'll get the same IDs.
> > >
> > > This is why I think that using a counter is better if you really need something
> > > unique. Randoms only reduce predictability which helps avoiding collisions.
> >
> > Microsoft's spec tells us that they're giving us cryptographically
> > random numbers. Where they're getting those from is not our problem.
> > (And if even the hypervisor is not able to collect enough entropy to
> > securely generate random numbers, worrying about RNG reseeding in the
> > guest would be kinda pointless, we'd be fairly screwed anyway.)
>
> Sorry if I sound annoying, but it's a matter of terminology and needs.
>
> Cryptograhically random means safe for use with cryptography in that it
> is unguessable enough so that you can use it for encryption keys that
> nobody will be able to guess. It in no ways guarantees uniqueness, just
> like you don't really care if the symmetric crypto key of you VPN has
> already been used once somewhere else as long as there's no way to know.
> However with the good enough distribution that a CSPRNG provides,
> collisions within a *same* generator are bound to a very low, predictable
> rate which is by generally considered as acceptable for all use cases.

Yes.

> Something random (cryptographically or not) *cannot* be unique by
> definition, otherwise it's not random anymore, since each draw has an
> influence on the remaining list of possible draws, which is contrary to
> randomness. And conversely something unique cannot be completely random
> because if you know it's unique, you can already rule out all other known
> values from the candidates, thus it's more predictable than random.

Yes.

> With this in mind, picking randoms from a same RNG is often highly
> sufficient to consider they're highly likely unique within a long
> period. But it's not a guarantee. And it's even less one between two
> RNGs (e.g. if uniqueness is required between multiple hypervisors in
> case VMs are migrated or centrally managed, which I don't know).
>
> If what is sought here is a strong guarantee of uniqueness, using a
> counter as you first suggested is better.

My suggestion is to use a counter *in the UAPI*, not in the hypervisor
protocol. (And as long as that counter can only miss increments in a
cryptographically negligible fraction of cases, everything's fine.)

> If what is sought is pure
> randomness (in the sense that it's unpredictable, which I don't think
> is needed here), then randoms are better.

And this is what *the hypervisor protocol* gives us (which could be
very useful for reseeding the kernel RNG).

> If both are required, just
> concatenate a counter and a random. And if you need them to be spatially
> unique, just include a node identifier.
>
> Now the initial needs in the forwarded message are not entirely clear
> to me but I wanted to rule out the apparent mismatch between the expressed
> needs for uniqueness and the proposed solutions solely based on randomness.

Sure, from a theoretical standpoint, it would be a little bit nicer if
the hypervisor protocol included a generation number along with the
128-bit random value. But AFAIU it doesn't, so if we want this to just
work under Microsoft's existing hypervisor, we'll have to make do with
checking whether the random value changed. :P
Willy Tarreau Oct. 17, 2020, 7:17 a.m. UTC | #13
On Sat, Oct 17, 2020 at 08:55:34AM +0200, Jann Horn wrote:
> My suggestion is to use a counter *in the UAPI*, not in the hypervisor
> protocol. (And as long as that counter can only miss increments in a
> cryptographically negligible fraction of cases, everything's fine.)

OK I got it now and I agree.

> > If what is sought is pure
> > randomness (in the sense that it's unpredictable, which I don't think
> > is needed here), then randoms are better.
> 
> And this is what *the hypervisor protocol* gives us (which could be
> very useful for reseeding the kernel RNG).

As an external source, yes very likely, as long as it's not trivially
observable by everyone under the same hypervisor :-)

> > Now the initial needs in the forwarded message are not entirely clear
> > to me but I wanted to rule out the apparent mismatch between the expressed
> > needs for uniqueness and the proposed solutions solely based on randomness.
> 
> Sure, from a theoretical standpoint, it would be a little bit nicer if
> the hypervisor protocol included a generation number along with the
> 128-bit random value. But AFAIU it doesn't, so if we want this to just
> work under Microsoft's existing hypervisor, we'll have to make do with
> checking whether the random value changed. :P

OK got it, thanks for the explanation!

Willy
Jason A. Donenfeld Oct. 17, 2020, 1:24 p.m. UTC | #14
After discussing this offline with Jann a bit, I have a few general
comments on the design of this.

First, the UUID communicated by the hypervisor should be consumed by
the kernel -- added as another input to the rng -- and then userspace
should be notified that it should reseed any userspace RNGs that it
may have, without actually communicating that UUID to userspace. IOW,
I agree with Jann there. Then, it's the functioning of this
notification mechanism to userspace that is interesting to me.

There are a few design goals of notifying userspace: it should be
fast, because people who are using userspace RNGs are usually doing so
in the first place to completely avoid syscall overhead for whatever
high performance application they have - e.g. I recall conversations
with Colm about his TLS implementation needing to make random IVs
_really_ fast. It should also happen as early as possible, with no
race or as minimal as possible race window, so that userspace doesn't
begin using old randomness and then switch over after the damage is
already done.

I'm also not wedded to using Microsoft's proprietary hypervisor design
for this. If we come up with a better interface, I don't think it's
asking too much to implement that and reasonably expect for Microsoft
to catch up. Maybe someone here will find that controversial, but
whatever -- discussing ideal designs does not seem out of place or
inappropriate for how we usually approach things in the kernel, and a
closed source hypervisor coming along shouldn't disrupt that.

So, anyway, here are a few options with some pros and cons for the
kernel notifying userspace that its RNG should reseed.

1. SIGRND - a new signal. Lol.

2. Userspace opens a file descriptor that it can epoll on. Pros are
that many notification mechanisms already use this. Cons is that this
requires syscall and might be more racy than we want. Another con is
that this a new thing for userspace programs to do.

3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are
that this is extremely fast, and also simple to use and implement.
There are enough sequence points in typical crypto programs that
checking to see whether this counter has changed before doing whatever
operation seems easy enough. Cons are that typically we've been
conservative about adding things to the vDSO, and this is also a new
thing for userspace programs to do.

4. We already have a mechanism for this kind of thing, because the
same issue comes up when fork()ing. The solution was MADV_WIPEONFORK,
where userspace marks a page to be zeroed when forking, for the
purposes of the RNG being notified when its world gets split in two.
This is basically the same thing as we're discussing here with guest
snapshots, except it's on the system level rather than the process
level, and a system has many processes. But the problem space is still
almost the same, and we could simply reuse that same mechanism. There
are a few implementation strategies for that:

4a. We mess with the PTEs of all processes' pages that are
MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us
to do so. Then we wind up reusing the already existing logic for
userspace RNGs. Cons might be that this usually requires semaphores,
and we're in irq context, so we'd have to hoist to a workqueue, which
means either more wake up latency, or a larger race window.

4b. We just memzero all processes' pages that are MADV_WIPEONFORK,
when the hypervisor notifies us to do so. Then we wind up reusing the
already existing logic for userspace RNGs.

4c. The guest kernel maintains an array of physical addresses that are
MADV_WIPEONFORK. The hypervisor knows about this array and its
location through whatever protocol, and before resuming a
moved/snapshotted/duplicated VM, it takes the responsibility for
memzeroing this memory. The huge pro here would be that this
eliminates all races, and reduces complexity quite a bit, because the
hypervisor can perfectly synchronize its bringup (and SMP bringup)
with this, and it can even optimize things like on-disk memory
snapshots to simply not write out those pages to disk.

A 4c-like approach seems like it'd be a lot of bang for the buck -- we
reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
userspace API to deal with, and it'd be race free, and eliminate a lot
of kernel complexity.

But 4b and 3 don't seem too bad either.

Any thoughts on 4c? Is that utterly insane, or does that actually get
us somewhere close to what we want?

Jason
Catangiu, Adrian Costin Oct. 17, 2020, 6:06 p.m. UTC | #15
After discussing this offline with Jann a bit, I have a few general
    comments on the design of this. 
    First, the UUID communicated by the hypervisor should be consumed by
    the kernel -- added as another input to the rng -- and then userspace
    should be notified that it should reseed any userspace RNGs that it
    may have, without actually communicating that UUID to userspace. IOW,
    I agree with Jann there. Then, it's the functioning of this
    notification mechanism to userspace that is interesting to me.

Agreed! The UUID/vmgenid is the glue to the hypervisor to be able to find 
out about VM snapshots/forks. The really interesting (and important) topic
here is finding the right notification mechanism to userspace.

...In retrospect, I should have posted this as RFC instead of PATCH.
    
    So, anyway, here are a few options with some pros and cons for the
    kernel notifying userspace that its RNG should reseed.
    1. SIGRND - a new signal. Lol.
    2. Userspace opens a file descriptor that it can epoll on. Pros are
    that many notification mechanisms already use this. Cons is that this
    requires syscall and might be more racy than we want. Another con is
    that this a new thing for userspace programs to do.
    3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are
    that this is extremely fast, and also simple to use and implement.
    There are enough sequence points in typical crypto programs that
    checking to see whether this counter has changed before doing whatever
    operation seems easy enough. Cons are that typically we've been
    conservative about adding things to the vDSO, and this is also a new
    thing for userspace programs to do.

For each 1, 2, and 3 options, userspace programs _have to do smth new_
anyway, so I wouldn't weigh that as a con.

An atomic counter in the vDSO looks like the most bang for the buck to me.
I'm really curious to hear more opinions on why we shouldn't do it.
    
    4. We already have a mechanism for this kind of thing, because the
    same issue comes up when fork()ing. The solution was MADV_WIPEONFORK,
    where userspace marks a page to be zeroed when forking, for the
    purposes of the RNG being notified when its world gets split in two.
    This is basically the same thing as we're discussing here with guest
    snapshots, except it's on the system level rather than the process
    level, and a system has many processes. But the problem space is still
    almost the same, and we could simply reuse that same mechanism. There
    are a few implementation strategies for that:

I don't think we can piggy back on MADV_WIPEONFORK. That madvise flag
has a clear contract of only wiping _on fork_. Overloading it with wiping
on VM-fork - while process doesn't fork - might break some of its users.
    
    4a. We mess with the PTEs of all processes' pages that are
    MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us
    to do so. Then we wind up reusing the already existing logic for
    userspace RNGs. Cons might be that this usually requires semaphores,
    and we're in irq context, so we'd have to hoist to a workqueue, which
    means either more wake up latency, or a larger race window.
    4b. We just memzero all processes' pages that are MADV_WIPEONFORK,
    when the hypervisor notifies us to do so. Then we wind up reusing the
    already existing logic for userspace RNGs.
    4c. The guest kernel maintains an array of physical addresses that are
    MADV_WIPEONFORK. The hypervisor knows about this array and its
    location through whatever protocol, and before resuming a
    moved/snapshotted/duplicated VM, it takes the responsibility for
    memzeroing this memory. The huge pro here would be that this
    eliminates all races, and reduces complexity quite a bit, because the
    hypervisor can perfectly synchronize its bringup (and SMP bringup)
    with this, and it can even optimize things like on-disk memory
    snapshots to simply not write out those pages to disk.

I've previously proposed a path similar (in concept at least) with a combination
of 4 a,b and c - https://lwn.net/ml/linux-mm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/
without reusing MADV_WIPEONFORK, but by adding a dedicated
MADV_WIPEONSUSPEND.

That proposal was clunky however with many people raising concerns around
how the interface is too subtle and hard to work with.

A vmgenid driver offering a clean FS interface seemed cleaner, although, like
some of you observed, it still allows a window of time between actual VM fork
and userspace handling of the event.

One other direction that I would like to explore and I feel it’s similar to your 4c
proposal is to do smth like:
"mm: extend memfd with ability to create 'secret' memory"
https://patchwork.kernel.org/project/linux-mm/patch/20200130162340.GA14232@rapoport-lnx/

Maybe we can combine ideas from the two patches in smth like: instead of libs
using anon mem with MADV_WIPEONSUSPEND, they can use a secret_mem_fd
with a (secretmem specific) WIPEONSUSPEND flag; then instead of crawling
PTEs in the core PM code looking for MADV_WIPEONSUSPEND mappings,
we can register this secretmem device to wipe its own secrets during a pm callback.

From a crypto safety pov, the ideal solution would be one where wiping happens
before or during VM forks, not after.
Having said that, I think a vDSO (updated by the guest kernel _after_ VM fork) still
closes that gap enough to be practically safe.

Thanks,
Adrian.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Alexander Graf Oct. 17, 2020, 6:09 p.m. UTC | #16
Hi Jason,

On 17.10.20 15:24, Jason A. Donenfeld wrote:
> 
> After discussing this offline with Jann a bit, I have a few general
> comments on the design of this.
> 
> First, the UUID communicated by the hypervisor should be consumed by
> the kernel -- added as another input to the rng -- and then userspace

We definitely want a kernel internal notifier as well, yes :).

> should be notified that it should reseed any userspace RNGs that it
> may have, without actually communicating that UUID to userspace. IOW,

I also tend to agree that it makes sense to disconnect the actual UUID 
we receive from the notification to user space. This would allow us to 
create a generic mechanism for VM save/restore cycles across different 
hypervisors. Let me add PPC and s390x people to the CC list to see 
whether they have anything remotely similar to the VmGenID mechanism. 
For x86 and aarch64, the ACPI and memory based VmGenID implemented here 
is the most obvious option to implement IMHO. It's also already 
implemented in all major hypervisors.

> I agree with Jann there. Then, it's the functioning of this
> notification mechanism to userspace that is interesting to me.

Absolutely! Please have a look at the previous discussion here:

 
https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/

The user space interface is absolutely what this is about.

> There are a few design goals of notifying userspace: it should be
> fast, because people who are using userspace RNGs are usually doing so
> in the first place to completely avoid syscall overhead for whatever
> high performance application they have - e.g. I recall conversations
> with Colm about his TLS implementation needing to make random IVs
> _really_ fast. It should also happen as early as possible, with no
> race or as minimal as possible race window, so that userspace doesn't
> begin using old randomness and then switch over after the damage is
> already done.

There are multiple facets and different types of consumers here. For a 
user space RNG, I agree that fast and as race free as possible is key. 
That's what the mmap interface is there for.

There are applications way beyond that though. What do you do with 
applications that already consumed randomness? For example a cached pool 
of SSL keys. Or a higher level language primitive that consumes 
randomness and caches its seed somewhere in an internal data structure. 
Or even worse: your system's host ssh key.

For those types of events, an mmap (or vDSO) interface does not work. We 
need to actively allow user space applications to readjust to the new 
environment - either internally (the language primitive case) or through 
a system event, maybe even as systemd trigger (the ssh host key case).

To give everyone enough time before we consider a system as "updated to 
the new environment", we have the callback logic with the "Orchestrator" 
that can check whether all listeners to system wide updates confirms 
they adjusted themselves.

That's what the rest of the logic is there for: A read+poll interface 
and all of the orchestration logic. It's not for the user space RNG 
case, it's for all of its downstream users.

> I'm also not wedded to using Microsoft's proprietary hypervisor design
> for this. If we come up with a better interface, I don't think it's
> asking too much to implement that and reasonably expect for Microsoft
> to catch up. Maybe someone here will find that controversial, but
> whatever -- discussing ideal designs does not seem out of place or
> inappropriate for how we usually approach things in the kernel, and a
> closed source hypervisor coming along shouldn't disrupt that.

The main bonus point on this interface is that Hyper-V, VMware and QEMU 
implement it already. It would be a very natural for into the ecosystem. 
I agree though that we shouldn't have our user space interface 
necessarily dictated by it: Other hypervisors may implement different 
ways such as a simple edge IRQ that gets triggered whenever the VM gets 
resumed.

> So, anyway, here are a few options with some pros and cons for the
> kernel notifying userspace that its RNG should reseed.

I can only stress again that we should not be laser focused on the RNG 
case. In a lot of cases, data has already been generated by the RNG 
before the snapshot and needs to be reinitialized after the snapshot. In 
other cases such as system UUIDs, it's completely orthogonal to the RNG.

> 
> 1. SIGRND - a new signal. Lol.

Doable, but a lot of plumbing in user space. It's also not necessarily a 
good for for event notification in most user space applications.

> 
> 2. Userspace opens a file descriptor that it can epoll on. Pros are
> that many notification mechanisms already use this. Cons is that this
> requires syscall and might be more racy than we want. Another con is
> that this a new thing for userspace programs to do.

That's part of what this patch does, right? This patch implements 
read+poll as well as mmap() for high speed reads.

> 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are
> that this is extremely fast, and also simple to use and implement.
> There are enough sequence points in typical crypto programs that
> checking to see whether this counter has changed before doing whatever
> operation seems easy enough. Cons are that typically we've been
> conservative about adding things to the vDSO, and this is also a new
> thing for userspace programs to do.

The big con is that its use is going to be super limited to applications 
that can be adapted to check their "vm generation" through a vDSO call / 
read every time they consume data that may potentially need to be 
regenerated.

This probably works for the pure RNG case. It falls apart for more 
sophisticated things such as "redo my ssh host keys and restart the 
service" or "regenerate my samba machine uuid".

> 4. We already have a mechanism for this kind of thing, because the
> same issue comes up when fork()ing. The solution was MADV_WIPEONFORK,
> where userspace marks a page to be zeroed when forking, for the
> purposes of the RNG being notified when its world gets split in two.
> This is basically the same thing as we're discussing here with guest
> snapshots, except it's on the system level rather than the process
> level, and a system has many processes. But the problem space is still
> almost the same, and we could simply reuse that same mechanism. There
> are a few implementation strategies for that:

Yup, that's where we started from :). And then we ran into resistance by 
the mm people (on CC here). And then we looked at the problem more in 
depth and checked what it would take to for example implement this for 
user space RNGs in Java. It's ... more complicated than one may think at 
first.

> 4a. We mess with the PTEs of all processes' pages that are
> MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us
> to do so. Then we wind up reusing the already existing logic for
> userspace RNGs. Cons might be that this usually requires semaphores,
> and we're in irq context, so we'd have to hoist to a workqueue, which
> means either more wake up latency, or a larger race window.
> 
> 4b. We just memzero all processes' pages that are MADV_WIPEONFORK,
> when the hypervisor notifies us to do so. Then we wind up reusing the
> already existing logic for userspace RNGs.
> 
> 4c. The guest kernel maintains an array of physical addresses that are
> MADV_WIPEONFORK. The hypervisor knows about this array and its
> location through whatever protocol, and before resuming a
> moved/snapshotted/duplicated VM, it takes the responsibility for
> memzeroing this memory. The huge pro here would be that this
> eliminates all races, and reduces complexity quite a bit, because the
> hypervisor can perfectly synchronize its bringup (and SMP bringup)
> with this, and it can even optimize things like on-disk memory
> snapshots to simply not write out those pages to disk.
> 
> A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> userspace API to deal with, and it'd be race free, and eliminate a lot
> of kernel complexity.
> 
> But 4b and 3 don't seem too bad either.
> 
> Any thoughts on 4c? Is that utterly insane, or does that actually get
> us somewhere close to what we want?

All of the options for "4" are possible and have an RFC out. Please 
check out the discussion linked above :).

The problem with anything that relies on close loop reads (options 3+4) 
is not going to work well with the more sophisticated use case of 
derived data.

IMHO it will boil down to "both". We will need a high-speed interface 
that with close-to-0 overhead tells you either the generation ID or 
clears pages (options 3+4) as well as something that is bigger for 
applications that can either intrinsically (sshd) or by system design 
(Java) not adopt the mechanisms above easily.

That said, we need to start somewhere. I don't mind which angle we start 
from. But this is a real world problem and one that will only become 
more prevalent over time as VMs are used for more than only your 
traditional enterprise hardware consolidation.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Andy Lutomirski Oct. 17, 2020, 6:10 p.m. UTC | #17
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote:
>
> [adding some more people who are interested in RNG stuff: Andy, Jason,
> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> concerns some pretty fundamental API stuff related to RNG usage]
>
> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
> <acatan@amazon.com> wrote:
> > - Background
> >
> > The VM Generation ID is a feature defined by Microsoft (paper:
> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> > multiple hypervisor vendors.
> >
> > The feature is required in virtualized environments by apps that work
> > with local copies/caches of world-unique data such as random values,
> > uuids, monotonically increasing counters, etc.
> > Such apps can be negatively affected by VM snapshotting when the VM
> > is either cloned or returned to an earlier point in time.
> >
> > The VM Generation ID is a simple concept meant to alleviate the issue
> > by providing a unique ID that changes each time the VM is restored
> > from a snapshot. The hw provided UUID value can be used to
> > differentiate between VMs or different generations of the same VM.
> >
> > - Problem
> >
> > The VM Generation ID is exposed through an ACPI device by multiple
> > hypervisor vendors but neither the vendors or upstream Linux have no
> > default driver for it leaving users to fend for themselves.
> >
> > Furthermore, simply finding out about a VM generation change is only
> > the starting point of a process to renew internal states of possibly
> > multiple applications across the system. This process could benefit
> > from a driver that provides an interface through which orchestration
> > can be easily done.
> >
> > - Solution
> >
> > This patch is a driver which exposes the Virtual Machine Generation ID
> > via a char-dev FS interface that provides ID update sync and async
> > notification, retrieval and confirmation mechanisms:
> >
> > When the device is 'open()'ed a copy of the current vm UUID is
> > associated with the file handle. 'read()' operations block until the
> > associated UUID is no longer up to date - until HW vm gen id changes -
> > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> >
> > 'poll()' is implemented to allow polling for UUID updates. Such
> > updates result in 'EPOLLIN' events.
> >
> > Subsequent read()s following a UUID update no longer block, but return
> > the updated UUID. The application needs to acknowledge the UUID update
> > by confirming it through a 'write()'.
> > Only on writing back to the driver the right/latest UUID, will the
> > driver mark this "watcher" as up to date and remove EPOLLIN status.
> >
> > 'mmap()' support allows mapping a single read-only shared page which
> > will always contain the latest UUID value at offset 0.
>
> It would be nicer if that page just contained an incrementing counter,
> instead of a UUID. It's not like the application cares *what* the UUID
> changed to, just that it *did* change and all RNGs state now needs to
> be reseeded from the kernel, right? And an application can't reliably
> read the entire UUID from the memory mapping anyway, because the VM
> might be forked in the middle.
>
> So I think your kernel driver should detect UUID changes and then turn
> those into a monotonically incrementing counter. (Probably 64 bits
> wide?) (That's probably also a little bit faster than comparing an
> entire UUID.)
>
> An option might be to put that counter into the vDSO, instead of a
> separate VMA; but I don't know how the other folks feel about that.
> Andy, do you have opinions on this? That way, normal userspace code
> that uses this infrastructure wouldn't have to mess around with a
> special device at all. And it'd be usable in seccomp sandboxes and so
> on without needing special plumbing. And libraries wouldn't have to
> call open() and mess with file descriptor numbers.

The vDSO might be annoyingly slow for this.  Something like the rseq
page might make sense.  It could be a generic indication of "system
went through some form of suspend".
Jann Horn Oct. 18, 2020, 2:08 a.m. UTC | #18
On Sat, Oct 17, 2020 at 8:09 PM Alexander Graf <graf@amazon.de> wrote:
> There are applications way beyond that though. What do you do with
> applications that already consumed randomness? For example a cached pool
> of SSL keys. Or a higher level language primitive that consumes
> randomness and caches its seed somewhere in an internal data structure.

For deterministic protection, those would also have to poll some
memory location that tells them whether the VmGenID changed:

1. between reading entropy from their RNG pool and using it
2. between collecting data from external sources (user input, clock,
...) and encrypting it

and synchronously shoot down the connection if a change happened. If
e.g. an application inside the VM has an AES-GCM-encrypted TLS
connection and, directly after the VM is restored, triggers an
application-level timeout that sends some fixed message across the
connection, then the TLS library must guarantee that either the VM was
already committed to sending exactly that message before the VM was
forked or the message will be blocked. If we don't do that, an
attacker who captures both a single packet from the forked VM and
traffic from the old VM can decrypt the next message from the old VM
after the fork (because AES-GCM is like AES-CTR plus an authenticator,
and CTR means that when keystream reuse occurs and one of the
plaintexts is known, the attacker can simply recover the other
plaintext using XOR).

(Or maybe, in disaster failover environments, TLS 1.3 servers could
get away with rekeying the connection instead of shooting it down? Ask
your resident friendly cryptographer whether that would be secure, I
am not one.)

I don't think a mechanism based around asynchronously telling the
application and waiting for it to confirm the rotation at a later
point is going to cut it; we should have some hard semantics on when
an application needs to poll this value.

> Or even worse: your system's host ssh key.

Mmmh... I think I normally would not want a VM to reset its host ssh
key after merely restoring a snapshot though? And more importantly,
Microsoft's docs say that they also change the VmGenID on disaster
failover. I think you very much wouldn't want your server to lose its
host key every time disaster failover happens. On the other hand,
after importing a public VM image, it might be a good idea.

I guess you could push that responsibility on the user, by adding an
option to the sshd_config that tells OpenSSH whether the host key
should be rotated on an ID change or not... but that still would not
be particularly pretty.

Ideally we would have the host tell us what type of events happened to
the VM, or something like that... or maybe even get the host VM
management software to ask the user whether they're importing a public
image... I really feel like with Microsoft's current protocol, we
don't get enough information to figure out what we should do about
private long-term authentication keys.
Colm MacCarthaigh Oct. 18, 2020, 3:14 a.m. UTC | #19
On 17 Oct 2020, at 6:24, Jason A. Donenfeld wrote:

> There are a few design goals of notifying userspace: it should be
> fast, because people who are using userspace RNGs are usually doing so
> in the first place to completely avoid syscall overhead for whatever
> high performance application they have - e.g. I recall conversations
> with Colm about his TLS implementation needing to make random IVs
> _really_ fast.

That’s our old friend TLS1.1 in CBC mode, which needs a random 
explicit IV for every record sent. Speed is still a reason at the 
margins in cases like that, but getrandom() is really fast. A stickier 
problem is that getrandom() is not certified for use with every 
compliance standard, and those often dictate precise use of some NIST 
DRBG or NRBG construction. That keeps people proliferating user-space 
RNGs even when speed isn’t as important.

> It should also happen as early as possible, with no
> race or as minimal as possible race window, so that userspace doesn't
> begin using old randomness and then switch over after the damage is
> already done.

+1 to this, and I’d add that anyone making VM snapshots that they plan 
to restore from multiple times really needs to think this through top to 
bottom. The system would likely need to be put in to some kind of 
quiescent state when the snapshot is taken.

> So, anyway, here are a few options with some pros and cons for the
> kernel notifying userspace that its RNG should reseed.
>
> 1. SIGRND - a new signal. Lol.
>
> 2. Userspace opens a file descriptor that it can epoll on. Pros are
> that many notification mechanisms already use this. Cons is that this
> requires syscall and might be more racy than we want. Another con is
> that this a new thing for userspace programs to do.

A library like OpenSSL or BoringSSL also has to account for running 
inside a chroot, which also makes this hard.

> Any thoughts on 4c? Is that utterly insane, or does that actually get
> us somewhere close to what we want?

I still like 4c, and as a user-space crypto-person, and a VM person, 
they have a lot of appeal. Alex and Adrian’s replies get into some of 
the sufficiency challenge. But for user-space libraries like the *SSLs, 
the JVMs, and other runtimes where RNGs show up, it could plug in easily 
enough.

-
Colm
Michael S. Tsirkin Oct. 18, 2020, 3:52 p.m. UTC | #20
On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> 4c. The guest kernel maintains an array of physical addresses that are
> MADV_WIPEONFORK. The hypervisor knows about this array and its
> location through whatever protocol, and before resuming a
> moved/snapshotted/duplicated VM, it takes the responsibility for
> memzeroing this memory. The huge pro here would be that this
> eliminates all races, and reduces complexity quite a bit, because the
> hypervisor can perfectly synchronize its bringup (and SMP bringup)
> with this, and it can even optimize things like on-disk memory
> snapshots to simply not write out those pages to disk.
> 
> A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> userspace API to deal with, and it'd be race free, and eliminate a lot
> of kernel complexity.

Clearly this has a chance to break applications, right?
If there's an app that uses this as a non-system-calls way
to find out whether there was a fork, it will break
when wipe triggers without a fork ...
For example, imagine:

MADV_WIPEONFORK
copy secret data to MADV_DONTFORK
fork


used to work, with this change it gets 0s instead of the secret data.


I am also not sure it's wise to expose each guest process
to the hypervisor like this. E.g. each process needs a
guest physical address of its own then. This is a finite resource.


The mmap interface proposed here is somewhat baroque, but it is
certainly simple to implement ...
Andy Lutomirski Oct. 18, 2020, 3:54 p.m. UTC | #21
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > 4c. The guest kernel maintains an array of physical addresses that are
> > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > location through whatever protocol, and before resuming a
> > moved/snapshotted/duplicated VM, it takes the responsibility for
> > memzeroing this memory. The huge pro here would be that this
> > eliminates all races, and reduces complexity quite a bit, because the
> > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > with this, and it can even optimize things like on-disk memory
> > snapshots to simply not write out those pages to disk.
> >
> > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > userspace API to deal with, and it'd be race free, and eliminate a lot
> > of kernel complexity.
>
> Clearly this has a chance to break applications, right?
> If there's an app that uses this as a non-system-calls way
> to find out whether there was a fork, it will break
> when wipe triggers without a fork ...
> For example, imagine:
>
> MADV_WIPEONFORK
> copy secret data to MADV_DONTFORK
> fork
>
>
> used to work, with this change it gets 0s instead of the secret data.
>
>
> I am also not sure it's wise to expose each guest process
> to the hypervisor like this. E.g. each process needs a
> guest physical address of its own then. This is a finite resource.
>
>
> The mmap interface proposed here is somewhat baroque, but it is
> certainly simple to implement ...

Wipe of fork/vmgenid/whatever could end up being much more problematic
than it naively appears -- it could be wiped in the middle of a read.
Either the API needs to handle this cleanly, or we need something more
aggressive like signal-on-fork.

--Andy
Michael S. Tsirkin Oct. 18, 2020, 3:59 p.m. UTC | #22
On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > 4c. The guest kernel maintains an array of physical addresses that are
> > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > location through whatever protocol, and before resuming a
> > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > memzeroing this memory. The huge pro here would be that this
> > > eliminates all races, and reduces complexity quite a bit, because the
> > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > with this, and it can even optimize things like on-disk memory
> > > snapshots to simply not write out those pages to disk.
> > >
> > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > of kernel complexity.
> >
> > Clearly this has a chance to break applications, right?
> > If there's an app that uses this as a non-system-calls way
> > to find out whether there was a fork, it will break
> > when wipe triggers without a fork ...
> > For example, imagine:
> >
> > MADV_WIPEONFORK
> > copy secret data to MADV_DONTFORK
> > fork
> >
> >
> > used to work, with this change it gets 0s instead of the secret data.
> >
> >
> > I am also not sure it's wise to expose each guest process
> > to the hypervisor like this. E.g. each process needs a
> > guest physical address of its own then. This is a finite resource.
> >
> >
> > The mmap interface proposed here is somewhat baroque, but it is
> > certainly simple to implement ...
> 
> Wipe of fork/vmgenid/whatever could end up being much more problematic
> than it naively appears -- it could be wiped in the middle of a read.
> Either the API needs to handle this cleanly, or we need something more
> aggressive like signal-on-fork.
> 
> --Andy


Right, it's not on fork, it's actually when process is snapshotted.

If we assume it's CRIU we care about, then I
wonder what's wrong with something like
MADV_CHANGEONPTRACE_SEIZE
and basically say it's X bytes which change the value...
Andy Lutomirski Oct. 18, 2020, 4:14 p.m. UTC | #23
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > location through whatever protocol, and before resuming a
> > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > memzeroing this memory. The huge pro here would be that this
> > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > with this, and it can even optimize things like on-disk memory
> > > > snapshots to simply not write out those pages to disk.
> > > >
> > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > of kernel complexity.
> > >
> > > Clearly this has a chance to break applications, right?
> > > If there's an app that uses this as a non-system-calls way
> > > to find out whether there was a fork, it will break
> > > when wipe triggers without a fork ...
> > > For example, imagine:
> > >
> > > MADV_WIPEONFORK
> > > copy secret data to MADV_DONTFORK
> > > fork
> > >
> > >
> > > used to work, with this change it gets 0s instead of the secret data.
> > >
> > >
> > > I am also not sure it's wise to expose each guest process
> > > to the hypervisor like this. E.g. each process needs a
> > > guest physical address of its own then. This is a finite resource.
> > >
> > >
> > > The mmap interface proposed here is somewhat baroque, but it is
> > > certainly simple to implement ...
> >
> > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > than it naively appears -- it could be wiped in the middle of a read.
> > Either the API needs to handle this cleanly, or we need something more
> > aggressive like signal-on-fork.
> >
> > --Andy
>
>
> Right, it's not on fork, it's actually when process is snapshotted.
>
> If we assume it's CRIU we care about, then I
> wonder what's wrong with something like
> MADV_CHANGEONPTRACE_SEIZE
> and basically say it's X bytes which change the value...

I feel like we may be approaching this from the wrong end.  Rather
than saying "what data structure can the kernel expose that might
plausibly be useful", how about we try identifying some specific
userspace needs and see what a good solution could look like.  I can
identify two major cryptographic use cases:

1. A userspace RNG.  The API exposed by the userspace end is a
function that generates random numbers.  The userspace code in turn
wants to know some things from the kernel: it wants some
best-quality-available random seed data from the kernel (and possibly
an indication of how good it is) as well as an indication of whether
the userspace memory may have been cloned or rolled back, or, failing
that, an indication of whether a reseed is needed.  Userspace could
implement a wide variety of algorithms on top depending on its goals
and compliance requirements, but the end goal is for the userspace
part to be very, very fast.

2. A userspace crypto stack that wants to avoid shooting itself in the
foot due to inadvertently doing the same thing twice.  For example, an
AES-GCM stack does not want to reuse an IV, *expecially* if there is
even the slightest chance that it might reuse the IV for different
data.  This use case doesn't necessarily involve random numbers, but,
if anything, it needs to be even faster than #1.

The threats here are not really the same.  For #1, a userspace RNG
should be able to recover from a scenario in which an adversary clones
the entire process *and gets to own the clone*.  For example, in
Android, an adversary can often gain complete control of a fork of the
zygote -- this shouldn't adversely affect the security properties of
other forks.  Similarly, a server farm could operate by having one
booted server that is cloned to create more workers.  Those clones
could be provisioned with secrets and permissions post-clone, and at
attacker gaining control of a fresh clone could be considered
acceptable.  For #2, in contrast, if an adversary gains control of a
clone of an AES-GCM session, they learn the key outright -- the
relevant attack scenario is that the adversary gets to interact with
two clones without compromising either clone per se.

It's worth noting that, in both cases, there could possibly be more
than one instance of an RNG or an AES-GCM session in the same process.
This means that using signals is awkward but not necessarily
impossibly.  (This is an area in which Linux, and POSIX in general, is
much weaker than Windows.)
Michael S. Tsirkin Oct. 19, 2020, 3 p.m. UTC | #24
On Sun, Oct 18, 2020 at 09:14:00AM -0700, Andy Lutomirski wrote:
> On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > > location through whatever protocol, and before resuming a
> > > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > > memzeroing this memory. The huge pro here would be that this
> > > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > > with this, and it can even optimize things like on-disk memory
> > > > > snapshots to simply not write out those pages to disk.
> > > > >
> > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > > of kernel complexity.
> > > >
> > > > Clearly this has a chance to break applications, right?
> > > > If there's an app that uses this as a non-system-calls way
> > > > to find out whether there was a fork, it will break
> > > > when wipe triggers without a fork ...
> > > > For example, imagine:
> > > >
> > > > MADV_WIPEONFORK
> > > > copy secret data to MADV_DONTFORK
> > > > fork
> > > >
> > > >
> > > > used to work, with this change it gets 0s instead of the secret data.
> > > >
> > > >
> > > > I am also not sure it's wise to expose each guest process
> > > > to the hypervisor like this. E.g. each process needs a
> > > > guest physical address of its own then. This is a finite resource.
> > > >
> > > >
> > > > The mmap interface proposed here is somewhat baroque, but it is
> > > > certainly simple to implement ...
> > >
> > > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > > than it naively appears -- it could be wiped in the middle of a read.
> > > Either the API needs to handle this cleanly, or we need something more
> > > aggressive like signal-on-fork.
> > >
> > > --Andy
> >
> >
> > Right, it's not on fork, it's actually when process is snapshotted.
> >
> > If we assume it's CRIU we care about, then I
> > wonder what's wrong with something like
> > MADV_CHANGEONPTRACE_SEIZE
> > and basically say it's X bytes which change the value...
> 
> I feel like we may be approaching this from the wrong end.  Rather
> than saying "what data structure can the kernel expose that might
> plausibly be useful", how about we try identifying some specific
> userspace needs and see what a good solution could look like.  I can
> identify two major cryptographic use cases:

Well, I'm aware of a non-cryptographic use-case:
https://bugzilla.redhat.com/show_bug.cgi?id=1118834

this seems to just ask for the guest to have a way to detect that
a VM cloning triggered.
Mathieu Desnoyers Oct. 19, 2020, 5:15 p.m. UTC | #25
----- On Oct 17, 2020, at 2:10 PM, Andy Lutomirski luto@kernel.org wrote:

> On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote:
>>
>> [adding some more people who are interested in RNG stuff: Andy, Jason,
>> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
>> concerns some pretty fundamental API stuff related to RNG usage]
>>
>> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
>> <acatan@amazon.com> wrote:
>> > - Background
>> >
>> > The VM Generation ID is a feature defined by Microsoft (paper:
>> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>> > multiple hypervisor vendors.
>> >
>> > The feature is required in virtualized environments by apps that work
>> > with local copies/caches of world-unique data such as random values,
>> > uuids, monotonically increasing counters, etc.
>> > Such apps can be negatively affected by VM snapshotting when the VM
>> > is either cloned or returned to an earlier point in time.
>> >
>> > The VM Generation ID is a simple concept meant to alleviate the issue
>> > by providing a unique ID that changes each time the VM is restored
>> > from a snapshot. The hw provided UUID value can be used to
>> > differentiate between VMs or different generations of the same VM.
>> >
>> > - Problem
>> >
>> > The VM Generation ID is exposed through an ACPI device by multiple
>> > hypervisor vendors but neither the vendors or upstream Linux have no
>> > default driver for it leaving users to fend for themselves.
>> >
>> > Furthermore, simply finding out about a VM generation change is only
>> > the starting point of a process to renew internal states of possibly
>> > multiple applications across the system. This process could benefit
>> > from a driver that provides an interface through which orchestration
>> > can be easily done.
>> >
>> > - Solution
>> >
>> > This patch is a driver which exposes the Virtual Machine Generation ID
>> > via a char-dev FS interface that provides ID update sync and async
>> > notification, retrieval and confirmation mechanisms:
>> >
>> > When the device is 'open()'ed a copy of the current vm UUID is
>> > associated with the file handle. 'read()' operations block until the
>> > associated UUID is no longer up to date - until HW vm gen id changes -
>> > at which point the new UUID is provided/returned. Nonblocking 'read()'
>> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
>> >
>> > 'poll()' is implemented to allow polling for UUID updates. Such
>> > updates result in 'EPOLLIN' events.
>> >
>> > Subsequent read()s following a UUID update no longer block, but return
>> > the updated UUID. The application needs to acknowledge the UUID update
>> > by confirming it through a 'write()'.
>> > Only on writing back to the driver the right/latest UUID, will the
>> > driver mark this "watcher" as up to date and remove EPOLLIN status.
>> >
>> > 'mmap()' support allows mapping a single read-only shared page which
>> > will always contain the latest UUID value at offset 0.
>>
>> It would be nicer if that page just contained an incrementing counter,
>> instead of a UUID. It's not like the application cares *what* the UUID
>> changed to, just that it *did* change and all RNGs state now needs to
>> be reseeded from the kernel, right? And an application can't reliably
>> read the entire UUID from the memory mapping anyway, because the VM
>> might be forked in the middle.
>>
>> So I think your kernel driver should detect UUID changes and then turn
>> those into a monotonically incrementing counter. (Probably 64 bits
>> wide?) (That's probably also a little bit faster than comparing an
>> entire UUID.)
>>
>> An option might be to put that counter into the vDSO, instead of a
>> separate VMA; but I don't know how the other folks feel about that.
>> Andy, do you have opinions on this? That way, normal userspace code
>> that uses this infrastructure wouldn't have to mess around with a
>> special device at all. And it'd be usable in seccomp sandboxes and so
>> on without needing special plumbing. And libraries wouldn't have to
>> call open() and mess with file descriptor numbers.
> 
> The vDSO might be annoyingly slow for this.  Something like the rseq
> page might make sense.  It could be a generic indication of "system
> went through some form of suspend".

This might indeed fit nicely as an extension of my KTLS prototype (extensible rseq):

https://lore.kernel.org/lkml/20200925181518.4141-1-mathieu.desnoyers@efficios.com/

There are a few ways we could wire things up. One might be to add the
UUID field into the extended KTLS structure (so it's always updated after it
changes on next return to user-space). For this I assume that the Linux scheduler
within the guest VM always preempts all threads before a VM is suspended (is that
indeed true ?).

This leads to one important question though: how is the UUID check vs commit operation
made atomic with respect to suspend ? Unless we use rseq critical sections in assembly,
where the kernel will abort the rseq critical section on preemption, I don't see how we
can ensure that the UUID value does not change right after it has been checked, before
the "commit" side-effect. And what is the expected "commit" side-effect ? Is it a store
to a variable in user-space memory, or is it issuing a system call which sends a packet over
the network ?

Thanks,

Mathieu
Christian Borntraeger Oct. 20, 2020, 9:35 a.m. UTC | #26
On 17.10.20 20:09, Alexander Graf wrote:
> Hi Jason,
> 
> On 17.10.20 15:24, Jason A. Donenfeld wrote:
>>
>> After discussing this offline with Jann a bit, I have a few general
>> comments on the design of this.
>>
>> First, the UUID communicated by the hypervisor should be consumed by
>> the kernel -- added as another input to the rng -- and then userspace
> 
> We definitely want a kernel internal notifier as well, yes :).
> 
>> should be notified that it should reseed any userspace RNGs that it
>> may have, without actually communicating that UUID to userspace. IOW,
> 
> I also tend to agree that it makes sense to disconnect the actual UUID we receive from the notification to user space. This would allow us to create a generic mechanism for VM save/restore cycles across different hypervisors. Let me add PPC and s390x people to the CC list to see whether they have anything remotely similar to the VmGenID mechanism. For x86 and aarch64, the ACPI and memory based VmGenID implemented here is the most obvious option to implement IMHO. It's also already implemented in all major hypervisors.

Hmm, what we do have configurations (e.g. stfle bits) and we do have a notification mechanism via sclp that notifies guests when things change.
As of today neither KVM nor Linux implement the sclp change notification mechanism, but I do see value in such a thing.

> 
>> I agree with Jann there. Then, it's the functioning of this
>> notification mechanism to userspace that is interesting to me.
> 
> Absolutely! Please have a look at the previous discussion here:
> 
> 
> https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/
> 
> The user space interface is absolutely what this is about.

Yes. Passing a notification to userspace is essential. Where I do not see a solution yet is the race between notification and
already running with the old knowledge.
> 
>> There are a few design goals of notifying userspace: it should be
>> fast, because people who are using userspace RNGs are usually doing so
>> in the first place to completely avoid syscall overhead for whatever
>> high performance application they have - e.g. I recall conversations
>> with Colm about his TLS implementation needing to make random IVs
>> _really_ fast. It should also happen as early as possible, with no
>> race or as minimal as possible race window, so that userspace doesn't
>> begin using old randomness and then switch over after the damage is
>> already done.
> 
> There are multiple facets and different types of consumers here. For a user space RNG, I agree that fast and as race free as possible is key. That's what the mmap interface is there for.
> 
> There are applications way beyond that though. What do you do with applications that already consumed randomness? For example a cached pool of SSL keys. Or a higher level language primitive that consumes randomness and caches its seed somewhere in an internal data structure. Or even worse: your system's host ssh key.
> 
> For those types of events, an mmap (or vDSO) interface does not work. We need to actively allow user space applications to readjust to the new environment - either internally (the language primitive case) or through a system event, maybe even as systemd trigger (the ssh host key case).
> 
> To give everyone enough time before we consider a system as "updated to the new environment", we have the callback logic with the "Orchestrator" that can check whether all listeners to system wide updates confirms they adjusted themselves.
> 
> That's what the rest of the logic is there for: A read+poll interface and all of the orchestration logic. It's not for the user space RNG case, it's for all of its downstream users.
> 
>> I'm also not wedded to using Microsoft's proprietary hypervisor design
>> for this. If we come up with a better interface, I don't think it's
>> asking too much to implement that and reasonably expect for Microsoft
>> to catch up. Maybe someone here will find that controversial, but
>> whatever -- discussing ideal designs does not seem out of place or
>> inappropriate for how we usually approach things in the kernel, and a
>> closed source hypervisor coming along shouldn't disrupt that.
> 
> The main bonus point on this interface is that Hyper-V, VMware and QEMU implement it already. It would be a very natural for into the ecosystem. I agree though that we shouldn't have our user space interface necessarily dictated by it: Other hypervisors may implement different ways such as a simple edge IRQ that gets triggered whenever the VM gets resumed.
> 
>> So, anyway, here are a few options with some pros and cons for the
>> kernel notifying userspace that its RNG should reseed.
> 
> I can only stress again that we should not be laser focused on the RNG case. In a lot of cases, data has already been generated by the RNG before the snapshot and needs to be reinitialized after the snapshot. In other cases such as system UUIDs, it's completely orthogonal to the RNG.
> 
>>
>> 1. SIGRND - a new signal. Lol.
> 
> Doable, but a lot of plumbing in user space. It's also not necessarily a good for for event notification in most user space applications.
> 
>>
>> 2. Userspace opens a file descriptor that it can epoll on. Pros are
>> that many notification mechanisms already use this. Cons is that this
>> requires syscall and might be more racy than we want. Another con is
>> that this a new thing for userspace programs to do.
> 
> That's part of what this patch does, right? This patch implements read+poll as well as mmap() for high speed reads.
> 
>> 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are
>> that this is extremely fast, and also simple to use and implement.
>> There are enough sequence points in typical crypto programs that
>> checking to see whether this counter has changed before doing whatever
>> operation seems easy enough. Cons are that typically we've been
>> conservative about adding things to the vDSO, and this is also a new
>> thing for userspace programs to do.
> 
> The big con is that its use is going to be super limited to applications that can be adapted to check their "vm generation" through a vDSO call / read every time they consume data that may potentially need to be regenerated.
> 
> This probably works for the pure RNG case. It falls apart for more sophisticated things such as "redo my ssh host keys and restart the service" or "regenerate my samba machine uuid".
> 
>> 4. We already have a mechanism for this kind of thing, because the
>> same issue comes up when fork()ing. The solution was MADV_WIPEONFORK,
>> where userspace marks a page to be zeroed when forking, for the
>> purposes of the RNG being notified when its world gets split in two.
>> This is basically the same thing as we're discussing here with guest
>> snapshots, except it's on the system level rather than the process
>> level, and a system has many processes. But the problem space is still
>> almost the same, and we could simply reuse that same mechanism. There
>> are a few implementation strategies for that:
> 
> Yup, that's where we started from :). And then we ran into resistance by the mm people (on CC here). And then we looked at the problem more in depth and checked what it would take to for example implement this for user space RNGs in Java. It's ... more complicated than one may think at first.
> 
>> 4a. We mess with the PTEs of all processes' pages that are
>> MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us
>> to do so. Then we wind up reusing the already existing logic for
>> userspace RNGs. Cons might be that this usually requires semaphores,
>> and we're in irq context, so we'd have to hoist to a workqueue, which
>> means either more wake up latency, or a larger race window.
>>
>> 4b. We just memzero all processes' pages that are MADV_WIPEONFORK,
>> when the hypervisor notifies us to do so. Then we wind up reusing the
>> already existing logic for userspace RNGs.
>>
>> 4c. The guest kernel maintains an array of physical addresses that are
>> MADV_WIPEONFORK. The hypervisor knows about this array and its
>> location through whatever protocol, and before resuming a
>> moved/snapshotted/duplicated VM, it takes the responsibility for
>> memzeroing this memory. The huge pro here would be that this
>> eliminates all races, and reduces complexity quite a bit, because the
>> hypervisor can perfectly synchronize its bringup (and SMP bringup)
>> with this, and it can even optimize things like on-disk memory
>> snapshots to simply not write out those pages to disk.
>>
>> A 4c-like approach seems like it'd be a lot of bang for the buck -- we
>> reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
>> userspace API to deal with, and it'd be race free, and eliminate a lot
>> of kernel complexity.
>>
>> But 4b and 3 don't seem too bad either.
>>
>> Any thoughts on 4c? Is that utterly insane, or does that actually get
>> us somewhere close to what we want?
> 
> All of the options for "4" are possible and have an RFC out. Please check out the discussion linked above :).
> 
> The problem with anything that relies on close loop reads (options 3+4) is not going to work well with the more sophisticated use case of derived data.
> 
> IMHO it will boil down to "both". We will need a high-speed interface that with close-to-0 overhead tells you either the generation ID or clears pages (options 3+4) as well as something that is bigger for applications that can either intrinsically (sshd) or by system design (Java) not adopt the mechanisms above easily.
> 
> That said, we need to start somewhere. I don't mind which angle we start from. But this is a real world problem and one that will only become more prevalent over time as VMs are used for more than only your traditional enterprise hardware consolidation.
> 
> 
> Alex
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
>
Alexander Graf Oct. 20, 2020, 9:54 a.m. UTC | #27
On 20.10.20 11:35, Christian Borntraeger wrote:
> On 17.10.20 20:09, Alexander Graf wrote:
>> Hi Jason,
>>
>> On 17.10.20 15:24, Jason A. Donenfeld wrote:
>>>
>>> After discussing this offline with Jann a bit, I have a few general
>>> comments on the design of this.
>>>
>>> First, the UUID communicated by the hypervisor should be consumed by
>>> the kernel -- added as another input to the rng -- and then userspace
>>
>> We definitely want a kernel internal notifier as well, yes :).
>>
>>> should be notified that it should reseed any userspace RNGs that it
>>> may have, without actually communicating that UUID to userspace. IOW,
>>
>> I also tend to agree that it makes sense to disconnect the actual UUID we receive from the notification to user space. This would allow us to create a generic mechanism for VM save/restore cycles across different hypervisors. Let me add PPC and s390x people to the CC list to see whether they have anything remotely similar to the VmGenID mechanism. For x86 and aarch64, the ACPI and memory based VmGenID implemented here is the most obvious option to implement IMHO. It's also already implemented in all major hypervisors.
> 
> Hmm, what we do have configurations (e.g. stfle bits) and we do have a notification mechanism via sclp that notifies guests when things change.
> As of today neither KVM nor Linux implement the sclp change notification mechanism, but I do see value in such a thing.

stfle only stores architected CPU capabilities, no? The UUID is about 
uniquely identifying clones of the same base image, so they can 
reestablish their uniqueness.

That said, your interest means that there may be a mechanism on s390 one 
day, so we should think about making it more generic.

> 
>>
>>> I agree with Jann there. Then, it's the functioning of this
>>> notification mechanism to userspace that is interesting to me.
>>
>> Absolutely! Please have a look at the previous discussion here:
>>
>>
>> https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/
>>
>> The user space interface is absolutely what this is about.
> 
> Yes. Passing a notification to userspace is essential. Where I do not see a solution yet is the race between notification and
> already running with the old knowledge.

With a post-mortem interface, we will always have a tiny race window. 
I'm not really convinced that this is a serious problem yet though.

In order to do extract anything off a virtual machine that was cloned, 
you need to communicate with it. If you for example stop the network 
link until all of this device's users have indicated they are finished 
adjusting, the race should be small enough for any practical purposes I 
can think of.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Alexander Graf Oct. 20, 2020, 10 a.m. UTC | #28
On 19.10.20 19:15, Mathieu Desnoyers wrote:
> 
> 
> ----- On Oct 17, 2020, at 2:10 PM, Andy Lutomirski luto@kernel.org wrote:
> 
>> On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote:
>>>
>>> [adding some more people who are interested in RNG stuff: Andy, Jason,
>>> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
>>> concerns some pretty fundamental API stuff related to RNG usage]
>>>
>>> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
>>> <acatan@amazon.com> wrote:
>>>> - Background
>>>>
>>>> The VM Generation ID is a feature defined by Microsoft (paper:
>>>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>>>> multiple hypervisor vendors.
>>>>
>>>> The feature is required in virtualized environments by apps that work
>>>> with local copies/caches of world-unique data such as random values,
>>>> uuids, monotonically increasing counters, etc.
>>>> Such apps can be negatively affected by VM snapshotting when the VM
>>>> is either cloned or returned to an earlier point in time.
>>>>
>>>> The VM Generation ID is a simple concept meant to alleviate the issue
>>>> by providing a unique ID that changes each time the VM is restored
>>>> from a snapshot. The hw provided UUID value can be used to
>>>> differentiate between VMs or different generations of the same VM.
>>>>
>>>> - Problem
>>>>
>>>> The VM Generation ID is exposed through an ACPI device by multiple
>>>> hypervisor vendors but neither the vendors or upstream Linux have no
>>>> default driver for it leaving users to fend for themselves.
>>>>
>>>> Furthermore, simply finding out about a VM generation change is only
>>>> the starting point of a process to renew internal states of possibly
>>>> multiple applications across the system. This process could benefit
>>>> from a driver that provides an interface through which orchestration
>>>> can be easily done.
>>>>
>>>> - Solution
>>>>
>>>> This patch is a driver which exposes the Virtual Machine Generation ID
>>>> via a char-dev FS interface that provides ID update sync and async
>>>> notification, retrieval and confirmation mechanisms:
>>>>
>>>> When the device is 'open()'ed a copy of the current vm UUID is
>>>> associated with the file handle. 'read()' operations block until the
>>>> associated UUID is no longer up to date - until HW vm gen id changes -
>>>> at which point the new UUID is provided/returned. Nonblocking 'read()'
>>>> uses EWOULDBLOCK to signal that there is no _new_ UUID available.
>>>>
>>>> 'poll()' is implemented to allow polling for UUID updates. Such
>>>> updates result in 'EPOLLIN' events.
>>>>
>>>> Subsequent read()s following a UUID update no longer block, but return
>>>> the updated UUID. The application needs to acknowledge the UUID update
>>>> by confirming it through a 'write()'.
>>>> Only on writing back to the driver the right/latest UUID, will the
>>>> driver mark this "watcher" as up to date and remove EPOLLIN status.
>>>>
>>>> 'mmap()' support allows mapping a single read-only shared page which
>>>> will always contain the latest UUID value at offset 0.
>>>
>>> It would be nicer if that page just contained an incrementing counter,
>>> instead of a UUID. It's not like the application cares *what* the UUID
>>> changed to, just that it *did* change and all RNGs state now needs to
>>> be reseeded from the kernel, right? And an application can't reliably
>>> read the entire UUID from the memory mapping anyway, because the VM
>>> might be forked in the middle.
>>>
>>> So I think your kernel driver should detect UUID changes and then turn
>>> those into a monotonically incrementing counter. (Probably 64 bits
>>> wide?) (That's probably also a little bit faster than comparing an
>>> entire UUID.)
>>>
>>> An option might be to put that counter into the vDSO, instead of a
>>> separate VMA; but I don't know how the other folks feel about that.
>>> Andy, do you have opinions on this? That way, normal userspace code
>>> that uses this infrastructure wouldn't have to mess around with a
>>> special device at all. And it'd be usable in seccomp sandboxes and so
>>> on without needing special plumbing. And libraries wouldn't have to
>>> call open() and mess with file descriptor numbers.
>>
>> The vDSO might be annoyingly slow for this.  Something like the rseq
>> page might make sense.  It could be a generic indication of "system
>> went through some form of suspend".
> 
> This might indeed fit nicely as an extension of my KTLS prototype (extensible rseq):
> 
> https://lore.kernel.org/lkml/20200925181518.4141-1-mathieu.desnoyers@efficios.com/
> 
> There are a few ways we could wire things up. One might be to add the
> UUID field into the extended KTLS structure (so it's always updated after it
> changes on next return to user-space). For this I assume that the Linux scheduler

I think one that that became apparent in the discussion in this thread 
was that we want a Linux internal generation counter rather than expose 
the UUID verbatim. That way, we don't give away potential secrets to 
user space and we can support other architectures more easily.

> within the guest VM always preempts all threads before a VM is suspended (is that
> indeed true ?).

The VM does not know that it gets snapshotted. It only knows that it 
gets resumed (through this interface).

> This leads to one important question though: how is the UUID check vs commit operation
> made atomic with respect to suspend ? Unless we use rseq critical sections in assembly,
> where the kernel will abort the rseq critical section on preemption, I don't see how we
> can ensure that the UUID value does not change right after it has been checked, before
> the "commit" side-effect. And what is the expected "commit" side-effect ? Is it a store
> to a variable in user-space memory, or is it issuing a system call which sends a packet over
> the network ?

I think the easiest answer I could come up with here would be "make it a 
u32". Then you can just access it atomically anywhere, no?

The burden on user space with such an interface is still pretty high 
though. All user space that wants to do a "transaction" based on secrets 
would now need to read the generation ID at the beginning of the 
transaction and double check whether it's still the same at the end of 
it (e.g. before sending out a network packet based on a key derived from 
randomness?).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Catangiu, Adrian Costin Oct. 20, 2020, 4:54 p.m. UTC | #29
Hi all,

On 17/10/2020, 21:09, "Graf (AWS), Alexander" <graf@amazon.de> wrote:
    
    On 17.10.20 15:24, Jason A. Donenfeld wrote:
    > 
    > After discussing this offline with Jann a bit, I have a few general
    > comments on the design of this.
    > 
    > First, the UUID communicated by the hypervisor should be consumed by
    > the kernel -- added as another input to the rng -- and then userspace
    
    We definitely want a kernel internal notifier as well, yes :).

What would be a generic event trigger mechanism we could use from a kernel
module/driver for triggering rng reseed (possibly adding the uuid to the mix
as well)?

  




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
diff mbox series

Patch

diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..5224415
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,211 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+VMGENID
+============
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by apps that work
+with local copies/caches of world-unique data such as random values,
+uuids, monotonically increasing counters, etc.
+Such apps can be negatively affected by VM snapshotting when the VM
+is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hw provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The driver exposes the Virtual Machine Generation ID via a char-dev FS
+interface that provides ID update sync/async notification, retrieval
+and confirmation mechanisms:
+
+``open()``:
+  When the device is opened, a copy of the current vm UUID is
+  associated with the file handle. The driver now tracks this file
+  handle as an independent *watcher*. The driver tracks how many
+  watchers are aware of the latest Vm-Gen-Id uuid and how many of
+  them are *outdated*, outdated being those that have lived through
+  a Vm-Gen-Id change but not yet confirmed the generation change event.
+
+``read()``:
+  Read is meant to provide the *new* Vm-Gen-Id when a generation change
+  takes place. The read operation blocks until the associated UUID is
+  no longer up to date - until HW vm gen id changes - at which point
+  the new UUID is provided/returned. Nonblocking ``read()``
+  uses ``EAGAIN`` to signal that there is no *new* UUID available.
+  The hw UUID is considered *new* for each open file handle that hasn't
+  confirmed the new value, following a generation change. Therefore,
+  once a generation change takes place, all ``read()`` calls will
+  immediately return the new uuid and will continue to do so until the
+  new value is confirmed back to the driver through ``write()``.
+  Partial reads are not allowed - read buffer needs to be at least
+  ``sizeof(uuid_t)`` in size.
+
+``write()``:
+  Write is used to confirm the up-to-date Vm-Gen-Id back to the driver.
+  Following a VM generation change, all existing watchers are marked
+  as *outdated*. Each file handle will maintain the *outdated* status
+  until a ``write()`` confirms the up-to-date UUID back to the driver.
+  Partial writes are not allowed - write buffer should be exactly
+  ``sizeof(uuid_t)`` in size.
+
+``poll()``:
+  Poll is implemented to allow polling for UUID updates. Such
+  updates result in ``EPOLLIN`` polling status until the new up-to-date
+  UUID is confirmed back to the driver through a ``write()``.
+
+``ioctl()``:
+  The driver also adds support for tracking count of open file handles
+  that haven't acknowledged an UUID update. This is exposed through
+  two IOCTLs:
+
+  - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
+    *outdated* watchers - number of file handles that were open during
+    a VM generation change, and which have not yet confirmed the new
+    Vm-Gen-Id.
+  - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+    watchers, or if a ``timeout`` argument is provided, until the
+    timeout expires.
+
+``mmap()``:
+  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+  in size. The first 16 bytes of the mapped page will contain an
+  up-to-date copy of the VM generation UUID.
+  The mapped memory can be used as a low-latency UUID probe mechanism
+  in critical sections - see examples.
+
+``close()``:
+  Removes the file handle as a Vm-Gen-Id watcher.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+	void watchdog_thread_handler(int *thread_active)
+	{
+		uuid_t uuid;
+		int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
+
+		do {
+			// read new UUID - blocks until VM generation changes
+			read(fd, &uuid, sizeof(uuid));
+
+			// because of VM generation change, we need to rebuild world
+			reseed_app_env();
+
+			// confirm we're done handling UUID update
+			write(fd, &uuid, sizeof(uuid));
+		} while (atomic_read(thread_active));
+
+		close(fd);
+	}
+
+2) ASYNC simplified example::
+
+	void handle_io_on_vmgenfd(int vmgenfd)
+	{
+		uuid_t uuid;
+
+		// because of VM generation change, we need to rebuild world
+		reseed_app_env();
+
+		// read new UUID - we need it to confirm we've handled update
+		read(fd, &uuid, sizeof(uuid));
+
+		// confirm we're done handling UUID update
+		write(fd, &uuid, sizeof(uuid));
+	}
+
+	int main() {
+		int epfd, vmgenfd;
+		struct epoll_event ev;
+
+		epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+		vmgenfd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
+
+		// register vmgenid for polling
+		ev.events = EPOLLIN;
+		ev.data.fd = vmgenfd;
+		epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
+
+		// register other parts of your app for polling
+		// ...
+
+		while (1) {
+			// wait for something to do...
+			int nfds = epoll_wait(epfd, events,
+				MAX_EPOLL_EVENTS_PER_RUN,
+				EPOLL_RUN_TIMEOUT);
+			if (nfds < 0) die("Error in epoll_wait!");
+
+			// for each ready fd
+			for(int i = 0; i < nfds; i++) {
+				int fd = events[i].data.fd;
+
+				if (fd == vmgenfd)
+					handle_io_on_vmgenfd(vmgenfd);
+				else
+					handle_some_other_part_of_the_app(fd);
+			}
+		}
+
+		return 0;
+	}
+
+3) Mapped memory polling simplified example::
+
+	/*
+	 * app/library function that provides cached secrets
+	 */
+	char * safe_cached_secret(app_data_t *app)
+	{
+		char *secret;
+		volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app);
+	again:
+		secret = __cached_secret(app);
+
+		if (unlikely(*uuid_ptr != app->cached_uuid)) {
+			app->cached_uuid = *uuid_ptr;
+
+			// rebuild world then confirm the uuid update (thru write)
+			rebuild_caches(app);
+			ack_vmgenid_update(app);
+
+			goto again;
+		}
+
+		return secret;
+	}
+
+4) Orchestrator simplified example::
+
+	/*
+	 * orchestrator - manages multiple apps and libraries used by a service
+	 * and tries to make sure all sensitive components gracefully handle
+	 * VM generation changes.
+	 * Following function is called on detection of a VM generation change.
+	 */
+	int handle_vmgen_update(int vmgenfd, uuid_t new_uuid)
+	{
+		// pause until all components have handled event
+		pause_service();
+
+		// confirm *this* watcher as up-to-date
+		write(fd, &new_uuid, sizeof(uuid_t));
+
+		// wait for all *others*
+		ioctl(fd, VMGENID_WAIT_WATCHERS, NULL);
+
+		// all apps on the system have rebuilt worlds
+		resume_service();
+	}
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 363af2e..c80f1ce 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,19 @@  menuconfig VIRT_DRIVERS
 
 if VIRT_DRIVERS
 
+config VMGENID
+	tristate "Virtual Machine Generation ID driver"
+	depends on ACPI
+	default M
+	help
+	  This is a Virtual Machine Generation ID driver which provides
+	  a virtual machine unique identifier. The provided UUID can be
+	  watched through the FS interface exposed by this driver, and
+	  thus can provide notifications for VM snapshot or cloning events.
+	  This enables applications and libraries that store or cache
+	  sensitive information, to know that they need to regenerate it
+	  after process memory has been exposed to potential copying.
+
 config FSL_HV_MANAGER
 	tristate "Freescale hypervisor management driver"
 	depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index fd33124..a1f8dcc 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,4 +4,5 @@ 
 #
 
 obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
+obj-$(CONFIG_VMGENID)		+= vmgenid.o
 obj-y				+= vboxguest/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d314c72
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,419 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc
+ * All rights reserved.
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *	  Or Idgar <oridgar@gmail.com>
+ *	  Gal Hammer <ghammer@redhat.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/cdev.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/uuid.h>
+#include <linux/vmgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct dev_data {
+	struct cdev       cdev;
+	dev_t             dev_id;
+	unsigned long     map_buf;
+
+	void              *uuid_iomap;
+	uuid_t            uuid;
+	wait_queue_head_t read_wait;
+
+	atomic_t          watchers;
+	atomic_t          outdated_watchers;
+	wait_queue_head_t outdated_wait;
+};
+
+struct file_data {
+	struct dev_data  *dev_data;
+	uuid_t           acked_uuid;
+};
+
+static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid)
+{
+	return !memcmp(uuid, &priv->uuid, sizeof(uuid_t));
+}
+
+static void vmgenid_put_outdated_watchers(struct dev_data *priv)
+{
+	if (atomic_dec_and_test(&priv->outdated_watchers))
+		wake_up_interruptible(&priv->outdated_wait);
+}
+
+static int vmgenid_open(struct inode *inode, struct file *file)
+{
+	struct dev_data *priv =
+		container_of(inode->i_cdev, struct dev_data, cdev);
+	struct file_data *file_data =
+		kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+	if (!file_data)
+		return -ENOMEM;
+
+	file_data->acked_uuid = priv->uuid;
+	file_data->dev_data = priv;
+
+	file->private_data = file_data;
+	atomic_inc(&priv->watchers);
+
+	return 0;
+}
+
+static int vmgenid_close(struct inode *inode, struct file *file)
+{
+	struct file_data *file_data = (struct file_data *) file->private_data;
+	struct dev_data *priv = file_data->dev_data;
+
+	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+		vmgenid_put_outdated_watchers(priv);
+	atomic_dec(&priv->watchers);
+	kfree(file->private_data);
+
+	return 0;
+}
+
+static ssize_t
+vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
+{
+	struct file_data *file_data =
+		(struct file_data *) file->private_data;
+	struct dev_data *priv = file_data->dev_data;
+	ssize_t ret;
+
+	if (nbytes == 0)
+		return 0;
+	/* disallow partial UUID reads */
+	if (nbytes < sizeof(uuid_t))
+		return -EINVAL;
+	nbytes = sizeof(uuid_t);
+
+	if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+		ret = wait_event_interruptible(
+			priv->read_wait,
+			!vmgenid_uuid_matches(priv, &file_data->acked_uuid)
+		);
+		if (ret)
+			return ret;
+	}
+
+	ret = copy_to_user(ubuf, &priv->uuid, nbytes);
+	if (ret)
+		return -EFAULT;
+
+	return nbytes;
+}
+
+static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	struct file_data *file_data = (struct file_data *) file->private_data;
+	struct dev_data *priv = file_data->dev_data;
+	uuid_t ack_uuid;
+
+	/* disallow partial UUID writes */
+	if (count != sizeof(uuid_t))
+		return -EINVAL;
+	if (copy_from_user(&ack_uuid, ubuf, count))
+		return -EFAULT;
+	/* wrong UUID acknowledged */
+	if (!vmgenid_uuid_matches(priv, &ack_uuid))
+		return -EINVAL;
+
+	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
+		/* update local view of UUID */
+		file_data->acked_uuid = ack_uuid;
+		vmgenid_put_outdated_watchers(priv);
+	}
+
+	return (ssize_t)count;
+}
+
+static __poll_t
+vmgenid_poll(struct file *file, poll_table *wait)
+{
+	__poll_t mask = 0;
+	struct file_data *file_data =
+		(struct file_data *) file->private_data;
+	struct dev_data *priv = file_data->dev_data;
+
+	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+		return EPOLLIN | EPOLLRDNORM;
+
+	poll_wait(file, &priv->read_wait, wait);
+
+	if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+		mask = EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static long vmgenid_ioctl(struct file *file,
+		unsigned int cmd, unsigned long arg)
+{
+	struct file_data *file_data =
+		(struct file_data *) file->private_data;
+	struct dev_data *priv = file_data->dev_data;
+	struct timespec __user *timeout = (void *) arg;
+	struct timespec kts;
+	ktime_t until;
+	int ret;
+
+	switch (cmd) {
+	case VMGENID_GET_OUTDATED_WATCHERS:
+		ret = atomic_read(&priv->outdated_watchers);
+		break;
+	case VMGENID_WAIT_WATCHERS:
+		if (timeout) {
+			ret = copy_from_user(&kts, timeout, sizeof(kts));
+			if (ret)
+				return -EFAULT;
+			until = timespec_to_ktime(kts);
+		} else {
+			until = KTIME_MAX;
+		}
+
+		ret = wait_event_interruptible_hrtimeout(
+			priv->outdated_wait,
+			!atomic_read(&priv->outdated_watchers),
+			until
+		);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf)
+{
+	struct page *page;
+	struct file_data *file_data =
+			(struct file_data *) vmf->vma->vm_private_data;
+	struct dev_data *priv = file_data->dev_data;
+
+	if (priv->map_buf) {
+		page = virt_to_page(priv->map_buf);
+		get_page(page);
+		vmf->page = page;
+	}
+
+	return 0;
+}
+
+static const struct vm_operations_struct vmgenid_vm_ops = {
+	.fault = vmgenid_vm_fault,
+};
+
+static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+		return -EINVAL;
+
+	if ((vma->vm_flags & VM_WRITE) != 0)
+		return -EPERM;
+
+	vma->vm_ops = &vmgenid_vm_ops;
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_private_data = file->private_data;
+
+	return 0;
+}
+
+static const struct file_operations fops = {
+	.owner          = THIS_MODULE,
+	.mmap           = vmgenid_mmap,
+	.open           = vmgenid_open,
+	.release        = vmgenid_close,
+	.read           = vmgenid_read,
+	.write          = vmgenid_write,
+	.poll           = vmgenid_poll,
+	.compat_ioctl   = vmgenid_ioctl,
+	.unlocked_ioctl = vmgenid_ioctl,
+};
+
+static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle)
+{
+	int i;
+	phys_addr_t phys_addr;
+	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+	acpi_status status;
+	union acpi_object *pss;
+	union acpi_object *element;
+
+	status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+	if (ACPI_FAILURE(status)) {
+		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+		return -ENODEV;
+	}
+	pss = buffer.pointer;
+	if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+		return -EINVAL;
+
+	phys_addr = 0;
+	for (i = 0; i < pss->package.count; i++) {
+		element = &(pss->package.elements[i]);
+		if (element->type != ACPI_TYPE_INTEGER)
+			return -EINVAL;
+		phys_addr |= element->integer.value << i * 32;
+	}
+
+	priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+	if (!priv->uuid_iomap) {
+		pr_err("Could not map memory at 0x%llx, size %u",
+			   phys_addr,
+			   (u32)sizeof(uuid_t));
+		return -ENOMEM;
+	}
+
+	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+	memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
+
+	return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+	int ret;
+	struct dev_data *priv;
+
+	if (!device)
+		return -EINVAL;
+
+	priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->map_buf = get_zeroed_page(GFP_KERNEL);
+	if (!priv->map_buf) {
+		ret = -ENOMEM;
+		goto free;
+	}
+
+	device->driver_data = priv;
+
+	init_waitqueue_head(&priv->read_wait);
+	atomic_set(&priv->watchers, 0);
+	atomic_set(&priv->outdated_watchers, 0);
+	init_waitqueue_head(&priv->outdated_wait);
+
+	ret = vmgenid_acpi_map(priv, device->handle);
+	if (ret < 0)
+		goto err;
+
+	ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME);
+	if (ret < 0) {
+		pr_err("alloc_chrdev_region() failed for vmgenid\n");
+		goto err;
+	}
+
+	cdev_init(&priv->cdev, &fops);
+	cdev_add(&priv->cdev, priv->dev_id, 1);
+
+	return 0;
+
+err:
+	if (priv->uuid_iomap)
+		acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+
+	free_pages(priv->map_buf, 0);
+
+free:
+	kfree(priv);
+
+	return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+	struct dev_data *priv;
+
+	if (!device || !acpi_driver_data(device))
+		return -EINVAL;
+	priv = acpi_driver_data(device);
+
+	cdev_del(&priv->cdev);
+	unregister_chrdev_region(priv->dev_id, 1);
+	device->driver_data = NULL;
+
+	if (priv->uuid_iomap)
+		acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+	free_pages(priv->map_buf, 0);
+	kfree(priv);
+
+	return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+	uuid_t old_uuid;
+	struct dev_data *priv;
+
+	pr_debug("VMGENID notified, event %u", event);
+
+	if (!device || !acpi_driver_data(device)) {
+		pr_err("VMGENID notify with NULL private data");
+		return;
+	}
+	priv = acpi_driver_data(device);
+
+	/* update VM Generation UUID */
+	old_uuid = priv->uuid;
+	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+	if (!vmgenid_uuid_matches(priv, &old_uuid)) {
+		/* HW uuid updated */
+		memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
+		atomic_set(&priv->outdated_watchers,
+			 atomic_read(&priv->watchers));
+		wake_up_interruptible(&priv->read_wait);
+	}
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+	{"QEMUVGID", 0},
+	{"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+	.name = "vm_generation_id",
+	.ids = vmgenid_ids,
+	.owner = THIS_MODULE,
+	.ops = {
+		.add = vmgenid_acpi_add,
+		.remove = vmgenid_acpi_remove,
+		.notify = vmgenid_acpi_notify,
+	}
+};
+
+static int __init vmgenid_init(void)
+{
+	return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
new file mode 100644
index 0000000..f7fca7b
--- /dev/null
+++ b/include/uapi/linux/vmgenid.h
@@ -0,0 +1,22 @@ 
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2020, Amazon.com Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_VMGENID_H
+#define _UAPI_LINUX_VMGENID_H
+
+#include <linux/ioctl.h>
+#include <linux/time.h>
+
+#define VMGENID_IOCTL 0x2d
+#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
+#define VMGENID_WAIT_WATCHERS         _IOW(VMGENID_IOCTL, 2, struct timespec)
+
+#endif /* _UAPI_LINUX_VMGENID_H */
+