Message ID | 788878CE-2578-4991-A5A6-669DCABAC2F2@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drivers/virt: vmgenid: add vm generation id driver | expand |
Sorry, I forgot to add a few people interested in this and the KVM ML to CC. Added them.
On 16/10/2020, 17:33, "Catangiu, Adrian Costin" <acatan@amazon.com> wrote:
- Background
The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.
The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.
The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.
- Problem
The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.
Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.
- Solution
This patch is a driver which exposes the Virtual Machine Generation ID
via a char-dev FS interface that provides ID update sync and async
notification, retrieval and confirmation mechanisms:
When the device is 'open()'ed a copy of the current vm UUID is
associated with the file handle. 'read()' operations block until the
associated UUID is no longer up to date - until HW vm gen id changes -
at which point the new UUID is provided/returned. Nonblocking 'read()'
uses EWOULDBLOCK to signal that there is no _new_ UUID available.
'poll()' is implemented to allow polling for UUID updates. Such
updates result in 'EPOLLIN' events.
Subsequent read()s following a UUID update no longer block, but return
the updated UUID. The application needs to acknowledge the UUID update
by confirming it through a 'write()'.
Only on writing back to the driver the right/latest UUID, will the
driver mark this "watcher" as up to date and remove EPOLLIN status.
'mmap()' support allows mapping a single read-only shared page which
will always contain the latest UUID value at offset 0.
The driver also adds support for tracking count of open file handles
that haven't acknowledged an UUID update. This is exposed through
two IOCTLs:
* VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
_outdated_ watchers - number of file handles that were open during
a VM generation change, and which have not yet confirmed the new
Vm-Gen-Id.
* VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_
watchers, or if a 'timeout' argument is provided, until the timeout
expires.
This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal
https://lkml.org/lkml/2018/3/1/498
- Future improvements
Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.
There is a path for platform devices to match on _CID when _HID is
'PRP0001' - which is not the case for the Qemu vmgenid device.
Guidance and help here would be greatly appreciated.
Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
Documentation/virt/vmgenid.rst | 211 +++++++++++++++++++++
drivers/virt/Kconfig | 13 ++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 419 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vmgenid.h | 22 +++
5 files changed, 666 insertions(+)
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/virt/vmgenid.c
create mode 100644 include/uapi/linux/vmgenid.h
diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..5224415
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,211 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+VMGENID
+============
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by apps that work
+with local copies/caches of world-unique data such as random values,
+uuids, monotonically increasing counters, etc.
+Such apps can be negatively affected by VM snapshotting when the VM
+is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hw provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The driver exposes the Virtual Machine Generation ID via a char-dev FS
+interface that provides ID update sync/async notification, retrieval
+and confirmation mechanisms:
+
+``open()``:
+ When the device is opened, a copy of the current vm UUID is
+ associated with the file handle. The driver now tracks this file
+ handle as an independent *watcher*. The driver tracks how many
+ watchers are aware of the latest Vm-Gen-Id uuid and how many of
+ them are *outdated*, outdated being those that have lived through
+ a Vm-Gen-Id change but not yet confirmed the generation change event.
+
+``read()``:
+ Read is meant to provide the *new* Vm-Gen-Id when a generation change
+ takes place. The read operation blocks until the associated UUID is
+ no longer up to date - until HW vm gen id changes - at which point
+ the new UUID is provided/returned. Nonblocking ``read()``
+ uses ``EAGAIN`` to signal that there is no *new* UUID available.
+ The hw UUID is considered *new* for each open file handle that hasn't
+ confirmed the new value, following a generation change. Therefore,
+ once a generation change takes place, all ``read()`` calls will
+ immediately return the new uuid and will continue to do so until the
+ new value is confirmed back to the driver through ``write()``.
+ Partial reads are not allowed - read buffer needs to be at least
+ ``sizeof(uuid_t)`` in size.
+
+``write()``:
+ Write is used to confirm the up-to-date Vm-Gen-Id back to the driver.
+ Following a VM generation change, all existing watchers are marked
+ as *outdated*. Each file handle will maintain the *outdated* status
+ until a ``write()`` confirms the up-to-date UUID back to the driver.
+ Partial writes are not allowed - write buffer should be exactly
+ ``sizeof(uuid_t)`` in size.
+
+``poll()``:
+ Poll is implemented to allow polling for UUID updates. Such
+ updates result in ``EPOLLIN`` polling status until the new up-to-date
+ UUID is confirmed back to the driver through a ``write()``.
+
+``ioctl()``:
+ The driver also adds support for tracking count of open file handles
+ that haven't acknowledged an UUID update. This is exposed through
+ two IOCTLs:
+
+ - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
+ *outdated* watchers - number of file handles that were open during
+ a VM generation change, and which have not yet confirmed the new
+ Vm-Gen-Id.
+ - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+ watchers, or if a ``timeout`` argument is provided, until the
+ timeout expires.
+
+``mmap()``:
+ The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+ in size. The first 16 bytes of the mapped page will contain an
+ up-to-date copy of the VM generation UUID.
+ The mapped memory can be used as a low-latency UUID probe mechanism
+ in critical sections - see examples.
+
+``close()``:
+ Removes the file handle as a Vm-Gen-Id watcher.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+ void watchdog_thread_handler(int *thread_active)
+ {
+ uuid_t uuid;
+ int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
+
+ do {
+ // read new UUID - blocks until VM generation changes
+ read(fd, &uuid, sizeof(uuid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling UUID update
+ write(fd, &uuid, sizeof(uuid));
+ } while (atomic_read(thread_active));
+
+ close(fd);
+ }
+
+2) ASYNC simplified example::
+
+ void handle_io_on_vmgenfd(int vmgenfd)
+ {
+ uuid_t uuid;
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // read new UUID - we need it to confirm we've handled update
+ read(fd, &uuid, sizeof(uuid));
+
+ // confirm we're done handling UUID update
+ write(fd, &uuid, sizeof(uuid));
+ }
+
+ int main() {
+ int epfd, vmgenfd;
+ struct epoll_event ev;
+
+ epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+ vmgenfd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR);
+
+ // register vmgenid for polling
+ ev.events = EPOLLIN;
+ ev.data.fd = vmgenfd;
+ epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
+
+ // register other parts of your app for polling
+ // ...
+
+ while (1) {
+ // wait for something to do...
+ int nfds = epoll_wait(epfd, events,
+ MAX_EPOLL_EVENTS_PER_RUN,
+ EPOLL_RUN_TIMEOUT);
+ if (nfds < 0) die("Error in epoll_wait!");
+
+ // for each ready fd
+ for(int i = 0; i < nfds; i++) {
+ int fd = events[i].data.fd;
+
+ if (fd == vmgenfd)
+ handle_io_on_vmgenfd(vmgenfd);
+ else
+ handle_some_other_part_of_the_app(fd);
+ }
+ }
+
+ return 0;
+ }
+
+3) Mapped memory polling simplified example::
+
+ /*
+ * app/library function that provides cached secrets
+ */
+ char * safe_cached_secret(app_data_t *app)
+ {
+ char *secret;
+ volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app);
+ again:
+ secret = __cached_secret(app);
+
+ if (unlikely(*uuid_ptr != app->cached_uuid)) {
+ app->cached_uuid = *uuid_ptr;
+
+ // rebuild world then confirm the uuid update (thru write)
+ rebuild_caches(app);
+ ack_vmgenid_update(app);
+
+ goto again;
+ }
+
+ return secret;
+ }
+
+4) Orchestrator simplified example::
+
+ /*
+ * orchestrator - manages multiple apps and libraries used by a service
+ * and tries to make sure all sensitive components gracefully handle
+ * VM generation changes.
+ * Following function is called on detection of a VM generation change.
+ */
+ int handle_vmgen_update(int vmgenfd, uuid_t new_uuid)
+ {
+ // pause until all components have handled event
+ pause_service();
+
+ // confirm *this* watcher as up-to-date
+ write(fd, &new_uuid, sizeof(uuid_t));
+
+ // wait for all *others*
+ ioctl(fd, VMGENID_WAIT_WATCHERS, NULL);
+
+ // all apps on the system have rebuilt worlds
+ resume_service();
+ }
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 363af2e..c80f1ce 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,19 @@ menuconfig VIRT_DRIVERS
if VIRT_DRIVERS
+config VMGENID
+ tristate "Virtual Machine Generation ID driver"
+ depends on ACPI
+ default M
+ help
+ This is a Virtual Machine Generation ID driver which provides
+ a virtual machine unique identifier. The provided UUID can be
+ watched through the FS interface exposed by this driver, and
+ thus can provide notifications for VM snapshot or cloning events.
+ This enables applications and libraries that store or cache
+ sensitive information, to know that they need to regenerate it
+ after process memory has been exposed to potential copying.
+
config FSL_HV_MANAGER
tristate "Freescale hypervisor management driver"
depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index fd33124..a1f8dcc 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,4 +4,5 @@
#
obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
+obj-$(CONFIG_VMGENID) += vmgenid.o
obj-y += vboxguest/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d314c72
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,419 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc
+ * All rights reserved.
+ * Authors:
+ * Adrian Catangiu <acatan@amazon.com>
+ * Or Idgar <oridgar@gmail.com>
+ * Gal Hammer <ghammer@redhat.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/cdev.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/uuid.h>
+#include <linux/vmgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct dev_data {
+ struct cdev cdev;
+ dev_t dev_id;
+ unsigned long map_buf;
+
+ void *uuid_iomap;
+ uuid_t uuid;
+ wait_queue_head_t read_wait;
+
+ atomic_t watchers;
+ atomic_t outdated_watchers;
+ wait_queue_head_t outdated_wait;
+};
+
+struct file_data {
+ struct dev_data *dev_data;
+ uuid_t acked_uuid;
+};
+
+static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid)
+{
+ return !memcmp(uuid, &priv->uuid, sizeof(uuid_t));
+}
+
+static void vmgenid_put_outdated_watchers(struct dev_data *priv)
+{
+ if (atomic_dec_and_test(&priv->outdated_watchers))
+ wake_up_interruptible(&priv->outdated_wait);
+}
+
+static int vmgenid_open(struct inode *inode, struct file *file)
+{
+ struct dev_data *priv =
+ container_of(inode->i_cdev, struct dev_data, cdev);
+ struct file_data *file_data =
+ kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+ if (!file_data)
+ return -ENOMEM;
+
+ file_data->acked_uuid = priv->uuid;
+ file_data->dev_data = priv;
+
+ file->private_data = file_data;
+ atomic_inc(&priv->watchers);
+
+ return 0;
+}
+
+static int vmgenid_close(struct inode *inode, struct file *file)
+{
+ struct file_data *file_data = (struct file_data *) file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+ vmgenid_put_outdated_watchers(priv);
+ atomic_dec(&priv->watchers);
+ kfree(file->private_data);
+
+ return 0;
+}
+
+static ssize_t
+vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
+{
+ struct file_data *file_data =
+ (struct file_data *) file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ ssize_t ret;
+
+ if (nbytes == 0)
+ return 0;
+ /* disallow partial UUID reads */
+ if (nbytes < sizeof(uuid_t))
+ return -EINVAL;
+ nbytes = sizeof(uuid_t);
+
+ if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
+ if (file->f_flags & O_NONBLOCK)
+ return -EAGAIN;
+ ret = wait_event_interruptible(
+ priv->read_wait,
+ !vmgenid_uuid_matches(priv, &file_data->acked_uuid)
+ );
+ if (ret)
+ return ret;
+ }
+
+ ret = copy_to_user(ubuf, &priv->uuid, nbytes);
+ if (ret)
+ return -EFAULT;
+
+ return nbytes;
+}
+
+static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct file_data *file_data = (struct file_data *) file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ uuid_t ack_uuid;
+
+ /* disallow partial UUID writes */
+ if (count != sizeof(uuid_t))
+ return -EINVAL;
+ if (copy_from_user(&ack_uuid, ubuf, count))
+ return -EFAULT;
+ /* wrong UUID acknowledged */
+ if (!vmgenid_uuid_matches(priv, &ack_uuid))
+ return -EINVAL;
+
+ if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) {
+ /* update local view of UUID */
+ file_data->acked_uuid = ack_uuid;
+ vmgenid_put_outdated_watchers(priv);
+ }
+
+ return (ssize_t)count;
+}
+
+static __poll_t
+vmgenid_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct file_data *file_data =
+ (struct file_data *) file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+ return EPOLLIN | EPOLLRDNORM;
+
+ poll_wait(file, &priv->read_wait, wait);
+
+ if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid))
+ mask = EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static long vmgenid_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct file_data *file_data =
+ (struct file_data *) file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ struct timespec __user *timeout = (void *) arg;
+ struct timespec kts;
+ ktime_t until;
+ int ret;
+
+ switch (cmd) {
+ case VMGENID_GET_OUTDATED_WATCHERS:
+ ret = atomic_read(&priv->outdated_watchers);
+ break;
+ case VMGENID_WAIT_WATCHERS:
+ if (timeout) {
+ ret = copy_from_user(&kts, timeout, sizeof(kts));
+ if (ret)
+ return -EFAULT;
+ until = timespec_to_ktime(kts);
+ } else {
+ until = KTIME_MAX;
+ }
+
+ ret = wait_event_interruptible_hrtimeout(
+ priv->outdated_wait,
+ !atomic_read(&priv->outdated_watchers),
+ until
+ );
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf)
+{
+ struct page *page;
+ struct file_data *file_data =
+ (struct file_data *) vmf->vma->vm_private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (priv->map_buf) {
+ page = virt_to_page(priv->map_buf);
+ get_page(page);
+ vmf->page = page;
+ }
+
+ return 0;
+}
+
+static const struct vm_operations_struct vmgenid_vm_ops = {
+ .fault = vmgenid_vm_fault,
+};
+
+static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) != 0)
+ return -EPERM;
+
+ vma->vm_ops = &vmgenid_vm_ops;
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_private_data = file->private_data;
+
+ return 0;
+}
+
+static const struct file_operations fops = {
+ .owner = THIS_MODULE,
+ .mmap = vmgenid_mmap,
+ .open = vmgenid_open,
+ .release = vmgenid_close,
+ .read = vmgenid_read,
+ .write = vmgenid_write,
+ .poll = vmgenid_poll,
+ .compat_ioctl = vmgenid_ioctl,
+ .unlocked_ioctl = vmgenid_ioctl,
+};
+
+static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle)
+{
+ int i;
+ phys_addr_t phys_addr;
+ struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+ acpi_status status;
+ union acpi_object *pss;
+ union acpi_object *element;
+
+ status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ pss = buffer.pointer;
+ if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+ return -EINVAL;
+
+ phys_addr = 0;
+ for (i = 0; i < pss->package.count; i++) {
+ element = &(pss->package.elements[i]);
+ if (element->type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+ phys_addr |= element->integer.value << i * 32;
+ }
+
+ priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+ if (!priv->uuid_iomap) {
+ pr_err("Could not map memory at 0x%llx, size %u",
+ phys_addr,
+ (u32)sizeof(uuid_t));
+ return -ENOMEM;
+ }
+
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+ memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
+
+ return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+ int ret;
+ struct dev_data *priv;
+
+ if (!device)
+ return -EINVAL;
+
+ priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->map_buf = get_zeroed_page(GFP_KERNEL);
+ if (!priv->map_buf) {
+ ret = -ENOMEM;
+ goto free;
+ }
+
+ device->driver_data = priv;
+
+ init_waitqueue_head(&priv->read_wait);
+ atomic_set(&priv->watchers, 0);
+ atomic_set(&priv->outdated_watchers, 0);
+ init_waitqueue_head(&priv->outdated_wait);
+
+ ret = vmgenid_acpi_map(priv, device->handle);
+ if (ret < 0)
+ goto err;
+
+ ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME);
+ if (ret < 0) {
+ pr_err("alloc_chrdev_region() failed for vmgenid\n");
+ goto err;
+ }
+
+ cdev_init(&priv->cdev, &fops);
+ cdev_add(&priv->cdev, priv->dev_id, 1);
+
+ return 0;
+
+err:
+ if (priv->uuid_iomap)
+ acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+
+ free_pages(priv->map_buf, 0);
+
+free:
+ kfree(priv);
+
+ return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+ struct dev_data *priv;
+
+ if (!device || !acpi_driver_data(device))
+ return -EINVAL;
+ priv = acpi_driver_data(device);
+
+ cdev_del(&priv->cdev);
+ unregister_chrdev_region(priv->dev_id, 1);
+ device->driver_data = NULL;
+
+ if (priv->uuid_iomap)
+ acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+ free_pages(priv->map_buf, 0);
+ kfree(priv);
+
+ return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+ uuid_t old_uuid;
+ struct dev_data *priv;
+
+ pr_debug("VMGENID notified, event %u", event);
+
+ if (!device || !acpi_driver_data(device)) {
+ pr_err("VMGENID notify with NULL private data");
+ return;
+ }
+ priv = acpi_driver_data(device);
+
+ /* update VM Generation UUID */
+ old_uuid = priv->uuid;
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ if (!vmgenid_uuid_matches(priv, &old_uuid)) {
+ /* HW uuid updated */
+ memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t));
+ atomic_set(&priv->outdated_watchers,
+ atomic_read(&priv->watchers));
+ wake_up_interruptible(&priv->read_wait);
+ }
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+ {"QEMUVGID", 0},
+ {"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+ .name = "vm_generation_id",
+ .ids = vmgenid_ids,
+ .owner = THIS_MODULE,
+ .ops = {
+ .add = vmgenid_acpi_add,
+ .remove = vmgenid_acpi_remove,
+ .notify = vmgenid_acpi_notify,
+ }
+};
+
+static int __init vmgenid_init(void)
+{
+ return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+ acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
new file mode 100644
index 0000000..f7fca7b
--- /dev/null
+++ b/include/uapi/linux/vmgenid.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2020, Amazon.com Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_VMGENID_H
+#define _UAPI_LINUX_VMGENID_H
+
+#include <linux/ioctl.h>
+#include <linux/time.h>
+
+#define VMGENID_IOCTL 0x2d
+#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
+#define VMGENID_WAIT_WATCHERS _IOW(VMGENID_IOCTL, 2, struct timespec)
+
+#endif /* _UAPI_LINUX_VMGENID_H */
+
--
2.7.4
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On Fri, Oct 16, 2020 at 02:33:15PM +0000, Catangiu, Adrian Costin wrote: > +config VMGENID > + tristate "Virtual Machine Generation ID driver" > + depends on ACPI > + default M Unless this is required to boot a machine, this should be removed. > + help > + This is a Virtual Machine Generation ID driver which provides > + a virtual machine unique identifier. The provided UUID can be > + watched through the FS interface exposed by this driver, and > + thus can provide notifications for VM snapshot or cloning events. Where is the uuid exposed in the filesystem? > + This enables applications and libraries that store or cache > + sensitive information, to know that they need to regenerate it > + after process memory has been exposed to potential copying. Module name? > + > config FSL_HV_MANAGER > tristate "Freescale hypervisor management driver" > depends on FSL_SOC > diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile > index fd33124..a1f8dcc 100644 > --- a/drivers/virt/Makefile > +++ b/drivers/virt/Makefile > @@ -4,4 +4,5 @@ > # > > obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o > +obj-$(CONFIG_VMGENID) += vmgenid.o > obj-y += vboxguest/ > diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c > new file mode 100644 > index 0000000..d314c72 > --- /dev/null > +++ b/drivers/virt/vmgenid.c > @@ -0,0 +1,419 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Virtual Machine Generation ID driver > + * > + * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc That's not how you write copyright lines, please fix up. > + * All rights reserved. > + * Authors: > + * Adrian Catangiu <acatan@amazon.com> > + * Or Idgar <oridgar@gmail.com> > + * Gal Hammer <ghammer@redhat.com> > + * > + */ > +#include <linux/acpi.h> > +#include <linux/cdev.h> > +#include <linux/kernel.h> > +#include <linux/mm.h> > +#include <linux/module.h> > +#include <linux/poll.h> > +#include <linux/uuid.h> > +#include <linux/vmgenid.h> > + > +#define DEV_NAME "vmgenid" > +ACPI_MODULE_NAME(DEV_NAME); > + > +struct dev_data { > + struct cdev cdev; > + dev_t dev_id; > + unsigned long map_buf; > + > + void *uuid_iomap; > + uuid_t uuid; > + wait_queue_head_t read_wait; > + > + atomic_t watchers; > + atomic_t outdated_watchers; > + wait_queue_head_t outdated_wait; > +}; > + > +struct file_data { > + struct dev_data *dev_data; > + uuid_t acked_uuid; > +}; > + > +static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid) > +{ > + return !memcmp(uuid, &priv->uuid, sizeof(uuid_t)); > +} > + > +static void vmgenid_put_outdated_watchers(struct dev_data *priv) > +{ > + if (atomic_dec_and_test(&priv->outdated_watchers)) > + wake_up_interruptible(&priv->outdated_wait); > +} > + > +static int vmgenid_open(struct inode *inode, struct file *file) > +{ > + struct dev_data *priv = > + container_of(inode->i_cdev, struct dev_data, cdev); > + struct file_data *file_data = > + kzalloc(sizeof(struct file_data), GFP_KERNEL); > + > + if (!file_data) > + return -ENOMEM; > + > + file_data->acked_uuid = priv->uuid; > + file_data->dev_data = priv; > + > + file->private_data = file_data; > + atomic_inc(&priv->watchers); > + > + return 0; > +} > + > +static int vmgenid_close(struct inode *inode, struct file *file) > +{ > + struct file_data *file_data = (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + > + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) > + vmgenid_put_outdated_watchers(priv); > + atomic_dec(&priv->watchers); > + kfree(file->private_data); > + > + return 0; > +} > + > +static ssize_t > +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos) > +{ > + struct file_data *file_data = > + (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + ssize_t ret; > + > + if (nbytes == 0) > + return 0; > + /* disallow partial UUID reads */ > + if (nbytes < sizeof(uuid_t)) > + return -EINVAL; > + nbytes = sizeof(uuid_t); > + > + if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) { > + if (file->f_flags & O_NONBLOCK) > + return -EAGAIN; > + ret = wait_event_interruptible( > + priv->read_wait, > + !vmgenid_uuid_matches(priv, &file_data->acked_uuid) > + ); > + if (ret) > + return ret; > + } > + > + ret = copy_to_user(ubuf, &priv->uuid, nbytes); > + if (ret) > + return -EFAULT; > + > + return nbytes; > +} > + > +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf, > + size_t count, loff_t *ppos) > +{ > + struct file_data *file_data = (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + uuid_t ack_uuid; > + > + /* disallow partial UUID writes */ > + if (count != sizeof(uuid_t)) > + return -EINVAL; > + if (copy_from_user(&ack_uuid, ubuf, count)) > + return -EFAULT; > + /* wrong UUID acknowledged */ > + if (!vmgenid_uuid_matches(priv, &ack_uuid)) > + return -EINVAL; > + > + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) { > + /* update local view of UUID */ > + file_data->acked_uuid = ack_uuid; > + vmgenid_put_outdated_watchers(priv); > + } > + > + return (ssize_t)count; > +} > + > +static __poll_t > +vmgenid_poll(struct file *file, poll_table *wait) > +{ > + __poll_t mask = 0; > + struct file_data *file_data = > + (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + > + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) > + return EPOLLIN | EPOLLRDNORM; > + > + poll_wait(file, &priv->read_wait, wait); > + > + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) > + mask = EPOLLIN | EPOLLRDNORM; > + > + return mask; > +} > + > +static long vmgenid_ioctl(struct file *file, > + unsigned int cmd, unsigned long arg) > +{ > + struct file_data *file_data = > + (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + struct timespec __user *timeout = (void *) arg; > + struct timespec kts; > + ktime_t until; > + int ret; > + > + switch (cmd) { > + case VMGENID_GET_OUTDATED_WATCHERS: > + ret = atomic_read(&priv->outdated_watchers); > + break; > + case VMGENID_WAIT_WATCHERS: > + if (timeout) { > + ret = copy_from_user(&kts, timeout, sizeof(kts)); > + if (ret) > + return -EFAULT; > + until = timespec_to_ktime(kts); No validation of structure fields? And you are exposing a kernel structure to userspace? I don't see this function in Linus's tree right now, are you sure this builds? > + } else { > + until = KTIME_MAX; > + } > + > + ret = wait_event_interruptible_hrtimeout( > + priv->outdated_wait, > + !atomic_read(&priv->outdated_watchers), > + until > + ); > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf) > +{ > + struct page *page; > + struct file_data *file_data = > + (struct file_data *) vmf->vma->vm_private_data; You do know you don't need to cast stuff like this, right? Also run checkpatch.pl so maintainers are not grumpy and tell you to run checkpatch.pl. > + struct dev_data *priv = file_data->dev_data; > + > + if (priv->map_buf) { > + page = virt_to_page(priv->map_buf); > + get_page(page); > + vmf->page = page; > + } > + > + return 0; > +} > + > +static const struct vm_operations_struct vmgenid_vm_ops = { > + .fault = vmgenid_vm_fault, > +}; > + > +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1) > + return -EINVAL; > + > + if ((vma->vm_flags & VM_WRITE) != 0) > + return -EPERM; > + > + vma->vm_ops = &vmgenid_vm_ops; > + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; > + vma->vm_private_data = file->private_data; > + > + return 0; > +} > + > +static const struct file_operations fops = { > + .owner = THIS_MODULE, > + .mmap = vmgenid_mmap, > + .open = vmgenid_open, > + .release = vmgenid_close, > + .read = vmgenid_read, > + .write = vmgenid_write, > + .poll = vmgenid_poll, > + .compat_ioctl = vmgenid_ioctl, > + .unlocked_ioctl = vmgenid_ioctl, > +}; > + > +static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle) > +{ > + int i; > + phys_addr_t phys_addr; > + struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; > + acpi_status status; > + union acpi_object *pss; > + union acpi_object *element; > + > + status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer); > + if (ACPI_FAILURE(status)) { > + ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR")); > + return -ENODEV; > + } > + pss = buffer.pointer; > + if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2) > + return -EINVAL; > + > + phys_addr = 0; > + for (i = 0; i < pss->package.count; i++) { > + element = &(pss->package.elements[i]); > + if (element->type != ACPI_TYPE_INTEGER) > + return -EINVAL; > + phys_addr |= element->integer.value << i * 32; > + } > + > + priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t)); > + if (!priv->uuid_iomap) { > + pr_err("Could not map memory at 0x%llx, size %u", > + phys_addr, > + (u32)sizeof(uuid_t)); > + return -ENOMEM; > + } > + > + memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t)); > + memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t)); > + > + return 0; > +} > + > +static int vmgenid_acpi_add(struct acpi_device *device) > +{ > + int ret; > + struct dev_data *priv; > + > + if (!device) > + return -EINVAL; > + > + priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL); > + if (!priv) > + return -ENOMEM; > + > + priv->map_buf = get_zeroed_page(GFP_KERNEL); > + if (!priv->map_buf) { > + ret = -ENOMEM; > + goto free; > + } > + > + device->driver_data = priv; > + > + init_waitqueue_head(&priv->read_wait); > + atomic_set(&priv->watchers, 0); > + atomic_set(&priv->outdated_watchers, 0); > + init_waitqueue_head(&priv->outdated_wait); > + > + ret = vmgenid_acpi_map(priv, device->handle); > + if (ret < 0) > + goto err; > + > + ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME); Why is this not a misc device driver instead of messing around with all of this "by hand"? By doing things this way, you are not creating a device node in /dev/ automatically, right? How did you test this? > @@ -0,0 +1,22 @@ > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ > +/* > + * Copyright (c) 2020, Amazon.com Inc. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; either version > + * 2 of the License, or (at your option) any later version. License boilerplate is not needed. > + */ > + > +#ifndef _UAPI_LINUX_VMGENID_H > +#define _UAPI_LINUX_VMGENID_H > + > +#include <linux/ioctl.h> > +#include <linux/time.h> > + > +#define VMGENID_IOCTL 0x2d > +#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1) > +#define VMGENID_WAIT_WATCHERS _IOW(VMGENID_IOCTL, 2, struct timespec) Why do you need ioctls for this? Why not just use read/write? thanks, greg k-h
[adding some more people who are interested in RNG stuff: Andy, Jason, Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this concerns some pretty fundamental API stuff related to RNG usage] On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin <acatan@amazon.com> wrote: > - Background > > The VM Generation ID is a feature defined by Microsoft (paper: > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > multiple hypervisor vendors. > > The feature is required in virtualized environments by apps that work > with local copies/caches of world-unique data such as random values, > uuids, monotonically increasing counters, etc. > Such apps can be negatively affected by VM snapshotting when the VM > is either cloned or returned to an earlier point in time. > > The VM Generation ID is a simple concept meant to alleviate the issue > by providing a unique ID that changes each time the VM is restored > from a snapshot. The hw provided UUID value can be used to > differentiate between VMs or different generations of the same VM. > > - Problem > > The VM Generation ID is exposed through an ACPI device by multiple > hypervisor vendors but neither the vendors or upstream Linux have no > default driver for it leaving users to fend for themselves. > > Furthermore, simply finding out about a VM generation change is only > the starting point of a process to renew internal states of possibly > multiple applications across the system. This process could benefit > from a driver that provides an interface through which orchestration > can be easily done. > > - Solution > > This patch is a driver which exposes the Virtual Machine Generation ID > via a char-dev FS interface that provides ID update sync and async > notification, retrieval and confirmation mechanisms: > > When the device is 'open()'ed a copy of the current vm UUID is > associated with the file handle. 'read()' operations block until the > associated UUID is no longer up to date - until HW vm gen id changes - > at which point the new UUID is provided/returned. Nonblocking 'read()' > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > 'poll()' is implemented to allow polling for UUID updates. Such > updates result in 'EPOLLIN' events. > > Subsequent read()s following a UUID update no longer block, but return > the updated UUID. The application needs to acknowledge the UUID update > by confirming it through a 'write()'. > Only on writing back to the driver the right/latest UUID, will the > driver mark this "watcher" as up to date and remove EPOLLIN status. > > 'mmap()' support allows mapping a single read-only shared page which > will always contain the latest UUID value at offset 0. It would be nicer if that page just contained an incrementing counter, instead of a UUID. It's not like the application cares *what* the UUID changed to, just that it *did* change and all RNGs state now needs to be reseeded from the kernel, right? And an application can't reliably read the entire UUID from the memory mapping anyway, because the VM might be forked in the middle. So I think your kernel driver should detect UUID changes and then turn those into a monotonically incrementing counter. (Probably 64 bits wide?) (That's probably also a little bit faster than comparing an entire UUID.) An option might be to put that counter into the vDSO, instead of a separate VMA; but I don't know how the other folks feel about that. Andy, do you have opinions on this? That way, normal userspace code that uses this infrastructure wouldn't have to mess around with a special device at all. And it'd be usable in seccomp sandboxes and so on without needing special plumbing. And libraries wouldn't have to call open() and mess with file descriptor numbers. > The driver also adds support for tracking count of open file handles > that haven't acknowledged an UUID update. This is exposed through > two IOCTLs: > * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of > _outdated_ watchers - number of file handles that were open during > a VM generation change, and which have not yet confirmed the new > Vm-Gen-Id. > * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_ > watchers, or if a 'timeout' argument is provided, until the timeout > expires. Does this mean that code that uses the memory mapping to detect changes is still supposed to confirm generation IDs? What about multithreaded processes, especially ones that use libraries - if a library opens a single file descriptor that is used from multiple threads, is the library required to synchronize all its threads before confirming the change? That seems like it could get messy. And it again means that this interface can't easily be used from inside seccomp sandboxes. [...] > diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst [...] > +``close()``: > + Removes the file handle as a Vm-Gen-Id watcher. (Linux doesn't have "file handles". Technically close() just closes a file descriptor, and if that file descriptor points to the same file description object (aka struct file) as another file descriptor, nothing happens.) > +Example application workflows > +----------------------------- > + > +1) Watchdog thread simplified example:: > + > + void watchdog_thread_handler(int *thread_active) > + { > + uuid_t uuid; > + int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR); In case we actually keep this API, you should use O_CLOEXEC here. > + > + do { > + // read new UUID - blocks until VM generation changes > + read(fd, &uuid, sizeof(uuid)); > + > + // because of VM generation change, we need to rebuild world > + reseed_app_env(); > + > + // confirm we're done handling UUID update > + write(fd, &uuid, sizeof(uuid)); > + } while (atomic_read(thread_active)); > + > + close(fd); > + } [...] > +3) Mapped memory polling simplified example:: > + > + /* > + * app/library function that provides cached secrets > + */ > + char * safe_cached_secret(app_data_t *app) > + { > + char *secret; > + volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app); > + again: > + secret = __cached_secret(app); > + > + if (unlikely(*uuid_ptr != app->cached_uuid)) { > + app->cached_uuid = *uuid_ptr; > + > + // rebuild world then confirm the uuid update (thru write) > + rebuild_caches(app); > + ack_vmgenid_update(app); > + > + goto again; > + } > + > + return secret; > + } > + > +4) Orchestrator simplified example:: > + > + /* > + * orchestrator - manages multiple apps and libraries used by a service > + * and tries to make sure all sensitive components gracefully handle > + * VM generation changes. > + * Following function is called on detection of a VM generation change. > + */ > + int handle_vmgen_update(int vmgenfd, uuid_t new_uuid) > + { > + // pause until all components have handled event > + pause_service(); > + > + // confirm *this* watcher as up-to-date > + write(fd, &new_uuid, sizeof(uuid_t)); > + > + // wait for all *others* > + ioctl(fd, VMGENID_WAIT_WATCHERS, NULL); > + > + // all apps on the system have rebuilt worlds > + resume_service(); > + } Can you describe what value such an "Orchestrator" would add? Because it seems to me like this will just unnecessarily complicate things. [...] > diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile > index fd33124..a1f8dcc 100644 > --- a/drivers/virt/Makefile > +++ b/drivers/virt/Makefile > @@ -4,4 +4,5 @@ > # > > obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o > +obj-$(CONFIG_VMGENID) += vmgenid.o > obj-y += vboxguest/ > diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c [...] > +static int vmgenid_close(struct inode *inode, struct file *file) > +{ > + struct file_data *file_data = (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + > + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) > + vmgenid_put_outdated_watchers(priv); > + atomic_dec(&priv->watchers); What happens if the UUID changes between the previous two calls? Then the outdated watcher count will go out of sync, right? (But as I've said, I think that maybe the outdated watcher counting is a bad idea in general, and we should just get rid of it.) > + kfree(file->private_data); > + > + return 0; > +} > + > +static ssize_t > +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos) > +{ > + struct file_data *file_data = > + (struct file_data *) file->private_data; > + struct dev_data *priv = file_data->dev_data; > + ssize_t ret; > + > + if (nbytes == 0) > + return 0; > + /* disallow partial UUID reads */ > + if (nbytes < sizeof(uuid_t)) > + return -EINVAL; > + nbytes = sizeof(uuid_t); > + > + if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) { > + if (file->f_flags & O_NONBLOCK) > + return -EAGAIN; > + ret = wait_event_interruptible( > + priv->read_wait, > + !vmgenid_uuid_matches(priv, &file_data->acked_uuid) > + ); > + if (ret) > + return ret; > + } > + > + ret = copy_to_user(ubuf, &priv->uuid, nbytes); If the VM is again forked in the middle of this, will userspace see a torn UUID (consisting of half old and half new value)? > + if (ret) > + return -EFAULT; > + > + return nbytes; > +} [...] > +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf) > +{ > + struct page *page; > + struct file_data *file_data = > + (struct file_data *) vmf->vma->vm_private_data; > + struct dev_data *priv = file_data->dev_data; > + > + if (priv->map_buf) { > + page = virt_to_page(priv->map_buf); > + get_page(page); > + vmf->page = page; > + } > + > + return 0; > +} > + > +static const struct vm_operations_struct vmgenid_vm_ops = { > + .fault = vmgenid_vm_fault, > +}; > + > +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1) > + return -EINVAL; > + > + if ((vma->vm_flags & VM_WRITE) != 0) > + return -EPERM; This doesn't work, you also need to clear VM_MAYWRITE. See e.g. binder_mmap(). Also, I think mmap handlers for special mappings like this usually directly install the page inside the mmap handler, using something like vm_insert_page(). And then they don't need a ->fault handler. (But if we decide to put this into the vDSO, the whole memory mapping thing would become unnecessary anyway.) > + vma->vm_ops = &vmgenid_vm_ops; > + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; > + vma->vm_private_data = file->private_data; > + > + return 0; > +} > + > +static const struct file_operations fops = { > + .owner = THIS_MODULE, > + .mmap = vmgenid_mmap, > + .open = vmgenid_open, > + .release = vmgenid_close, > + .read = vmgenid_read, > + .write = vmgenid_write, > + .poll = vmgenid_poll, > + .compat_ioctl = vmgenid_ioctl, You don't need to define a compat_ioctl if the normal ioctl handler is the same. > + .unlocked_ioctl = vmgenid_ioctl, > +}; [...] > +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event) > +{ > + uuid_t old_uuid; > + struct dev_data *priv; > + > + pr_debug("VMGENID notified, event %u", event); > + > + if (!device || !acpi_driver_data(device)) { > + pr_err("VMGENID notify with NULL private data"); > + return; > + } > + priv = acpi_driver_data(device); > + > + /* update VM Generation UUID */ > + old_uuid = priv->uuid; > + memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t)); > + > + if (!vmgenid_uuid_matches(priv, &old_uuid)) { > + /* HW uuid updated */ > + memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t)); > + atomic_set(&priv->outdated_watchers, > + atomic_read(&priv->watchers)); > + wake_up_interruptible(&priv->read_wait); > + } > +} If we know that the VM just got forked, we should probably also make sure that we reseed the kernel's internal RNG before we tell userspace to fetch new RNG seeds from the kernel? Otherwise this is kinda pointless. Or are we already taking care of that elsewhere? If not, we should probably mix the UUID into the entropy pool (at least `write_pool(&input_pool, uuid, sizeof(uuid_t))`, although technically it would be better to do it in a way that ensures that userspace can't write the same value into the RNG - maybe we should introduce type prefixes into write_pool()?) and then trigger a reseed of everything else (`crng_reseed(&primary_crng, &input_pool)`).
On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote: > [adding some more people who are interested in RNG stuff: Andy, Jason, > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > concerns some pretty fundamental API stuff related to RNG usage] > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > <acatan@amazon.com> wrote: > > This patch is a driver which exposes the Virtual Machine Generation ID > > via a char-dev FS interface that provides ID update sync and async > > notification, retrieval and confirmation mechanisms: > > > > When the device is 'open()'ed a copy of the current vm UUID is > > associated with the file handle. 'read()' operations block until the > > associated UUID is no longer up to date - until HW vm gen id changes - > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > updates result in 'EPOLLIN' events. > > > > Subsequent read()s following a UUID update no longer block, but return > > the updated UUID. The application needs to acknowledge the UUID update > > by confirming it through a 'write()'. > > Only on writing back to the driver the right/latest UUID, will the > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > 'mmap()' support allows mapping a single read-only shared page which > > will always contain the latest UUID value at offset 0. > > It would be nicer if that page just contained an incrementing counter, > instead of a UUID. It's not like the application cares *what* the UUID > changed to, just that it *did* change and all RNGs state now needs to > be reseeded from the kernel, right? And an application can't reliably > read the entire UUID from the memory mapping anyway, because the VM > might be forked in the middle. > > So I think your kernel driver should detect UUID changes and then turn > those into a monotonically incrementing counter. (Probably 64 bits > wide?) (That's probably also a little bit faster than comparing an > entire UUID.) I agree with this. Further, I'm observing there is a very common confusion between "universally unique" and "random". Randoms are needed when seeking unpredictability. A random number generator *must* be able to return the same value multiple times in a row (though this is rare), otherwise it's not random. To illustrate this, a die has less than 3 bits of randomness and is sufficient to play games with friends where a counter would allow everyone to cheat. Conversely if you want to assign IDs to members of your family you'd rather use a counter than a die for this or you risk collisions and/or long series of retries to pick unique IDs. RFC4122 explains in great length how to produce guaranteed unique IDs, and this only involves space, time and counters. There's indeed a lazy variant that probably everyone uses nowadays, consisting in picking random numbers, but this is not guaranteed unique anymore. If the UUIDs used there are real UUIDs, it could be as simple as updating them according to their format, i.e. updating the timestamp, and if the timestamp is already the same, just increase the seq counter. Doing this doesn't require entropy, doesn't need to block and doesn't needlessly leak randoms that sometimes make people feel nervous. Just my two cents, Willy
On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote: > On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote: > > [adding some more people who are interested in RNG stuff: Andy, Jason, > > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > > concerns some pretty fundamental API stuff related to RNG usage] > > > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > > <acatan@amazon.com> wrote: > > > This patch is a driver which exposes the Virtual Machine Generation ID > > > via a char-dev FS interface that provides ID update sync and async > > > notification, retrieval and confirmation mechanisms: > > > > > > When the device is 'open()'ed a copy of the current vm UUID is > > > associated with the file handle. 'read()' operations block until the > > > associated UUID is no longer up to date - until HW vm gen id changes - > > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > > updates result in 'EPOLLIN' events. > > > > > > Subsequent read()s following a UUID update no longer block, but return > > > the updated UUID. The application needs to acknowledge the UUID update > > > by confirming it through a 'write()'. > > > Only on writing back to the driver the right/latest UUID, will the > > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > > > 'mmap()' support allows mapping a single read-only shared page which > > > will always contain the latest UUID value at offset 0. > > > > It would be nicer if that page just contained an incrementing counter, > > instead of a UUID. It's not like the application cares *what* the UUID > > changed to, just that it *did* change and all RNGs state now needs to > > be reseeded from the kernel, right? And an application can't reliably > > read the entire UUID from the memory mapping anyway, because the VM > > might be forked in the middle. > > > > So I think your kernel driver should detect UUID changes and then turn > > those into a monotonically incrementing counter. (Probably 64 bits > > wide?) (That's probably also a little bit faster than comparing an > > entire UUID.) > > I agree with this. Further, I'm observing there is a very common > confusion between "universally unique" and "random". Randoms are > needed when seeking unpredictability. A random number generator > *must* be able to return the same value multiple times in a row > (though this is rare), otherwise it's not random. [...] > If the UUIDs used there are real UUIDs, it could be as simple as > updating them according to their format, i.e. updating the timestamp, > and if the timestamp is already the same, just increase the seq counter. > Doing this doesn't require entropy, doesn't need to block and doesn't > needlessly leak randoms that sometimes make people feel nervous. Those UUIDs are supplied by existing hypervisor code; in that regard, this is almost like a driver for a hardware device. It is written against a fixed API provided by the underlying machine. Making sure that the sequence of UUIDs, as seen from inside the machine, never changes back to a previous one is the responsibility of the hypervisor and out of scope for this driver. Microsoft's spec document (which is a .docx file for reasons I don't understand) actually promises us that it is a cryptographically random 128-bit integer value, which means that if you fork a VM 2^64 times, the probability that any two of those VMs have the same counter is 2^-64. That should be good enough. But in userspace, we just need a simple counter. There's no need for us to worry about anything else, like timestamps or whatever. If we repeatedly fork a paused VM, the forked VMs will see the same counter value, but that's totally fine, because the only thing that matters to userspace is that the counter changes when the VM is forked. And actually, since the value is a cryptographically random 128-bit value, I think that we should definitely use it to help reseed the kernel's RNG, and keep it secret from userspace. That way, even if the VM image is public, we can ensure that going forward, the kernel RNG will return securely random data.
On 16 Oct 2020, at 21:02, Jann Horn wrote: > On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote: > But in userspace, we just need a simple counter. There's no need for > us to worry about anything else, like timestamps or whatever. If we > repeatedly fork a paused VM, the forked VMs will see the same counter > value, but that's totally fine, because the only thing that matters to > userspace is that the counter changes when the VM is forked. For user-space, even a single bit would do. We added MADVISE_WIPEONFORK so that userspace libraries can detect fork()/clone() robustly, for the same reasons. It just wipes a page as the indicator, which is effectively a single-bit signal, and it works well. On the user-space side of this, I’m keen to find a solution like that that we can use fairly easily inside of portable libraries and applications. The “have I forked” checks do end up in hot paths, so it’s nice if they can be CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my favorite. > And actually, since the value is a cryptographically random 128-bit > value, I think that we should definitely use it to help reseed the > kernel's RNG, and keep it secret from userspace. That way, even if the > VM image is public, we can ensure that going forward, the kernel RNG > will return securely random data. If the image is public, you need some extra new raw entropy from somewhere. The gen-id could be mixed in, that can’t do any harm as long as rigorous cryptographic mixing with the prior state is used, but if that’s all you do then the final state is still deterministic and non-secret. The kernel would need to use the change as a trigger to measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just define the machine contract as “this has to be unique random data and if it’s not unique, or if it’s pubic, you’re toast”. - Colm
On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh <colmmacc@amazon.com> wrote: > On 16 Oct 2020, at 21:02, Jann Horn wrote: > > On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau <w@1wt.eu> wrote: > > But in userspace, we just need a simple counter. There's no need for > > us to worry about anything else, like timestamps or whatever. If we > > repeatedly fork a paused VM, the forked VMs will see the same counter > > value, but that's totally fine, because the only thing that matters to > > userspace is that the counter changes when the VM is forked. > > For user-space, even a single bit would do. We added MADVISE_WIPEONFORK > so that userspace libraries can detect fork()/clone() robustly, for the > same reasons. It just wipes a page as the indicator, which is > effectively a single-bit signal, and it works well. On the user-space > side of this, I’m keen to find a solution like that that we can use > fairly easily inside of portable libraries and applications. The “have > I forked” checks do end up in hot paths, so it’s nice if they can be > CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my > favorite. I'm pretty sure a single bit is not enough if you want to have a single page, shared across the entire system, that stores the VM forking state; you need a counter for that. > > And actually, since the value is a cryptographically random 128-bit > > value, I think that we should definitely use it to help reseed the > > kernel's RNG, and keep it secret from userspace. That way, even if the > > VM image is public, we can ensure that going forward, the kernel RNG > > will return securely random data. > > If the image is public, you need some extra new raw entropy from > somewhere. The gen-id could be mixed in, that can’t do any harm as > long as rigorous cryptographic mixing with the prior state is used, but > if that’s all you do then the final state is still deterministic and > non-secret. Microsoft's documentation (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM Generation ID that we get after a fork "is a 128-bit, cryptographically random integer value". If multiple people use the same image, it guarantees that each use of the image gets its own, fresh ID: The table in section "How to implement virtual machine generation ID support in a virtualization platform" says that (among other things) "Virtual machine is imported, copied, or cloned" generates a new generation ID. So the RNG state after mixing in the new VM Generation ID would contain 128 bits of secret entropy not known to anyone else, including people with access to the VM image. Now, 128 bits of cryptographically random data aren't _optimal_; I think something on the order of 256 bits would be nicer from a theoretical standpoint. But in practice I think we'll be good with the 128 bits we're getting (since the number of users who fork a VM image is probably not going to be so large that worst-case collision probabilities matter). > The kernel would need to use the change as a trigger to > measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just > define the machine contract as “this has to be unique random data and > if it’s not unique, or if it’s pubic, you’re toast”. As far as I can tell from Microsoft's spec, that is a guarantee we're already getting.
On 16 Oct 2020, at 22:01, Jann Horn wrote: > > On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh > <colmmacc@amazon.com> wrote: >> For user-space, even a single bit would do. We added >> MADVISE_WIPEONFORK >> so that userspace libraries can detect fork()/clone() robustly, for >> the >> same reasons. It just wipes a page as the indicator, which is >> effectively a single-bit signal, and it works well. On the user-space >> side of this, I’m keen to find a solution like that that we can use >> fairly easily inside of portable libraries and applications. The >> “have >> I forked” checks do end up in hot paths, so it’s nice if they can >> be >> CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my >> favorite. > > I'm pretty sure a single bit is not enough if you want to have a > single page, shared across the entire system, that stores the VM > forking state; you need a counter for that. You’re right. WIPEONFORK is more like a single-bit per use. If it’s something system wide then a counter is better. > So the RNG state after mixing in the new VM Generation ID would > contain 128 bits of secret entropy not known to anyone else, including > people with access to the VM image. > > Now, 128 bits of cryptographically random data aren't _optimal_; I > think something on the order of 256 bits would be nicer from a > theoretical standpoint. But in practice I think we'll be good with the > 128 bits we're getting (since the number of users who fork a VM image > is probably not going to be so large that worst-case collision > probabilities matter). This reminds me on key/IV usage limits for AES encryption, where the same birthday bounds apply, and even though 256-bits would be better, we routinely make 128-bit birthday bounds work for massively scalable systems. >> The kernel would need to use the change as a trigger to >> measure some entropy (e.g. interrupts and RDRAND, or whatever). Our >> just >> define the machine contract as “this has to be unique random data >> and >> if it’s not unique, or if it’s pubic, you’re toast”. > > As far as I can tell from Microsoft's spec, that is a guarantee we're > already getting. Neat. - Colm
On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > Microsoft's documentation > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > Generation ID that we get after a fork "is a 128-bit, > cryptographically random integer value". If multiple people use the > same image, it guarantees that each use of the image gets its own, > fresh ID: No. It cannot be more unique than the source that feeds that cryptographic transformation. All it guarantees is that the entropy source is protected from being guessed based on the output. Applying cryptography on a simple counter provides apparently random numbers that will be unique for a long period for the same source, but as soon as you duplicate that code between users and they start from the same counter they'll get the same IDs. This is why I think that using a counter is better if you really need something unique. Randoms only reduce predictability which helps avoiding collisions. And I'm saying this as someone who had on his external gateway the same SSH host key as 89 other hosts on the net, each of them using randoms to provide a universally unique one... Willy
On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote: > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > > Microsoft's documentation > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > > Generation ID that we get after a fork "is a 128-bit, > > cryptographically random integer value". If multiple people use the > > same image, it guarantees that each use of the image gets its own, > > fresh ID: > > No. It cannot be more unique than the source that feeds that cryptographic > transformation. All it guarantees is that the entropy source is protected > from being guessed based on the output. Applying cryptography on a simple > counter provides apparently random numbers that will be unique for a long > period for the same source, but as soon as you duplicate that code between > users and they start from the same counter they'll get the same IDs. > > This is why I think that using a counter is better if you really need something > unique. Randoms only reduce predictability which helps avoiding collisions. Microsoft's spec tells us that they're giving us cryptographically random numbers. Where they're getting those from is not our problem. (And if even the hypervisor is not able to collect enough entropy to securely generate random numbers, worrying about RNG reseeding in the guest would be kinda pointless, we'd be fairly screwed anyway.) Also note that we don't actually need to *always* reinitialize RNG state on forks for functional correctness; it is fine if that fails with a probability of 2^-128, because functionally everything will be fine, and an attacker who is that lucky could also just guess an AES key (which has the same probability of being successful). (And also 2^-128 is such a tiny number that it doesn't matter anyway.) > And I'm saying this as someone who had on his external gateway the same SSH > host key as 89 other hosts on the net, each of them using randoms to provide > a universally unique one... If your SSH host key was shared with 89 other hosts, it evidently wasn't generated from cryptographically random numbers. :P Either because the key generator was not properly hooked up to the system's entropy pool (if you're talking about the Debian fiasco), or because the system simply did not have enough entropy available. (Or because the key generator is broken, but I don't think that ever happened with OpenSSH?)
On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote: > On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote: > > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > > > Microsoft's documentation > > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > > > Generation ID that we get after a fork "is a 128-bit, > > > cryptographically random integer value". If multiple people use the > > > same image, it guarantees that each use of the image gets its own, > > > fresh ID: > > > > No. It cannot be more unique than the source that feeds that cryptographic > > transformation. All it guarantees is that the entropy source is protected > > from being guessed based on the output. Applying cryptography on a simple > > counter provides apparently random numbers that will be unique for a long > > period for the same source, but as soon as you duplicate that code between > > users and they start from the same counter they'll get the same IDs. > > > > This is why I think that using a counter is better if you really need something > > unique. Randoms only reduce predictability which helps avoiding collisions. > > Microsoft's spec tells us that they're giving us cryptographically > random numbers. Where they're getting those from is not our problem. > (And if even the hypervisor is not able to collect enough entropy to > securely generate random numbers, worrying about RNG reseeding in the > guest would be kinda pointless, we'd be fairly screwed anyway.) Sorry if I sound annoying, but it's a matter of terminology and needs. Cryptograhically random means safe for use with cryptography in that it is unguessable enough so that you can use it for encryption keys that nobody will be able to guess. It in no ways guarantees uniqueness, just like you don't really care if the symmetric crypto key of you VPN has already been used once somewhere else as long as there's no way to know. However with the good enough distribution that a CSPRNG provides, collisions within a *same* generator are bound to a very low, predictable rate which is by generally considered as acceptable for all use cases. Something random (cryptographically or not) *cannot* be unique by definition, otherwise it's not random anymore, since each draw has an influence on the remaining list of possible draws, which is contrary to randomness. And conversely something unique cannot be completely random because if you know it's unique, you can already rule out all other known values from the candidates, thus it's more predictable than random. With this in mind, picking randoms from a same RNG is often highly sufficient to consider they're highly likely unique within a long period. But it's not a guarantee. And it's even less one between two RNGs (e.g. if uniqueness is required between multiple hypervisors in case VMs are migrated or centrally managed, which I don't know). If what is sought here is a strong guarantee of uniqueness, using a counter as you first suggested is better. If what is sought is pure randomness (in the sense that it's unpredictable, which I don't think is needed here), then randoms are better. If both are required, just concatenate a counter and a random. And if you need them to be spatially unique, just include a node identifier. Now the initial needs in the forwarded message are not entirely clear to me but I wanted to rule out the apparent mismatch between the expressed needs for uniqueness and the proposed solutions solely based on randomness. Cheers, Willy
On Sat, Oct 17, 2020 at 8:44 AM Willy Tarreau <w@1wt.eu> wrote: > On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote: > > On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau <w@1wt.eu> wrote: > > > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > > > > Microsoft's documentation > > > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > > > > Generation ID that we get after a fork "is a 128-bit, > > > > cryptographically random integer value". If multiple people use the > > > > same image, it guarantees that each use of the image gets its own, > > > > fresh ID: > > > > > > No. It cannot be more unique than the source that feeds that cryptographic > > > transformation. All it guarantees is that the entropy source is protected > > > from being guessed based on the output. Applying cryptography on a simple > > > counter provides apparently random numbers that will be unique for a long > > > period for the same source, but as soon as you duplicate that code between > > > users and they start from the same counter they'll get the same IDs. > > > > > > This is why I think that using a counter is better if you really need something > > > unique. Randoms only reduce predictability which helps avoiding collisions. > > > > Microsoft's spec tells us that they're giving us cryptographically > > random numbers. Where they're getting those from is not our problem. > > (And if even the hypervisor is not able to collect enough entropy to > > securely generate random numbers, worrying about RNG reseeding in the > > guest would be kinda pointless, we'd be fairly screwed anyway.) > > Sorry if I sound annoying, but it's a matter of terminology and needs. > > Cryptograhically random means safe for use with cryptography in that it > is unguessable enough so that you can use it for encryption keys that > nobody will be able to guess. It in no ways guarantees uniqueness, just > like you don't really care if the symmetric crypto key of you VPN has > already been used once somewhere else as long as there's no way to know. > However with the good enough distribution that a CSPRNG provides, > collisions within a *same* generator are bound to a very low, predictable > rate which is by generally considered as acceptable for all use cases. Yes. > Something random (cryptographically or not) *cannot* be unique by > definition, otherwise it's not random anymore, since each draw has an > influence on the remaining list of possible draws, which is contrary to > randomness. And conversely something unique cannot be completely random > because if you know it's unique, you can already rule out all other known > values from the candidates, thus it's more predictable than random. Yes. > With this in mind, picking randoms from a same RNG is often highly > sufficient to consider they're highly likely unique within a long > period. But it's not a guarantee. And it's even less one between two > RNGs (e.g. if uniqueness is required between multiple hypervisors in > case VMs are migrated or centrally managed, which I don't know). > > If what is sought here is a strong guarantee of uniqueness, using a > counter as you first suggested is better. My suggestion is to use a counter *in the UAPI*, not in the hypervisor protocol. (And as long as that counter can only miss increments in a cryptographically negligible fraction of cases, everything's fine.) > If what is sought is pure > randomness (in the sense that it's unpredictable, which I don't think > is needed here), then randoms are better. And this is what *the hypervisor protocol* gives us (which could be very useful for reseeding the kernel RNG). > If both are required, just > concatenate a counter and a random. And if you need them to be spatially > unique, just include a node identifier. > > Now the initial needs in the forwarded message are not entirely clear > to me but I wanted to rule out the apparent mismatch between the expressed > needs for uniqueness and the proposed solutions solely based on randomness. Sure, from a theoretical standpoint, it would be a little bit nicer if the hypervisor protocol included a generation number along with the 128-bit random value. But AFAIU it doesn't, so if we want this to just work under Microsoft's existing hypervisor, we'll have to make do with checking whether the random value changed. :P
On Sat, Oct 17, 2020 at 08:55:34AM +0200, Jann Horn wrote: > My suggestion is to use a counter *in the UAPI*, not in the hypervisor > protocol. (And as long as that counter can only miss increments in a > cryptographically negligible fraction of cases, everything's fine.) OK I got it now and I agree. > > If what is sought is pure > > randomness (in the sense that it's unpredictable, which I don't think > > is needed here), then randoms are better. > > And this is what *the hypervisor protocol* gives us (which could be > very useful for reseeding the kernel RNG). As an external source, yes very likely, as long as it's not trivially observable by everyone under the same hypervisor :-) > > Now the initial needs in the forwarded message are not entirely clear > > to me but I wanted to rule out the apparent mismatch between the expressed > > needs for uniqueness and the proposed solutions solely based on randomness. > > Sure, from a theoretical standpoint, it would be a little bit nicer if > the hypervisor protocol included a generation number along with the > 128-bit random value. But AFAIU it doesn't, so if we want this to just > work under Microsoft's existing hypervisor, we'll have to make do with > checking whether the random value changed. :P OK got it, thanks for the explanation! Willy
After discussing this offline with Jann a bit, I have a few general comments on the design of this. First, the UUID communicated by the hypervisor should be consumed by the kernel -- added as another input to the rng -- and then userspace should be notified that it should reseed any userspace RNGs that it may have, without actually communicating that UUID to userspace. IOW, I agree with Jann there. Then, it's the functioning of this notification mechanism to userspace that is interesting to me. There are a few design goals of notifying userspace: it should be fast, because people who are using userspace RNGs are usually doing so in the first place to completely avoid syscall overhead for whatever high performance application they have - e.g. I recall conversations with Colm about his TLS implementation needing to make random IVs _really_ fast. It should also happen as early as possible, with no race or as minimal as possible race window, so that userspace doesn't begin using old randomness and then switch over after the damage is already done. I'm also not wedded to using Microsoft's proprietary hypervisor design for this. If we come up with a better interface, I don't think it's asking too much to implement that and reasonably expect for Microsoft to catch up. Maybe someone here will find that controversial, but whatever -- discussing ideal designs does not seem out of place or inappropriate for how we usually approach things in the kernel, and a closed source hypervisor coming along shouldn't disrupt that. So, anyway, here are a few options with some pros and cons for the kernel notifying userspace that its RNG should reseed. 1. SIGRND - a new signal. Lol. 2. Userspace opens a file descriptor that it can epoll on. Pros are that many notification mechanisms already use this. Cons is that this requires syscall and might be more racy than we want. Another con is that this a new thing for userspace programs to do. 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are that this is extremely fast, and also simple to use and implement. There are enough sequence points in typical crypto programs that checking to see whether this counter has changed before doing whatever operation seems easy enough. Cons are that typically we've been conservative about adding things to the vDSO, and this is also a new thing for userspace programs to do. 4. We already have a mechanism for this kind of thing, because the same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, where userspace marks a page to be zeroed when forking, for the purposes of the RNG being notified when its world gets split in two. This is basically the same thing as we're discussing here with guest snapshots, except it's on the system level rather than the process level, and a system has many processes. But the problem space is still almost the same, and we could simply reuse that same mechanism. There are a few implementation strategies for that: 4a. We mess with the PTEs of all processes' pages that are MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. Cons might be that this usually requires semaphores, and we're in irq context, so we'd have to hoist to a workqueue, which means either more wake up latency, or a larger race window. 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. 4c. The guest kernel maintains an array of physical addresses that are MADV_WIPEONFORK. The hypervisor knows about this array and its location through whatever protocol, and before resuming a moved/snapshotted/duplicated VM, it takes the responsibility for memzeroing this memory. The huge pro here would be that this eliminates all races, and reduces complexity quite a bit, because the hypervisor can perfectly synchronize its bringup (and SMP bringup) with this, and it can even optimize things like on-disk memory snapshots to simply not write out those pages to disk. A 4c-like approach seems like it'd be a lot of bang for the buck -- we reuse the existing mechanism (MADV_WIPEONFORK), so there's no new userspace API to deal with, and it'd be race free, and eliminate a lot of kernel complexity. But 4b and 3 don't seem too bad either. Any thoughts on 4c? Is that utterly insane, or does that actually get us somewhere close to what we want? Jason
After discussing this offline with Jann a bit, I have a few general comments on the design of this. First, the UUID communicated by the hypervisor should be consumed by the kernel -- added as another input to the rng -- and then userspace should be notified that it should reseed any userspace RNGs that it may have, without actually communicating that UUID to userspace. IOW, I agree with Jann there. Then, it's the functioning of this notification mechanism to userspace that is interesting to me. Agreed! The UUID/vmgenid is the glue to the hypervisor to be able to find out about VM snapshots/forks. The really interesting (and important) topic here is finding the right notification mechanism to userspace. ...In retrospect, I should have posted this as RFC instead of PATCH. So, anyway, here are a few options with some pros and cons for the kernel notifying userspace that its RNG should reseed. 1. SIGRND - a new signal. Lol. 2. Userspace opens a file descriptor that it can epoll on. Pros are that many notification mechanisms already use this. Cons is that this requires syscall and might be more racy than we want. Another con is that this a new thing for userspace programs to do. 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are that this is extremely fast, and also simple to use and implement. There are enough sequence points in typical crypto programs that checking to see whether this counter has changed before doing whatever operation seems easy enough. Cons are that typically we've been conservative about adding things to the vDSO, and this is also a new thing for userspace programs to do. For each 1, 2, and 3 options, userspace programs _have to do smth new_ anyway, so I wouldn't weigh that as a con. An atomic counter in the vDSO looks like the most bang for the buck to me. I'm really curious to hear more opinions on why we shouldn't do it. 4. We already have a mechanism for this kind of thing, because the same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, where userspace marks a page to be zeroed when forking, for the purposes of the RNG being notified when its world gets split in two. This is basically the same thing as we're discussing here with guest snapshots, except it's on the system level rather than the process level, and a system has many processes. But the problem space is still almost the same, and we could simply reuse that same mechanism. There are a few implementation strategies for that: I don't think we can piggy back on MADV_WIPEONFORK. That madvise flag has a clear contract of only wiping _on fork_. Overloading it with wiping on VM-fork - while process doesn't fork - might break some of its users. 4a. We mess with the PTEs of all processes' pages that are MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. Cons might be that this usually requires semaphores, and we're in irq context, so we'd have to hoist to a workqueue, which means either more wake up latency, or a larger race window. 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. 4c. The guest kernel maintains an array of physical addresses that are MADV_WIPEONFORK. The hypervisor knows about this array and its location through whatever protocol, and before resuming a moved/snapshotted/duplicated VM, it takes the responsibility for memzeroing this memory. The huge pro here would be that this eliminates all races, and reduces complexity quite a bit, because the hypervisor can perfectly synchronize its bringup (and SMP bringup) with this, and it can even optimize things like on-disk memory snapshots to simply not write out those pages to disk. I've previously proposed a path similar (in concept at least) with a combination of 4 a,b and c - https://lwn.net/ml/linux-mm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/ without reusing MADV_WIPEONFORK, but by adding a dedicated MADV_WIPEONSUSPEND. That proposal was clunky however with many people raising concerns around how the interface is too subtle and hard to work with. A vmgenid driver offering a clean FS interface seemed cleaner, although, like some of you observed, it still allows a window of time between actual VM fork and userspace handling of the event. One other direction that I would like to explore and I feel it’s similar to your 4c proposal is to do smth like: "mm: extend memfd with ability to create 'secret' memory" https://patchwork.kernel.org/project/linux-mm/patch/20200130162340.GA14232@rapoport-lnx/ Maybe we can combine ideas from the two patches in smth like: instead of libs using anon mem with MADV_WIPEONSUSPEND, they can use a secret_mem_fd with a (secretmem specific) WIPEONSUSPEND flag; then instead of crawling PTEs in the core PM code looking for MADV_WIPEONSUSPEND mappings, we can register this secretmem device to wipe its own secrets during a pm callback. From a crypto safety pov, the ideal solution would be one where wiping happens before or during VM forks, not after. Having said that, I think a vDSO (updated by the guest kernel _after_ VM fork) still closes that gap enough to be practically safe. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Hi Jason, On 17.10.20 15:24, Jason A. Donenfeld wrote: > > After discussing this offline with Jann a bit, I have a few general > comments on the design of this. > > First, the UUID communicated by the hypervisor should be consumed by > the kernel -- added as another input to the rng -- and then userspace We definitely want a kernel internal notifier as well, yes :). > should be notified that it should reseed any userspace RNGs that it > may have, without actually communicating that UUID to userspace. IOW, I also tend to agree that it makes sense to disconnect the actual UUID we receive from the notification to user space. This would allow us to create a generic mechanism for VM save/restore cycles across different hypervisors. Let me add PPC and s390x people to the CC list to see whether they have anything remotely similar to the VmGenID mechanism. For x86 and aarch64, the ACPI and memory based VmGenID implemented here is the most obvious option to implement IMHO. It's also already implemented in all major hypervisors. > I agree with Jann there. Then, it's the functioning of this > notification mechanism to userspace that is interesting to me. Absolutely! Please have a look at the previous discussion here: https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/ The user space interface is absolutely what this is about. > There are a few design goals of notifying userspace: it should be > fast, because people who are using userspace RNGs are usually doing so > in the first place to completely avoid syscall overhead for whatever > high performance application they have - e.g. I recall conversations > with Colm about his TLS implementation needing to make random IVs > _really_ fast. It should also happen as early as possible, with no > race or as minimal as possible race window, so that userspace doesn't > begin using old randomness and then switch over after the damage is > already done. There are multiple facets and different types of consumers here. For a user space RNG, I agree that fast and as race free as possible is key. That's what the mmap interface is there for. There are applications way beyond that though. What do you do with applications that already consumed randomness? For example a cached pool of SSL keys. Or a higher level language primitive that consumes randomness and caches its seed somewhere in an internal data structure. Or even worse: your system's host ssh key. For those types of events, an mmap (or vDSO) interface does not work. We need to actively allow user space applications to readjust to the new environment - either internally (the language primitive case) or through a system event, maybe even as systemd trigger (the ssh host key case). To give everyone enough time before we consider a system as "updated to the new environment", we have the callback logic with the "Orchestrator" that can check whether all listeners to system wide updates confirms they adjusted themselves. That's what the rest of the logic is there for: A read+poll interface and all of the orchestration logic. It's not for the user space RNG case, it's for all of its downstream users. > I'm also not wedded to using Microsoft's proprietary hypervisor design > for this. If we come up with a better interface, I don't think it's > asking too much to implement that and reasonably expect for Microsoft > to catch up. Maybe someone here will find that controversial, but > whatever -- discussing ideal designs does not seem out of place or > inappropriate for how we usually approach things in the kernel, and a > closed source hypervisor coming along shouldn't disrupt that. The main bonus point on this interface is that Hyper-V, VMware and QEMU implement it already. It would be a very natural for into the ecosystem. I agree though that we shouldn't have our user space interface necessarily dictated by it: Other hypervisors may implement different ways such as a simple edge IRQ that gets triggered whenever the VM gets resumed. > So, anyway, here are a few options with some pros and cons for the > kernel notifying userspace that its RNG should reseed. I can only stress again that we should not be laser focused on the RNG case. In a lot of cases, data has already been generated by the RNG before the snapshot and needs to be reinitialized after the snapshot. In other cases such as system UUIDs, it's completely orthogonal to the RNG. > > 1. SIGRND - a new signal. Lol. Doable, but a lot of plumbing in user space. It's also not necessarily a good for for event notification in most user space applications. > > 2. Userspace opens a file descriptor that it can epoll on. Pros are > that many notification mechanisms already use this. Cons is that this > requires syscall and might be more racy than we want. Another con is > that this a new thing for userspace programs to do. That's part of what this patch does, right? This patch implements read+poll as well as mmap() for high speed reads. > 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are > that this is extremely fast, and also simple to use and implement. > There are enough sequence points in typical crypto programs that > checking to see whether this counter has changed before doing whatever > operation seems easy enough. Cons are that typically we've been > conservative about adding things to the vDSO, and this is also a new > thing for userspace programs to do. The big con is that its use is going to be super limited to applications that can be adapted to check their "vm generation" through a vDSO call / read every time they consume data that may potentially need to be regenerated. This probably works for the pure RNG case. It falls apart for more sophisticated things such as "redo my ssh host keys and restart the service" or "regenerate my samba machine uuid". > 4. We already have a mechanism for this kind of thing, because the > same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, > where userspace marks a page to be zeroed when forking, for the > purposes of the RNG being notified when its world gets split in two. > This is basically the same thing as we're discussing here with guest > snapshots, except it's on the system level rather than the process > level, and a system has many processes. But the problem space is still > almost the same, and we could simply reuse that same mechanism. There > are a few implementation strategies for that: Yup, that's where we started from :). And then we ran into resistance by the mm people (on CC here). And then we looked at the problem more in depth and checked what it would take to for example implement this for user space RNGs in Java. It's ... more complicated than one may think at first. > 4a. We mess with the PTEs of all processes' pages that are > MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us > to do so. Then we wind up reusing the already existing logic for > userspace RNGs. Cons might be that this usually requires semaphores, > and we're in irq context, so we'd have to hoist to a workqueue, which > means either more wake up latency, or a larger race window. > > 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, > when the hypervisor notifies us to do so. Then we wind up reusing the > already existing logic for userspace RNGs. > > 4c. The guest kernel maintains an array of physical addresses that are > MADV_WIPEONFORK. The hypervisor knows about this array and its > location through whatever protocol, and before resuming a > moved/snapshotted/duplicated VM, it takes the responsibility for > memzeroing this memory. The huge pro here would be that this > eliminates all races, and reduces complexity quite a bit, because the > hypervisor can perfectly synchronize its bringup (and SMP bringup) > with this, and it can even optimize things like on-disk memory > snapshots to simply not write out those pages to disk. > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > userspace API to deal with, and it'd be race free, and eliminate a lot > of kernel complexity. > > But 4b and 3 don't seem too bad either. > > Any thoughts on 4c? Is that utterly insane, or does that actually get > us somewhere close to what we want? All of the options for "4" are possible and have an RFC out. Please check out the discussion linked above :). The problem with anything that relies on close loop reads (options 3+4) is not going to work well with the more sophisticated use case of derived data. IMHO it will boil down to "both". We will need a high-speed interface that with close-to-0 overhead tells you either the generation ID or clears pages (options 3+4) as well as something that is bigger for applications that can either intrinsically (sshd) or by system design (Java) not adopt the mechanisms above easily. That said, we need to start somewhere. I don't mind which angle we start from. But this is a real world problem and one that will only become more prevalent over time as VMs are used for more than only your traditional enterprise hardware consolidation. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote: > > [adding some more people who are interested in RNG stuff: Andy, Jason, > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > concerns some pretty fundamental API stuff related to RNG usage] > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > <acatan@amazon.com> wrote: > > - Background > > > > The VM Generation ID is a feature defined by Microsoft (paper: > > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > > multiple hypervisor vendors. > > > > The feature is required in virtualized environments by apps that work > > with local copies/caches of world-unique data such as random values, > > uuids, monotonically increasing counters, etc. > > Such apps can be negatively affected by VM snapshotting when the VM > > is either cloned or returned to an earlier point in time. > > > > The VM Generation ID is a simple concept meant to alleviate the issue > > by providing a unique ID that changes each time the VM is restored > > from a snapshot. The hw provided UUID value can be used to > > differentiate between VMs or different generations of the same VM. > > > > - Problem > > > > The VM Generation ID is exposed through an ACPI device by multiple > > hypervisor vendors but neither the vendors or upstream Linux have no > > default driver for it leaving users to fend for themselves. > > > > Furthermore, simply finding out about a VM generation change is only > > the starting point of a process to renew internal states of possibly > > multiple applications across the system. This process could benefit > > from a driver that provides an interface through which orchestration > > can be easily done. > > > > - Solution > > > > This patch is a driver which exposes the Virtual Machine Generation ID > > via a char-dev FS interface that provides ID update sync and async > > notification, retrieval and confirmation mechanisms: > > > > When the device is 'open()'ed a copy of the current vm UUID is > > associated with the file handle. 'read()' operations block until the > > associated UUID is no longer up to date - until HW vm gen id changes - > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > updates result in 'EPOLLIN' events. > > > > Subsequent read()s following a UUID update no longer block, but return > > the updated UUID. The application needs to acknowledge the UUID update > > by confirming it through a 'write()'. > > Only on writing back to the driver the right/latest UUID, will the > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > 'mmap()' support allows mapping a single read-only shared page which > > will always contain the latest UUID value at offset 0. > > It would be nicer if that page just contained an incrementing counter, > instead of a UUID. It's not like the application cares *what* the UUID > changed to, just that it *did* change and all RNGs state now needs to > be reseeded from the kernel, right? And an application can't reliably > read the entire UUID from the memory mapping anyway, because the VM > might be forked in the middle. > > So I think your kernel driver should detect UUID changes and then turn > those into a monotonically incrementing counter. (Probably 64 bits > wide?) (That's probably also a little bit faster than comparing an > entire UUID.) > > An option might be to put that counter into the vDSO, instead of a > separate VMA; but I don't know how the other folks feel about that. > Andy, do you have opinions on this? That way, normal userspace code > that uses this infrastructure wouldn't have to mess around with a > special device at all. And it'd be usable in seccomp sandboxes and so > on without needing special plumbing. And libraries wouldn't have to > call open() and mess with file descriptor numbers. The vDSO might be annoyingly slow for this. Something like the rseq page might make sense. It could be a generic indication of "system went through some form of suspend".
On Sat, Oct 17, 2020 at 8:09 PM Alexander Graf <graf@amazon.de> wrote: > There are applications way beyond that though. What do you do with > applications that already consumed randomness? For example a cached pool > of SSL keys. Or a higher level language primitive that consumes > randomness and caches its seed somewhere in an internal data structure. For deterministic protection, those would also have to poll some memory location that tells them whether the VmGenID changed: 1. between reading entropy from their RNG pool and using it 2. between collecting data from external sources (user input, clock, ...) and encrypting it and synchronously shoot down the connection if a change happened. If e.g. an application inside the VM has an AES-GCM-encrypted TLS connection and, directly after the VM is restored, triggers an application-level timeout that sends some fixed message across the connection, then the TLS library must guarantee that either the VM was already committed to sending exactly that message before the VM was forked or the message will be blocked. If we don't do that, an attacker who captures both a single packet from the forked VM and traffic from the old VM can decrypt the next message from the old VM after the fork (because AES-GCM is like AES-CTR plus an authenticator, and CTR means that when keystream reuse occurs and one of the plaintexts is known, the attacker can simply recover the other plaintext using XOR). (Or maybe, in disaster failover environments, TLS 1.3 servers could get away with rekeying the connection instead of shooting it down? Ask your resident friendly cryptographer whether that would be secure, I am not one.) I don't think a mechanism based around asynchronously telling the application and waiting for it to confirm the rotation at a later point is going to cut it; we should have some hard semantics on when an application needs to poll this value. > Or even worse: your system's host ssh key. Mmmh... I think I normally would not want a VM to reset its host ssh key after merely restoring a snapshot though? And more importantly, Microsoft's docs say that they also change the VmGenID on disaster failover. I think you very much wouldn't want your server to lose its host key every time disaster failover happens. On the other hand, after importing a public VM image, it might be a good idea. I guess you could push that responsibility on the user, by adding an option to the sshd_config that tells OpenSSH whether the host key should be rotated on an ID change or not... but that still would not be particularly pretty. Ideally we would have the host tell us what type of events happened to the VM, or something like that... or maybe even get the host VM management software to ask the user whether they're importing a public image... I really feel like with Microsoft's current protocol, we don't get enough information to figure out what we should do about private long-term authentication keys.
On 17 Oct 2020, at 6:24, Jason A. Donenfeld wrote: > There are a few design goals of notifying userspace: it should be > fast, because people who are using userspace RNGs are usually doing so > in the first place to completely avoid syscall overhead for whatever > high performance application they have - e.g. I recall conversations > with Colm about his TLS implementation needing to make random IVs > _really_ fast. That’s our old friend TLS1.1 in CBC mode, which needs a random explicit IV for every record sent. Speed is still a reason at the margins in cases like that, but getrandom() is really fast. A stickier problem is that getrandom() is not certified for use with every compliance standard, and those often dictate precise use of some NIST DRBG or NRBG construction. That keeps people proliferating user-space RNGs even when speed isn’t as important. > It should also happen as early as possible, with no > race or as minimal as possible race window, so that userspace doesn't > begin using old randomness and then switch over after the damage is > already done. +1 to this, and I’d add that anyone making VM snapshots that they plan to restore from multiple times really needs to think this through top to bottom. The system would likely need to be put in to some kind of quiescent state when the snapshot is taken. > So, anyway, here are a few options with some pros and cons for the > kernel notifying userspace that its RNG should reseed. > > 1. SIGRND - a new signal. Lol. > > 2. Userspace opens a file descriptor that it can epoll on. Pros are > that many notification mechanisms already use this. Cons is that this > requires syscall and might be more racy than we want. Another con is > that this a new thing for userspace programs to do. A library like OpenSSL or BoringSSL also has to account for running inside a chroot, which also makes this hard. > Any thoughts on 4c? Is that utterly insane, or does that actually get > us somewhere close to what we want? I still like 4c, and as a user-space crypto-person, and a VM person, they have a lot of appeal. Alex and Adrian’s replies get into some of the sufficiency challenge. But for user-space libraries like the *SSLs, the JVMs, and other runtimes where RNGs show up, it could plug in easily enough. - Colm
On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > 4c. The guest kernel maintains an array of physical addresses that are > MADV_WIPEONFORK. The hypervisor knows about this array and its > location through whatever protocol, and before resuming a > moved/snapshotted/duplicated VM, it takes the responsibility for > memzeroing this memory. The huge pro here would be that this > eliminates all races, and reduces complexity quite a bit, because the > hypervisor can perfectly synchronize its bringup (and SMP bringup) > with this, and it can even optimize things like on-disk memory > snapshots to simply not write out those pages to disk. > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > userspace API to deal with, and it'd be race free, and eliminate a lot > of kernel complexity. Clearly this has a chance to break applications, right? If there's an app that uses this as a non-system-calls way to find out whether there was a fork, it will break when wipe triggers without a fork ... For example, imagine: MADV_WIPEONFORK copy secret data to MADV_DONTFORK fork used to work, with this change it gets 0s instead of the secret data. I am also not sure it's wise to expose each guest process to the hypervisor like this. E.g. each process needs a guest physical address of its own then. This is a finite resource. The mmap interface proposed here is somewhat baroque, but it is certainly simple to implement ...
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > 4c. The guest kernel maintains an array of physical addresses that are > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > location through whatever protocol, and before resuming a > > moved/snapshotted/duplicated VM, it takes the responsibility for > > memzeroing this memory. The huge pro here would be that this > > eliminates all races, and reduces complexity quite a bit, because the > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > with this, and it can even optimize things like on-disk memory > > snapshots to simply not write out those pages to disk. > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > userspace API to deal with, and it'd be race free, and eliminate a lot > > of kernel complexity. > > Clearly this has a chance to break applications, right? > If there's an app that uses this as a non-system-calls way > to find out whether there was a fork, it will break > when wipe triggers without a fork ... > For example, imagine: > > MADV_WIPEONFORK > copy secret data to MADV_DONTFORK > fork > > > used to work, with this change it gets 0s instead of the secret data. > > > I am also not sure it's wise to expose each guest process > to the hypervisor like this. E.g. each process needs a > guest physical address of its own then. This is a finite resource. > > > The mmap interface proposed here is somewhat baroque, but it is > certainly simple to implement ... Wipe of fork/vmgenid/whatever could end up being much more problematic than it naively appears -- it could be wiped in the middle of a read. Either the API needs to handle this cleanly, or we need something more aggressive like signal-on-fork. --Andy
On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > 4c. The guest kernel maintains an array of physical addresses that are > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > location through whatever protocol, and before resuming a > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > memzeroing this memory. The huge pro here would be that this > > > eliminates all races, and reduces complexity quite a bit, because the > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > with this, and it can even optimize things like on-disk memory > > > snapshots to simply not write out those pages to disk. > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > of kernel complexity. > > > > Clearly this has a chance to break applications, right? > > If there's an app that uses this as a non-system-calls way > > to find out whether there was a fork, it will break > > when wipe triggers without a fork ... > > For example, imagine: > > > > MADV_WIPEONFORK > > copy secret data to MADV_DONTFORK > > fork > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > I am also not sure it's wise to expose each guest process > > to the hypervisor like this. E.g. each process needs a > > guest physical address of its own then. This is a finite resource. > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > certainly simple to implement ... > > Wipe of fork/vmgenid/whatever could end up being much more problematic > than it naively appears -- it could be wiped in the middle of a read. > Either the API needs to handle this cleanly, or we need something more > aggressive like signal-on-fork. > > --Andy Right, it's not on fork, it's actually when process is snapshotted. If we assume it's CRIU we care about, then I wonder what's wrong with something like MADV_CHANGEONPTRACE_SEIZE and basically say it's X bytes which change the value...
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > > 4c. The guest kernel maintains an array of physical addresses that are > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > > location through whatever protocol, and before resuming a > > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > > memzeroing this memory. The huge pro here would be that this > > > > eliminates all races, and reduces complexity quite a bit, because the > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > > with this, and it can even optimize things like on-disk memory > > > > snapshots to simply not write out those pages to disk. > > > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > > of kernel complexity. > > > > > > Clearly this has a chance to break applications, right? > > > If there's an app that uses this as a non-system-calls way > > > to find out whether there was a fork, it will break > > > when wipe triggers without a fork ... > > > For example, imagine: > > > > > > MADV_WIPEONFORK > > > copy secret data to MADV_DONTFORK > > > fork > > > > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > > > > I am also not sure it's wise to expose each guest process > > > to the hypervisor like this. E.g. each process needs a > > > guest physical address of its own then. This is a finite resource. > > > > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > > certainly simple to implement ... > > > > Wipe of fork/vmgenid/whatever could end up being much more problematic > > than it naively appears -- it could be wiped in the middle of a read. > > Either the API needs to handle this cleanly, or we need something more > > aggressive like signal-on-fork. > > > > --Andy > > > Right, it's not on fork, it's actually when process is snapshotted. > > If we assume it's CRIU we care about, then I > wonder what's wrong with something like > MADV_CHANGEONPTRACE_SEIZE > and basically say it's X bytes which change the value... I feel like we may be approaching this from the wrong end. Rather than saying "what data structure can the kernel expose that might plausibly be useful", how about we try identifying some specific userspace needs and see what a good solution could look like. I can identify two major cryptographic use cases: 1. A userspace RNG. The API exposed by the userspace end is a function that generates random numbers. The userspace code in turn wants to know some things from the kernel: it wants some best-quality-available random seed data from the kernel (and possibly an indication of how good it is) as well as an indication of whether the userspace memory may have been cloned or rolled back, or, failing that, an indication of whether a reseed is needed. Userspace could implement a wide variety of algorithms on top depending on its goals and compliance requirements, but the end goal is for the userspace part to be very, very fast. 2. A userspace crypto stack that wants to avoid shooting itself in the foot due to inadvertently doing the same thing twice. For example, an AES-GCM stack does not want to reuse an IV, *expecially* if there is even the slightest chance that it might reuse the IV for different data. This use case doesn't necessarily involve random numbers, but, if anything, it needs to be even faster than #1. The threats here are not really the same. For #1, a userspace RNG should be able to recover from a scenario in which an adversary clones the entire process *and gets to own the clone*. For example, in Android, an adversary can often gain complete control of a fork of the zygote -- this shouldn't adversely affect the security properties of other forks. Similarly, a server farm could operate by having one booted server that is cloned to create more workers. Those clones could be provisioned with secrets and permissions post-clone, and at attacker gaining control of a fresh clone could be considered acceptable. For #2, in contrast, if an adversary gains control of a clone of an AES-GCM session, they learn the key outright -- the relevant attack scenario is that the adversary gets to interact with two clones without compromising either clone per se. It's worth noting that, in both cases, there could possibly be more than one instance of an RNG or an AES-GCM session in the same process. This means that using signals is awkward but not necessarily impossibly. (This is an area in which Linux, and POSIX in general, is much weaker than Windows.)
On Sun, Oct 18, 2020 at 09:14:00AM -0700, Andy Lutomirski wrote: > On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > > > 4c. The guest kernel maintains an array of physical addresses that are > > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > > > location through whatever protocol, and before resuming a > > > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > > > memzeroing this memory. The huge pro here would be that this > > > > > eliminates all races, and reduces complexity quite a bit, because the > > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > > > with this, and it can even optimize things like on-disk memory > > > > > snapshots to simply not write out those pages to disk. > > > > > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > > > of kernel complexity. > > > > > > > > Clearly this has a chance to break applications, right? > > > > If there's an app that uses this as a non-system-calls way > > > > to find out whether there was a fork, it will break > > > > when wipe triggers without a fork ... > > > > For example, imagine: > > > > > > > > MADV_WIPEONFORK > > > > copy secret data to MADV_DONTFORK > > > > fork > > > > > > > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > > > > > > > I am also not sure it's wise to expose each guest process > > > > to the hypervisor like this. E.g. each process needs a > > > > guest physical address of its own then. This is a finite resource. > > > > > > > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > > > certainly simple to implement ... > > > > > > Wipe of fork/vmgenid/whatever could end up being much more problematic > > > than it naively appears -- it could be wiped in the middle of a read. > > > Either the API needs to handle this cleanly, or we need something more > > > aggressive like signal-on-fork. > > > > > > --Andy > > > > > > Right, it's not on fork, it's actually when process is snapshotted. > > > > If we assume it's CRIU we care about, then I > > wonder what's wrong with something like > > MADV_CHANGEONPTRACE_SEIZE > > and basically say it's X bytes which change the value... > > I feel like we may be approaching this from the wrong end. Rather > than saying "what data structure can the kernel expose that might > plausibly be useful", how about we try identifying some specific > userspace needs and see what a good solution could look like. I can > identify two major cryptographic use cases: Well, I'm aware of a non-cryptographic use-case: https://bugzilla.redhat.com/show_bug.cgi?id=1118834 this seems to just ask for the guest to have a way to detect that a VM cloning triggered.
----- On Oct 17, 2020, at 2:10 PM, Andy Lutomirski luto@kernel.org wrote: > On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote: >> >> [adding some more people who are interested in RNG stuff: Andy, Jason, >> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this >> concerns some pretty fundamental API stuff related to RNG usage] >> >> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin >> <acatan@amazon.com> wrote: >> > - Background >> > >> > The VM Generation ID is a feature defined by Microsoft (paper: >> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by >> > multiple hypervisor vendors. >> > >> > The feature is required in virtualized environments by apps that work >> > with local copies/caches of world-unique data such as random values, >> > uuids, monotonically increasing counters, etc. >> > Such apps can be negatively affected by VM snapshotting when the VM >> > is either cloned or returned to an earlier point in time. >> > >> > The VM Generation ID is a simple concept meant to alleviate the issue >> > by providing a unique ID that changes each time the VM is restored >> > from a snapshot. The hw provided UUID value can be used to >> > differentiate between VMs or different generations of the same VM. >> > >> > - Problem >> > >> > The VM Generation ID is exposed through an ACPI device by multiple >> > hypervisor vendors but neither the vendors or upstream Linux have no >> > default driver for it leaving users to fend for themselves. >> > >> > Furthermore, simply finding out about a VM generation change is only >> > the starting point of a process to renew internal states of possibly >> > multiple applications across the system. This process could benefit >> > from a driver that provides an interface through which orchestration >> > can be easily done. >> > >> > - Solution >> > >> > This patch is a driver which exposes the Virtual Machine Generation ID >> > via a char-dev FS interface that provides ID update sync and async >> > notification, retrieval and confirmation mechanisms: >> > >> > When the device is 'open()'ed a copy of the current vm UUID is >> > associated with the file handle. 'read()' operations block until the >> > associated UUID is no longer up to date - until HW vm gen id changes - >> > at which point the new UUID is provided/returned. Nonblocking 'read()' >> > uses EWOULDBLOCK to signal that there is no _new_ UUID available. >> > >> > 'poll()' is implemented to allow polling for UUID updates. Such >> > updates result in 'EPOLLIN' events. >> > >> > Subsequent read()s following a UUID update no longer block, but return >> > the updated UUID. The application needs to acknowledge the UUID update >> > by confirming it through a 'write()'. >> > Only on writing back to the driver the right/latest UUID, will the >> > driver mark this "watcher" as up to date and remove EPOLLIN status. >> > >> > 'mmap()' support allows mapping a single read-only shared page which >> > will always contain the latest UUID value at offset 0. >> >> It would be nicer if that page just contained an incrementing counter, >> instead of a UUID. It's not like the application cares *what* the UUID >> changed to, just that it *did* change and all RNGs state now needs to >> be reseeded from the kernel, right? And an application can't reliably >> read the entire UUID from the memory mapping anyway, because the VM >> might be forked in the middle. >> >> So I think your kernel driver should detect UUID changes and then turn >> those into a monotonically incrementing counter. (Probably 64 bits >> wide?) (That's probably also a little bit faster than comparing an >> entire UUID.) >> >> An option might be to put that counter into the vDSO, instead of a >> separate VMA; but I don't know how the other folks feel about that. >> Andy, do you have opinions on this? That way, normal userspace code >> that uses this infrastructure wouldn't have to mess around with a >> special device at all. And it'd be usable in seccomp sandboxes and so >> on without needing special plumbing. And libraries wouldn't have to >> call open() and mess with file descriptor numbers. > > The vDSO might be annoyingly slow for this. Something like the rseq > page might make sense. It could be a generic indication of "system > went through some form of suspend". This might indeed fit nicely as an extension of my KTLS prototype (extensible rseq): https://lore.kernel.org/lkml/20200925181518.4141-1-mathieu.desnoyers@efficios.com/ There are a few ways we could wire things up. One might be to add the UUID field into the extended KTLS structure (so it's always updated after it changes on next return to user-space). For this I assume that the Linux scheduler within the guest VM always preempts all threads before a VM is suspended (is that indeed true ?). This leads to one important question though: how is the UUID check vs commit operation made atomic with respect to suspend ? Unless we use rseq critical sections in assembly, where the kernel will abort the rseq critical section on preemption, I don't see how we can ensure that the UUID value does not change right after it has been checked, before the "commit" side-effect. And what is the expected "commit" side-effect ? Is it a store to a variable in user-space memory, or is it issuing a system call which sends a packet over the network ? Thanks, Mathieu
On 17.10.20 20:09, Alexander Graf wrote: > Hi Jason, > > On 17.10.20 15:24, Jason A. Donenfeld wrote: >> >> After discussing this offline with Jann a bit, I have a few general >> comments on the design of this. >> >> First, the UUID communicated by the hypervisor should be consumed by >> the kernel -- added as another input to the rng -- and then userspace > > We definitely want a kernel internal notifier as well, yes :). > >> should be notified that it should reseed any userspace RNGs that it >> may have, without actually communicating that UUID to userspace. IOW, > > I also tend to agree that it makes sense to disconnect the actual UUID we receive from the notification to user space. This would allow us to create a generic mechanism for VM save/restore cycles across different hypervisors. Let me add PPC and s390x people to the CC list to see whether they have anything remotely similar to the VmGenID mechanism. For x86 and aarch64, the ACPI and memory based VmGenID implemented here is the most obvious option to implement IMHO. It's also already implemented in all major hypervisors. Hmm, what we do have configurations (e.g. stfle bits) and we do have a notification mechanism via sclp that notifies guests when things change. As of today neither KVM nor Linux implement the sclp change notification mechanism, but I do see value in such a thing. > >> I agree with Jann there. Then, it's the functioning of this >> notification mechanism to userspace that is interesting to me. > > Absolutely! Please have a look at the previous discussion here: > > > https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/ > > The user space interface is absolutely what this is about. Yes. Passing a notification to userspace is essential. Where I do not see a solution yet is the race between notification and already running with the old knowledge. > >> There are a few design goals of notifying userspace: it should be >> fast, because people who are using userspace RNGs are usually doing so >> in the first place to completely avoid syscall overhead for whatever >> high performance application they have - e.g. I recall conversations >> with Colm about his TLS implementation needing to make random IVs >> _really_ fast. It should also happen as early as possible, with no >> race or as minimal as possible race window, so that userspace doesn't >> begin using old randomness and then switch over after the damage is >> already done. > > There are multiple facets and different types of consumers here. For a user space RNG, I agree that fast and as race free as possible is key. That's what the mmap interface is there for. > > There are applications way beyond that though. What do you do with applications that already consumed randomness? For example a cached pool of SSL keys. Or a higher level language primitive that consumes randomness and caches its seed somewhere in an internal data structure. Or even worse: your system's host ssh key. > > For those types of events, an mmap (or vDSO) interface does not work. We need to actively allow user space applications to readjust to the new environment - either internally (the language primitive case) or through a system event, maybe even as systemd trigger (the ssh host key case). > > To give everyone enough time before we consider a system as "updated to the new environment", we have the callback logic with the "Orchestrator" that can check whether all listeners to system wide updates confirms they adjusted themselves. > > That's what the rest of the logic is there for: A read+poll interface and all of the orchestration logic. It's not for the user space RNG case, it's for all of its downstream users. > >> I'm also not wedded to using Microsoft's proprietary hypervisor design >> for this. If we come up with a better interface, I don't think it's >> asking too much to implement that and reasonably expect for Microsoft >> to catch up. Maybe someone here will find that controversial, but >> whatever -- discussing ideal designs does not seem out of place or >> inappropriate for how we usually approach things in the kernel, and a >> closed source hypervisor coming along shouldn't disrupt that. > > The main bonus point on this interface is that Hyper-V, VMware and QEMU implement it already. It would be a very natural for into the ecosystem. I agree though that we shouldn't have our user space interface necessarily dictated by it: Other hypervisors may implement different ways such as a simple edge IRQ that gets triggered whenever the VM gets resumed. > >> So, anyway, here are a few options with some pros and cons for the >> kernel notifying userspace that its RNG should reseed. > > I can only stress again that we should not be laser focused on the RNG case. In a lot of cases, data has already been generated by the RNG before the snapshot and needs to be reinitialized after the snapshot. In other cases such as system UUIDs, it's completely orthogonal to the RNG. > >> >> 1. SIGRND - a new signal. Lol. > > Doable, but a lot of plumbing in user space. It's also not necessarily a good for for event notification in most user space applications. > >> >> 2. Userspace opens a file descriptor that it can epoll on. Pros are >> that many notification mechanisms already use this. Cons is that this >> requires syscall and might be more racy than we want. Another con is >> that this a new thing for userspace programs to do. > > That's part of what this patch does, right? This patch implements read+poll as well as mmap() for high speed reads. > >> 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are >> that this is extremely fast, and also simple to use and implement. >> There are enough sequence points in typical crypto programs that >> checking to see whether this counter has changed before doing whatever >> operation seems easy enough. Cons are that typically we've been >> conservative about adding things to the vDSO, and this is also a new >> thing for userspace programs to do. > > The big con is that its use is going to be super limited to applications that can be adapted to check their "vm generation" through a vDSO call / read every time they consume data that may potentially need to be regenerated. > > This probably works for the pure RNG case. It falls apart for more sophisticated things such as "redo my ssh host keys and restart the service" or "regenerate my samba machine uuid". > >> 4. We already have a mechanism for this kind of thing, because the >> same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, >> where userspace marks a page to be zeroed when forking, for the >> purposes of the RNG being notified when its world gets split in two. >> This is basically the same thing as we're discussing here with guest >> snapshots, except it's on the system level rather than the process >> level, and a system has many processes. But the problem space is still >> almost the same, and we could simply reuse that same mechanism. There >> are a few implementation strategies for that: > > Yup, that's where we started from :). And then we ran into resistance by the mm people (on CC here). And then we looked at the problem more in depth and checked what it would take to for example implement this for user space RNGs in Java. It's ... more complicated than one may think at first. > >> 4a. We mess with the PTEs of all processes' pages that are >> MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us >> to do so. Then we wind up reusing the already existing logic for >> userspace RNGs. Cons might be that this usually requires semaphores, >> and we're in irq context, so we'd have to hoist to a workqueue, which >> means either more wake up latency, or a larger race window. >> >> 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, >> when the hypervisor notifies us to do so. Then we wind up reusing the >> already existing logic for userspace RNGs. >> >> 4c. The guest kernel maintains an array of physical addresses that are >> MADV_WIPEONFORK. The hypervisor knows about this array and its >> location through whatever protocol, and before resuming a >> moved/snapshotted/duplicated VM, it takes the responsibility for >> memzeroing this memory. The huge pro here would be that this >> eliminates all races, and reduces complexity quite a bit, because the >> hypervisor can perfectly synchronize its bringup (and SMP bringup) >> with this, and it can even optimize things like on-disk memory >> snapshots to simply not write out those pages to disk. >> >> A 4c-like approach seems like it'd be a lot of bang for the buck -- we >> reuse the existing mechanism (MADV_WIPEONFORK), so there's no new >> userspace API to deal with, and it'd be race free, and eliminate a lot >> of kernel complexity. >> >> But 4b and 3 don't seem too bad either. >> >> Any thoughts on 4c? Is that utterly insane, or does that actually get >> us somewhere close to what we want? > > All of the options for "4" are possible and have an RFC out. Please check out the discussion linked above :). > > The problem with anything that relies on close loop reads (options 3+4) is not going to work well with the more sophisticated use case of derived data. > > IMHO it will boil down to "both". We will need a high-speed interface that with close-to-0 overhead tells you either the generation ID or clears pages (options 3+4) as well as something that is bigger for applications that can either intrinsically (sshd) or by system design (Java) not adopt the mechanisms above easily. > > That said, we need to start somewhere. I don't mind which angle we start from. But this is a real world problem and one that will only become more prevalent over time as VMs are used for more than only your traditional enterprise hardware consolidation. > > > Alex > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
On 20.10.20 11:35, Christian Borntraeger wrote: > On 17.10.20 20:09, Alexander Graf wrote: >> Hi Jason, >> >> On 17.10.20 15:24, Jason A. Donenfeld wrote: >>> >>> After discussing this offline with Jann a bit, I have a few general >>> comments on the design of this. >>> >>> First, the UUID communicated by the hypervisor should be consumed by >>> the kernel -- added as another input to the rng -- and then userspace >> >> We definitely want a kernel internal notifier as well, yes :). >> >>> should be notified that it should reseed any userspace RNGs that it >>> may have, without actually communicating that UUID to userspace. IOW, >> >> I also tend to agree that it makes sense to disconnect the actual UUID we receive from the notification to user space. This would allow us to create a generic mechanism for VM save/restore cycles across different hypervisors. Let me add PPC and s390x people to the CC list to see whether they have anything remotely similar to the VmGenID mechanism. For x86 and aarch64, the ACPI and memory based VmGenID implemented here is the most obvious option to implement IMHO. It's also already implemented in all major hypervisors. > > Hmm, what we do have configurations (e.g. stfle bits) and we do have a notification mechanism via sclp that notifies guests when things change. > As of today neither KVM nor Linux implement the sclp change notification mechanism, but I do see value in such a thing. stfle only stores architected CPU capabilities, no? The UUID is about uniquely identifying clones of the same base image, so they can reestablish their uniqueness. That said, your interest means that there may be a mechanism on s390 one day, so we should think about making it more generic. > >> >>> I agree with Jann there. Then, it's the functioning of this >>> notification mechanism to userspace that is interesting to me. >> >> Absolutely! Please have a look at the previous discussion here: >> >> >> https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/ >> >> The user space interface is absolutely what this is about. > > Yes. Passing a notification to userspace is essential. Where I do not see a solution yet is the race between notification and > already running with the old knowledge. With a post-mortem interface, we will always have a tiny race window. I'm not really convinced that this is a serious problem yet though. In order to do extract anything off a virtual machine that was cloned, you need to communicate with it. If you for example stop the network link until all of this device's users have indicated they are finished adjusting, the race should be small enough for any practical purposes I can think of. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 19.10.20 19:15, Mathieu Desnoyers wrote: > > > ----- On Oct 17, 2020, at 2:10 PM, Andy Lutomirski luto@kernel.org wrote: > >> On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@google.com> wrote: >>> >>> [adding some more people who are interested in RNG stuff: Andy, Jason, >>> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this >>> concerns some pretty fundamental API stuff related to RNG usage] >>> >>> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin >>> <acatan@amazon.com> wrote: >>>> - Background >>>> >>>> The VM Generation ID is a feature defined by Microsoft (paper: >>>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by >>>> multiple hypervisor vendors. >>>> >>>> The feature is required in virtualized environments by apps that work >>>> with local copies/caches of world-unique data such as random values, >>>> uuids, monotonically increasing counters, etc. >>>> Such apps can be negatively affected by VM snapshotting when the VM >>>> is either cloned or returned to an earlier point in time. >>>> >>>> The VM Generation ID is a simple concept meant to alleviate the issue >>>> by providing a unique ID that changes each time the VM is restored >>>> from a snapshot. The hw provided UUID value can be used to >>>> differentiate between VMs or different generations of the same VM. >>>> >>>> - Problem >>>> >>>> The VM Generation ID is exposed through an ACPI device by multiple >>>> hypervisor vendors but neither the vendors or upstream Linux have no >>>> default driver for it leaving users to fend for themselves. >>>> >>>> Furthermore, simply finding out about a VM generation change is only >>>> the starting point of a process to renew internal states of possibly >>>> multiple applications across the system. This process could benefit >>>> from a driver that provides an interface through which orchestration >>>> can be easily done. >>>> >>>> - Solution >>>> >>>> This patch is a driver which exposes the Virtual Machine Generation ID >>>> via a char-dev FS interface that provides ID update sync and async >>>> notification, retrieval and confirmation mechanisms: >>>> >>>> When the device is 'open()'ed a copy of the current vm UUID is >>>> associated with the file handle. 'read()' operations block until the >>>> associated UUID is no longer up to date - until HW vm gen id changes - >>>> at which point the new UUID is provided/returned. Nonblocking 'read()' >>>> uses EWOULDBLOCK to signal that there is no _new_ UUID available. >>>> >>>> 'poll()' is implemented to allow polling for UUID updates. Such >>>> updates result in 'EPOLLIN' events. >>>> >>>> Subsequent read()s following a UUID update no longer block, but return >>>> the updated UUID. The application needs to acknowledge the UUID update >>>> by confirming it through a 'write()'. >>>> Only on writing back to the driver the right/latest UUID, will the >>>> driver mark this "watcher" as up to date and remove EPOLLIN status. >>>> >>>> 'mmap()' support allows mapping a single read-only shared page which >>>> will always contain the latest UUID value at offset 0. >>> >>> It would be nicer if that page just contained an incrementing counter, >>> instead of a UUID. It's not like the application cares *what* the UUID >>> changed to, just that it *did* change and all RNGs state now needs to >>> be reseeded from the kernel, right? And an application can't reliably >>> read the entire UUID from the memory mapping anyway, because the VM >>> might be forked in the middle. >>> >>> So I think your kernel driver should detect UUID changes and then turn >>> those into a monotonically incrementing counter. (Probably 64 bits >>> wide?) (That's probably also a little bit faster than comparing an >>> entire UUID.) >>> >>> An option might be to put that counter into the vDSO, instead of a >>> separate VMA; but I don't know how the other folks feel about that. >>> Andy, do you have opinions on this? That way, normal userspace code >>> that uses this infrastructure wouldn't have to mess around with a >>> special device at all. And it'd be usable in seccomp sandboxes and so >>> on without needing special plumbing. And libraries wouldn't have to >>> call open() and mess with file descriptor numbers. >> >> The vDSO might be annoyingly slow for this. Something like the rseq >> page might make sense. It could be a generic indication of "system >> went through some form of suspend". > > This might indeed fit nicely as an extension of my KTLS prototype (extensible rseq): > > https://lore.kernel.org/lkml/20200925181518.4141-1-mathieu.desnoyers@efficios.com/ > > There are a few ways we could wire things up. One might be to add the > UUID field into the extended KTLS structure (so it's always updated after it > changes on next return to user-space). For this I assume that the Linux scheduler I think one that that became apparent in the discussion in this thread was that we want a Linux internal generation counter rather than expose the UUID verbatim. That way, we don't give away potential secrets to user space and we can support other architectures more easily. > within the guest VM always preempts all threads before a VM is suspended (is that > indeed true ?). The VM does not know that it gets snapshotted. It only knows that it gets resumed (through this interface). > This leads to one important question though: how is the UUID check vs commit operation > made atomic with respect to suspend ? Unless we use rseq critical sections in assembly, > where the kernel will abort the rseq critical section on preemption, I don't see how we > can ensure that the UUID value does not change right after it has been checked, before > the "commit" side-effect. And what is the expected "commit" side-effect ? Is it a store > to a variable in user-space memory, or is it issuing a system call which sends a packet over > the network ? I think the easiest answer I could come up with here would be "make it a u32". Then you can just access it atomically anywhere, no? The burden on user space with such an interface is still pretty high though. All user space that wants to do a "transaction" based on secrets would now need to read the generation ID at the beginning of the transaction and double check whether it's still the same at the end of it (e.g. before sending out a network packet based on a key derived from randomness?). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi all, On 17/10/2020, 21:09, "Graf (AWS), Alexander" <graf@amazon.de> wrote: On 17.10.20 15:24, Jason A. Donenfeld wrote: > > After discussing this offline with Jann a bit, I have a few general > comments on the design of this. > > First, the UUID communicated by the hypervisor should be consumed by > the kernel -- added as another input to the rng -- and then userspace We definitely want a kernel internal notifier as well, yes :). What would be a generic event trigger mechanism we could use from a kernel module/driver for triggering rng reseed (possibly adding the uuid to the mix as well)? Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 0000000..5224415 --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,211 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +VMGENID +============ + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor vendors. + +The feature is required in virtualized environments by apps that work +with local copies/caches of world-unique data such as random values, +uuids, monotonically increasing counters, etc. +Such apps can be negatively affected by VM snapshotting when the VM +is either cloned or returned to an earlier point in time. + +The VM Generation ID is a simple concept meant to alleviate the issue +by providing a unique ID that changes each time the VM is restored +from a snapshot. The hw provided UUID value can be used to +differentiate between VMs or different generations of the same VM. + +The VM Generation ID is exposed through an ACPI device by multiple +hypervisor vendors. The driver for it lives at +``drivers/virt/vmgenid.c`` + +The driver exposes the Virtual Machine Generation ID via a char-dev FS +interface that provides ID update sync/async notification, retrieval +and confirmation mechanisms: + +``open()``: + When the device is opened, a copy of the current vm UUID is + associated with the file handle. The driver now tracks this file + handle as an independent *watcher*. The driver tracks how many + watchers are aware of the latest Vm-Gen-Id uuid and how many of + them are *outdated*, outdated being those that have lived through + a Vm-Gen-Id change but not yet confirmed the generation change event. + +``read()``: + Read is meant to provide the *new* Vm-Gen-Id when a generation change + takes place. The read operation blocks until the associated UUID is + no longer up to date - until HW vm gen id changes - at which point + the new UUID is provided/returned. Nonblocking ``read()`` + uses ``EAGAIN`` to signal that there is no *new* UUID available. + The hw UUID is considered *new* for each open file handle that hasn't + confirmed the new value, following a generation change. Therefore, + once a generation change takes place, all ``read()`` calls will + immediately return the new uuid and will continue to do so until the + new value is confirmed back to the driver through ``write()``. + Partial reads are not allowed - read buffer needs to be at least + ``sizeof(uuid_t)`` in size. + +``write()``: + Write is used to confirm the up-to-date Vm-Gen-Id back to the driver. + Following a VM generation change, all existing watchers are marked + as *outdated*. Each file handle will maintain the *outdated* status + until a ``write()`` confirms the up-to-date UUID back to the driver. + Partial writes are not allowed - write buffer should be exactly + ``sizeof(uuid_t)`` in size. + +``poll()``: + Poll is implemented to allow polling for UUID updates. Such + updates result in ``EPOLLIN`` polling status until the new up-to-date + UUID is confirmed back to the driver through a ``write()``. + +``ioctl()``: + The driver also adds support for tracking count of open file handles + that haven't acknowledged an UUID update. This is exposed through + two IOCTLs: + + - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of + *outdated* watchers - number of file handles that were open during + a VM generation change, and which have not yet confirmed the new + Vm-Gen-Id. + - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated* + watchers, or if a ``timeout`` argument is provided, until the + timeout expires. + +``mmap()``: + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page + in size. The first 16 bytes of the mapped page will contain an + up-to-date copy of the VM generation UUID. + The mapped memory can be used as a low-latency UUID probe mechanism + in critical sections - see examples. + +``close()``: + Removes the file handle as a Vm-Gen-Id watcher. + +Example application workflows +----------------------------- + +1) Watchdog thread simplified example:: + + void watchdog_thread_handler(int *thread_active) + { + uuid_t uuid; + int fd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR); + + do { + // read new UUID - blocks until VM generation changes + read(fd, &uuid, sizeof(uuid)); + + // because of VM generation change, we need to rebuild world + reseed_app_env(); + + // confirm we're done handling UUID update + write(fd, &uuid, sizeof(uuid)); + } while (atomic_read(thread_active)); + + close(fd); + } + +2) ASYNC simplified example:: + + void handle_io_on_vmgenfd(int vmgenfd) + { + uuid_t uuid; + + // because of VM generation change, we need to rebuild world + reseed_app_env(); + + // read new UUID - we need it to confirm we've handled update + read(fd, &uuid, sizeof(uuid)); + + // confirm we're done handling UUID update + write(fd, &uuid, sizeof(uuid)); + } + + int main() { + int epfd, vmgenfd; + struct epoll_event ev; + + epfd = epoll_create(EPOLL_QUEUE_LEN); + + vmgenfd = open("/dev/vmgenid", O_RDWR, S_IRUSR | S_IWUSR); + + // register vmgenid for polling + ev.events = EPOLLIN; + ev.data.fd = vmgenfd; + epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev); + + // register other parts of your app for polling + // ... + + while (1) { + // wait for something to do... + int nfds = epoll_wait(epfd, events, + MAX_EPOLL_EVENTS_PER_RUN, + EPOLL_RUN_TIMEOUT); + if (nfds < 0) die("Error in epoll_wait!"); + + // for each ready fd + for(int i = 0; i < nfds; i++) { + int fd = events[i].data.fd; + + if (fd == vmgenfd) + handle_io_on_vmgenfd(vmgenfd); + else + handle_some_other_part_of_the_app(fd); + } + } + + return 0; + } + +3) Mapped memory polling simplified example:: + + /* + * app/library function that provides cached secrets + */ + char * safe_cached_secret(app_data_t *app) + { + char *secret; + volatile uuid_t *const uuid_ptr = get_vmgenid_mapping(app); + again: + secret = __cached_secret(app); + + if (unlikely(*uuid_ptr != app->cached_uuid)) { + app->cached_uuid = *uuid_ptr; + + // rebuild world then confirm the uuid update (thru write) + rebuild_caches(app); + ack_vmgenid_update(app); + + goto again; + } + + return secret; + } + +4) Orchestrator simplified example:: + + /* + * orchestrator - manages multiple apps and libraries used by a service + * and tries to make sure all sensitive components gracefully handle + * VM generation changes. + * Following function is called on detection of a VM generation change. + */ + int handle_vmgen_update(int vmgenfd, uuid_t new_uuid) + { + // pause until all components have handled event + pause_service(); + + // confirm *this* watcher as up-to-date + write(fd, &new_uuid, sizeof(uuid_t)); + + // wait for all *others* + ioctl(fd, VMGENID_WAIT_WATCHERS, NULL); + + // all apps on the system have rebuilt worlds + resume_service(); + } diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig index 363af2e..c80f1ce 100644 --- a/drivers/virt/Kconfig +++ b/drivers/virt/Kconfig @@ -13,6 +13,19 @@ menuconfig VIRT_DRIVERS if VIRT_DRIVERS +config VMGENID + tristate "Virtual Machine Generation ID driver" + depends on ACPI + default M + help + This is a Virtual Machine Generation ID driver which provides + a virtual machine unique identifier. The provided UUID can be + watched through the FS interface exposed by this driver, and + thus can provide notifications for VM snapshot or cloning events. + This enables applications and libraries that store or cache + sensitive information, to know that they need to regenerate it + after process memory has been exposed to potential copying. + config FSL_HV_MANAGER tristate "Freescale hypervisor management driver" depends on FSL_SOC diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile index fd33124..a1f8dcc 100644 --- a/drivers/virt/Makefile +++ b/drivers/virt/Makefile @@ -4,4 +4,5 @@ # obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o +obj-$(CONFIG_VMGENID) += vmgenid.o obj-y += vboxguest/ diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c new file mode 100644 index 0000000..d314c72 --- /dev/null +++ b/drivers/virt/vmgenid.c @@ -0,0 +1,419 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Virtual Machine Generation ID driver + * + * Copyright (C) 2018 Red Hat Inc, Copyright (C) 2020 Amazon.com Inc + * All rights reserved. + * Authors: + * Adrian Catangiu <acatan@amazon.com> + * Or Idgar <oridgar@gmail.com> + * Gal Hammer <ghammer@redhat.com> + * + */ +#include <linux/acpi.h> +#include <linux/cdev.h> +#include <linux/kernel.h> +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/poll.h> +#include <linux/uuid.h> +#include <linux/vmgenid.h> + +#define DEV_NAME "vmgenid" +ACPI_MODULE_NAME(DEV_NAME); + +struct dev_data { + struct cdev cdev; + dev_t dev_id; + unsigned long map_buf; + + void *uuid_iomap; + uuid_t uuid; + wait_queue_head_t read_wait; + + atomic_t watchers; + atomic_t outdated_watchers; + wait_queue_head_t outdated_wait; +}; + +struct file_data { + struct dev_data *dev_data; + uuid_t acked_uuid; +}; + +static bool vmgenid_uuid_matches(struct dev_data *priv, uuid_t *uuid) +{ + return !memcmp(uuid, &priv->uuid, sizeof(uuid_t)); +} + +static void vmgenid_put_outdated_watchers(struct dev_data *priv) +{ + if (atomic_dec_and_test(&priv->outdated_watchers)) + wake_up_interruptible(&priv->outdated_wait); +} + +static int vmgenid_open(struct inode *inode, struct file *file) +{ + struct dev_data *priv = + container_of(inode->i_cdev, struct dev_data, cdev); + struct file_data *file_data = + kzalloc(sizeof(struct file_data), GFP_KERNEL); + + if (!file_data) + return -ENOMEM; + + file_data->acked_uuid = priv->uuid; + file_data->dev_data = priv; + + file->private_data = file_data; + atomic_inc(&priv->watchers); + + return 0; +} + +static int vmgenid_close(struct inode *inode, struct file *file) +{ + struct file_data *file_data = (struct file_data *) file->private_data; + struct dev_data *priv = file_data->dev_data; + + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) + vmgenid_put_outdated_watchers(priv); + atomic_dec(&priv->watchers); + kfree(file->private_data); + + return 0; +} + +static ssize_t +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos) +{ + struct file_data *file_data = + (struct file_data *) file->private_data; + struct dev_data *priv = file_data->dev_data; + ssize_t ret; + + if (nbytes == 0) + return 0; + /* disallow partial UUID reads */ + if (nbytes < sizeof(uuid_t)) + return -EINVAL; + nbytes = sizeof(uuid_t); + + if (vmgenid_uuid_matches(priv, &file_data->acked_uuid)) { + if (file->f_flags & O_NONBLOCK) + return -EAGAIN; + ret = wait_event_interruptible( + priv->read_wait, + !vmgenid_uuid_matches(priv, &file_data->acked_uuid) + ); + if (ret) + return ret; + } + + ret = copy_to_user(ubuf, &priv->uuid, nbytes); + if (ret) + return -EFAULT; + + return nbytes; +} + +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct file_data *file_data = (struct file_data *) file->private_data; + struct dev_data *priv = file_data->dev_data; + uuid_t ack_uuid; + + /* disallow partial UUID writes */ + if (count != sizeof(uuid_t)) + return -EINVAL; + if (copy_from_user(&ack_uuid, ubuf, count)) + return -EFAULT; + /* wrong UUID acknowledged */ + if (!vmgenid_uuid_matches(priv, &ack_uuid)) + return -EINVAL; + + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) { + /* update local view of UUID */ + file_data->acked_uuid = ack_uuid; + vmgenid_put_outdated_watchers(priv); + } + + return (ssize_t)count; +} + +static __poll_t +vmgenid_poll(struct file *file, poll_table *wait) +{ + __poll_t mask = 0; + struct file_data *file_data = + (struct file_data *) file->private_data; + struct dev_data *priv = file_data->dev_data; + + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) + return EPOLLIN | EPOLLRDNORM; + + poll_wait(file, &priv->read_wait, wait); + + if (!vmgenid_uuid_matches(priv, &file_data->acked_uuid)) + mask = EPOLLIN | EPOLLRDNORM; + + return mask; +} + +static long vmgenid_ioctl(struct file *file, + unsigned int cmd, unsigned long arg) +{ + struct file_data *file_data = + (struct file_data *) file->private_data; + struct dev_data *priv = file_data->dev_data; + struct timespec __user *timeout = (void *) arg; + struct timespec kts; + ktime_t until; + int ret; + + switch (cmd) { + case VMGENID_GET_OUTDATED_WATCHERS: + ret = atomic_read(&priv->outdated_watchers); + break; + case VMGENID_WAIT_WATCHERS: + if (timeout) { + ret = copy_from_user(&kts, timeout, sizeof(kts)); + if (ret) + return -EFAULT; + until = timespec_to_ktime(kts); + } else { + until = KTIME_MAX; + } + + ret = wait_event_interruptible_hrtimeout( + priv->outdated_wait, + !atomic_read(&priv->outdated_watchers), + until + ); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} + +static vm_fault_t vmgenid_vm_fault(struct vm_fault *vmf) +{ + struct page *page; + struct file_data *file_data = + (struct file_data *) vmf->vma->vm_private_data; + struct dev_data *priv = file_data->dev_data; + + if (priv->map_buf) { + page = virt_to_page(priv->map_buf); + get_page(page); + vmf->page = page; + } + + return 0; +} + +static const struct vm_operations_struct vmgenid_vm_ops = { + .fault = vmgenid_vm_fault, +}; + +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma) +{ + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1) + return -EINVAL; + + if ((vma->vm_flags & VM_WRITE) != 0) + return -EPERM; + + vma->vm_ops = &vmgenid_vm_ops; + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_private_data = file->private_data; + + return 0; +} + +static const struct file_operations fops = { + .owner = THIS_MODULE, + .mmap = vmgenid_mmap, + .open = vmgenid_open, + .release = vmgenid_close, + .read = vmgenid_read, + .write = vmgenid_write, + .poll = vmgenid_poll, + .compat_ioctl = vmgenid_ioctl, + .unlocked_ioctl = vmgenid_ioctl, +}; + +static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle) +{ + int i; + phys_addr_t phys_addr; + struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; + acpi_status status; + union acpi_object *pss; + union acpi_object *element; + + status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer); + if (ACPI_FAILURE(status)) { + ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR")); + return -ENODEV; + } + pss = buffer.pointer; + if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2) + return -EINVAL; + + phys_addr = 0; + for (i = 0; i < pss->package.count; i++) { + element = &(pss->package.elements[i]); + if (element->type != ACPI_TYPE_INTEGER) + return -EINVAL; + phys_addr |= element->integer.value << i * 32; + } + + priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t)); + if (!priv->uuid_iomap) { + pr_err("Could not map memory at 0x%llx, size %u", + phys_addr, + (u32)sizeof(uuid_t)); + return -ENOMEM; + } + + memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t)); + memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t)); + + return 0; +} + +static int vmgenid_acpi_add(struct acpi_device *device) +{ + int ret; + struct dev_data *priv; + + if (!device) + return -EINVAL; + + priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->map_buf = get_zeroed_page(GFP_KERNEL); + if (!priv->map_buf) { + ret = -ENOMEM; + goto free; + } + + device->driver_data = priv; + + init_waitqueue_head(&priv->read_wait); + atomic_set(&priv->watchers, 0); + atomic_set(&priv->outdated_watchers, 0); + init_waitqueue_head(&priv->outdated_wait); + + ret = vmgenid_acpi_map(priv, device->handle); + if (ret < 0) + goto err; + + ret = alloc_chrdev_region(&priv->dev_id, 0, 1, DEV_NAME); + if (ret < 0) { + pr_err("alloc_chrdev_region() failed for vmgenid\n"); + goto err; + } + + cdev_init(&priv->cdev, &fops); + cdev_add(&priv->cdev, priv->dev_id, 1); + + return 0; + +err: + if (priv->uuid_iomap) + acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t)); + + free_pages(priv->map_buf, 0); + +free: + kfree(priv); + + return ret; +} + +static int vmgenid_acpi_remove(struct acpi_device *device) +{ + struct dev_data *priv; + + if (!device || !acpi_driver_data(device)) + return -EINVAL; + priv = acpi_driver_data(device); + + cdev_del(&priv->cdev); + unregister_chrdev_region(priv->dev_id, 1); + device->driver_data = NULL; + + if (priv->uuid_iomap) + acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t)); + free_pages(priv->map_buf, 0); + kfree(priv); + + return 0; +} + +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event) +{ + uuid_t old_uuid; + struct dev_data *priv; + + pr_debug("VMGENID notified, event %u", event); + + if (!device || !acpi_driver_data(device)) { + pr_err("VMGENID notify with NULL private data"); + return; + } + priv = acpi_driver_data(device); + + /* update VM Generation UUID */ + old_uuid = priv->uuid; + memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t)); + + if (!vmgenid_uuid_matches(priv, &old_uuid)) { + /* HW uuid updated */ + memcpy((void *) priv->map_buf, &priv->uuid, sizeof(uuid_t)); + atomic_set(&priv->outdated_watchers, + atomic_read(&priv->watchers)); + wake_up_interruptible(&priv->read_wait); + } +} + +static const struct acpi_device_id vmgenid_ids[] = { + {"QEMUVGID", 0}, + {"", 0}, +}; + +static struct acpi_driver acpi_vmgenid_driver = { + .name = "vm_generation_id", + .ids = vmgenid_ids, + .owner = THIS_MODULE, + .ops = { + .add = vmgenid_acpi_add, + .remove = vmgenid_acpi_remove, + .notify = vmgenid_acpi_notify, + } +}; + +static int __init vmgenid_init(void) +{ + return acpi_bus_register_driver(&acpi_vmgenid_driver); +} + +static void __exit vmgenid_exit(void) +{ + acpi_bus_unregister_driver(&acpi_vmgenid_driver); +} + +module_init(vmgenid_init); +module_exit(vmgenid_exit); + +MODULE_AUTHOR("Adrian Catangiu"); +MODULE_DESCRIPTION("Virtual Machine Generation ID"); +MODULE_LICENSE("GPL"); +MODULE_VERSION("0.1"); diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h new file mode 100644 index 0000000..f7fca7b --- /dev/null +++ b/include/uapi/linux/vmgenid.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +/* + * Copyright (c) 2020, Amazon.com Inc. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _UAPI_LINUX_VMGENID_H +#define _UAPI_LINUX_VMGENID_H + +#include <linux/ioctl.h> +#include <linux/time.h> + +#define VMGENID_IOCTL 0x2d +#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1) +#define VMGENID_WAIT_WATCHERS _IOW(VMGENID_IOCTL, 2, struct timespec) + +#endif /* _UAPI_LINUX_VMGENID_H */ +
- Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver which exposes the Virtual Machine Generation ID via a char-dev FS interface that provides ID update sync and async notification, retrieval and confirmation mechanisms: When the device is 'open()'ed a copy of the current vm UUID is associated with the file handle. 'read()' operations block until the associated UUID is no longer up to date - until HW vm gen id changes - at which point the new UUID is provided/returned. Nonblocking 'read()' uses EWOULDBLOCK to signal that there is no _new_ UUID available. 'poll()' is implemented to allow polling for UUID updates. Such updates result in 'EPOLLIN' events. Subsequent read()s following a UUID update no longer block, but return the updated UUID. The application needs to acknowledge the UUID update by confirming it through a 'write()'. Only on writing back to the driver the right/latest UUID, will the driver mark this "watcher" as up to date and remove EPOLLIN status. 'mmap()' support allows mapping a single read-only shared page which will always contain the latest UUID value at offset 0. The driver also adds support for tracking count of open file handles that haven't acknowledged an UUID update. This is exposed through two IOCTLs: * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of _outdated_ watchers - number of file handles that were open during a VM generation change, and which have not yet confirmed the new Vm-Gen-Id. * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_ watchers, or if a 'timeout' argument is provided, until the timeout expires. This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - which is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. Signed-off-by: Adrian Catangiu <acatan@amazon.com> --- Documentation/virt/vmgenid.rst | 211 +++++++++++++++++++++ drivers/virt/Kconfig | 13 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 419 +++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/vmgenid.h | 22 +++ 5 files changed, 666 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h