[RFC,1/1] vhost-user: add shmem mmap request

Message ID	20240530152223.780232-2-aesteve@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Albert Esteve <aesteve@redhat.com> To: qemu-devel@nongnu.org Cc: mst@redhat.com, slp@redhat.com, stefanha@redhat.com, jasowang@redhat.com, Albert Esteve <aesteve@redhat.com> Subject: [RFC PATCH 1/1] vhost-user: add shmem mmap request Date: Thu, 30 May 2024 17:22:23 +0200 Message-ID: <20240530152223.780232-2-aesteve@redhat.com> In-Reply-To: <20240530152223.780232-1-aesteve@redhat.com> References: <20240530152223.780232-1-aesteve@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=aesteve@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -21 X-Spam_score: -2.2 X-Spam_bar: -- X-Spam_report: (-2.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.085, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	vhost-user: Add SHMEM_MAP/UNMAP requests \| expand [RFC,0/1] vhost-user: Add SHMEM_MAP/UNMAP requests [RFC,1/1] vhost-user: add shmem mmap request

Albert Esteve May 30, 2024, 3:22 p.m. UTC

Add SHMEM_MAP/UNMAP requests to vhost-user.

This request allows backends to dynamically map
fds into a shared memory region indentified by
its `shmid`. Then, the fd memory is advertised
to the frontend through a BAR+offset, so it can
be read by the driver while its valid.

Then, the backend can munmap the memory range
in a given shared memory region (again, identified
by its `shmid`), to free it. After this, the
region becomes private and shall not be accessed
by the frontend anymore.

Initializing the memory region is reponsiblity
of the PCI device that will using it.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 docs/interop/vhost-user.rst |  23 ++++++++
 hw/virtio/vhost-user.c      | 106 ++++++++++++++++++++++++++++++++++++
 hw/virtio/virtio.c          |   2 +
 include/hw/virtio/virtio.h  |   3 +
 4 files changed, 134 insertions(+)

Stefan Hajnoczi June 4, 2024, 6:54 p.m. UTC | #1

On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> Add SHMEM_MAP/UNMAP requests to vhost-user.
> 
> This request allows backends to dynamically map
> fds into a shared memory region indentified by

Please call this "VIRTIO Shared Memory Region" everywhere (code,
vhost-user spec, commit description, etc) so it's clear that this is not
about vhost-user shared memory tables/regions.

> its `shmid`. Then, the fd memory is advertised
> to the frontend through a BAR+offset, so it can
> be read by the driver while its valid.

Why is a PCI BAR mentioned here? vhost-user does not know about the
VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
report VIRTIO Shared Memory Regions to the driver.

> 
> Then, the backend can munmap the memory range
> in a given shared memory region (again, identified
> by its `shmid`), to free it. After this, the
> region becomes private and shall not be accessed
> by the frontend anymore.

What does "private" mean?

The frontend must mmap PROT_NONE to reserve the virtual memory space
when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
unrelated mmap(NULL, ...) might use that address range and the guest
would have access to the host memory! This is a security issue and needs
to be mentioned explicitly in the spec.

> 
> Initializing the memory region is reponsiblity
> of the PCI device that will using it.

What does this mean?

> 
> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> ---
>  docs/interop/vhost-user.rst |  23 ++++++++
>  hw/virtio/vhost-user.c      | 106 ++++++++++++++++++++++++++++++++++++
>  hw/virtio/virtio.c          |   2 +
>  include/hw/virtio/virtio.h  |   3 +
>  4 files changed, 134 insertions(+)

Two missing pieces:

1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
Shared Memory Regions from the vhost-user backend. vhost-user-device is
a generic vhost-user frontend without knowledge of the device type, so
it doesn't know what the valid shmids are and what size the regions
have.

2. Other backends don't see these mappings. If the guest submits a vring
descriptor referencing a mapping to another backend, then that backend
won't be able to access this memory. David Gilbert hit this problem when
working on the virtiofs DAX Window. Either the frontend needs to forward
all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
maybe racy!) or a new "memcpy" message is needed as a fallback for when
vhost-user memory region translation fails.

> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d8419fd2f1..3caf2a290c 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -1859,6 +1859,29 @@ is sent by the front-end.
>    when the operation is successful, or non-zero otherwise. Note that if the
>    operation fails, no fd is sent to the backend.
>  
> +``VHOST_USER_BACKEND_SHMEM_MAP``
> +  :id: 9
> +  :equivalent ioctl: N/A
> +  :request payload: fd and ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends to advertise a new mapping
> +  to be made in a given shared memory region. Upon receiving the message,
> +  QEMU will mmap the given fd into the shared memory region with the

s/QEMU/the frontend/

> +  requested ``shmid``. A reply is generated indicating whether mapping
> +  succeeded.

Please document whether mapping over an existing mapping is allowed. I
think it should be allowed because it might be useful to atomically
update a mapping without a race where the driver sees unmapped memory.

If mapping over an existing mapping is allowed, does the new mapping
need to cover the old mapping exactly, or can it span multiple previous
mappings or a subset of an existing mapping?

From a security point of view we need to be careful here. A potentially
untrusted backend process now has the ability to mmap an arbitrary file
descriptor into the frontend process. The backend can cause
denial of service by creating many small mappings to exhaust the OS
limits on virtual memory areas. The backend can map memory to use as
part of a security compromise, so we need to be sure the virtual memory
addresses are not leaked to the backend and only read/write page
permissions are available.

> +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> +  :id: 10
> +  :equivalent ioctl: N/A
> +  :request payload: ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends so that QEMU un-mmap

s/QEMU/the frontend/

> +  a given range (``offset``, ``len``) in the shared memory region with the
> +  requested ``shmid``.

Does the range need to correspond to a previously-mapped VhostUserMMap
or can it cross multiple VhostUserMMaps, be a subset of a VhostUserMMap,
etc?

> +  A reply is generated indicating whether unmapping succeeded.
> +
>  .. _reply_ack:
>  
>  VHOST_USER_PROTOCOL_F_REPLY_ACK
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index cdf9af4a4b..9526b9d07f 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
>      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
>      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
>      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
> +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
>      VHOST_USER_BACKEND_MAX
>  }  VhostUserBackendRequest;
>  
> @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
>      unsigned char uuid[16];
>  } VhostUserShared;
>  
> +/* For the flags field of VhostUserMMap */
> +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> +
> +typedef struct {
> +    /* Shared memory region ID */
> +    uint8_t shmid;

There is a hole (padding) in the struct since the following fields are
naturally aligned to 8 bytes. I suggest moving shmid to the end to
reduce the chance of information leaks if padding is not zeroed.

> +    /* File offset */
> +    uint64_t fd_offset;
> +    /* Offset within the shared memory region */
> +    uint64_t shm_offset;
> +    /* Size of region to map */

To avoid giving "region" additional meanings:

s/Size of the region to map/Size of the mapping/

> +    uint64_t len;
> +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
> +    uint64_t flags;
> +} VhostUserMMap;
> +
>  typedef struct {
>      VhostUserRequest request;
>  
> @@ -224,6 +243,7 @@ typedef union {
>          VhostUserInflight inflight;
>          VhostUserShared object;
>          VhostUserTransferDeviceState transfer_state;
> +        VhostUserMMap mmap;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -1748,6 +1768,85 @@ vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
>      return 0;
>  }
>  
> +static int
> +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
> +                                  VhostUserMMap *vu_mmap,
> +                                  int fd)
> +{
> +    void *addr = 0;
> +    MemoryRegion *mr = NULL;
> +
> +    if (fd < 0) {
> +        error_report("Bad fd for map");
> +        return -EBADF;
> +    }
> +
> +    if (!dev->vdev->shmem_list ||
> +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> +        error_report("Shared memory region at "
> +                     "ID %d unitialized", vu_mmap->shmid);
> +        return -EFAULT;
> +    }
> +
> +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> +
> +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> +                     vu_mmap->shm_offset, vu_mmap->len);
> +        return -EFAULT;
> +    }
> +
> +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> +
> +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
> +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
> +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
> +    if (addr == MAP_FAILED) {
> +        error_report("Failed to mmap mem fd");
> +        return -EFAULT;
> +    }
> +
> +    return 0;
> +}
> +
> +static int
> +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
> +                                    VhostUserMMap *vu_mmap)
> +{
> +    void *addr = 0;
> +    MemoryRegion *mr = NULL;
> +
> +    if (!dev->vdev->shmem_list ||
> +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> +        error_report("Shared memory region at "
> +                     "ID %d unitialized", vu_mmap->shmid);
> +        return -EFAULT;
> +    }
> +
> +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> +
> +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> +                     vu_mmap->shm_offset, vu_mmap->len);
> +        return -EFAULT;
> +    }
> +
> +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> +
> +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
> +
> +    if (addr == MAP_FAILED) {
> +        error_report("Failed to unmap memory");
> +        return -EFAULT;
> +    }
> +
> +    return 0;
> +}
> +
>  static void close_backend_channel(struct vhost_user *u)
>  {
>      g_source_destroy(u->backend_src);
> @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc, GIOCondition condition,
>          ret = vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
>                                                               &hdr, &payload);
>          break;
> +    case VHOST_USER_BACKEND_SHMEM_MAP:
> +        ret = vhost_user_backend_handle_shmem_map(dev, &payload.mmap,
> +                                                fd ? fd[0] : -1);
> +        break;
> +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
> +        ret = vhost_user_backend_handle_shmem_unmap(dev, &payload.mmap);
> +        break;
>      default:
>          error_report("Received unexpected msg type: %d.", hdr.request);
>          ret = -EINVAL;
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..59596370ec 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t device_id, size_t config_size)
>              virtio_vmstate_change, vdev);
>      vdev->device_endian = virtio_default_endian();
>      vdev->use_guest_notifier_mask = true;
> +    vdev->shmem_list = NULL;
> +    vdev->n_shmem_regions = 0;
>  }
>  
>  /*
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 7d5ffdc145..34bec26593 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -165,6 +165,9 @@ struct VirtIODevice
>       */
>      EventNotifier config_notifier;
>      bool device_iotlb_enabled;
> +    /* Shared memory region for vhost-user mappings. */
> +    MemoryRegion *shmem_list;
> +    int n_shmem_regions;
>  };
>  
>  struct VirtioDeviceClass {
> -- 
> 2.44.0
>

Albert Esteve June 5, 2024, 8:13 a.m. UTC | #2

On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > Add SHMEM_MAP/UNMAP requests to vhost-user.
> >
> > This request allows backends to dynamically map
> > fds into a shared memory region indentified by
>
> Please call this "VIRTIO Shared Memory Region" everywhere (code,
> vhost-user spec, commit description, etc) so it's clear that this is not
> about vhost-user shared memory tables/regions.
>
> > its `shmid`. Then, the fd memory is advertised
> > to the frontend through a BAR+offset, so it can
> > be read by the driver while its valid.
>
> Why is a PCI BAR mentioned here? vhost-user does not know about the
> VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> report VIRTIO Shared Memory Regions to the driver.
>
>
I will remove PCI BAR, as it is true that it depends on the
transport. I was trying to explain that the driver
will use the shm_base + shm_offset to access
the mapped memory.


> >
> > Then, the backend can munmap the memory range
> > in a given shared memory region (again, identified
> > by its `shmid`), to free it. After this, the
> > region becomes private and shall not be accessed
> > by the frontend anymore.
>
> What does "private" mean?
>
> The frontend must mmap PROT_NONE to reserve the virtual memory space
> when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> unrelated mmap(NULL, ...) might use that address range and the guest
> would have access to the host memory! This is a security issue and needs
> to be mentioned explicitly in the spec.
>

I mentioned private because it changes the mapping from MAP_SHARED
to MAP_PRIVATE. I will highlight PROT_NONE instead.


>
> >
> > Initializing the memory region is reponsiblity
> > of the PCI device that will using it.
>
> What does this mean?
>

The MemoryRegion is declared in `struct VirtIODevice`,
but it is uninitialized in this commit. So I was trying to say
that the initialization will happen in, e.g., vhost-user-gpu-pci.c
with something like `memory_region_init` , and later `pci_register_bar`.

I am testing that part still.


>
> >
> > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > ---
> >  docs/interop/vhost-user.rst |  23 ++++++++
> >  hw/virtio/vhost-user.c      | 106 ++++++++++++++++++++++++++++++++++++
> >  hw/virtio/virtio.c          |   2 +
> >  include/hw/virtio/virtio.h  |   3 +
> >  4 files changed, 134 insertions(+)
>
> Two missing pieces:
>
> 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
> Shared Memory Regions from the vhost-user backend. vhost-user-device is
> a generic vhost-user frontend without knowledge of the device type, so
> it doesn't know what the valid shmids are and what size the regions
> have.
>

Ok. I was assuming that if a device (backend) makes a request without a
valid shmid or not enough size in the region to perform the mmap, would
just fail in the VMM performing the actual mmap operation. So it would
not necessarily need to know about valid shmids or region sizes.


>
> 2. Other backends don't see these mappings. If the guest submits a vring
> descriptor referencing a mapping to another backend, then that backend
> won't be able to access this memory. David Gilbert hit this problem when
> working on the virtiofs DAX Window. Either the frontend needs to forward
> all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
> maybe racy!) or a new "memcpy" message is needed as a fallback for when
> vhost-user memory region translation fails.
>

Ok. In which scenario would another device want to access the same mapping?
If it is for a shared VIRTIO object, the device that receives the dmabuf fd
could
just do a new mapping in its VIRTIO shared memory region.


>
> >
> > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > index d8419fd2f1..3caf2a290c 100644
> > --- a/docs/interop/vhost-user.rst
> > +++ b/docs/interop/vhost-user.rst
> > @@ -1859,6 +1859,29 @@ is sent by the front-end.
> >    when the operation is successful, or non-zero otherwise. Note that if
> the
> >    operation fails, no fd is sent to the backend.
> >
> > +``VHOST_USER_BACKEND_SHMEM_MAP``
> > +  :id: 9
> > +  :equivalent ioctl: N/A
> > +  :request payload: fd and ``struct VhostUserMMap``
> > +  :reply payload: N/A
> > +
> > +  This message can be submitted by the backends to advertise a new
> mapping
> > +  to be made in a given shared memory region. Upon receiving the
> message,
> > +  QEMU will mmap the given fd into the shared memory region with the
>
> s/QEMU/the frontend/
>
> > +  requested ``shmid``. A reply is generated indicating whether mapping
> > +  succeeded.
>
> Please document whether mapping over an existing mapping is allowed. I
> think it should be allowed because it might be useful to atomically
> update a mapping without a race where the driver sees unmapped memory.
>
>
So in my understanding, the frontend (driver) in the guest would initiate
the
mmap/munmap by sending request to the backend (vhost-user device).
Then the vhost-user device sends a request to the VMM to perform the
mapping. We could enforce an ACK to ensure that the mmap operation finished
before the vhost-user device responds to the driver, and thus avoid
races. This way, we support only the simple usecase of not allowing
mmaps over an already mapped region.


> If mapping over an existing mapping is allowed, does the new mapping
> need to cover the old mapping exactly, or can it span multiple previous
> mappings or a subset of an existing mapping?
>
> From a security point of view we need to be careful here. A potentially
> untrusted backend process now has the ability to mmap an arbitrary file
> descriptor into the frontend process. The backend can cause
> denial of service by creating many small mappings to exhaust the OS
> limits on virtual memory areas. The backend can map memory to use as
> part of a security compromise, so we need to be sure the virtual memory
> addresses are not leaked to the backend and only read/write page
> permissions are available.
>

Right, security from untrusted backends is usally the hardest part to me.
But vhost-user devices do only see shm_offset, so this should be safe, no?


>
> > +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> > +  :id: 10
> > +  :equivalent ioctl: N/A
> > +  :request payload: ``struct VhostUserMMap``
> > +  :reply payload: N/A
> > +
> > +  This message can be submitted by the backends so that QEMU un-mmap
>
> s/QEMU/the frontend/
>

This is probably my bad, but I really thought the frontend is the driver.
So frontend/backend as alternative terms for vhost-user driver/device.
And then here we would keep QEMU or use VMM instead?


>
> > +  a given range (``offset``, ``len``) in the shared memory region with
> the
> > +  requested ``shmid``.
>
> Does the range need to correspond to a previously-mapped VhostUserMMap
> or can it cross multiple VhostUserMMaps, be a subset of a VhostUserMMap,
> etc?
>

I would prefer to keep it simple and disallow mapping over a
previously-mapped
region. The range need to correspond to a valid (within size bound), free,
memory region, or else the request will fail. I will add that to the text.

Nonetheless, I am open to discuss other options.


>
> > +  A reply is generated indicating whether unmapping succeeded.
> > +
> >  .. _reply_ack:
> >
> >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index cdf9af4a4b..9526b9d07f 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
> >      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
> >      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
> >      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> > +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
> > +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
> >      VHOST_USER_BACKEND_MAX
> >  }  VhostUserBackendRequest;
> >
> > @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
> >      unsigned char uuid[16];
> >  } VhostUserShared;
> >
> > +/* For the flags field of VhostUserMMap */
> > +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> > +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> > +
> > +typedef struct {
> > +    /* Shared memory region ID */
> > +    uint8_t shmid;
>
> There is a hole (padding) in the struct since the following fields are
> naturally aligned to 8 bytes. I suggest moving shmid to the end to
> reduce the chance of information leaks if padding is not zeroed.
>

I see. I can move it to the end of the struct or add a padding field in
between. I'll do what you suggested, as it sound like the simplest solution.


>
> > +    /* File offset */
> > +    uint64_t fd_offset;
> > +    /* Offset within the shared memory region */
> > +    uint64_t shm_offset;
> > +    /* Size of region to map */
>
> To avoid giving "region" additional meanings:
>
> s/Size of the region to map/Size of the mapping/
>

Ok, I will change it in the next drop. Probably will keep the RFC for
at least one more patch, seeing that there are a few things I wasn't
correctly considering.

Thanks for all the feedback!


>
> > +    uint64_t len;
> > +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
> > +    uint64_t flags;
> > +} VhostUserMMap;
> > +
> >  typedef struct {
> >      VhostUserRequest request;
> >
> > @@ -224,6 +243,7 @@ typedef union {
> >          VhostUserInflight inflight;
> >          VhostUserShared object;
> >          VhostUserTransferDeviceState transfer_state;
> > +        VhostUserMMap mmap;
> >  } VhostUserPayload;
> >
> >  typedef struct VhostUserMsg {
> > @@ -1748,6 +1768,85 @@
> vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
> >      return 0;
> >  }
> >
> > +static int
> > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
> > +                                  VhostUserMMap *vu_mmap,
> > +                                  int fd)
> > +{
> > +    void *addr = 0;
> > +    MemoryRegion *mr = NULL;
> > +
> > +    if (fd < 0) {
> > +        error_report("Bad fd for map");
> > +        return -EBADF;
> > +    }
> > +
> > +    if (!dev->vdev->shmem_list ||
> > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > +        error_report("Shared memory region at "
> > +                     "ID %d unitialized", vu_mmap->shmid);
> > +        return -EFAULT;
> > +    }
> > +
> > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > +
> > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > +                     vu_mmap->shm_offset, vu_mmap->len);
> > +        return -EFAULT;
> > +    }
> > +
> > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > +
> > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
> > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
> > +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
> > +    if (addr == MAP_FAILED) {
> > +        error_report("Failed to mmap mem fd");
> > +        return -EFAULT;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int
> > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
> > +                                    VhostUserMMap *vu_mmap)
> > +{
> > +    void *addr = 0;
> > +    MemoryRegion *mr = NULL;
> > +
> > +    if (!dev->vdev->shmem_list ||
> > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > +        error_report("Shared memory region at "
> > +                     "ID %d unitialized", vu_mmap->shmid);
> > +        return -EFAULT;
> > +    }
> > +
> > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > +
> > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > +                     vu_mmap->shm_offset, vu_mmap->len);
> > +        return -EFAULT;
> > +    }
> > +
> > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > +
> > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1,
> 0);
> > +
> > +    if (addr == MAP_FAILED) {
> > +        error_report("Failed to unmap memory");
> > +        return -EFAULT;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  static void close_backend_channel(struct vhost_user *u)
> >  {
> >      g_source_destroy(u->backend_src);
> > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc,
> GIOCondition condition,
> >          ret =
> vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
> >                                                               &hdr,
> &payload);
> >          break;
> > +    case VHOST_USER_BACKEND_SHMEM_MAP:
> > +        ret = vhost_user_backend_handle_shmem_map(dev, &payload.mmap,
> > +                                                fd ? fd[0] : -1);
> > +        break;
> > +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
> > +        ret = vhost_user_backend_handle_shmem_unmap(dev, &payload.mmap);
> > +        break;
> >      default:
> >          error_report("Received unexpected msg type: %d.", hdr.request);
> >          ret = -EINVAL;
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index 893a072c9d..59596370ec 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t
> device_id, size_t config_size)
> >              virtio_vmstate_change, vdev);
> >      vdev->device_endian = virtio_default_endian();
> >      vdev->use_guest_notifier_mask = true;
> > +    vdev->shmem_list = NULL;
> > +    vdev->n_shmem_regions = 0;
> >  }
> >
> >  /*
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index 7d5ffdc145..34bec26593 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -165,6 +165,9 @@ struct VirtIODevice
> >       */
> >      EventNotifier config_notifier;
> >      bool device_iotlb_enabled;
> > +    /* Shared memory region for vhost-user mappings. */
> > +    MemoryRegion *shmem_list;
> > +    int n_shmem_regions;
> >  };
> >
> >  struct VirtioDeviceClass {
> > --
> > 2.44.0
> >
>

Stefan Hajnoczi June 5, 2024, 2:28 p.m. UTC | #3

On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > >
> > > This request allows backends to dynamically map
> > > fds into a shared memory region indentified by
> >
> > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > vhost-user spec, commit description, etc) so it's clear that this is not
> > about vhost-user shared memory tables/regions.
> >
> > > its `shmid`. Then, the fd memory is advertised
> > > to the frontend through a BAR+offset, so it can
> > > be read by the driver while its valid.
> >
> > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > report VIRTIO Shared Memory Regions to the driver.
> >
> >
> I will remove PCI BAR, as it is true that it depends on the
> transport. I was trying to explain that the driver
> will use the shm_base + shm_offset to access
> the mapped memory.
> 
> 
> > >
> > > Then, the backend can munmap the memory range
> > > in a given shared memory region (again, identified
> > > by its `shmid`), to free it. After this, the
> > > region becomes private and shall not be accessed
> > > by the frontend anymore.
> >
> > What does "private" mean?
> >
> > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > unrelated mmap(NULL, ...) might use that address range and the guest
> > would have access to the host memory! This is a security issue and needs
> > to be mentioned explicitly in the spec.
> >
> 
> I mentioned private because it changes the mapping from MAP_SHARED
> to MAP_PRIVATE. I will highlight PROT_NONE instead.

I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
mean mmap flags or something like the memory range is no longer
accessible to the driver.

> 
> 
> >
> > >
> > > Initializing the memory region is reponsiblity
> > > of the PCI device that will using it.
> >
> > What does this mean?
> >
> 
> The MemoryRegion is declared in `struct VirtIODevice`,
> but it is uninitialized in this commit. So I was trying to say
> that the initialization will happen in, e.g., vhost-user-gpu-pci.c
> with something like `memory_region_init` , and later `pci_register_bar`.

Okay. The device model needs to create MemoryRegion instances for the
device's Shared Memory Regions and add them to the VirtIODevice.

--device vhost-user-device will need to query the backend since, unlike
vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
device types. It will need to create MemoryRegions enumerated from the
backend.

By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
Regions so this work should not be tied to PCI.

> 
> I am testing that part still.
> 
> 
> >
> > >
> > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > ---
> > >  docs/interop/vhost-user.rst |  23 ++++++++
> > >  hw/virtio/vhost-user.c      | 106 ++++++++++++++++++++++++++++++++++++
> > >  hw/virtio/virtio.c          |   2 +
> > >  include/hw/virtio/virtio.h  |   3 +
> > >  4 files changed, 134 insertions(+)
> >
> > Two missing pieces:
> >
> > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
> > Shared Memory Regions from the vhost-user backend. vhost-user-device is
> > a generic vhost-user frontend without knowledge of the device type, so
> > it doesn't know what the valid shmids are and what size the regions
> > have.
> >
> 
> Ok. I was assuming that if a device (backend) makes a request without a
> valid shmid or not enough size in the region to perform the mmap, would
> just fail in the VMM performing the actual mmap operation. So it would
> not necessarily need to know about valid shmids or region sizes.

But then --device vhost-user-device wouldn't be able to support VIRTIO
Shared Memory Regions, which means this patch series is incomplete. New
vhost-user features need to support both --device vhost-user-<type>-*
and --device vhost-user-device.

What's needed is:
1. New vhost-user messages so the frontend can query the shmids and
   sizes from the backend.
2. QEMU --device vhost-user-device code that queries the VIRTIO Shared
   Memory Regions from the vhost-user backend and then creates
   MemoryRegions for them.

> 
> 
> >
> > 2. Other backends don't see these mappings. If the guest submits a vring
> > descriptor referencing a mapping to another backend, then that backend
> > won't be able to access this memory. David Gilbert hit this problem when
> > working on the virtiofs DAX Window. Either the frontend needs to forward
> > all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
> > maybe racy!) or a new "memcpy" message is needed as a fallback for when
> > vhost-user memory region translation fails.
> >
> 
> Ok. In which scenario would another device want to access the same mapping?
> If it is for a shared VIRTIO object, the device that receives the dmabuf fd
> could
> just do a new mapping in its VIRTIO shared memory region.

You can reproduce this with two virtiofs devices. Where device A has the
DAX Window feature enabled. The test program mmaps the DAX file and then
opens a new file on device B and performs an O_DIRECT write(2) syscall.

I think that this issue must be address in this series since this issue
is introduced by enabling VIRTIO Shared Memory Regions in vhost-user.

A bit more detail:

Device A has a VIRTIO Shared Memory Region. An application mmaps that
memory (examples: guest userspace driver using Linux VFIO, a guest
kernel driver that exposes the memory to userspace via mmap, or guest
kernel DAX). The application passes that memory as an I/O buffer to
device B (e.g. O_DIRECT disk I/O).

The result is that device B's vhost-user backend receives a vring
descriptor that points to a guest memory address in device A's VIRTIO
Shared Memory Region. Since device B does not have this memory in its
table, it cannot translate the address and the device breaks.

> 
> 
> >
> > >
> > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > > index d8419fd2f1..3caf2a290c 100644
> > > --- a/docs/interop/vhost-user.rst
> > > +++ b/docs/interop/vhost-user.rst
> > > @@ -1859,6 +1859,29 @@ is sent by the front-end.
> > >    when the operation is successful, or non-zero otherwise. Note that if
> > the
> > >    operation fails, no fd is sent to the backend.
> > >
> > > +``VHOST_USER_BACKEND_SHMEM_MAP``
> > > +  :id: 9
> > > +  :equivalent ioctl: N/A
> > > +  :request payload: fd and ``struct VhostUserMMap``
> > > +  :reply payload: N/A
> > > +
> > > +  This message can be submitted by the backends to advertise a new
> > mapping
> > > +  to be made in a given shared memory region. Upon receiving the
> > message,
> > > +  QEMU will mmap the given fd into the shared memory region with the
> >
> > s/QEMU/the frontend/
> >
> > > +  requested ``shmid``. A reply is generated indicating whether mapping
> > > +  succeeded.
> >
> > Please document whether mapping over an existing mapping is allowed. I
> > think it should be allowed because it might be useful to atomically
> > update a mapping without a race where the driver sees unmapped memory.
> >
> >
> So in my understanding, the frontend (driver) in the guest would initiate
> the
> mmap/munmap by sending request to the backend (vhost-user device).
> Then the vhost-user device sends a request to the VMM to perform the
> mapping. We could enforce an ACK to ensure that the mmap operation finished
> before the vhost-user device responds to the driver, and thus avoid
> races. This way, we support only the simple usecase of not allowing
> mmaps over an already mapped region.

I think it's fine to start with the stricter model where no overlapping
mappings are allowed. If someone needs overlapping mappings in the
future, a feature bit could be added to negotiate support.

Please document that overlapping mappings are not supported.

> > If mapping over an existing mapping is allowed, does the new mapping
> > need to cover the old mapping exactly, or can it span multiple previous
> > mappings or a subset of an existing mapping?
> >
> > From a security point of view we need to be careful here. A potentially
> > untrusted backend process now has the ability to mmap an arbitrary file
> > descriptor into the frontend process. The backend can cause
> > denial of service by creating many small mappings to exhaust the OS
> > limits on virtual memory areas. The backend can map memory to use as
> > part of a security compromise, so we need to be sure the virtual memory
> > addresses are not leaked to the backend and only read/write page
> > permissions are available.
> >
> 
> Right, security from untrusted backends is usally the hardest part to me.
> But vhost-user devices do only see shm_offset, so this should be safe, no?

Yes, I think the current interface is safe, too.

> >
> > > +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> > > +  :id: 10
> > > +  :equivalent ioctl: N/A
> > > +  :request payload: ``struct VhostUserMMap``
> > > +  :reply payload: N/A
> > > +
> > > +  This message can be submitted by the backends so that QEMU un-mmap
> >
> > s/QEMU/the frontend/
> >
> 
> This is probably my bad, but I really thought the frontend is the driver.
> So frontend/backend as alternative terms for vhost-user driver/device.

"vhost-user driver" is not a term that's used, as far as I know. The
vhost-user spec calls it the front-end (older code and docs may call it
the vhost-user master).

> And then here we would keep QEMU or use VMM instead?

The vhost-user spec uses both front-end and QEMU to mean the same thing.
VMM is not used in the vhost-user spec. I suggest using front-end in new
spec text because QEMU is not the only application that implements this
spec anymore.

> 
> 
> >
> > > +  a given range (``offset``, ``len``) in the shared memory region with
> > the
> > > +  requested ``shmid``.
> >
> > Does the range need to correspond to a previously-mapped VhostUserMMap
> > or can it cross multiple VhostUserMMaps, be a subset of a VhostUserMMap,
> > etc?
> >
> 
> I would prefer to keep it simple and disallow mapping over a
> previously-mapped
> region. The range need to correspond to a valid (within size bound), free,
> memory region, or else the request will fail. I will add that to the text.
> 
> Nonetheless, I am open to discuss other options.

That sounds good. I'm not aware of any use cases that require anything fancy.

> 
> 
> >
> > > +  A reply is generated indicating whether unmapping succeeded.
> > > +
> > >  .. _reply_ack:
> > >
> > >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > index cdf9af4a4b..9526b9d07f 100644
> > > --- a/hw/virtio/vhost-user.c
> > > +++ b/hw/virtio/vhost-user.c
> > > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
> > >      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
> > >      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
> > >      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> > > +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
> > > +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
> > >      VHOST_USER_BACKEND_MAX
> > >  }  VhostUserBackendRequest;
> > >
> > > @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
> > >      unsigned char uuid[16];
> > >  } VhostUserShared;
> > >
> > > +/* For the flags field of VhostUserMMap */
> > > +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> > > +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> > > +
> > > +typedef struct {
> > > +    /* Shared memory region ID */
> > > +    uint8_t shmid;
> >
> > There is a hole (padding) in the struct since the following fields are
> > naturally aligned to 8 bytes. I suggest moving shmid to the end to
> > reduce the chance of information leaks if padding is not zeroed.
> >
> 
> I see. I can move it to the end of the struct or add a padding field in
> between. I'll do what you suggested, as it sound like the simplest solution.
> 
> 
> >
> > > +    /* File offset */
> > > +    uint64_t fd_offset;
> > > +    /* Offset within the shared memory region */
> > > +    uint64_t shm_offset;
> > > +    /* Size of region to map */
> >
> > To avoid giving "region" additional meanings:
> >
> > s/Size of the region to map/Size of the mapping/
> >
> 
> Ok, I will change it in the next drop. Probably will keep the RFC for
> at least one more patch, seeing that there are a few things I wasn't
> correctly considering.
> 
> Thanks for all the feedback!

Great, thanks!

Stefan

> 
> 
> >
> > > +    uint64_t len;
> > > +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
> > > +    uint64_t flags;
> > > +} VhostUserMMap;
> > > +
> > >  typedef struct {
> > >      VhostUserRequest request;
> > >
> > > @@ -224,6 +243,7 @@ typedef union {
> > >          VhostUserInflight inflight;
> > >          VhostUserShared object;
> > >          VhostUserTransferDeviceState transfer_state;
> > > +        VhostUserMMap mmap;
> > >  } VhostUserPayload;
> > >
> > >  typedef struct VhostUserMsg {
> > > @@ -1748,6 +1768,85 @@
> > vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
> > >      return 0;
> > >  }
> > >
> > > +static int
> > > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
> > > +                                  VhostUserMMap *vu_mmap,
> > > +                                  int fd)
> > > +{
> > > +    void *addr = 0;
> > > +    MemoryRegion *mr = NULL;
> > > +
> > > +    if (fd < 0) {
> > > +        error_report("Bad fd for map");
> > > +        return -EBADF;
> > > +    }
> > > +
> > > +    if (!dev->vdev->shmem_list ||
> > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > +        error_report("Shared memory region at "
> > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > +
> > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > +
> > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
> > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
> > > +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
> > > +    if (addr == MAP_FAILED) {
> > > +        error_report("Failed to mmap mem fd");
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int
> > > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
> > > +                                    VhostUserMMap *vu_mmap)
> > > +{
> > > +    void *addr = 0;
> > > +    MemoryRegion *mr = NULL;
> > > +
> > > +    if (!dev->vdev->shmem_list ||
> > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > +        error_report("Shared memory region at "
> > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > +
> > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > +
> > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1,
> > 0);
> > > +
> > > +    if (addr == MAP_FAILED) {
> > > +        error_report("Failed to unmap memory");
> > > +        return -EFAULT;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > >  static void close_backend_channel(struct vhost_user *u)
> > >  {
> > >      g_source_destroy(u->backend_src);
> > > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc,
> > GIOCondition condition,
> > >          ret =
> > vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
> > >                                                               &hdr,
> > &payload);
> > >          break;
> > > +    case VHOST_USER_BACKEND_SHMEM_MAP:
> > > +        ret = vhost_user_backend_handle_shmem_map(dev, &payload.mmap,
> > > +                                                fd ? fd[0] : -1);
> > > +        break;
> > > +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
> > > +        ret = vhost_user_backend_handle_shmem_unmap(dev, &payload.mmap);
> > > +        break;
> > >      default:
> > >          error_report("Received unexpected msg type: %d.", hdr.request);
> > >          ret = -EINVAL;
> > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > index 893a072c9d..59596370ec 100644
> > > --- a/hw/virtio/virtio.c
> > > +++ b/hw/virtio/virtio.c
> > > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t
> > device_id, size_t config_size)
> > >              virtio_vmstate_change, vdev);
> > >      vdev->device_endian = virtio_default_endian();
> > >      vdev->use_guest_notifier_mask = true;
> > > +    vdev->shmem_list = NULL;
> > > +    vdev->n_shmem_regions = 0;
> > >  }
> > >
> > >  /*
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index 7d5ffdc145..34bec26593 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -165,6 +165,9 @@ struct VirtIODevice
> > >       */
> > >      EventNotifier config_notifier;
> > >      bool device_iotlb_enabled;
> > > +    /* Shared memory region for vhost-user mappings. */
> > > +    MemoryRegion *shmem_list;
> > > +    int n_shmem_regions;
> > >  };
> > >
> > >  struct VirtioDeviceClass {
> > > --
> > > 2.44.0
> > >
> >

Stefan Hajnoczi June 5, 2024, 3:19 p.m. UTC | #4

On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > > >
> > > > This request allows backends to dynamically map
> > > > fds into a shared memory region indentified by
> > >
> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > > vhost-user spec, commit description, etc) so it's clear that this is not
> > > about vhost-user shared memory tables/regions.
> > >
> > > > its `shmid`. Then, the fd memory is advertised
> > > > to the frontend through a BAR+offset, so it can
> > > > be read by the driver while its valid.
> > >
> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > > report VIRTIO Shared Memory Regions to the driver.
> > >
> > >
> > I will remove PCI BAR, as it is true that it depends on the
> > transport. I was trying to explain that the driver
> > will use the shm_base + shm_offset to access
> > the mapped memory.
> >
> >
> > > >
> > > > Then, the backend can munmap the memory range
> > > > in a given shared memory region (again, identified
> > > > by its `shmid`), to free it. After this, the
> > > > region becomes private and shall not be accessed
> > > > by the frontend anymore.
> > >
> > > What does "private" mean?
> > >
> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > > unrelated mmap(NULL, ...) might use that address range and the guest
> > > would have access to the host memory! This is a security issue and needs
> > > to be mentioned explicitly in the spec.
> > >
> >
> > I mentioned private because it changes the mapping from MAP_SHARED
> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>
> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> mean mmap flags or something like the memory range is no longer
> accessible to the driver.

One more thing: please check whether kvm.ko memory regions need to be
modified or split to match the SHMEM_MAP mapping's read/write
permissions.

The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
PROT_READ|PROT_WRITE, or PROT_NONE.

kvm.ko memory regions are read/write or read-only. I'm not sure what
happens when the guest accesses a kvm.ko memory region containing
mappings with permissions more restrictive than its kvm.ko memory
region. In other words, the kvm.ko memory region would allow the
access but the Linux virtual memory configuration would cause a page
fault.

For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
mapping with PROT_READ. The kvm.ko memory region would be read/write
(unless extra steps were taken to tell kvm.ko about the permissions).
When the guest stores to the PROT_READ page, kvm.ko will process a
fault...and I'm not sure what happens next.

A similar scenario occurs when a PROT_NONE mapping exists within a
kvm.ko memory region. I don't remember how kvm.ko behaves when the
guest tries to access the pages.

It's worth figuring this out before going further because it could
become tricky if issues are discovered later. I have CCed David
Hildenbrand in case he knows.

Stefan

David Hildenbrand June 5, 2024, 4:02 p.m. UTC | #5

On 05.06.24 17:19, Stefan Hajnoczi wrote:
> On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
>>> On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>
>>>> On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
>>>>> Add SHMEM_MAP/UNMAP requests to vhost-user.
>>>>>
>>>>> This request allows backends to dynamically map
>>>>> fds into a shared memory region indentified by
>>>>
>>>> Please call this "VIRTIO Shared Memory Region" everywhere (code,
>>>> vhost-user spec, commit description, etc) so it's clear that this is not
>>>> about vhost-user shared memory tables/regions.
>>>>
>>>>> its `shmid`. Then, the fd memory is advertised
>>>>> to the frontend through a BAR+offset, so it can
>>>>> be read by the driver while its valid.
>>>>
>>>> Why is a PCI BAR mentioned here? vhost-user does not know about the
>>>> VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
>>>> report VIRTIO Shared Memory Regions to the driver.
>>>>
>>>>
>>> I will remove PCI BAR, as it is true that it depends on the
>>> transport. I was trying to explain that the driver
>>> will use the shm_base + shm_offset to access
>>> the mapped memory.
>>>
>>>
>>>>>
>>>>> Then, the backend can munmap the memory range
>>>>> in a given shared memory region (again, identified
>>>>> by its `shmid`), to free it. After this, the
>>>>> region becomes private and shall not be accessed
>>>>> by the frontend anymore.
>>>>
>>>> What does "private" mean?
>>>>
>>>> The frontend must mmap PROT_NONE to reserve the virtual memory space
>>>> when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
>>>> unrelated mmap(NULL, ...) might use that address range and the guest
>>>> would have access to the host memory! This is a security issue and needs
>>>> to be mentioned explicitly in the spec.
>>>>
>>>
>>> I mentioned private because it changes the mapping from MAP_SHARED
>>> to MAP_PRIVATE. I will highlight PROT_NONE instead.
>>
>> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
>> mean mmap flags or something like the memory range is no longer
>> accessible to the driver.
> 
> One more thing: please check whether kvm.ko memory regions need to be
> modified or split to match the SHMEM_MAP mapping's read/write
> permissions.
> 
> The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
> PROT_READ|PROT_WRITE, or PROT_NONE.
> 
> kvm.ko memory regions are read/write or read-only. I'm not sure what
> happens when the guest accesses a kvm.ko memory region containing
> mappings with permissions more restrictive than its kvm.ko memory
> region.

IIRC, the KVM R/O memory region requests could allow to further reduce 
permissions (assuming your mmap is R/W you could map it R/O into the KVM 
MMU), but I might remember things incorrectly.


> In other words, the kvm.ko memory region would allow the
> access but the Linux virtual memory configuration would cause a page
> fault.
> 
> For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
> mapping with PROT_READ. The kvm.ko memory region would be read/write
> (unless extra steps were taken to tell kvm.ko about the permissions).
> When the guest stores to the PROT_READ page, kvm.ko will process a
> fault...and I'm not sure what happens next.
> 
> A similar scenario occurs when a PROT_NONE mapping exists within a
> kvm.ko memory region. I don't remember how kvm.ko behaves when the
> guest tries to access the pages.
> 
> It's worth figuring this out before going further because it could
> become tricky if issues are discovered later. I have CCed David
> Hildenbrand in case he knows.


One relevant piece is likely:

"When the KVM_CAP_SYNC_MMU capability is available, changes in the 
backing of the memory region are automatically reflected into the guest. 
  For example, an mmap() that affects the region will be made visible 
immediately. "

We can already effectively get R/O or PROT_NONE PTEs in 
PROT_READ|PROT_WRITE mappings, and the KVM must be able to handle that 
properly -- trigegring a page fault to let core-MM resolve that.

If we have a PROT_NONE VMA and the guest writes to it, we'd likely end 
up (to resolve the KVM MMU page fault) in 
hva_to_pfn_slow()->get_user_pages_unlocked(), which would return -EFAULT.

Not sure if we really inject a page fault into the guest or if the KVM 
run would effectively fail with -EFAULT and make user space unhappy. 
Certainly something to play with!

Stefan Hajnoczi June 5, 2024, 4:57 p.m. UTC | #6

On Wed, Jun 5, 2024, 12:02 David Hildenbrand <david@redhat.com> wrote:

> On 05.06.24 17:19, Stefan Hajnoczi wrote:
> > On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> >>
> >> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> >>> On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> >>>
> >>>> On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> >>>>> Add SHMEM_MAP/UNMAP requests to vhost-user.
> >>>>>
> >>>>> This request allows backends to dynamically map
> >>>>> fds into a shared memory region indentified by
> >>>>
> >>>> Please call this "VIRTIO Shared Memory Region" everywhere (code,
> >>>> vhost-user spec, commit description, etc) so it's clear that this is
> not
> >>>> about vhost-user shared memory tables/regions.
> >>>>
> >>>>> its `shmid`. Then, the fd memory is advertised
> >>>>> to the frontend through a BAR+offset, so it can
> >>>>> be read by the driver while its valid.
> >>>>
> >>>> Why is a PCI BAR mentioned here? vhost-user does not know about the
> >>>> VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> >>>> report VIRTIO Shared Memory Regions to the driver.
> >>>>
> >>>>
> >>> I will remove PCI BAR, as it is true that it depends on the
> >>> transport. I was trying to explain that the driver
> >>> will use the shm_base + shm_offset to access
> >>> the mapped memory.
> >>>
> >>>
> >>>>>
> >>>>> Then, the backend can munmap the memory range
> >>>>> in a given shared memory region (again, identified
> >>>>> by its `shmid`), to free it. After this, the
> >>>>> region becomes private and shall not be accessed
> >>>>> by the frontend anymore.
> >>>>
> >>>> What does "private" mean?
> >>>>
> >>>> The frontend must mmap PROT_NONE to reserve the virtual memory space
> >>>> when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> >>>> unrelated mmap(NULL, ...) might use that address range and the guest
> >>>> would have access to the host memory! This is a security issue and
> needs
> >>>> to be mentioned explicitly in the spec.
> >>>>
> >>>
> >>> I mentioned private because it changes the mapping from MAP_SHARED
> >>> to MAP_PRIVATE. I will highlight PROT_NONE instead.
> >>
> >> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> >> mean mmap flags or something like the memory range is no longer
> >> accessible to the driver.
> >
> > One more thing: please check whether kvm.ko memory regions need to be
> > modified or split to match the SHMEM_MAP mapping's read/write
> > permissions.
> >
> > The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
> > PROT_READ|PROT_WRITE, or PROT_NONE.
> >
> > kvm.ko memory regions are read/write or read-only. I'm not sure what
> > happens when the guest accesses a kvm.ko memory region containing
> > mappings with permissions more restrictive than its kvm.ko memory
> > region.
>
> IIRC, the KVM R/O memory region requests could allow to further reduce
> permissions (assuming your mmap is R/W you could map it R/O into the KVM
> MMU), but I might remember things incorrectly.
>

I'm thinking about the opposite case where KVM is configured for R/W but
the mmap is more restrictive. This patch series makes this scenario
possible.


>
> > In other words, the kvm.ko memory region would allow the
> > access but the Linux virtual memory configuration would cause a page
> > fault.
> >
> > For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
> > mapping with PROT_READ. The kvm.ko memory region would be read/write
> > (unless extra steps were taken to tell kvm.ko about the permissions).
> > When the guest stores to the PROT_READ page, kvm.ko will process a
> > fault...and I'm not sure what happens next.
> >
> > A similar scenario occurs when a PROT_NONE mapping exists within a
> > kvm.ko memory region. I don't remember how kvm.ko behaves when the
> > guest tries to access the pages.
> >
> > It's worth figuring this out before going further because it could
> > become tricky if issues are discovered later. I have CCed David
> > Hildenbrand in case he knows.
>
>
> One relevant piece is likely:
>
> "When the KVM_CAP_SYNC_MMU capability is available, changes in the
> backing of the memory region are automatically reflected into the guest.
>   For example, an mmap() that affects the region will be made visible
> immediately. "
>
> We can already effectively get R/O or PROT_NONE PTEs in
> PROT_READ|PROT_WRITE mappings, and the KVM must be able to handle that
> properly -- trigegring a page fault to let core-MM resolve that.
>
> If we have a PROT_NONE VMA and the guest writes to it, we'd likely end
> up (to resolve the KVM MMU page fault) in
> hva_to_pfn_slow()->get_user_pages_unlocked(), which would return -EFAULT.
>
> Not sure if we really inject a page fault into the guest or if the KVM
> run would effectively fail with -EFAULT and make user space unhappy.
> Certainly something to play with!
>

I think KVM probably treats the fault as a misconfiguration that host user
space needs to resolve. There might be no way to inject a page fault into
the guest because the access is allowed according to the guest page tables
and the guest fault handlers aren't prepared to deal with a spurious fault.

Let's see what happens in practice. I think the cleanest solution would be
to create separate kvm.ko memory regions with the appropriate permissions
(i.e. might require creating multiple MemoryRegions inside QEMU for a
single VIRTIO Shared Memory Region when permissions vary between mmaps).

Stefan


> --
> Cheers,
>
> David / dhildenb
>
>

Albert Esteve June 26, 2024, 7:53 a.m. UTC | #7

Hi Stefan,

On Wed, Jun 5, 2024 at 4:28 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> >
> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > > >
> > > > This request allows backends to dynamically map
> > > > fds into a shared memory region indentified by
> > >
> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > > vhost-user spec, commit description, etc) so it's clear that this is
> not
> > > about vhost-user shared memory tables/regions.
> > >
> > > > its `shmid`. Then, the fd memory is advertised
> > > > to the frontend through a BAR+offset, so it can
> > > > be read by the driver while its valid.
> > >
> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > > report VIRTIO Shared Memory Regions to the driver.
> > >
> > >
> > I will remove PCI BAR, as it is true that it depends on the
> > transport. I was trying to explain that the driver
> > will use the shm_base + shm_offset to access
> > the mapped memory.
> >
> >
> > > >
> > > > Then, the backend can munmap the memory range
> > > > in a given shared memory region (again, identified
> > > > by its `shmid`), to free it. After this, the
> > > > region becomes private and shall not be accessed
> > > > by the frontend anymore.
> > >
> > > What does "private" mean?
> > >
> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > > unrelated mmap(NULL, ...) might use that address range and the guest
> > > would have access to the host memory! This is a security issue and
> needs
> > > to be mentioned explicitly in the spec.
> > >
> >
> > I mentioned private because it changes the mapping from MAP_SHARED
> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>
> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> mean mmap flags or something like the memory range is no longer
> accessible to the driver.
>
> >
> >
> > >
> > > >
> > > > Initializing the memory region is reponsiblity
> > > > of the PCI device that will using it.
> > >
> > > What does this mean?
> > >
> >
> > The MemoryRegion is declared in `struct VirtIODevice`,
> > but it is uninitialized in this commit. So I was trying to say
> > that the initialization will happen in, e.g., vhost-user-gpu-pci.c
> > with something like `memory_region_init` , and later `pci_register_bar`.
>
> Okay. The device model needs to create MemoryRegion instances for the
> device's Shared Memory Regions and add them to the VirtIODevice.
>
> --device vhost-user-device will need to query the backend since, unlike
> vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
> device types. It will need to create MemoryRegions enumerated from the
> backend.
>
> By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
> Regions so this work should not be tied to PCI.
>
> >
> > I am testing that part still.
> >
> >
> > >
> > > >
> > > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
> > > > ---
> > > >  docs/interop/vhost-user.rst |  23 ++++++++
> > > >  hw/virtio/vhost-user.c      | 106
> ++++++++++++++++++++++++++++++++++++
> > > >  hw/virtio/virtio.c          |   2 +
> > > >  include/hw/virtio/virtio.h  |   3 +
> > > >  4 files changed, 134 insertions(+)
> > >
> > > Two missing pieces:
> > >
> > > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
> > > Shared Memory Regions from the vhost-user backend. vhost-user-device is
> > > a generic vhost-user frontend without knowledge of the device type, so
> > > it doesn't know what the valid shmids are and what size the regions
> > > have.
> > >
> >
> > Ok. I was assuming that if a device (backend) makes a request without a
> > valid shmid or not enough size in the region to perform the mmap, would
> > just fail in the VMM performing the actual mmap operation. So it would
> > not necessarily need to know about valid shmids or region sizes.
>
> But then --device vhost-user-device wouldn't be able to support VIRTIO
> Shared Memory Regions, which means this patch series is incomplete. New
> vhost-user features need to support both --device vhost-user-<type>-*
> and --device vhost-user-device.
>
> What's needed is:
> 1. New vhost-user messages so the frontend can query the shmids and
>    sizes from the backend.
> 2. QEMU --device vhost-user-device code that queries the VIRTIO Shared
>    Memory Regions from the vhost-user backend and then creates
>    MemoryRegions for them.
>
> >
> >
> > >
> > > 2. Other backends don't see these mappings. If the guest submits a
> vring
> > > descriptor referencing a mapping to another backend, then that backend
> > > won't be able to access this memory. David Gilbert hit this problem
> when
> > > working on the virtiofs DAX Window. Either the frontend needs to
> forward
> > > all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
> > > maybe racy!) or a new "memcpy" message is needed as a fallback for when
> > > vhost-user memory region translation fails.
> > >
> >
> > Ok. In which scenario would another device want to access the same
> mapping?
> > If it is for a shared VIRTIO object, the device that receives the dmabuf
> fd
> > could
> > just do a new mapping in its VIRTIO shared memory region.
>
> You can reproduce this with two virtiofs devices. Where device A has the
> DAX Window feature enabled. The test program mmaps the DAX file and then
> opens a new file on device B and performs an O_DIRECT write(2) syscall.
>
> I think that this issue must be address in this series since this issue
> is introduced by enabling VIRTIO Shared Memory Regions in vhost-user.
>
> A bit more detail:
>
> Device A has a VIRTIO Shared Memory Region. An application mmaps that
> memory (examples: guest userspace driver using Linux VFIO, a guest
> kernel driver that exposes the memory to userspace via mmap, or guest
> kernel DAX). The application passes that memory as an I/O buffer to
> device B (e.g. O_DIRECT disk I/O).
>
> The result is that device B's vhost-user backend receives a vring
> descriptor that points to a guest memory address in device A's VIRTIO
> Shared Memory Region. Since device B does not have this memory in its
> table, it cannot translate the address and the device breaks.
>
>
I am working on the next version, and I am struggling on what to do
to solve this situation.
So IIUC is the device B (in your example) who will notice that it cannot
translate, and then, the backend will send a message (let's say,
SHMEM_COPY) to the frontend so that it copies all mappings
done on device A to device B. Something like that?
Maybe better only for the mapping that triggered the
message?

But the message would have to point to the device A so that
the frontend knows the src and dest to perform the memcpy?
And indicate the memory range that we want to map (offset + len).
I am not sure how the frontend would know that it needs to copy
from device A to device B after receiving the new message,
and get a pointer of both devices (device B is easy as it is
the one that sent the message, but not device A).
Or maybe I misunderstood your suggested solution.

BR,
Albert.


> >
> >
> > >
> > > >
> > > > diff --git a/docs/interop/vhost-user.rst
> b/docs/interop/vhost-user.rst
> > > > index d8419fd2f1..3caf2a290c 100644
> > > > --- a/docs/interop/vhost-user.rst
> > > > +++ b/docs/interop/vhost-user.rst
> > > > @@ -1859,6 +1859,29 @@ is sent by the front-end.
> > > >    when the operation is successful, or non-zero otherwise. Note
> that if
> > > the
> > > >    operation fails, no fd is sent to the backend.
> > > >
> > > > +``VHOST_USER_BACKEND_SHMEM_MAP``
> > > > +  :id: 9
> > > > +  :equivalent ioctl: N/A
> > > > +  :request payload: fd and ``struct VhostUserMMap``
> > > > +  :reply payload: N/A
> > > > +
> > > > +  This message can be submitted by the backends to advertise a new
> > > mapping
> > > > +  to be made in a given shared memory region. Upon receiving the
> > > message,
> > > > +  QEMU will mmap the given fd into the shared memory region with the
> > >
> > > s/QEMU/the frontend/
> > >
> > > > +  requested ``shmid``. A reply is generated indicating whether
> mapping
> > > > +  succeeded.
> > >
> > > Please document whether mapping over an existing mapping is allowed. I
> > > think it should be allowed because it might be useful to atomically
> > > update a mapping without a race where the driver sees unmapped memory.
> > >
> > >
> > So in my understanding, the frontend (driver) in the guest would initiate
> > the
> > mmap/munmap by sending request to the backend (vhost-user device).
> > Then the vhost-user device sends a request to the VMM to perform the
> > mapping. We could enforce an ACK to ensure that the mmap operation
> finished
> > before the vhost-user device responds to the driver, and thus avoid
> > races. This way, we support only the simple usecase of not allowing
> > mmaps over an already mapped region.
>
> I think it's fine to start with the stricter model where no overlapping
> mappings are allowed. If someone needs overlapping mappings in the
> future, a feature bit could be added to negotiate support.
>
> Please document that overlapping mappings are not supported.
>
> > > If mapping over an existing mapping is allowed, does the new mapping
> > > need to cover the old mapping exactly, or can it span multiple previous
> > > mappings or a subset of an existing mapping?
> > >
> > > From a security point of view we need to be careful here. A potentially
> > > untrusted backend process now has the ability to mmap an arbitrary file
> > > descriptor into the frontend process. The backend can cause
> > > denial of service by creating many small mappings to exhaust the OS
> > > limits on virtual memory areas. The backend can map memory to use as
> > > part of a security compromise, so we need to be sure the virtual memory
> > > addresses are not leaked to the backend and only read/write page
> > > permissions are available.
> > >
> >
> > Right, security from untrusted backends is usally the hardest part to me.
> > But vhost-user devices do only see shm_offset, so this should be safe,
> no?
>
> Yes, I think the current interface is safe, too.
>
> > >
> > > > +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> > > > +  :id: 10
> > > > +  :equivalent ioctl: N/A
> > > > +  :request payload: ``struct VhostUserMMap``
> > > > +  :reply payload: N/A
> > > > +
> > > > +  This message can be submitted by the backends so that QEMU un-mmap
> > >
> > > s/QEMU/the frontend/
> > >
> >
> > This is probably my bad, but I really thought the frontend is the driver.
> > So frontend/backend as alternative terms for vhost-user driver/device.
>
> "vhost-user driver" is not a term that's used, as far as I know. The
> vhost-user spec calls it the front-end (older code and docs may call it
> the vhost-user master).
>
> > And then here we would keep QEMU or use VMM instead?
>
> The vhost-user spec uses both front-end and QEMU to mean the same thing.
> VMM is not used in the vhost-user spec. I suggest using front-end in new
> spec text because QEMU is not the only application that implements this
> spec anymore.
>
> >
> >
> > >
> > > > +  a given range (``offset``, ``len``) in the shared memory region
> with
> > > the
> > > > +  requested ``shmid``.
> > >
> > > Does the range need to correspond to a previously-mapped VhostUserMMap
> > > or can it cross multiple VhostUserMMaps, be a subset of a
> VhostUserMMap,
> > > etc?
> > >
> >
> > I would prefer to keep it simple and disallow mapping over a
> > previously-mapped
> > region. The range need to correspond to a valid (within size bound),
> free,
> > memory region, or else the request will fail. I will add that to the
> text.
> >
> > Nonetheless, I am open to discuss other options.
>
> That sounds good. I'm not aware of any use cases that require anything
> fancy.
>
> >
> >
> > >
> > > > +  A reply is generated indicating whether unmapping succeeded.
> > > > +
> > > >  .. _reply_ack:
> > > >
> > > >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > index cdf9af4a4b..9526b9d07f 100644
> > > > --- a/hw/virtio/vhost-user.c
> > > > +++ b/hw/virtio/vhost-user.c
> > > > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> > > > +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
> > > > +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
> > > >      VHOST_USER_BACKEND_MAX
> > > >  }  VhostUserBackendRequest;
> > > >
> > > > @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
> > > >      unsigned char uuid[16];
> > > >  } VhostUserShared;
> > > >
> > > > +/* For the flags field of VhostUserMMap */
> > > > +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> > > > +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> > > > +
> > > > +typedef struct {
> > > > +    /* Shared memory region ID */
> > > > +    uint8_t shmid;
> > >
> > > There is a hole (padding) in the struct since the following fields are
> > > naturally aligned to 8 bytes. I suggest moving shmid to the end to
> > > reduce the chance of information leaks if padding is not zeroed.
> > >
> >
> > I see. I can move it to the end of the struct or add a padding field in
> > between. I'll do what you suggested, as it sound like the simplest
> solution.
> >
> >
> > >
> > > > +    /* File offset */
> > > > +    uint64_t fd_offset;
> > > > +    /* Offset within the shared memory region */
> > > > +    uint64_t shm_offset;
> > > > +    /* Size of region to map */
> > >
> > > To avoid giving "region" additional meanings:
> > >
> > > s/Size of the region to map/Size of the mapping/
> > >
> >
> > Ok, I will change it in the next drop. Probably will keep the RFC for
> > at least one more patch, seeing that there are a few things I wasn't
> > correctly considering.
> >
> > Thanks for all the feedback!
>
> Great, thanks!
>
> Stefan
>
> >
> >
> > >
> > > > +    uint64_t len;
> > > > +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
> > > > +    uint64_t flags;
> > > > +} VhostUserMMap;
> > > > +
> > > >  typedef struct {
> > > >      VhostUserRequest request;
> > > >
> > > > @@ -224,6 +243,7 @@ typedef union {
> > > >          VhostUserInflight inflight;
> > > >          VhostUserShared object;
> > > >          VhostUserTransferDeviceState transfer_state;
> > > > +        VhostUserMMap mmap;
> > > >  } VhostUserPayload;
> > > >
> > > >  typedef struct VhostUserMsg {
> > > > @@ -1748,6 +1768,85 @@
> > > vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
> > > >      return 0;
> > > >  }
> > > >
> > > > +static int
> > > > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
> > > > +                                  VhostUserMMap *vu_mmap,
> > > > +                                  int fd)
> > > > +{
> > > > +    void *addr = 0;
> > > > +    MemoryRegion *mr = NULL;
> > > > +
> > > > +    if (fd < 0) {
> > > > +        error_report("Bad fd for map");
> > > > +        return -EBADF;
> > > > +    }
> > > > +
> > > > +    if (!dev->vdev->shmem_list ||
> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > > +        error_report("Shared memory region at "
> > > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > > +
> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > > +
> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
> > > > +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
> > > > +    if (addr == MAP_FAILED) {
> > > > +        error_report("Failed to mmap mem fd");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
> > > > +                                    VhostUserMMap *vu_mmap)
> > > > +{
> > > > +    void *addr = 0;
> > > > +    MemoryRegion *mr = NULL;
> > > > +
> > > > +    if (!dev->vdev->shmem_list ||
> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > > +        error_report("Shared memory region at "
> > > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > > +
> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > > +
> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > > +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
> -1,
> > > 0);
> > > > +
> > > > +    if (addr == MAP_FAILED) {
> > > > +        error_report("Failed to unmap memory");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > >  static void close_backend_channel(struct vhost_user *u)
> > > >  {
> > > >      g_source_destroy(u->backend_src);
> > > > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc,
> > > GIOCondition condition,
> > > >          ret =
> > > vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
> > > >                                                               &hdr,
> > > &payload);
> > > >          break;
> > > > +    case VHOST_USER_BACKEND_SHMEM_MAP:
> > > > +        ret = vhost_user_backend_handle_shmem_map(dev,
> &payload.mmap,
> > > > +                                                fd ? fd[0] : -1);
> > > > +        break;
> > > > +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
> > > > +        ret = vhost_user_backend_handle_shmem_unmap(dev,
> &payload.mmap);
> > > > +        break;
> > > >      default:
> > > >          error_report("Received unexpected msg type: %d.",
> hdr.request);
> > > >          ret = -EINVAL;
> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > > index 893a072c9d..59596370ec 100644
> > > > --- a/hw/virtio/virtio.c
> > > > +++ b/hw/virtio/virtio.c
> > > > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t
> > > device_id, size_t config_size)
> > > >              virtio_vmstate_change, vdev);
> > > >      vdev->device_endian = virtio_default_endian();
> > > >      vdev->use_guest_notifier_mask = true;
> > > > +    vdev->shmem_list = NULL;
> > > > +    vdev->n_shmem_regions = 0;
> > > >  }
> > > >
> > > >  /*
> > > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > > index 7d5ffdc145..34bec26593 100644
> > > > --- a/include/hw/virtio/virtio.h
> > > > +++ b/include/hw/virtio/virtio.h
> > > > @@ -165,6 +165,9 @@ struct VirtIODevice
> > > >       */
> > > >      EventNotifier config_notifier;
> > > >      bool device_iotlb_enabled;
> > > > +    /* Shared memory region for vhost-user mappings. */
> > > > +    MemoryRegion *shmem_list;
> > > > +    int n_shmem_regions;
> > > >  };
> > > >
> > > >  struct VirtioDeviceClass {
> > > > --
> > > > 2.44.0
> > > >
> > >
>

Stefan Hajnoczi June 26, 2024, 1:31 p.m. UTC | #8

On Wed, 26 Jun 2024 at 03:54, Albert Esteve <aesteve@redhat.com> wrote:
>
> Hi Stefan,
>
> On Wed, Jun 5, 2024 at 4:28 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
>> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> >
>> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
>> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
>> > > >
>> > > > This request allows backends to dynamically map
>> > > > fds into a shared memory region indentified by
>> > >
>> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
>> > > vhost-user spec, commit description, etc) so it's clear that this is not
>> > > about vhost-user shared memory tables/regions.
>> > >
>> > > > its `shmid`. Then, the fd memory is advertised
>> > > > to the frontend through a BAR+offset, so it can
>> > > > be read by the driver while its valid.
>> > >
>> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
>> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
>> > > report VIRTIO Shared Memory Regions to the driver.
>> > >
>> > >
>> > I will remove PCI BAR, as it is true that it depends on the
>> > transport. I was trying to explain that the driver
>> > will use the shm_base + shm_offset to access
>> > the mapped memory.
>> >
>> >
>> > > >
>> > > > Then, the backend can munmap the memory range
>> > > > in a given shared memory region (again, identified
>> > > > by its `shmid`), to free it. After this, the
>> > > > region becomes private and shall not be accessed
>> > > > by the frontend anymore.
>> > >
>> > > What does "private" mean?
>> > >
>> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
>> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
>> > > unrelated mmap(NULL, ...) might use that address range and the guest
>> > > would have access to the host memory! This is a security issue and needs
>> > > to be mentioned explicitly in the spec.
>> > >
>> >
>> > I mentioned private because it changes the mapping from MAP_SHARED
>> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>>
>> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
>> mean mmap flags or something like the memory range is no longer
>> accessible to the driver.
>>
>> >
>> >
>> > >
>> > > >
>> > > > Initializing the memory region is reponsiblity
>> > > > of the PCI device that will using it.
>> > >
>> > > What does this mean?
>> > >
>> >
>> > The MemoryRegion is declared in `struct VirtIODevice`,
>> > but it is uninitialized in this commit. So I was trying to say
>> > that the initialization will happen in, e.g., vhost-user-gpu-pci.c
>> > with something like `memory_region_init` , and later `pci_register_bar`.
>>
>> Okay. The device model needs to create MemoryRegion instances for the
>> device's Shared Memory Regions and add them to the VirtIODevice.
>>
>> --device vhost-user-device will need to query the backend since, unlike
>> vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
>> device types. It will need to create MemoryRegions enumerated from the
>> backend.
>>
>> By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
>> Regions so this work should not be tied to PCI.
>>
>> >
>> > I am testing that part still.
>> >
>> >
>> > >
>> > > >
>> > > > Signed-off-by: Albert Esteve <aesteve@redhat.com>
>> > > > ---
>> > > >  docs/interop/vhost-user.rst |  23 ++++++++
>> > > >  hw/virtio/vhost-user.c      | 106 ++++++++++++++++++++++++++++++++++++
>> > > >  hw/virtio/virtio.c          |   2 +
>> > > >  include/hw/virtio/virtio.h  |   3 +
>> > > >  4 files changed, 134 insertions(+)
>> > >
>> > > Two missing pieces:
>> > >
>> > > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
>> > > Shared Memory Regions from the vhost-user backend. vhost-user-device is
>> > > a generic vhost-user frontend without knowledge of the device type, so
>> > > it doesn't know what the valid shmids are and what size the regions
>> > > have.
>> > >
>> >
>> > Ok. I was assuming that if a device (backend) makes a request without a
>> > valid shmid or not enough size in the region to perform the mmap, would
>> > just fail in the VMM performing the actual mmap operation. So it would
>> > not necessarily need to know about valid shmids or region sizes.
>>
>> But then --device vhost-user-device wouldn't be able to support VIRTIO
>> Shared Memory Regions, which means this patch series is incomplete. New
>> vhost-user features need to support both --device vhost-user-<type>-*
>> and --device vhost-user-device.
>>
>> What's needed is:
>> 1. New vhost-user messages so the frontend can query the shmids and
>>    sizes from the backend.
>> 2. QEMU --device vhost-user-device code that queries the VIRTIO Shared
>>    Memory Regions from the vhost-user backend and then creates
>>    MemoryRegions for them.
>>
>> >
>> >
>> > >
>> > > 2. Other backends don't see these mappings. If the guest submits a vring
>> > > descriptor referencing a mapping to another backend, then that backend
>> > > won't be able to access this memory. David Gilbert hit this problem when
>> > > working on the virtiofs DAX Window. Either the frontend needs to forward
>> > > all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
>> > > maybe racy!) or a new "memcpy" message is needed as a fallback for when
>> > > vhost-user memory region translation fails.
>> > >
>> >
>> > Ok. In which scenario would another device want to access the same mapping?
>> > If it is for a shared VIRTIO object, the device that receives the dmabuf fd
>> > could
>> > just do a new mapping in its VIRTIO shared memory region.
>>
>> You can reproduce this with two virtiofs devices. Where device A has the
>> DAX Window feature enabled. The test program mmaps the DAX file and then
>> opens a new file on device B and performs an O_DIRECT write(2) syscall.
>>
>> I think that this issue must be address in this series since this issue
>> is introduced by enabling VIRTIO Shared Memory Regions in vhost-user.
>>
>> A bit more detail:
>>
>> Device A has a VIRTIO Shared Memory Region. An application mmaps that
>> memory (examples: guest userspace driver using Linux VFIO, a guest
>> kernel driver that exposes the memory to userspace via mmap, or guest
>> kernel DAX). The application passes that memory as an I/O buffer to
>> device B (e.g. O_DIRECT disk I/O).
>>
>> The result is that device B's vhost-user backend receives a vring
>> descriptor that points to a guest memory address in device A's VIRTIO
>> Shared Memory Region. Since device B does not have this memory in its
>> table, it cannot translate the address and the device breaks.
>>
>
> I am working on the next version, and I am struggling on what to do
> to solve this situation.
> So IIUC is the device B (in your example) who will notice that it cannot
> translate, and then, the backend will send a message (let's say,
> SHMEM_COPY) to the frontend so that it copies all mappings
> done on device A to device B. Something like that?
> Maybe better only for the mapping that triggered the
> message?
>
> But the message would have to point to the device A so that
> the frontend knows the src and dest to perform the memcpy?
> And indicate the memory range that we want to map (offset + len).
> I am not sure how the frontend would know that it needs to copy
> from device A to device B after receiving the new message,
> and get a pointer of both devices (device B is easy as it is
> the one that sent the message, but not device A).
> Or maybe I misunderstood your suggested solution.

I can see two types of solutions:

1. Propagating mmaps to devices, either eagerly or lazily. It sounds
like your thinking about that when you say "copies all mappings done
on device A to device B".
2. A memcpy command where the frontend reads from the guest memory
space or writes to the guest memory space. Here the mmaps are never
transmitted to vhost-user device backends.

My preference is for #2 because the vhost-user protocol is simple. #1
is complex when done lazily and basically on par with the complexity
of an IOMMU interface.

The following shows the new vhost-user protocol commands:

Back-end message types
...
VHOST_USER_BACKEND_MEM_READ
Parameters:
- u64 guest address
- u32 size

The frontend replies with a buffer containing @size bytes at @guest_address.

VHOST_USER_BACKEND_MEM_WRITE
Parameters:
- u64 guest address
- u32 size
- u8 data[]

The frontend writes @size bytes from @data[] to guest memory at @guest_address.

The backend virtqueue code must be extended to handle guest address
translation failures by sending
VHOST_USER_BACKEND_MEM_READ/VHOST_USER_BACKEND_MEM_WRITE messages to
the frontend. This is invasive because any guest memory access could
fail, including vring fields as well as virtqueue element in/out data
buffers. It involves introducing bounce buffering when the backend
falls back to using these new protocol messages.

An even simpler alternative is an eager implementation of #1, where
the frontend sends mmaps to all other device backends so that they
always have a complete mem table. In this case no changes are needed
to the backend virtqueue code, but performance suffers when devices
send shmem map/unmap requests at runtime rather than only during
startup. An example of a device like this is virtiofs (the DAX Window
feature).

David Gilbert was implementing something like #2 but with protocol
messages that carry an fd so that the frontend performs a read(2) or
write(2) instead of transferring the contents of the data buffer. It's
kind of a zero-copy optimization but only works when the backend wants
to do read(2) or write(2), so it's less generic than memory read and
write protocol messages.

Finally, one thing I like about #2 is that it introduces an option for
doing vhost-user without shared memory altogether! A backend could
rely solely on the memory read and write messages. That can be useful
for portability and security, although it's bad for performance.

Stefan

>
> BR,
> Albert.
>
>>
>> >
>> >
>> > >
>> > > >
>> > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
>> > > > index d8419fd2f1..3caf2a290c 100644
>> > > > --- a/docs/interop/vhost-user.rst
>> > > > +++ b/docs/interop/vhost-user.rst
>> > > > @@ -1859,6 +1859,29 @@ is sent by the front-end.
>> > > >    when the operation is successful, or non-zero otherwise. Note that if
>> > > the
>> > > >    operation fails, no fd is sent to the backend.
>> > > >
>> > > > +``VHOST_USER_BACKEND_SHMEM_MAP``
>> > > > +  :id: 9
>> > > > +  :equivalent ioctl: N/A
>> > > > +  :request payload: fd and ``struct VhostUserMMap``
>> > > > +  :reply payload: N/A
>> > > > +
>> > > > +  This message can be submitted by the backends to advertise a new
>> > > mapping
>> > > > +  to be made in a given shared memory region. Upon receiving the
>> > > message,
>> > > > +  QEMU will mmap the given fd into the shared memory region with the
>> > >
>> > > s/QEMU/the frontend/
>> > >
>> > > > +  requested ``shmid``. A reply is generated indicating whether mapping
>> > > > +  succeeded.
>> > >
>> > > Please document whether mapping over an existing mapping is allowed. I
>> > > think it should be allowed because it might be useful to atomically
>> > > update a mapping without a race where the driver sees unmapped memory.
>> > >
>> > >
>> > So in my understanding, the frontend (driver) in the guest would initiate
>> > the
>> > mmap/munmap by sending request to the backend (vhost-user device).
>> > Then the vhost-user device sends a request to the VMM to perform the
>> > mapping. We could enforce an ACK to ensure that the mmap operation finished
>> > before the vhost-user device responds to the driver, and thus avoid
>> > races. This way, we support only the simple usecase of not allowing
>> > mmaps over an already mapped region.
>>
>> I think it's fine to start with the stricter model where no overlapping
>> mappings are allowed. If someone needs overlapping mappings in the
>> future, a feature bit could be added to negotiate support.
>>
>> Please document that overlapping mappings are not supported.
>>
>> > > If mapping over an existing mapping is allowed, does the new mapping
>> > > need to cover the old mapping exactly, or can it span multiple previous
>> > > mappings or a subset of an existing mapping?
>> > >
>> > > From a security point of view we need to be careful here. A potentially
>> > > untrusted backend process now has the ability to mmap an arbitrary file
>> > > descriptor into the frontend process. The backend can cause
>> > > denial of service by creating many small mappings to exhaust the OS
>> > > limits on virtual memory areas. The backend can map memory to use as
>> > > part of a security compromise, so we need to be sure the virtual memory
>> > > addresses are not leaked to the backend and only read/write page
>> > > permissions are available.
>> > >
>> >
>> > Right, security from untrusted backends is usally the hardest part to me.
>> > But vhost-user devices do only see shm_offset, so this should be safe, no?
>>
>> Yes, I think the current interface is safe, too.
>>
>> > >
>> > > > +``VHOST_USER_BACKEND_SHMEM_UNMAP``
>> > > > +  :id: 10
>> > > > +  :equivalent ioctl: N/A
>> > > > +  :request payload: ``struct VhostUserMMap``
>> > > > +  :reply payload: N/A
>> > > > +
>> > > > +  This message can be submitted by the backends so that QEMU un-mmap
>> > >
>> > > s/QEMU/the frontend/
>> > >
>> >
>> > This is probably my bad, but I really thought the frontend is the driver.
>> > So frontend/backend as alternative terms for vhost-user driver/device.
>>
>> "vhost-user driver" is not a term that's used, as far as I know. The
>> vhost-user spec calls it the front-end (older code and docs may call it
>> the vhost-user master).
>>
>> > And then here we would keep QEMU or use VMM instead?
>>
>> The vhost-user spec uses both front-end and QEMU to mean the same thing.
>> VMM is not used in the vhost-user spec. I suggest using front-end in new
>> spec text because QEMU is not the only application that implements this
>> spec anymore.
>>
>> >
>> >
>> > >
>> > > > +  a given range (``offset``, ``len``) in the shared memory region with
>> > > the
>> > > > +  requested ``shmid``.
>> > >
>> > > Does the range need to correspond to a previously-mapped VhostUserMMap
>> > > or can it cross multiple VhostUserMMaps, be a subset of a VhostUserMMap,
>> > > etc?
>> > >
>> >
>> > I would prefer to keep it simple and disallow mapping over a
>> > previously-mapped
>> > region. The range need to correspond to a valid (within size bound), free,
>> > memory region, or else the request will fail. I will add that to the text.
>> >
>> > Nonetheless, I am open to discuss other options.
>>
>> That sounds good. I'm not aware of any use cases that require anything fancy.
>>
>> >
>> >
>> > >
>> > > > +  A reply is generated indicating whether unmapping succeeded.
>> > > > +
>> > > >  .. _reply_ack:
>> > > >
>> > > >  VHOST_USER_PROTOCOL_F_REPLY_ACK
>> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> > > > index cdf9af4a4b..9526b9d07f 100644
>> > > > --- a/hw/virtio/vhost-user.c
>> > > > +++ b/hw/virtio/vhost-user.c
>> > > > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
>> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
>> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
>> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
>> > > > +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
>> > > > +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
>> > > >      VHOST_USER_BACKEND_MAX
>> > > >  }  VhostUserBackendRequest;
>> > > >
>> > > > @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
>> > > >      unsigned char uuid[16];
>> > > >  } VhostUserShared;
>> > > >
>> > > > +/* For the flags field of VhostUserMMap */
>> > > > +#define VHOST_USER_FLAG_MAP_R (1u << 0)
>> > > > +#define VHOST_USER_FLAG_MAP_W (1u << 1)
>> > > > +
>> > > > +typedef struct {
>> > > > +    /* Shared memory region ID */
>> > > > +    uint8_t shmid;
>> > >
>> > > There is a hole (padding) in the struct since the following fields are
>> > > naturally aligned to 8 bytes. I suggest moving shmid to the end to
>> > > reduce the chance of information leaks if padding is not zeroed.
>> > >
>> >
>> > I see. I can move it to the end of the struct or add a padding field in
>> > between. I'll do what you suggested, as it sound like the simplest solution.
>> >
>> >
>> > >
>> > > > +    /* File offset */
>> > > > +    uint64_t fd_offset;
>> > > > +    /* Offset within the shared memory region */
>> > > > +    uint64_t shm_offset;
>> > > > +    /* Size of region to map */
>> > >
>> > > To avoid giving "region" additional meanings:
>> > >
>> > > s/Size of the region to map/Size of the mapping/
>> > >
>> >
>> > Ok, I will change it in the next drop. Probably will keep the RFC for
>> > at least one more patch, seeing that there are a few things I wasn't
>> > correctly considering.
>> >
>> > Thanks for all the feedback!
>>
>> Great, thanks!
>>
>> Stefan
>>
>> >
>> >
>> > >
>> > > > +    uint64_t len;
>> > > > +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
>> > > > +    uint64_t flags;
>> > > > +} VhostUserMMap;
>> > > > +
>> > > >  typedef struct {
>> > > >      VhostUserRequest request;
>> > > >
>> > > > @@ -224,6 +243,7 @@ typedef union {
>> > > >          VhostUserInflight inflight;
>> > > >          VhostUserShared object;
>> > > >          VhostUserTransferDeviceState transfer_state;
>> > > > +        VhostUserMMap mmap;
>> > > >  } VhostUserPayload;
>> > > >
>> > > >  typedef struct VhostUserMsg {
>> > > > @@ -1748,6 +1768,85 @@
>> > > vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
>> > > >      return 0;
>> > > >  }
>> > > >
>> > > > +static int
>> > > > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
>> > > > +                                  VhostUserMMap *vu_mmap,
>> > > > +                                  int fd)
>> > > > +{
>> > > > +    void *addr = 0;
>> > > > +    MemoryRegion *mr = NULL;
>> > > > +
>> > > > +    if (fd < 0) {
>> > > > +        error_report("Bad fd for map");
>> > > > +        return -EBADF;
>> > > > +    }
>> > > > +
>> > > > +    if (!dev->vdev->shmem_list ||
>> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
>> > > > +        error_report("Shared memory region at "
>> > > > +                     "ID %d unitialized", vu_mmap->shmid);
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
>> > > > +
>> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
>> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
>> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
>> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
>> > > > +
>> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
>> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
>> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
>> > > > +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
>> > > > +    if (addr == MAP_FAILED) {
>> > > > +        error_report("Failed to mmap mem fd");
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    return 0;
>> > > > +}
>> > > > +
>> > > > +static int
>> > > > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
>> > > > +                                    VhostUserMMap *vu_mmap)
>> > > > +{
>> > > > +    void *addr = 0;
>> > > > +    MemoryRegion *mr = NULL;
>> > > > +
>> > > > +    if (!dev->vdev->shmem_list ||
>> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
>> > > > +        error_report("Shared memory region at "
>> > > > +                     "ID %d unitialized", vu_mmap->shmid);
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
>> > > > +
>> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
>> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
>> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
>> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
>> > > > +
>> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
>> > > > +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1,
>> > > 0);
>> > > > +
>> > > > +    if (addr == MAP_FAILED) {
>> > > > +        error_report("Failed to unmap memory");
>> > > > +        return -EFAULT;
>> > > > +    }
>> > > > +
>> > > > +    return 0;
>> > > > +}
>> > > > +
>> > > >  static void close_backend_channel(struct vhost_user *u)
>> > > >  {
>> > > >      g_source_destroy(u->backend_src);
>> > > > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc,
>> > > GIOCondition condition,
>> > > >          ret =
>> > > vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
>> > > >                                                               &hdr,
>> > > &payload);
>> > > >          break;
>> > > > +    case VHOST_USER_BACKEND_SHMEM_MAP:
>> > > > +        ret = vhost_user_backend_handle_shmem_map(dev, &payload.mmap,
>> > > > +                                                fd ? fd[0] : -1);
>> > > > +        break;
>> > > > +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
>> > > > +        ret = vhost_user_backend_handle_shmem_unmap(dev, &payload.mmap);
>> > > > +        break;
>> > > >      default:
>> > > >          error_report("Received unexpected msg type: %d.", hdr.request);
>> > > >          ret = -EINVAL;
>> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> > > > index 893a072c9d..59596370ec 100644
>> > > > --- a/hw/virtio/virtio.c
>> > > > +++ b/hw/virtio/virtio.c
>> > > > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t
>> > > device_id, size_t config_size)
>> > > >              virtio_vmstate_change, vdev);
>> > > >      vdev->device_endian = virtio_default_endian();
>> > > >      vdev->use_guest_notifier_mask = true;
>> > > > +    vdev->shmem_list = NULL;
>> > > > +    vdev->n_shmem_regions = 0;
>> > > >  }
>> > > >
>> > > >  /*
>> > > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> > > > index 7d5ffdc145..34bec26593 100644
>> > > > --- a/include/hw/virtio/virtio.h
>> > > > +++ b/include/hw/virtio/virtio.h
>> > > > @@ -165,6 +165,9 @@ struct VirtIODevice
>> > > >       */
>> > > >      EventNotifier config_notifier;
>> > > >      bool device_iotlb_enabled;
>> > > > +    /* Shared memory region for vhost-user mappings. */
>> > > > +    MemoryRegion *shmem_list;
>> > > > +    int n_shmem_regions;
>> > > >  };
>> > > >
>> > > >  struct VirtioDeviceClass {
>> > > > --
>> > > > 2.44.0
>> > > >
>> > >

[RFC,1/1] vhost-user: add shmem mmap request

Commit Message

Comments

Patch