diff mbox series

[v1,1/2] vhost: Defer filtering memory sections until building the vhost memory structure

Message ID 20230216114752.198627-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series vhost: memslot handling improvements | expand

Commit Message

David Hildenbrand Feb. 16, 2023, 11:47 a.m. UTC
Having multiple devices, some filtering memslots and some not filtering
memslots, messes up the "used_memslot" accounting. If we'd have a device
the filters out less memory sections after a device that filters out more,
we'd be in trouble, because our memslot checks stop working reliably.
For example, hotplugging a device that filters out less memslots might end
up passing the checks based on max vs. used memslots, but can run out of
memslots when getting notified about all memory sections.

Further, it will be helpful in memory device context in the near future
to know that a RAM memory region section will consume a memslot, and be
accounted for in the used vs. free memslots, such that we can implement
reservation of memslots for memory devices properly. Whether a device
filters this out and would theoretically still have a free memslot is
then hidden internally, making overall vhost memslot accounting easier.

Let's filter the memslots when creating the vhost memory array,
accounting all RAM && !ROM memory regions as "used_memslots" even if
vhost_user isn't interested in anonymous RAM regions, because it needs
an fd.

When a device actually filters out regions (which should happen rarely
in practice), we might detect a layout change although only filtered
regions changed. We won't bother about optimizing that for now.

Note: we cannot simply filter out the region and count them as
"filtered" to add them to used, because filtered regions could get
merged and result in a smaller effective number of memslots. Further,
we won't touch the hmp/qmp virtio introspection output.

Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
Cc: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 55 insertions(+), 24 deletions(-)

Comments

Michael S. Tsirkin Feb. 16, 2023, 12:04 p.m. UTC | #1
On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
> Having multiple devices, some filtering memslots and some not filtering
> memslots, messes up the "used_memslot" accounting. If we'd have a device
> the filters out less memory sections after a device that filters out more,
> we'd be in trouble, because our memslot checks stop working reliably.
> For example, hotplugging a device that filters out less memslots might end
> up passing the checks based on max vs. used memslots, but can run out of
> memslots when getting notified about all memory sections.
> 
> Further, it will be helpful in memory device context in the near future
> to know that a RAM memory region section will consume a memslot, and be
> accounted for in the used vs. free memslots, such that we can implement
> reservation of memslots for memory devices properly. Whether a device
> filters this out and would theoretically still have a free memslot is
> then hidden internally, making overall vhost memslot accounting easier.
> 
> Let's filter the memslots when creating the vhost memory array,
> accounting all RAM && !ROM memory regions as "used_memslots" even if
> vhost_user isn't interested in anonymous RAM regions, because it needs
> an fd.
> 
> When a device actually filters out regions (which should happen rarely
> in practice), we might detect a layout change although only filtered
> regions changed. We won't bother about optimizing that for now.

That caused trouble in the past when using VGA because it is playing
with mappings in weird ways.
I think we have to optimize it, sorry.


> Note: we cannot simply filter out the region and count them as
> "filtered" to add them to used, because filtered regions could get
> merged and result in a smaller effective number of memslots. Further,
> we won't touch the hmp/qmp virtio introspection output.
> 
> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
> Cc: Tiwei Bie <tiwei.bie@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I didn't review this yet but maybe you can answer:
will this create more slots for the backend?
Because some backends are limited in # of slots and breaking them is
not a good idea.

Thanks!

> ---
>  hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 55 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index eb8c4c378c..b7fb960fa9 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -219,8 +219,13 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
>      int i;
>      /* FIXME: this is N^2 in number of sections */
>      for (i = 0; i < dev->n_mem_sections; ++i) {
> -        MemoryRegionSection *section = &dev->mem_sections[i];
> -        vhost_sync_dirty_bitmap(dev, section, first, last);
> +        MemoryRegionSection *mrs = &dev->mem_sections[i];
> +
> +        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> +            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
> +            continue;
> +        }
> +        vhost_sync_dirty_bitmap(dev, mrs, first, last);
>      }
>  }
>  
> @@ -503,12 +508,6 @@ static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section)
>              return false;
>          }
>  
> -        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> -            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) {
> -            trace_vhost_reject_section(mr->name, 2);
> -            return false;
> -        }
> -
>          trace_vhost_section(mr->name);
>          return true;
>      } else {
> @@ -525,6 +524,43 @@ static void vhost_begin(MemoryListener *listener)
>      dev->n_tmp_sections = 0;
>  }
>  
> +static void vhost_realloc_vhost_memory(struct vhost_dev *dev,
> +                                       unsigned int nregions)
> +{
> +    const size_t size = offsetof(struct vhost_memory, regions) +
> +                        nregions * sizeof dev->mem->regions[0];
> +
> +    dev->mem = g_realloc(dev->mem, size);
> +    dev->mem->nregions = nregions;
> +}
> +
> +static void vhost_rebuild_vhost_memory(struct vhost_dev *dev)
> +{
> +    unsigned int nregions = 0;
> +    int i;
> +
> +    vhost_realloc_vhost_memory(dev, dev->n_mem_sections);
> +    for (i = 0; i < dev->n_mem_sections; i++) {
> +        struct MemoryRegionSection *mrs = dev->mem_sections + i;
> +        struct vhost_memory_region *cur_vmr;
> +
> +        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> +            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
> +            continue;
> +        }
> +        cur_vmr = dev->mem->regions + nregions;
> +        nregions++;
> +
> +        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
> +        cur_vmr->memory_size     = int128_get64(mrs->size);
> +        cur_vmr->userspace_addr  =
> +            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
> +            mrs->offset_within_region;
> +        cur_vmr->flags_padding   = 0;
> +    }
> +    vhost_realloc_vhost_memory(dev, nregions);
> +}
> +
>  static void vhost_commit(MemoryListener *listener)
>  {
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> @@ -532,7 +568,6 @@ static void vhost_commit(MemoryListener *listener)
>      MemoryRegionSection *old_sections;
>      int n_old_sections;
>      uint64_t log_size;
> -    size_t regions_size;
>      int r;
>      int i;
>      bool changed = false;
> @@ -564,23 +599,19 @@ static void vhost_commit(MemoryListener *listener)
>          goto out;
>      }
>  
> -    /* Rebuild the regions list from the new sections list */
> -    regions_size = offsetof(struct vhost_memory, regions) +
> -                       dev->n_mem_sections * sizeof dev->mem->regions[0];
> -    dev->mem = g_realloc(dev->mem, regions_size);
> -    dev->mem->nregions = dev->n_mem_sections;
> +    /*
> +     * Globally track the used memslots *without* device specific
> +     * filtering. This way, we always know how many memslots are required
> +     * when devices with differing filtering requirements get mixed, and
> +     * all RAM memory regions of memory devices will consume memslots.
> +     */
>      used_memslots = dev->mem->nregions;
> -    for (i = 0; i < dev->n_mem_sections; i++) {
> -        struct vhost_memory_region *cur_vmr = dev->mem->regions + i;
> -        struct MemoryRegionSection *mrs = dev->mem_sections + i;
>  
> -        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
> -        cur_vmr->memory_size     = int128_get64(mrs->size);
> -        cur_vmr->userspace_addr  =
> -            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
> -            mrs->offset_within_region;
> -        cur_vmr->flags_padding   = 0;
> -    }
> +    /*
> +     * Rebuild the regions list from the new sections list, filtering out all
> +     * sections that this device is not interested in.
> +     */
> +    vhost_rebuild_vhost_memory(dev);
>  
>      if (!dev->started) {
>          goto out;
> -- 
> 2.39.1
David Hildenbrand Feb. 16, 2023, 12:10 p.m. UTC | #2
On 16.02.23 13:04, Michael S. Tsirkin wrote:
> On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
>> Having multiple devices, some filtering memslots and some not filtering
>> memslots, messes up the "used_memslot" accounting. If we'd have a device
>> the filters out less memory sections after a device that filters out more,
>> we'd be in trouble, because our memslot checks stop working reliably.
>> For example, hotplugging a device that filters out less memslots might end
>> up passing the checks based on max vs. used memslots, but can run out of
>> memslots when getting notified about all memory sections.
>>
>> Further, it will be helpful in memory device context in the near future
>> to know that a RAM memory region section will consume a memslot, and be
>> accounted for in the used vs. free memslots, such that we can implement
>> reservation of memslots for memory devices properly. Whether a device
>> filters this out and would theoretically still have a free memslot is
>> then hidden internally, making overall vhost memslot accounting easier.
>>
>> Let's filter the memslots when creating the vhost memory array,
>> accounting all RAM && !ROM memory regions as "used_memslots" even if
>> vhost_user isn't interested in anonymous RAM regions, because it needs
>> an fd.
>>
>> When a device actually filters out regions (which should happen rarely
>> in practice), we might detect a layout change although only filtered
>> regions changed. We won't bother about optimizing that for now.
> 
> That caused trouble in the past when using VGA because it is playing
> with mappings in weird ways.
> I think we have to optimize it, sorry.

We still filter them out, just later.

>> Note: we cannot simply filter out the region and count them as
>> "filtered" to add them to used, because filtered regions could get
>> merged and result in a smaller effective number of memslots. Further,
>> we won't touch the hmp/qmp virtio introspection output.
>>
>> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
>> Cc: Tiwei Bie <tiwei.bie@intel.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I didn't review this yet but maybe you can answer:
> will this create more slots for the backend?
> Because some backends are limited in # of slots and breaking them is
> not a good idea.

It restores the handling we had before 988a27754bbb. RAM without an fd 
should be rare for vhost-user setups (where we actually filter) I assume?
David Hildenbrand Feb. 16, 2023, 12:13 p.m. UTC | #3
On 16.02.23 13:10, David Hildenbrand wrote:
> On 16.02.23 13:04, Michael S. Tsirkin wrote:
>> On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
>>> Having multiple devices, some filtering memslots and some not filtering
>>> memslots, messes up the "used_memslot" accounting. If we'd have a device
>>> the filters out less memory sections after a device that filters out more,
>>> we'd be in trouble, because our memslot checks stop working reliably.
>>> For example, hotplugging a device that filters out less memslots might end
>>> up passing the checks based on max vs. used memslots, but can run out of
>>> memslots when getting notified about all memory sections.
>>>
>>> Further, it will be helpful in memory device context in the near future
>>> to know that a RAM memory region section will consume a memslot, and be
>>> accounted for in the used vs. free memslots, such that we can implement
>>> reservation of memslots for memory devices properly. Whether a device
>>> filters this out and would theoretically still have a free memslot is
>>> then hidden internally, making overall vhost memslot accounting easier.
>>>
>>> Let's filter the memslots when creating the vhost memory array,
>>> accounting all RAM && !ROM memory regions as "used_memslots" even if
>>> vhost_user isn't interested in anonymous RAM regions, because it needs
>>> an fd.
>>>
>>> When a device actually filters out regions (which should happen rarely
>>> in practice), we might detect a layout change although only filtered
>>> regions changed. We won't bother about optimizing that for now.
>>
>> That caused trouble in the past when using VGA because it is playing
>> with mappings in weird ways.
>> I think we have to optimize it, sorry.
> 
> We still filter them out, just later.

To be precise, we still filter out all DIRTY_MEMORY_VGA as we used to. 
Only the device-specific filtering (vhost-user) is modified.
Michael S. Tsirkin Feb. 16, 2023, 12:21 p.m. UTC | #4
On Thu, Feb 16, 2023 at 01:10:54PM +0100, David Hildenbrand wrote:
> On 16.02.23 13:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
> > > Having multiple devices, some filtering memslots and some not filtering
> > > memslots, messes up the "used_memslot" accounting. If we'd have a device
> > > the filters out less memory sections after a device that filters out more,
> > > we'd be in trouble, because our memslot checks stop working reliably.
> > > For example, hotplugging a device that filters out less memslots might end
> > > up passing the checks based on max vs. used memslots, but can run out of
> > > memslots when getting notified about all memory sections.
> > > 
> > > Further, it will be helpful in memory device context in the near future
> > > to know that a RAM memory region section will consume a memslot, and be
> > > accounted for in the used vs. free memslots, such that we can implement
> > > reservation of memslots for memory devices properly. Whether a device
> > > filters this out and would theoretically still have a free memslot is
> > > then hidden internally, making overall vhost memslot accounting easier.
> > > 
> > > Let's filter the memslots when creating the vhost memory array,
> > > accounting all RAM && !ROM memory regions as "used_memslots" even if
> > > vhost_user isn't interested in anonymous RAM regions, because it needs
> > > an fd.
> > > 
> > > When a device actually filters out regions (which should happen rarely
> > > in practice), we might detect a layout change although only filtered
> > > regions changed. We won't bother about optimizing that for now.
> > 
> > That caused trouble in the past when using VGA because it is playing
> > with mappings in weird ways.
> > I think we have to optimize it, sorry.
> 
> We still filter them out, just later.


The issue is sending lots of unnecessary system calls to update the kernel which
goes through a slow RCU.

> > > Note: we cannot simply filter out the region and count them as
> > > "filtered" to add them to used, because filtered regions could get
> > > merged and result in a smaller effective number of memslots. Further,
> > > we won't touch the hmp/qmp virtio introspection output.
> > > 
> > > Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
> > > Cc: Tiwei Bie <tiwei.bie@intel.com>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > I didn't review this yet but maybe you can answer:
> > will this create more slots for the backend?
> > Because some backends are limited in # of slots and breaking them is
> > not a good idea.
> 
> It restores the handling we had before 988a27754bbb. RAM without an fd
> should be rare for vhost-user setups (where we actually filter) I assume?

Hmm, I guess so.

> -- 
> Thanks,
> 
> David / dhildenb
Michael S. Tsirkin Feb. 16, 2023, 12:22 p.m. UTC | #5
On Thu, Feb 16, 2023 at 01:13:07PM +0100, David Hildenbrand wrote:
> On 16.02.23 13:10, David Hildenbrand wrote:
> > On 16.02.23 13:04, Michael S. Tsirkin wrote:
> > > On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
> > > > Having multiple devices, some filtering memslots and some not filtering
> > > > memslots, messes up the "used_memslot" accounting. If we'd have a device
> > > > the filters out less memory sections after a device that filters out more,
> > > > we'd be in trouble, because our memslot checks stop working reliably.
> > > > For example, hotplugging a device that filters out less memslots might end
> > > > up passing the checks based on max vs. used memslots, but can run out of
> > > > memslots when getting notified about all memory sections.
> > > > 
> > > > Further, it will be helpful in memory device context in the near future
> > > > to know that a RAM memory region section will consume a memslot, and be
> > > > accounted for in the used vs. free memslots, such that we can implement
> > > > reservation of memslots for memory devices properly. Whether a device
> > > > filters this out and would theoretically still have a free memslot is
> > > > then hidden internally, making overall vhost memslot accounting easier.
> > > > 
> > > > Let's filter the memslots when creating the vhost memory array,
> > > > accounting all RAM && !ROM memory regions as "used_memslots" even if
> > > > vhost_user isn't interested in anonymous RAM regions, because it needs
> > > > an fd.
> > > > 
> > > > When a device actually filters out regions (which should happen rarely
> > > > in practice), we might detect a layout change although only filtered
> > > > regions changed. We won't bother about optimizing that for now.
> > > 
> > > That caused trouble in the past when using VGA because it is playing
> > > with mappings in weird ways.
> > > I think we have to optimize it, sorry.
> > 
> > We still filter them out, just later.
> 
> To be precise, we still filter out all DIRTY_MEMORY_VGA as we used to. Only
> the device-specific filtering (vhost-user) is modified.


Oh good so the VGA use-case is unaffected.

> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Feb. 16, 2023, 12:39 p.m. UTC | #6
On 16.02.23 13:21, Michael S. Tsirkin wrote:
> On Thu, Feb 16, 2023 at 01:10:54PM +0100, David Hildenbrand wrote:
>> On 16.02.23 13:04, Michael S. Tsirkin wrote:
>>> On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote:
>>>> Having multiple devices, some filtering memslots and some not filtering
>>>> memslots, messes up the "used_memslot" accounting. If we'd have a device
>>>> the filters out less memory sections after a device that filters out more,
>>>> we'd be in trouble, because our memslot checks stop working reliably.
>>>> For example, hotplugging a device that filters out less memslots might end
>>>> up passing the checks based on max vs. used memslots, but can run out of
>>>> memslots when getting notified about all memory sections.
>>>>
>>>> Further, it will be helpful in memory device context in the near future
>>>> to know that a RAM memory region section will consume a memslot, and be
>>>> accounted for in the used vs. free memslots, such that we can implement
>>>> reservation of memslots for memory devices properly. Whether a device
>>>> filters this out and would theoretically still have a free memslot is
>>>> then hidden internally, making overall vhost memslot accounting easier.
>>>>
>>>> Let's filter the memslots when creating the vhost memory array,
>>>> accounting all RAM && !ROM memory regions as "used_memslots" even if
>>>> vhost_user isn't interested in anonymous RAM regions, because it needs
>>>> an fd.
>>>>
>>>> When a device actually filters out regions (which should happen rarely
>>>> in practice), we might detect a layout change although only filtered
>>>> regions changed. We won't bother about optimizing that for now.
>>>
>>> That caused trouble in the past when using VGA because it is playing
>>> with mappings in weird ways.
>>> I think we have to optimize it, sorry.
>>
>> We still filter them out, just later.
> 
> 
> The issue is sending lots of unnecessary system calls to update the kernel which
> goes through a slow RCU.

I don't think this is the case when deferring the device-specific 
filtering. As discussed, the generic filtering (ignore !ram, ignore rom, 
ignore VMA) remains in place because that is identical for all devices.

> 
>>>> Note: we cannot simply filter out the region and count them as
>>>> "filtered" to add them to used, because filtered regions could get
>>>> merged and result in a smaller effective number of memslots. Further,
>>>> we won't touch the hmp/qmp virtio introspection output.
>>>>
>>>> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
>>>> Cc: Tiwei Bie <tiwei.bie@intel.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> I didn't review this yet but maybe you can answer:
>>> will this create more slots for the backend?
>>> Because some backends are limited in # of slots and breaking them is
>>> not a good idea.
>>
>> It restores the handling we had before 988a27754bbb. RAM without an fd
>> should be rare for vhost-user setups (where we actually filter) I assume?
> 
> Hmm, I guess so.

At least on simplistic QEMU invocations with vhost-user (and proper 
shared memory as backend) I don't see any such filtering happening, 
because everything that is RAM is proper fd-based.

IMHO the chance of braking a sane VM setup are very small.
Igor Mammedov March 7, 2023, 10:51 a.m. UTC | #7
On Thu, 16 Feb 2023 12:47:51 +0100
David Hildenbrand <david@redhat.com> wrote:

> Having multiple devices, some filtering memslots and some not filtering
> memslots, messes up the "used_memslot" accounting. If we'd have a device
> the filters out less memory sections after a device that filters out more,
> we'd be in trouble, because our memslot checks stop working reliably.
> For example, hotplugging a device that filters out less memslots might end
> up passing the checks based on max vs. used memslots, but can run out of
> memslots when getting notified about all memory sections.

an hypothetical example of such case would be appreciated
(I don't really get how above can happen, perhaps more detailed explanation
would help)
 
> Further, it will be helpful in memory device context in the near future
> to know that a RAM memory region section will consume a memslot, and be
> accounted for in the used vs. free memslots, such that we can implement
> reservation of memslots for memory devices properly. Whether a device
> filters this out and would theoretically still have a free memslot is
> then hidden internally, making overall vhost memslot accounting easier.
> 
> Let's filter the memslots when creating the vhost memory array,
> accounting all RAM && !ROM memory regions as "used_memslots" even if
> vhost_user isn't interested in anonymous RAM regions, because it needs
> an fd.
> 
> When a device actually filters out regions (which should happen rarely
> in practice), we might detect a layout change although only filtered
> regions changed. We won't bother about optimizing that for now.
> 
> Note: we cannot simply filter out the region and count them as
> "filtered" to add them to used, because filtered regions could get
> merged and result in a smaller effective number of memslots. Further,
> we won't touch the hmp/qmp virtio introspection output.
What output exactly you are talking about?

PS:
If we drop vhost_dev::memm the bulk of this patch would go away

side questions:
do we have MemorySection merging on qemu's kvm side?
also does KVM merge merge memslots?
 
> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections")
> Cc: Tiwei Bie <tiwei.bie@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 55 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index eb8c4c378c..b7fb960fa9 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -219,8 +219,13 @@ static void vhost_log_sync_range(struct vhost_dev *dev,
>      int i;
>      /* FIXME: this is N^2 in number of sections */
>      for (i = 0; i < dev->n_mem_sections; ++i) {
> -        MemoryRegionSection *section = &dev->mem_sections[i];
> -        vhost_sync_dirty_bitmap(dev, section, first, last);
> +        MemoryRegionSection *mrs = &dev->mem_sections[i];
> +
> +        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> +            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
> +            continue;
> +        }
> +        vhost_sync_dirty_bitmap(dev, mrs, first, last);
>      }
>  }
>  
> @@ -503,12 +508,6 @@ static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section)
>              return false;
>          }
>  
> -        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> -            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) {
> -            trace_vhost_reject_section(mr->name, 2);
> -            return false;
> -        }
> -
>          trace_vhost_section(mr->name);
>          return true;
>      } else {
> @@ -525,6 +524,43 @@ static void vhost_begin(MemoryListener *listener)
>      dev->n_tmp_sections = 0;
>  }
>  
> +static void vhost_realloc_vhost_memory(struct vhost_dev *dev,
> +                                       unsigned int nregions)
> +{
> +    const size_t size = offsetof(struct vhost_memory, regions) +
> +                        nregions * sizeof dev->mem->regions[0];
> +
> +    dev->mem = g_realloc(dev->mem, size);
> +    dev->mem->nregions = nregions;
> +}
> +
> +static void vhost_rebuild_vhost_memory(struct vhost_dev *dev)
> +{
> +    unsigned int nregions = 0;
> +    int i;
> +
> +    vhost_realloc_vhost_memory(dev, dev->n_mem_sections);
> +    for (i = 0; i < dev->n_mem_sections; i++) {
> +        struct MemoryRegionSection *mrs = dev->mem_sections + i;
> +        struct vhost_memory_region *cur_vmr;
> +
> +        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
> +            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
> +            continue;
> +        }
> +        cur_vmr = dev->mem->regions + nregions;
> +        nregions++;
> +
> +        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
> +        cur_vmr->memory_size     = int128_get64(mrs->size);
> +        cur_vmr->userspace_addr  =
> +            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
> +            mrs->offset_within_region;
> +        cur_vmr->flags_padding   = 0;
> +    }
> +    vhost_realloc_vhost_memory(dev, nregions);
> +}
> +
>  static void vhost_commit(MemoryListener *listener)
>  {
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> @@ -532,7 +568,6 @@ static void vhost_commit(MemoryListener *listener)
>      MemoryRegionSection *old_sections;
>      int n_old_sections;
>      uint64_t log_size;
> -    size_t regions_size;
>      int r;
>      int i;
>      bool changed = false;
> @@ -564,23 +599,19 @@ static void vhost_commit(MemoryListener *listener)
>          goto out;
>      }
>  
> -    /* Rebuild the regions list from the new sections list */
> -    regions_size = offsetof(struct vhost_memory, regions) +
> -                       dev->n_mem_sections * sizeof dev->mem->regions[0];
> -    dev->mem = g_realloc(dev->mem, regions_size);
> -    dev->mem->nregions = dev->n_mem_sections;
> +    /*
> +     * Globally track the used memslots *without* device specific
> +     * filtering. This way, we always know how many memslots are required
> +     * when devices with differing filtering requirements get mixed, and
> +     * all RAM memory regions of memory devices will consume memslots.
> +     */
>      used_memslots = dev->mem->nregions;
> -    for (i = 0; i < dev->n_mem_sections; i++) {
> -        struct vhost_memory_region *cur_vmr = dev->mem->regions + i;
> -        struct MemoryRegionSection *mrs = dev->mem_sections + i;
>  
> -        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
> -        cur_vmr->memory_size     = int128_get64(mrs->size);
> -        cur_vmr->userspace_addr  =
> -            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
> -            mrs->offset_within_region;
> -        cur_vmr->flags_padding   = 0;
> -    }
> +    /*
> +     * Rebuild the regions list from the new sections list, filtering out all
> +     * sections that this device is not interested in.
> +     */
> +    vhost_rebuild_vhost_memory(dev);
>  
>      if (!dev->started) {
>          goto out;
David Hildenbrand March 7, 2023, 12:46 p.m. UTC | #8
On 07.03.23 11:51, Igor Mammedov wrote:
> On Thu, 16 Feb 2023 12:47:51 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Having multiple devices, some filtering memslots and some not filtering
>> memslots, messes up the "used_memslot" accounting. If we'd have a device
>> the filters out less memory sections after a device that filters out more,
>> we'd be in trouble, because our memslot checks stop working reliably.
>> For example, hotplugging a device that filters out less memslots might end
>> up passing the checks based on max vs. used memslots, but can run out of
>> memslots when getting notified about all memory sections.
> 
> an hypothetical example of such case would be appreciated
> (I don't really get how above can happen, perhaps more detailed explanation
> would help)

Thanks for asking! AFAIKT, it's mostly about hot-adding first a vhost devices
that filters (and messes up used_memslots), and then messing with memslots that
get filtered out,

$ sudo rmmod vhost
$ sudo modprobe vhost max_mem_regions=4

// startup guest with virtio-net device
...

// hotplug a NVDIMM, resulting in used_memslots=4
echo "object_add memory-backend-ram,id=mem0,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
echo "device_add nvdimm,id=nvdimm0,memdev=mem0" | sudo nc -U /var/tmp/mon_src

// hotplug vhost-user device that overwrites "used_memslots=3"
echo "device_add vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,bus=root" | sudo nc -U /var/tmp/mon_src

// hotplug another NVDIMM
echo "object_add memory-backend-ram,id=mem1,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
echo "device_add pc-dimm,id=nvdimm1,memdev=mem1" | sudo nc -U /var/tmp/mon_src

// vvhost will fail to update the memslots
vhost_set_mem_table failed: Argument list too long (7)


So we tricked used_memslots to be smaller than it actually has to be, because
we're ignoring the memslots filtered out by the vhost-user device.


Now, this is all far from relevant in practice as of now I think, and
usually would indicate user errors already (memory that's not shared with
vhost-user?).

It might gets more relevant when virtio-mem dynamically adds/removes memslots and
relies on precise tracking of used vs. free memslots.


But maybe I should just ignore that case and live a happy life instead, it's
certainly hard to even trigger right now :)

>   
>> Further, it will be helpful in memory device context in the near future
>> to know that a RAM memory region section will consume a memslot, and be
>> accounted for in the used vs. free memslots, such that we can implement
>> reservation of memslots for memory devices properly. Whether a device
>> filters this out and would theoretically still have a free memslot is
>> then hidden internally, making overall vhost memslot accounting easier.
>>
>> Let's filter the memslots when creating the vhost memory array,
>> accounting all RAM && !ROM memory regions as "used_memslots" even if
>> vhost_user isn't interested in anonymous RAM regions, because it needs
>> an fd.
>>
>> When a device actually filters out regions (which should happen rarely
>> in practice), we might detect a layout change although only filtered
>> regions changed. We won't bother about optimizing that for now.
>>
>> Note: we cannot simply filter out the region and count them as
>> "filtered" to add them to used, because filtered regions could get
>> merged and result in a smaller effective number of memslots. Further,
>> we won't touch the hmp/qmp virtio introspection output.
> What output exactly you are talking about?

hw/virtio/virtio-qmp.c:qmp_x_query_virtio_status

Prints hdev->n_mem_sections and hdev->n_tmp_sections. I won't be
touching that (debug) output.

> 
> PS:
> If we drop vhost_dev::memm the bulk of this patch would go away

Yes, unfortunately we can't I think.

> 
> side questions:
> do we have MemorySection merging on qemu's kvm side?

Yes, we properly merge in flatview_simplify(). It's all about handling holes
in huge pages IIUC.

> also does KVM merge merge memslots?

No, for good reasons not. Mapping more than we're instructed to map via a notifier
sounds is kind-of hacky already. But I guess there is no easy way around it (e.g., if
mapping that part of memory doesn't work, we'd have to bounce the reads/writes
through QEMU instead).
Igor Mammedov March 8, 2023, 12:30 p.m. UTC | #9
On Tue, 7 Mar 2023 13:46:36 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 07.03.23 11:51, Igor Mammedov wrote:
> > On Thu, 16 Feb 2023 12:47:51 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> Having multiple devices, some filtering memslots and some not filtering
> >> memslots, messes up the "used_memslot" accounting. If we'd have a device
> >> the filters out less memory sections after a device that filters out more,
> >> we'd be in trouble,

it should say why/when it happens (in example you've provided
it's caused by mix of in kernel vhost and vhost-user devices)

> >> because our memslot checks stop working reliably.
> >> For example, hotplugging a device that filters out less memslots might end
> >> up passing the checks based on max vs. used memslots, but can run out of
> >> memslots when getting notified about all memory sections.  
> > 
> > an hypothetical example of such case would be appreciated
> > (I don't really get how above can happen, perhaps more detailed explanation
> > would help)  
> 
> Thanks for asking! AFAIKT, it's mostly about hot-adding first a vhost devices
> that filters (and messes up used_memslots), and then messing with memslots that
> get filtered out,
> 
> $ sudo rmmod vhost
> $ sudo modprobe vhost max_mem_regions=4
> 
> // startup guest with virtio-net device
> ...
> 
> // hotplug a NVDIMM, resulting in used_memslots=4
> echo "object_add memory-backend-ram,id=mem0,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
> echo "device_add nvdimm,id=nvdimm0,memdev=mem0" | sudo nc -U /var/tmp/mon_src
> 
> // hotplug vhost-user device that overwrites "used_memslots=3"
> echo "device_add vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,bus=root" | sudo nc -U /var/tmp/mon_src
> 
> // hotplug another NVDIMM
> echo "object_add memory-backend-ram,id=mem1,size=128M" | sudo nc -U /var/tmp/mon_src; echo ""
> echo "device_add pc-dimm,id=nvdimm1,memdev=mem1" | sudo nc -U /var/tmp/mon_src
> 
> // vvhost will fail to update the memslots
> vhost_set_mem_table failed: Argument list too long (7)
> 
> 
> So we tricked used_memslots to be smaller than it actually has to be, because
> we're ignoring the memslots filtered out by the vhost-user device.
> 
> 
> Now, this is all far from relevant in practice as of now I think, and
> usually would indicate user errors already (memory that's not shared with
> vhost-user?).

well vhost-user device_add should fail if it can't get hands on all RAM
(if it doesn't we have a bug somewhere else)

> 
> It might gets more relevant when virtio-mem dynamically adds/removes memslots and
> relies on precise tracking of used vs. free memslots.
> 
> 
> But maybe I should just ignore that case and live a happy life instead, it's
> certainly hard to even trigger right now :)
> >     
> >> Further, it will be helpful in memory device context in the near future
> >> to know that a RAM memory region section will consume a memslot, and be
> >> accounted for in the used vs. free memslots, such that we can implement
> >> reservation of memslots for memory devices properly. Whether a device
> >> filters this out and would theoretically still have a free memslot is
> >> then hidden internally, making overall vhost memslot accounting easier.
> >>

> >> Let's filter the memslots when creating the vhost memory array,
> >> accounting all RAM && !ROM memory regions as "used_memslots" even if
> >> vhost_user isn't interested in anonymous RAM regions, because it needs
> >> an fd.

that would regress existing setups where it was possible
to start with N DIMMs and after this patch the same VM could fail to
start if N was close to vhost's limit in otherwise perfectly working
configuration. So this approach doesn't seem right. 

Perhaps redoing vhost's used_memslots accounting would be
a better approach, right down to introducing reservations you'd
like to have eventually.

Something like:
  1: s/vhost_has_free_slot/vhost_memory_region_limit/
     and maybe the same for kvm_has_free_slot
  then rewrite memory_device_check_addable() moving all
  used_memslots accounting into memory_device core.

> >> When a device actually filters out regions (which should happen rarely
> >> in practice), we might detect a layout change although only filtered
> >> regions changed. We won't bother about optimizing that for now.
> >>
> >> Note: we cannot simply filter out the region and count them as
> >> "filtered" to add them to used, because filtered regions could get
> >> merged and result in a smaller effective number of memslots. Further,
> >> we won't touch the hmp/qmp virtio introspection output.  
> > What output exactly you are talking about?  
> 
> hw/virtio/virtio-qmp.c:qmp_x_query_virtio_status
> 
> Prints hdev->n_mem_sections and hdev->n_tmp_sections. I won't be
> touching that (debug) output.
> 
> > 
> > PS:
> > If we drop vhost_dev::memm the bulk of this patch would go away  
> 
> Yes, unfortunately we can't I think.
> 
> > 
> > side questions:
> > do we have MemorySection merging on qemu's kvm side?  
> 
> Yes, we properly merge in flatview_simplify(). It's all about handling holes
> in huge pages IIUC.
> 
> > also does KVM merge merge memslots?  
> 
> No, for good reasons not. Mapping more than we're instructed to map via a notifier
> sounds is kind-of hacky already. But I guess there is no easy way around it (e.g., if
> mapping that part of memory doesn't work, we'd have to bounce the reads/writes
> through QEMU instead).
>
David Hildenbrand March 8, 2023, 3:21 p.m. UTC | #10
>>
>> So we tricked used_memslots to be smaller than it actually has to be, because
>> we're ignoring the memslots filtered out by the vhost-user device.
>>
>>
>> Now, this is all far from relevant in practice as of now I think, and
>> usually would indicate user errors already (memory that's not shared with
>> vhost-user?).
> 
> well vhost-user device_add should fail if it can't get hands on all RAM
> (if it doesn't we have a bug somewhere else)

There are some corner cases already. Like R/O NVDIMMs are completely 
ignored -- I assume because we wouldn't be able to use them for virtio 
queues either way, so it kind-of makes sense I guess.

I have not yet figured out *why* 988a27754bbb ("vhost: allow backends to 
filter memory sections") was included at all. Why should we have 
memory-backend-ram mapped into guest address space that vhost-user 
cannot handle?

Take my NVDIMM example, if we'd use that NVDIMM memory in the guest as 
ordinary RAM, it could eventually be used for virtio queues ... and we 
don't even warn the user.

So I agree that hotplugging that device should fail. But it could also 
break some existing setups (we could work around it using compat 
machines I guess).

But we also have to fail hotplugging such a vhost-user device, ... and I 
am not sure where to even put such checks.


> 
>>
>> It might gets more relevant when virtio-mem dynamically adds/removes memslots and
>> relies on precise tracking of used vs. free memslots.
>>
>>
>> But maybe I should just ignore that case and live a happy life instead, it's
>> certainly hard to even trigger right now :)
>>>      
>>>> Further, it will be helpful in memory device context in the near future
>>>> to know that a RAM memory region section will consume a memslot, and be
>>>> accounted for in the used vs. free memslots, such that we can implement
>>>> reservation of memslots for memory devices properly. Whether a device
>>>> filters this out and would theoretically still have a free memslot is
>>>> then hidden internally, making overall vhost memslot accounting easier.
>>>>
> 
>>>> Let's filter the memslots when creating the vhost memory array,
>>>> accounting all RAM && !ROM memory regions as "used_memslots" even if
>>>> vhost_user isn't interested in anonymous RAM regions, because it needs
>>>> an fd.
> 
> that would regress existing setups where it was possible
> to start with N DIMMs and after this patch the same VM could fail to
> start if N was close to vhost's limit in otherwise perfectly working
> configuration. So this approach doesn't seem right.

As discussed already with MST, this was the state before that change. So 
I strongly doubt that we would break anything because using 
memory-backend-ram in that setup would already be suspicious.

Again, I did not figure out *why* that change was even included and 
which setups even care about that.

Maybe Tiwei can comment.

> 
> Perhaps redoing vhost's used_memslots accounting would be
> a better approach, right down to introducing reservations you'd
> like to have eventually.

The question what to do with memory-backend-ram for vhost-user still 
remains. It's independent of the "used_memslot" tracking, because used 
vs. reserved would depend on the vhost-backend filtering demands ... and 
I really wanted to avoid that with this commit. It just makes tracking 
much harder.

> 
> Something like:
>    1: s/vhost_has_free_slot/vhost_memory_region_limit/
>       and maybe the same for kvm_has_free_slot
>    then rewrite memory_device_check_addable() moving all
>    used_memslots accounting into memory_device core.

I do something similar already, however, by querying the free memslots 
from kvm and vhost, not the limits (limits are not very expressive).

Thanks!
diff mbox series

Patch

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index eb8c4c378c..b7fb960fa9 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -219,8 +219,13 @@  static void vhost_log_sync_range(struct vhost_dev *dev,
     int i;
     /* FIXME: this is N^2 in number of sections */
     for (i = 0; i < dev->n_mem_sections; ++i) {
-        MemoryRegionSection *section = &dev->mem_sections[i];
-        vhost_sync_dirty_bitmap(dev, section, first, last);
+        MemoryRegionSection *mrs = &dev->mem_sections[i];
+
+        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
+            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
+            continue;
+        }
+        vhost_sync_dirty_bitmap(dev, mrs, first, last);
     }
 }
 
@@ -503,12 +508,6 @@  static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section)
             return false;
         }
 
-        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
-            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) {
-            trace_vhost_reject_section(mr->name, 2);
-            return false;
-        }
-
         trace_vhost_section(mr->name);
         return true;
     } else {
@@ -525,6 +524,43 @@  static void vhost_begin(MemoryListener *listener)
     dev->n_tmp_sections = 0;
 }
 
+static void vhost_realloc_vhost_memory(struct vhost_dev *dev,
+                                       unsigned int nregions)
+{
+    const size_t size = offsetof(struct vhost_memory, regions) +
+                        nregions * sizeof dev->mem->regions[0];
+
+    dev->mem = g_realloc(dev->mem, size);
+    dev->mem->nregions = nregions;
+}
+
+static void vhost_rebuild_vhost_memory(struct vhost_dev *dev)
+{
+    unsigned int nregions = 0;
+    int i;
+
+    vhost_realloc_vhost_memory(dev, dev->n_mem_sections);
+    for (i = 0; i < dev->n_mem_sections; i++) {
+        struct MemoryRegionSection *mrs = dev->mem_sections + i;
+        struct vhost_memory_region *cur_vmr;
+
+        if (dev->vhost_ops->vhost_backend_mem_section_filter &&
+            !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) {
+            continue;
+        }
+        cur_vmr = dev->mem->regions + nregions;
+        nregions++;
+
+        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
+        cur_vmr->memory_size     = int128_get64(mrs->size);
+        cur_vmr->userspace_addr  =
+            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
+            mrs->offset_within_region;
+        cur_vmr->flags_padding   = 0;
+    }
+    vhost_realloc_vhost_memory(dev, nregions);
+}
+
 static void vhost_commit(MemoryListener *listener)
 {
     struct vhost_dev *dev = container_of(listener, struct vhost_dev,
@@ -532,7 +568,6 @@  static void vhost_commit(MemoryListener *listener)
     MemoryRegionSection *old_sections;
     int n_old_sections;
     uint64_t log_size;
-    size_t regions_size;
     int r;
     int i;
     bool changed = false;
@@ -564,23 +599,19 @@  static void vhost_commit(MemoryListener *listener)
         goto out;
     }
 
-    /* Rebuild the regions list from the new sections list */
-    regions_size = offsetof(struct vhost_memory, regions) +
-                       dev->n_mem_sections * sizeof dev->mem->regions[0];
-    dev->mem = g_realloc(dev->mem, regions_size);
-    dev->mem->nregions = dev->n_mem_sections;
+    /*
+     * Globally track the used memslots *without* device specific
+     * filtering. This way, we always know how many memslots are required
+     * when devices with differing filtering requirements get mixed, and
+     * all RAM memory regions of memory devices will consume memslots.
+     */
     used_memslots = dev->mem->nregions;
-    for (i = 0; i < dev->n_mem_sections; i++) {
-        struct vhost_memory_region *cur_vmr = dev->mem->regions + i;
-        struct MemoryRegionSection *mrs = dev->mem_sections + i;
 
-        cur_vmr->guest_phys_addr = mrs->offset_within_address_space;
-        cur_vmr->memory_size     = int128_get64(mrs->size);
-        cur_vmr->userspace_addr  =
-            (uintptr_t)memory_region_get_ram_ptr(mrs->mr) +
-            mrs->offset_within_region;
-        cur_vmr->flags_padding   = 0;
-    }
+    /*
+     * Rebuild the regions list from the new sections list, filtering out all
+     * sections that this device is not interested in.
+     */
+    vhost_rebuild_vhost_memory(dev);
 
     if (!dev->started) {
         goto out;