[RFC,0/9] kvm: implement atomic memslot updates

Message ID	20220909104506.738478-1-eesposit@redhat.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Emanuele Giuseppe Esposito <eesposit@redhat.com> To: kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Maxim Levitsky <mlevitsk@redhat.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org, Emanuele Giuseppe Esposito <eesposit@redhat.com> Subject: [RFC PATCH 0/9] kvm: implement atomic memslot updates Date: Fri, 9 Sep 2022 06:44:57 -0400 Message-Id: <20220909104506.738478-1-eesposit@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	kvm: implement atomic memslot updates \| expand [RFC,0/9] kvm: implement atomic memslot updates [RFC,1/9] kvm_main.c: move slot check in kvm_set_memory_region [RFC,2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl [RFC,3/9] kvm_main.c: introduce kvm_internal_memory_region_list [RFC,4/9] kvm_main.c: split logic in kvm_set_memslots [RFC,5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch [RFC,6/9] kvm_main.c: simplify change-specific callbacks [RFC,7/9] kvm_main.c: duplicate invalid memslot also in inactive list [RFC,8/9] kvm_main.c: find memslots from the inactive memslot list [RFC,9/9] kvm_main.c: handle atomic memslot update

Emanuele Giuseppe Esposito Sept. 9, 2022, 10:44 a.m. UTC

KVM is currently capable of receiving a single memslot update through
the KVM_SET_USER_MEMORY_REGION ioctl.
The problem arises when we want to atomically perform multiple updates,
so that readers of memslot active list avoid seeing incomplete states.

For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
we see how non atomic updates cause boot failure, because vcpus
will se a partial update (old memslot delete, new one not yet created)
and will crash.

In this series we introduce KVM_SET_USER_MEMORY_REGION_LIST, a new ioctl
that takes a kvm_userspace_memory_region_list, a list of memslot updates,
and performs them atomically.
"atomically" in KVM words just means "apply all modifications to the
inactive memslot list, and then perform a single swap to replace the
active list with the inactive".
It is slightly more complicated that that, since DELETE and MOVE operations
require 2 swaps, but the main idea is the above.

Patch 1-6 are just code movements, in preparation for the following patches.
Patch 7 allows the invalid slot to be in both inactive and active memslot lists.
Patch 8 allows searching for the existing memslot (old) in the inactive list,
and not the active, allowing to perform multiple memslot updates without swapping.
Patch 9 implements IOCTL logic.

QEMU userspace logic in preparation for the IOCTL is here:
https://patchew.org/QEMU/20220909081150.709060-1-eesposit@redhat.com/
"[RFC PATCH v2 0/3] accel/kvm: extend kvm memory listener to support"

TODOs and ideas:
- limit the size of the ioctl arguments. Right now it is unbounded
- try to reduce the amount of swaps necessary? ie every DELETE/MOVE
  requires an additional swap
- add selftests
- add documentation

Emanuele Giuseppe Esposito (9):
  kvm_main.c: move slot check in kvm_set_memory_region
  kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl
  kvm_main.c: introduce kvm_internal_memory_region_list
  kvm_main.c: split logic in kvm_set_memslots
  kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and
    kvm_prepare_batch
  kvm_main.c: simplify change-specific callbacks
  kvm_main.c: duplicate invalid memslot also in inactive list
  kvm_main.c: find memslots from the inactive memslot list
  kvm_main.c: handle atomic memslot update

 arch/x86/kvm/x86.c       |   3 +-
 include/linux/kvm_host.h |  21 +-
 include/uapi/linux/kvm.h |  21 +-
 virt/kvm/kvm_main.c      | 420 +++++++++++++++++++++++++++++++--------
 4 files changed, 377 insertions(+), 88 deletions(-)

Sean Christopherson Sept. 9, 2022, 2:30 p.m. UTC | #1

On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> KVM is currently capable of receiving a single memslot update through
> the KVM_SET_USER_MEMORY_REGION ioctl.
> The problem arises when we want to atomically perform multiple updates,
> so that readers of memslot active list avoid seeing incomplete states.
> 
> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276

I don't have access.  Can you provide a TL;DR?

> we see how non atomic updates cause boot failure, because vcpus
> will se a partial update (old memslot delete, new one not yet created)
> and will crash.

Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
to take on for something that appears to be solvable in userspace. 

And if the issue is related to KVM disallowing the toggling of read-only (can't see
the bug), we can likely solve that without needing a new ioctl() that allows
userspace to batch an arbitrary number of updates.

Emanuele Giuseppe Esposito Sept. 18, 2022, 4:13 p.m. UTC | #2

Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>> KVM is currently capable of receiving a single memslot update through
>> the KVM_SET_USER_MEMORY_REGION ioctl.
>> The problem arises when we want to atomically perform multiple updates,
>> so that readers of memslot active list avoid seeing incomplete states.
>>
>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
> 
> I don't have access.  Can you provide a TL;DR?

You should be able to have access to it now.

> 
>> we see how non atomic updates cause boot failure, because vcpus
>> will se a partial update (old memslot delete, new one not yet created)
>> and will crash.
> 
> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
> to take on for something that appears to be solvable in userspace. 
> 

I think it is not that easy to solve in userspace: see
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/


"Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
temporarily drop the BQL - something most callers can't handle (esp.
when called from vcpu context e.g., in virtio code)."

Probably @Paolo and @Maxim can add more to this.

Thank you,
Emanuele

Like Xu Sept. 19, 2022, 7:38 a.m. UTC | #3

On 19/9/2022 12:13 am, Emanuele Giuseppe Esposito wrote:
> 
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZhttps://bugzilla.redhat.com/show_bug.cgi?id=1979276

Oh, thanks for stepping up to try to address it.

As it turns out, this issue was discovered "long" before
https://bugzilla.kernel.org/show_bug.cgi?id=213781

As a comment, relevant selftests are necessary and required.

David Hildenbrand Sept. 19, 2022, 7:53 a.m. UTC | #4

On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>
>> I don't have access.  Can you provide a TL;DR?
> 
> You should be able to have access to it now.
> 
>>
>>> we see how non atomic updates cause boot failure, because vcpus
>>> will se a partial update (old memslot delete, new one not yet created)
>>> and will crash.
>>
>> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
>> to take on for something that appears to be solvable in userspace.
>>
> 
> I think it is not that easy to solve in userspace: see
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
> 
> 
> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
> temporarily drop the BQL - something most callers can't handle (esp.
> when called from vcpu context e.g., in virtio code)."

Can you please comment on the bigger picture? The patch from me works 
around *exactly that*, and for that reason, contains that comment.

David Hildenbrand Sept. 19, 2022, 5:30 p.m. UTC | #5

On 19.09.22 09:53, David Hildenbrand wrote:
> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>> KVM is currently capable of receiving a single memslot update through
>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>> The problem arises when we want to atomically perform multiple updates,
>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>
>>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>
>>> I don't have access.  Can you provide a TL;DR?
>>
>> You should be able to have access to it now.
>>
>>>
>>>> we see how non atomic updates cause boot failure, because vcpus
>>>> will se a partial update (old memslot delete, new one not yet created)
>>>> and will crash.
>>>
>>> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
>>> to take on for something that appears to be solvable in userspace.
>>>
>>
>> I think it is not that easy to solve in userspace: see
>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>
>>
>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>> temporarily drop the BQL - something most callers can't handle (esp.
>> when called from vcpu context e.g., in virtio code)."
> 
> Can you please comment on the bigger picture? The patch from me works
> around *exactly that*, and for that reason, contains that comment.
> 

FWIW, I hacked up my RFC to perform atomic updates on any memslot 
transactions (not just resizes) where ranges do add overlap with ranges 
to remove.

https://github.com/davidhildenbrand/qemu/tree/memslot


I only performed simple boot check under x86-64 (where I can see region 
resizes) and some make checks -- pretty sure it has some rough edges; 
but should indicate what's possible and what the possible price might 
be. [one could wire up a new KVM ioctl and call it conditionally on 
support if really required]

Emanuele Giuseppe Esposito Sept. 23, 2022, 1:10 p.m. UTC | #6

Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> On 19.09.22 09:53, David Hildenbrand wrote:
>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>> KVM is currently capable of receiving a single memslot update through
>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>> The problem arises when we want to atomically perform multiple
>>>>> updates,
>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>
>>>>> For example, in RHBZ
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>
>>>> I don't have access.  Can you provide a TL;DR?
>>>
>>> You should be able to have access to it now.
>>>
>>>>
>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>> and will crash.
>>>>
>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>> of a complexity
>>>> to take on for something that appears to be solvable in userspace.
>>>>
>>>
>>> I think it is not that easy to solve in userspace: see
>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>
>>>
>>>
>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>> temporarily drop the BQL - something most callers can't handle (esp.
>>> when called from vcpu context e.g., in virtio code)."
>>
>> Can you please comment on the bigger picture? The patch from me works
>> around *exactly that*, and for that reason, contains that comment.
>>
> 
> FWIW, I hacked up my RFC to perform atomic updates on any memslot
> transactions (not just resizes) where ranges do add overlap with ranges
> to remove.
> 
> https://github.com/davidhildenbrand/qemu/tree/memslot
> 
> 
> I only performed simple boot check under x86-64 (where I can see region
> resizes) and some make checks -- pretty sure it has some rough edges;
> but should indicate what's possible and what the possible price might
> be. [one could wire up a new KVM ioctl and call it conditionally on
> support if really required]
> 

A benefit of my ioctl implementation is that could be also used by other
hypervisors, which then do not need to share this kind of "hack".
However, after also talking with Maxim and Paolo, we all agreed that the
main disadvantage of your approach is that is not scalable with the
number of vcpus. It is also inferior to stop *all* vcpus just to allow a
memslot update (KVM only pauses vCPUs that access the modified memslots
instead).

So I took some measurements, to see what is the performance difference
between my implementation and yours. I used a machine where I could
replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
Processor with kernel 5.19.0 (+ my patches).

Then I measured the time it takes that QEMU spends in kvm_commit (ie in
memslot updates) while booting a VM. In other words, if kvm_commit takes
10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
not called anymore after boot, so this measurement is easy to compare
over multiple invocations of QEMU.

I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
command is the same to replicate the bug:
./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none >> ~/smp_$v;

Each boot is reproduced 100 times, and then from results I measure
average and stddev (in milliseconds).

ioctl:
-smp 1:        Average: 2.1ms        Stdev: 0.8ms
-smp 2:        Average: 2.5ms        Stdev: 1.5ms
-smp 4:        Average: 2.2ms        Stdev: 1.1ms
-smp 8:        Average: 2.4ms        Stdev: 0.7ms
-smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
-smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)


pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
-smp 1:        Average: 2.2ms        Stdev: 1.2ms
-smp 2:        Average: 3.0ms        Stdev: 1.4ms
-smp 4:        Average: 3.1ms        Stdev: 1.3m
-smp 8:        Average: 3.4ms        Stdev: 1.4ms
-smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
-smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)


Above 24 vCPUs performance gets worse quickly but I think it's already
quite clear that the results for ioctl scale better as the number of
vcpus increases, while pausing the vCPUs becomes slower already with 16
vcpus.

Thank you,
Emanuele

David Hildenbrand Sept. 23, 2022, 1:21 p.m. UTC | #7

On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access.  Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
> 
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
> 
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
> 
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
> 
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
> 
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
> 
> ioctl:
> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
> 
> 
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
> -smp 4:        Average: 3.1ms        Stdev: 1.3m
> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
> 
> 
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.

Right, the question is if it happens sufficiently enough that we even 
care and if there are not ways to mitigate.

It doesn't necessarily have to scale with the #VCPUs I think. What 
should dominate the overall time in theory how long it takes for one 
VCPU (the slowest one) to leave the kernel.

I wondered if

1) it might be easier to have a single KVM mechanism/call to kick all 
VCPUs out of KVM instead of doing it per VCPU. That might speed up 
things eventually heavily already.

2) One thing I wondered is whether the biggest overhead is actually 
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe 
we could optimize that as well. (for now I use one lock per VCPU because 
it felt like it would reduce the ioctl overhead; maybe there is a better 
alternative to balance between both users)

So treat my patch as a completely unoptimized version.

Emanuele Giuseppe Esposito Sept. 23, 2022, 1:38 p.m. UTC | #8

Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>> through
>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>> updates,
>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>> states.
>>>>>>>
>>>>>>> For example, in RHBZ
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>
>>>>>> I don't have access.  Can you provide a TL;DR?
>>>>>
>>>>> You should be able to have access to it now.
>>>>>
>>>>>>
>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>> created)
>>>>>>> and will crash.
>>>>>>
>>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>>> of a complexity
>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>
>>>>>
>>>>> I think it is not that easy to solve in userspace: see
>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>> will
>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>> when called from vcpu context e.g., in virtio code)."
>>>>
>>>> Can you please comment on the bigger picture? The patch from me works
>>>> around *exactly that*, and for that reason, contains that comment.
>>>>
>>>
>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>> transactions (not just resizes) where ranges do add overlap with ranges
>>> to remove.
>>>
>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>
>>>
>>> I only performed simple boot check under x86-64 (where I can see region
>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>> but should indicate what's possible and what the possible price might
>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>> support if really required]
>>>
>>
>> A benefit of my ioctl implementation is that could be also used by other
>> hypervisors, which then do not need to share this kind of "hack".
>> However, after also talking with Maxim and Paolo, we all agreed that the
>> main disadvantage of your approach is that is not scalable with the
>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>> memslot update (KVM only pauses vCPUs that access the modified memslots
>> instead).
>>
>> So I took some measurements, to see what is the performance difference
>> between my implementation and yours. I used a machine where I could
>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>> Processor with kernel 5.19.0 (+ my patches).
>>
>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>> not called anymore after boot, so this measurement is easy to compare
>> over multiple invocations of QEMU.
>>
>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>> command is the same to replicate the bug:
>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>> --display none >> ~/smp_$v;
>>
>> Each boot is reproduced 100 times, and then from results I measure
>> average and stddev (in milliseconds).
>>
>> ioctl:
>> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
>> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
>> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
>> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
>> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
>> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
>>
>>
>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
>> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
>> -smp 4:        Average: 3.1ms        Stdev: 1.3m
>> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
>> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
>> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
>>
>>
>> Above 24 vCPUs performance gets worse quickly but I think it's already
>> quite clear that the results for ioctl scale better as the number of
>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>> vcpus.
> 
> Right, the question is if it happens sufficiently enough that we even
> care and if there are not ways to mitigate.
> 
> It doesn't necessarily have to scale with the #VCPUs I think. What
> should dominate the overall time in theory how long it takes for one
> VCPU (the slowest one) to leave the kernel.
> 
> I wondered if
> 
> 1) it might be easier to have a single KVM mechanism/call to kick all
> VCPUs out of KVM instead of doing it per VCPU. That might speed up
> things eventually heavily already.

So if I understand correclty, this implies creating a new ioctl in KVM
anyways? What would be then the difference with what I do? We are
affecting the kernel anyways.

> 
> 2) One thing I wondered is whether the biggest overhead is actually
> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
> we could optimize that as well. (for now I use one lock per VCPU because
> it felt like it would reduce the ioctl overhead; maybe there is a better
> alternative to balance between both users)
> 
> So treat my patch as a completely unoptimized version.
> 
For what is worth, also my version performs #invalidate+1 swaps, which
is not optimized.

Honestly, I don't see how the above is easier or simpler than what is
being proposed here.

Thank you,
Emanuele

David Hildenbrand Sept. 26, 2022, 9:03 a.m. UTC | #9

On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>
>>>>>>
>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>> through
>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>> updates,
>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>> states.
>>>>>>>>
>>>>>>>> For example, in RHBZ
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>>
>>>>>>> I don't have access.  Can you provide a TL;DR?
>>>>>>
>>>>>> You should be able to have access to it now.
>>>>>>
>>>>>>>
>>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>>> created)
>>>>>>>> and will crash.
>>>>>>>
>>>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>>>> of a complexity
>>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>>
>>>>>>
>>>>>> I think it is not that easy to solve in userspace: see
>>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>>> will
>>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>>> when called from vcpu context e.g., in virtio code)."
>>>>>
>>>>> Can you please comment on the bigger picture? The patch from me works
>>>>> around *exactly that*, and for that reason, contains that comment.
>>>>>
>>>>
>>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>>> transactions (not just resizes) where ranges do add overlap with ranges
>>>> to remove.
>>>>
>>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>>
>>>>
>>>> I only performed simple boot check under x86-64 (where I can see region
>>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>>> but should indicate what's possible and what the possible price might
>>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>>> support if really required]
>>>>
>>>
>>> A benefit of my ioctl implementation is that could be also used by other
>>> hypervisors, which then do not need to share this kind of "hack".
>>> However, after also talking with Maxim and Paolo, we all agreed that the
>>> main disadvantage of your approach is that is not scalable with the
>>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>>> memslot update (KVM only pauses vCPUs that access the modified memslots
>>> instead).
>>>
>>> So I took some measurements, to see what is the performance difference
>>> between my implementation and yours. I used a machine where I could
>>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>>> Processor with kernel 5.19.0 (+ my patches).
>>>
>>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>>> not called anymore after boot, so this measurement is easy to compare
>>> over multiple invocations of QEMU.
>>>
>>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>>> command is the same to replicate the bug:
>>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>>> --display none >> ~/smp_$v;
>>>
>>> Each boot is reproduced 100 times, and then from results I measure
>>> average and stddev (in milliseconds).
>>>
>>> ioctl:
>>> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
>>> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
>>> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
>>> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
>>> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
>>> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
>>>
>>>
>>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>>> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
>>> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
>>> -smp 4:        Average: 3.1ms        Stdev: 1.3m
>>> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
>>> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
>>> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
>>>
>>>
>>> Above 24 vCPUs performance gets worse quickly but I think it's already
>>> quite clear that the results for ioctl scale better as the number of
>>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>>> vcpus.
>>
>> Right, the question is if it happens sufficiently enough that we even
>> care and if there are not ways to mitigate.
>>
>> It doesn't necessarily have to scale with the #VCPUs I think. What
>> should dominate the overall time in theory how long it takes for one
>> VCPU (the slowest one) to leave the kernel.
>>
>> I wondered if
>>
>> 1) it might be easier to have a single KVM mechanism/call to kick all
>> VCPUs out of KVM instead of doing it per VCPU. That might speed up
>> things eventually heavily already.
> 
> So if I understand correclty, this implies creating a new ioctl in KVM
> anyways? What would be then the difference with what I do? We are
> affecting the kernel anyways.
> 

If that turns out to be the actual bottleneck. I haven't analyzed what 
exactly is causing the delay with increasing # VCPUs, so I'm just guessing.

Right now, we primarily use pthread_kill(SIG_IPI) to send a signal to 
each VCPU thread.

>>
>> 2) One thing I wondered is whether the biggest overhead is actually
>> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
>> we could optimize that as well. (for now I use one lock per VCPU because
>> it felt like it would reduce the ioctl overhead; maybe there is a better
>> alternative to balance between both users)
>>
>> So treat my patch as a completely unoptimized version.
>>
> For what is worth, also my version performs #invalidate+1 swaps, which
> is not optimized.
> 
> Honestly, I don't see how the above is easier or simpler than what is
> being proposed here.

As Sean said "This is an awful lot of a complexity to take on for 
something that appears to be solvable in userspace."

I showed that it's possible. So from that point on, there has to be good 
justification why that complexity has to exist in kernel space to 
optimize this scenario.

Now, it's not my call to make. I'm just pointing out that it can be done 
and that there might be room for improvement in my quick prototype.

Sean Christopherson Sept. 26, 2022, 9:28 p.m. UTC | #10

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
> > 
> > 
> > Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> > > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> > > > 
> > > > 
> > > > Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> > > > > On 19.09.22 09:53, David Hildenbrand wrote:
> > > > > > On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> > > > > > > > On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> > > > > > > > > KVM is currently capable of receiving a single memslot update
> > > > > > > > > through
> > > > > > > > > the KVM_SET_USER_MEMORY_REGION ioctl.
> > > > > > > > > The problem arises when we want to atomically perform multiple
> > > > > > > > > updates,
> > > > > > > > > so that readers of memslot active list avoid seeing incomplete
> > > > > > > > > states.
> > > > > > > > > 
> > > > > > > > > For example, in RHBZ
> > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1979276

...

> As Sean said "This is an awful lot of a complexity to take on for something
> that appears to be solvable in userspace."

And if the userspace solution is unpalatable for whatever reason, I'd like to
understand exactly what KVM behavior is problematic for userspace.  E.g. the
above RHBZ bug should no longer be an issue as the buggy commit has since been
reverted.

If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
the race _if_ userspace chooses not to pause vCPUs.

Emanuele Giuseppe Esposito Sept. 27, 2022, 7:38 a.m. UTC | #11

Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>>>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>>>> through
>>>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>>>> updates,
>>>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>>>> states.
>>>>>>>>>>
>>>>>>>>>> For example, in RHBZ
>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
> 
> ...
> 
>> As Sean said "This is an awful lot of a complexity to take on for something
>> that appears to be solvable in userspace."
> 
> And if the userspace solution is unpalatable for whatever reason, I'd like to
> understand exactly what KVM behavior is problematic for userspace.  E.g. the
> above RHBZ bug should no longer be an issue as the buggy commit has since been
> reverted.

It still is because I can reproduce the bug, as also pointed out in
multiple comments below.

> 
> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> the race _if_ userspace chooses not to pause vCPUs.
> 

Also on the BZ they all seem (Paolo included) to agree that the issue is
non-atomic memslots update.

To be more precise, what I did mostly follows what Peter explained in
comment 19 : https://bugzilla.redhat.com/show_bug.cgi?id=1979276#c19

Sean Christopherson Sept. 27, 2022, 3:58 p.m. UTC | #12

On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
> 
> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> >> As Sean said "This is an awful lot of a complexity to take on for something
> >> that appears to be solvable in userspace."
> > 
> > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > understand exactly what KVM behavior is problematic for userspace.  E.g. the
> > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > reverted.
> 
> It still is because I can reproduce the bug, as also pointed out in
> multiple comments below.

You can reproduce _a_ bug, but it's obviously not the original bug, because the
last comment says:

  Second, indeed the patch was reverted and somehow accepted without generating
  too much noise:

  ...

  The underlying issue of course as we both know is still there.

  You might have luck reproducing it with this bug

  https://bugzilla.redhat.com/show_bug.cgi?id=1855298

  But for me it looks like it is 'working' as well, so you might have
  to write a unit test to trigger the issue.

> > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > the race _if_ userspace chooses not to pause vCPUs.
> > 
> 
> Also on the BZ they all seem (Paolo included) to agree that the issue is
> non-atomic memslots update.

Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
memslot.  I'm asking for an explanation of exactly what happens when that occurs,
because it should be possible to adjust KVM and/or QEMU to play nice with the
fetch, e.g. to resume the guest until the new memslot is installed, in which case
an atomic update isn't needed.

I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
the MMIO code fetch.

I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
otherwise exit with mmio.len==0.

Compile tested only...

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 27 Sep 2022 08:16:03 -0700
Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
 MMIO fetch

Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
a memslot.  If userspace is manipulating memslots without pausing vCPUs,
e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
valid memslot installed.  Depending on guest context, KVM will either
exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
resume the guest (L2 or CPL>0), neither of which is desirable as exiting
with "emulation error" effectively kills the VM, and resuming the guest
doesn't provide userspace an opportunity to react the to fetch.

Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
there is no other way to communicate "fetch" to userspace without
defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
and not a flag, and there is no known use case for actually supporting
code fetches from MMIO, i.e. there's no need to allow userspace to fill
in the instruction bytes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/emulate.c     | 2 ++
 arch/x86/kvm/kvm_emulate.h | 1 +
 arch/x86/kvm/x86.c         | 9 ++++++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f092c54d1a2f..e141238d93b0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
 done:
 	if (rc == X86EMUL_PROPAGATE_FAULT)
 		ctxt->have_exception = true;
+	if (rc == X86EMUL_IO_NEEDED)
+		return EMULATION_IO_FETCH;
 	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
 }

diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 89246446d6aa..3cb2e321fcd2 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_IO_FETCH 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa5ab0c620de..7eb72694c601 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 		bytes = (unsigned)PAGE_SIZE - offset;
 	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
 				       offset, bytes);
-	if (unlikely(ret < 0))
+	if (unlikely(ret < 0)) {
+		vcpu->run->mmio.phys_addr = gpa;
+		vcpu->run->mmio.len = 0;
+		vcpu->run->mmio.is_write = 0;
+		vcpu->run->exit_reason = KVM_EXIT_MMIO;
 		return X86EMUL_IO_NEEDED;
+	}

 	return X86EMUL_CONTINUE;
 }
@@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		r = x86_decode_emulated_instruction(vcpu, emulation_type,
 						    insn, insn_len);
 		if (r != EMULATION_OK)  {
+			if (r == EMULATION_IO_FETCH)
+				return 0;
 			if ((emulation_type & EMULTYPE_TRAP_UD) ||
 			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
 				kvm_queue_exception(vcpu, UD_VECTOR);

base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
--

Emanuele Giuseppe Esposito Sept. 28, 2022, 9:11 a.m. UTC | #13

Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
> 
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
> 
>   Second, indeed the patch was reverted and somehow accepted without generating
>   too much noise:
> 
>   ...
> 
>   The underlying issue of course as we both know is still there.
> 
>   You might have luck reproducing it with this bug
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> 
>   But for me it looks like it is 'working' as well, so you might have
>   to write a unit test to trigger the issue.
> 
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
> 
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
> 
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
> 
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.
> 
> Compile tested only...

So basically you are just making KVM catch the failed
kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
basically ends up in doing nothing and retry again executing the
instruction?

I wonder if there are some performance implications in this, but it's
definitely simpler than what I did.

Tested on the same failing machine used for the BZ, fixes the bug.

Do you want me to re-send the patch on your behalf (and add probably a
small documentation on Documentation/virt/kvm/api.rst)?

Emanuele
> 
> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 27 Sep 2022 08:16:03 -0700
> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
>  MMIO fetch
> 
> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> a memslot.  If userspace is manipulating memslots without pausing vCPUs,
> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> valid memslot installed.  Depending on guest context, KVM will either
> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> with "emulation error" effectively kills the VM, and resuming the guest
> doesn't provide userspace an opportunity to react the to fetch.
> 
> Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
> there is no other way to communicate "fetch" to userspace without
> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> and not a flag, and there is no known use case for actually supporting
> code fetches from MMIO, i.e. there's no need to allow userspace to fill
> in the instruction bytes.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/emulate.c     | 2 ++
>  arch/x86/kvm/kvm_emulate.h | 1 +
>  arch/x86/kvm/x86.c         | 9 ++++++++-
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index f092c54d1a2f..e141238d93b0 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>  done:
>  	if (rc == X86EMUL_PROPAGATE_FAULT)
>  		ctxt->have_exception = true;
> +	if (rc == X86EMUL_IO_NEEDED)
> +		return EMULATION_IO_FETCH;
>  	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
>  }
>  
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index 89246446d6aa..3cb2e321fcd2 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>  #define EMULATION_OK 0
>  #define EMULATION_RESTART 1
>  #define EMULATION_INTERCEPTED 2
> +#define EMULATION_IO_FETCH 3
>  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index aa5ab0c620de..7eb72694c601 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
>  		bytes = (unsigned)PAGE_SIZE - offset;
>  	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
>  				       offset, bytes);
> -	if (unlikely(ret < 0))
> +	if (unlikely(ret < 0)) {
> +		vcpu->run->mmio.phys_addr = gpa;
> +		vcpu->run->mmio.len = 0;
> +		vcpu->run->mmio.is_write = 0;
> +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
>  		return X86EMUL_IO_NEEDED;
> +	}
>  
>  	return X86EMUL_CONTINUE;
>  }
> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		r = x86_decode_emulated_instruction(vcpu, emulation_type,
>  						    insn, insn_len);
>  		if (r != EMULATION_OK)  {
> +			if (r == EMULATION_IO_FETCH)
> +				return 0;
>  			if ((emulation_type & EMULTYPE_TRAP_UD) ||
>  			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
>  				kvm_queue_exception(vcpu, UD_VECTOR);
> 
> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
>

Maxim Levitsky Sept. 28, 2022, 11:14 a.m. UTC | #14

On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
> 
> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
> > > Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > > > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > > > > As Sean said "This is an awful lot of a complexity to take on for something
> > > > > that appears to be solvable in userspace."
> > > > 
> > > > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > > > understand exactly what KVM behavior is problematic for userspace.  E.g. the
> > > > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > > > reverted.
> > > 
> > > It still is because I can reproduce the bug, as also pointed out in
> > > multiple comments below.
> > 
> > You can reproduce _a_ bug, but it's obviously not the original bug, because the
> > last comment says:
> > 
> >   Second, indeed the patch was reverted and somehow accepted without generating
> >   too much noise:
> > 
> >   ...
> > 
> >   The underlying issue of course as we both know is still there.
> > 
> >   You might have luck reproducing it with this bug
> > 
> >   https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> > 
> >   But for me it looks like it is 'working' as well, so you might have
> >   to write a unit test to trigger the issue.
> > 
> > > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > > > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > > > the race _if_ userspace chooses not to pause vCPUs.
> > > > 
> > > 
> > > Also on the BZ they all seem (Paolo included) to agree that the issue is
> > > non-atomic memslots update.
> > 
> > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> > memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> > because it should be possible to adjust KVM and/or QEMU to play nice with the
> > fetch, e.g. to resume the guest until the new memslot is installed, in which case
> > an atomic update isn't needed.
> > 
> > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> > guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> > the MMIO code fetch.
> > 
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> > 
> > Compile tested only...
> 
> So basically you are just making KVM catch the failed
> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
> basically ends up in doing nothing and retry again executing the
> instruction?
> 
> I wonder if there are some performance implications in this, but it's
> definitely simpler than what I did.
> 
> Tested on the same failing machine used for the BZ, fixes the bug.
> 
> Do you want me to re-send the patch on your behalf (and add probably a
> small documentation on Documentation/virt/kvm/api.rst)?
> 
> Emanuele
> > ---
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Tue, 27 Sep 2022 08:16:03 -0700
> > Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
> >  MMIO fetch
> > 
> > Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> > able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> > a memslot.  If userspace is manipulating memslots without pausing vCPUs,
> > e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> > valid memslot installed.  Depending on guest context, KVM will either
> > exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> > resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> > with "emulation error" effectively kills the VM, and resuming the guest
> > doesn't provide userspace an opportunity to react the to fetch.
> > 
> > Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
> > there is no other way to communicate "fetch" to userspace without
> > defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> > and not a flag, and there is no known use case for actually supporting
> > code fetches from MMIO, i.e. there's no need to allow userspace to fill
> > in the instruction bytes.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/emulate.c     | 2 ++
> >  arch/x86/kvm/kvm_emulate.h | 1 +
> >  arch/x86/kvm/x86.c         | 9 ++++++++-
> >  3 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index f092c54d1a2f..e141238d93b0 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
> >  done:
> >  	if (rc == X86EMUL_PROPAGATE_FAULT)
> >  		ctxt->have_exception = true;
> > +	if (rc == X86EMUL_IO_NEEDED)
> > +		return EMULATION_IO_FETCH;
> >  	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
> >  }
> >  
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 89246446d6aa..3cb2e321fcd2 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> >  #define EMULATION_OK 0
> >  #define EMULATION_RESTART 1
> >  #define EMULATION_INTERCEPTED 2
> > +#define EMULATION_IO_FETCH 3
> >  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> >  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> >  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index aa5ab0c620de..7eb72694c601 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
> >  		bytes = (unsigned)PAGE_SIZE - offset;
> >  	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
> >  				       offset, bytes);
> > -	if (unlikely(ret < 0))
> > +	if (unlikely(ret < 0)) {
> > +		vcpu->run->mmio.phys_addr = gpa;
> > +		vcpu->run->mmio.len = 0;
> > +		vcpu->run->mmio.is_write = 0;
> > +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
> >  		return X86EMUL_IO_NEEDED;
> > +	}
> >  
> >  	return X86EMUL_CONTINUE;
> >  }
> > @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  		r = x86_decode_emulated_instruction(vcpu, emulation_type,
> >  						    insn, insn_len);
> >  		if (r != EMULATION_OK)  {
> > +			if (r == EMULATION_IO_FETCH)
> > +				return 0;
> >  			if ((emulation_type & EMULTYPE_TRAP_UD) ||
> >  			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> >  				kvm_queue_exception(vcpu, UD_VECTOR);
> > 
> > base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
> > 

Note that AFAIK, there is another case (and probably more), if TDP is disabled,
and MMU root is in mmio, we kill the guest.


mmu_alloc_shadow_roots -> mmu_check_root


I used to have few hacks in KVM to cope with this, but AFAIK,
I gave up on it, because the issue would show up again and again.

Best regards,
	Maxim Levitsky

David Hildenbrand Sept. 28, 2022, 12:52 p.m. UTC | #15

On 28.09.22 13:14, Maxim Levitsky wrote:
> On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
>>
>> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
>>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>>> that appears to be solvable in userspace."
>>>>>
>>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>>> reverted.
>>>>
>>>> It still is because I can reproduce the bug, as also pointed out in
>>>> multiple comments below.
>>>
>>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>>> last comment says:
>>>
>>>    Second, indeed the patch was reverted and somehow accepted without generating
>>>    too much noise:
>>>
>>>    ...
>>>
>>>    The underlying issue of course as we both know is still there.
>>>
>>>    You might have luck reproducing it with this bug
>>>
>>>    https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>>
>>>    But for me it looks like it is 'working' as well, so you might have
>>>    to write a unit test to trigger the issue.
>>>
>>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>>
>>>>
>>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>>> non-atomic memslots update.
>>>
>>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>>> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
>>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>>> an atomic update isn't needed.
>>>
>>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>>> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
>>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>>> the MMIO code fetch.
>>>
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>>
>>> Compile tested only...
>>
>> So basically you are just making KVM catch the failed
>> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
>> basically ends up in doing nothing and retry again executing the
>> instruction?
>>
>> I wonder if there are some performance implications in this, but it's
>> definitely simpler than what I did.
>>
>> Tested on the same failing machine used for the BZ, fixes the bug.
>>
>> Do you want me to re-send the patch on your behalf (and add probably a
>> small documentation on Documentation/virt/kvm/api.rst)?
>>
>> Emanuele
>>> ---
>>> From: Sean Christopherson <seanjc@google.com>
>>> Date: Tue, 27 Sep 2022 08:16:03 -0700
>>> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
>>>   MMIO fetch
>>>
>>> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
>>> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
>>> a memslot.  If userspace is manipulating memslots without pausing vCPUs,
>>> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
>>> valid memslot installed.  Depending on guest context, KVM will either
>>> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
>>> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
>>> with "emulation error" effectively kills the VM, and resuming the guest
>>> doesn't provide userspace an opportunity to react the to fetch.
>>>
>>> Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
>>> there is no other way to communicate "fetch" to userspace without
>>> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
>>> and not a flag, and there is no known use case for actually supporting
>>> code fetches from MMIO, i.e. there's no need to allow userspace to fill
>>> in the instruction bytes.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>   arch/x86/kvm/emulate.c     | 2 ++
>>>   arch/x86/kvm/kvm_emulate.h | 1 +
>>>   arch/x86/kvm/x86.c         | 9 ++++++++-
>>>   3 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>>> index f092c54d1a2f..e141238d93b0 100644
>>> --- a/arch/x86/kvm/emulate.c
>>> +++ b/arch/x86/kvm/emulate.c
>>> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>>>   done:
>>>   	if (rc == X86EMUL_PROPAGATE_FAULT)
>>>   		ctxt->have_exception = true;
>>> +	if (rc == X86EMUL_IO_NEEDED)
>>> +		return EMULATION_IO_FETCH;
>>>   	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
>>>   }
>>>   
>>> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
>>> index 89246446d6aa..3cb2e321fcd2 100644
>>> --- a/arch/x86/kvm/kvm_emulate.h
>>> +++ b/arch/x86/kvm/kvm_emulate.h
>>> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>>>   #define EMULATION_OK 0
>>>   #define EMULATION_RESTART 1
>>>   #define EMULATION_INTERCEPTED 2
>>> +#define EMULATION_IO_FETCH 3
>>>   void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>>>   int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>>>   int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index aa5ab0c620de..7eb72694c601 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
>>>   		bytes = (unsigned)PAGE_SIZE - offset;
>>>   	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
>>>   				       offset, bytes);
>>> -	if (unlikely(ret < 0))
>>> +	if (unlikely(ret < 0)) {
>>> +		vcpu->run->mmio.phys_addr = gpa;
>>> +		vcpu->run->mmio.len = 0;
>>> +		vcpu->run->mmio.is_write = 0;
>>> +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
>>>   		return X86EMUL_IO_NEEDED;
>>> +	}
>>>   
>>>   	return X86EMUL_CONTINUE;
>>>   }
>>> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>>   		r = x86_decode_emulated_instruction(vcpu, emulation_type,
>>>   						    insn, insn_len);
>>>   		if (r != EMULATION_OK)  {
>>> +			if (r == EMULATION_IO_FETCH)
>>> +				return 0;
>>>   			if ((emulation_type & EMULTYPE_TRAP_UD) ||
>>>   			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
>>>   				kvm_queue_exception(vcpu, UD_VECTOR);
>>>
>>> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
>>>
> 
> Note that AFAIK, there is another case (and probably more), if TDP is disabled,
> and MMU root is in mmio, we kill the guest.
> 
> 
> mmu_alloc_shadow_roots -> mmu_check_root
> 
> 
> I used to have few hacks in KVM to cope with this, but AFAIK,
> I gave up on it, because the issue would show up again and again.

IIRC, s390x can have real problems if we temporarily remove a memslot. 
In case the emulation/interpretation code tries accessing guest memory 
and fails because there is no memslot describing that portion of guest RAM.

Note that resizing/merging/splitting currently shouldn't happen on 
s390x, though. But resizing of memslots might happen in the near future 
with virtio-mem in QEMU.

Paolo Bonzini Sept. 28, 2022, 3:07 p.m. UTC | #16

On 9/27/22 17:58, Sean Christopherson wrote:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
> 
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
> 
>    Second, indeed the patch was reverted and somehow accepted without generating
>    too much noise:
> 
>    ...
> 
>    The underlying issue of course as we both know is still there.
> 
>    You might have luck reproducing it with this bug
> 
>    https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> 
>    But for me it looks like it is 'working' as well, so you might have
>    to write a unit test to trigger the issue.
> 
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
> 
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
> 
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
> 
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.

I think this patch is not a good idea for two reasons:

1) we don't know how userspace behaves if mmio.len is zero.  It is of 
course reasonable to do nothing, but an assertion failure is also a 
valid behavior

2) more important, there is no way to distinguish a failure due to the 
guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from 
one due to the KVM_SET_USER_MEMORY_REGION race condition.  So this will 
cause a guest that correctly caused an internal error to loop forever.

While the former could be handled in a "wait and see" manner, the latter 
in particular is part of the KVM_RUN contract.  Of course it is possible 
for a guest to just loop forever, but in general all of KVM, QEMU and 
upper userspace layers want a crashed guest to be detected and stopped 
forever.

Yes, QEMU could loop only if memslot updates are in progress, but 
honestly all the alternatives I have seen to atomic memslot updates are 
really *awful*.  David's patches even invent a new kind of mutex for 
which I have absolutely no idea what kind of deadlocks one should worry 
about and why they should not exist; QEMU's locking is already pretty 
crappy, it's certainly not on my wishlist to make it worse!

This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) 
the kernel is the only place where you can have a *good* fix.  It should 
have been fixed years ago.

Paolo

David Hildenbrand Sept. 28, 2022, 3:33 p.m. UTC | #17

On 28.09.22 17:07, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>
>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>> that appears to be solvable in userspace."
>>>>
>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>> reverted.
>>>
>>> It still is because I can reproduce the bug, as also pointed out in
>>> multiple comments below.
>>
>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>> last comment says:
>>
>>     Second, indeed the patch was reverted and somehow accepted without generating
>>     too much noise:
>>
>>     ...
>>
>>     The underlying issue of course as we both know is still there.
>>
>>     You might have luck reproducing it with this bug
>>
>>     https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>
>>     But for me it looks like it is 'working' as well, so you might have
>>     to write a unit test to trigger the issue.
>>
>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>
>>>
>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>> non-atomic memslots update.
>>
>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>> an atomic update isn't needed.
>>
>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>> the MMIO code fetch.
>>
>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>> otherwise exit with mmio.len==0.
> 
> I think this patch is not a good idea for two reasons:
> 
> 1) we don't know how userspace behaves if mmio.len is zero.  It is of
> course reasonable to do nothing, but an assertion failure is also a
> valid behavior
> 
> 2) more important, there is no way to distinguish a failure due to the
> guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from
> one due to the KVM_SET_USER_MEMORY_REGION race condition.  So this will
> cause a guest that correctly caused an internal error to loop forever.
> 
> While the former could be handled in a "wait and see" manner, the latter
> in particular is part of the KVM_RUN contract.  Of course it is possible
> for a guest to just loop forever, but in general all of KVM, QEMU and
> upper userspace layers want a crashed guest to be detected and stopped
> forever.
> 
> Yes, QEMU could loop only if memslot updates are in progress, but
> honestly all the alternatives I have seen to atomic memslot updates are
> really *awful*.  David's patches even invent a new kind of mutex for
> which I have absolutely no idea what kind of deadlocks one should worry
> about and why they should not exist; QEMU's locking is already pretty
> crappy, it's certainly not on my wishlist to make it worse!

Just to comment on that (I'm happy as long as this gets fixed), a simple 
mutex with trylock should get the thing done as well -- kicking the VCPU 
if the trylock fails. But I did not look further into locking alternatives.

Sean Christopherson Sept. 28, 2022, 3:58 p.m. UTC | #18

On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> 
> I think this patch is not a good idea for two reasons:
> 
> 1) we don't know how userspace behaves if mmio.len is zero.  It is of course
> reasonable to do nothing, but an assertion failure is also a valid behavior

Except that KVM currently does neither.  If the fetch happens at CPL>0 and/or in
L2, KVM injects #UD.  That's flat out architecturally invalid.  If it's a sticking
point, the mmio.len==0 hack can be avoided by defining a new exit reason.

> 2) more important, there is no way to distinguish a failure due to the guest
> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
> to the KVM_SET_USER_MEMORY_REGION race condition.  So this will cause a
> guest that correctly caused an internal error to loop forever.

Userspace has the GPA and absolutely should be able to detect if the MMIO may have
been due to its memslot manipulation versus the guest jumping into the weeds.

> While the former could be handled in a "wait and see" manner, the latter in
> particular is part of the KVM_RUN contract.  Of course it is possible for a
> guest to just loop forever, but in general all of KVM, QEMU and upper
> userspace layers want a crashed guest to be detected and stopped forever.
> 
> Yes, QEMU could loop only if memslot updates are in progress, but honestly
> all the alternatives I have seen to atomic memslot updates are really
> *awful*.  David's patches even invent a new kind of mutex for which I have
> absolutely no idea what kind of deadlocks one should worry about and why
> they should not exist; QEMU's locking is already pretty crappy, it's
> certainly not on my wishlist to make it worse!
> 
> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
> kernel is the only place where you can have a *good* fix.  It should have
> been fixed years ago.

I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.

E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
do wholesale replacement?  That seems like it would be vastly simpler to handle
on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
API that allows 1->N or N->1 conversions but not arbitrary batching.

And just because QEMU's locking is "already pretty crappy", that's not a good
reason to drag KVM down into the mud.  E.g. taking a lock and conditionally
releasing it...  I get that this is an RFC, but IMO anything that requires such
shenanigans simply isn't acceptable.

  /*
   * Takes kvm->slots_arch_lock, and releases it only if
   * invalid_slot allocation, kvm_prepare_memory_region failed
   * or batch->is_move_delete is true.
   */
  static int kvm_prepare_memslot(struct kvm *kvm,
			         struct kvm_internal_memory_region_list *batch)

Paolo Bonzini Sept. 28, 2022, 4:38 p.m. UTC | #19

On 9/28/22 17:58, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/27/22 17:58, Sean Christopherson wrote:
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>
>> I think this patch is not a good idea for two reasons:
>>
>> 1) we don't know how userspace behaves if mmio.len is zero.  It is of course
>> reasonable to do nothing, but an assertion failure is also a valid behavior
> 
> Except that KVM currently does neither.  If the fetch happens at CPL>0 and/or in
> L2, KVM injects #UD.  That's flat out architecturally invalid.  If it's a sticking
> point, the mmio.len==0 hack can be avoided by defining a new exit reason.

I agree that doing this at CPL>0 or in L2 is invalid and makes little 
sense (because either way the invalid address cannot be reached without 
help from the supervisor or L1's page tables).

>> 2) more important, there is no way to distinguish a failure due to the guest
>> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
>> to the KVM_SET_USER_MEMORY_REGION race condition.  So this will cause a
>> guest that correctly caused an internal error to loop forever.
> 
> Userspace has the GPA and absolutely should be able to detect if the MMIO may have
> been due to its memslot manipulation versus the guest jumping into the weeds.
> 
>> While the former could be handled in a "wait and see" manner, the latter in
>> particular is part of the KVM_RUN contract.  Of course it is possible for a
>> guest to just loop forever, but in general all of KVM, QEMU and upper
>> userspace layers want a crashed guest to be detected and stopped forever.
>>
>> Yes, QEMU could loop only if memslot updates are in progress, but honestly
>> all the alternatives I have seen to atomic memslot updates are really
>> *awful*.  David's patches even invent a new kind of mutex for which I have
>> absolutely no idea what kind of deadlocks one should worry about and why
>> they should not exist; QEMU's locking is already pretty crappy, it's
>> certainly not on my wishlist to make it worse!
>>
>> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
>> kernel is the only place where you can have a *good* fix.  It should have
>> been fixed years ago.
> 
> I don't disagree that the memslots API is lacking, but IMO that is somewhat
> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> like to have the luxury of being able to explore ideas beyond "let userspace
> batch memslot updates", and I really don't want to feel pressured to get this
> code reviewed and merge.

I absolutely agree that this is not a bugfix.  Most new features for KVM 
can be seen as bug fixes if you squint hard enough, but they're still 
features.

> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> do wholesale replacement?  That seems like it would be vastly simpler to handle
> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> API that allows 1->N or N->1 conversions but not arbitrary batching.

Wholesale replacement was my first idea when I looked at the issue, I 
think at the end of 2020.  I never got to a full implementation, but my 
impression was that allocating/deallocating dirty bitmaps, rmaps etc. 
would make it any easier than arbitrary batch updates.

> And just because QEMU's locking is "already pretty crappy", that's not a good
> reason to drag KVM down into the mud.  E.g. taking a lock and conditionally
> releasing it...  I get that this is an RFC, but IMO anything that requires such
> shenanigans simply isn't acceptable.
> 
>    /*
>     * Takes kvm->slots_arch_lock, and releases it only if
>     * invalid_slot allocation, kvm_prepare_memory_region failed
>     * or batch->is_move_delete is true.
>     */
>    static int kvm_prepare_memslot(struct kvm *kvm,
> 			         struct kvm_internal_memory_region_list *batch)
> 

No objection about that. :)

Paolo

Sean Christopherson Sept. 28, 2022, 8:41 p.m. UTC | #20

On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/28/22 17:58, Sean Christopherson wrote:
> > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > like to have the luxury of being able to explore ideas beyond "let userspace
> > batch memslot updates", and I really don't want to feel pressured to get this
> > code reviewed and merge.
> 
> I absolutely agree that this is not a bugfix.  Most new features for KVM can
> be seen as bug fixes if you squint hard enough, but they're still features.

I guess I'm complaining that there isn't sufficient justification for this new
feature.  The cover letter provides a bug that would be fixed by having batched
updates, but as above, that's really due to deficiencies in a different KVM ABI.

Beyond that, there's no explanation of why this exact API is necessary, i.e. there
are no requirements given.

  - Why can't this be solved in userspace?

  - Is performance a concern?  I.e. are updates that need to be batched going to
    be high frequency operations?

  - What operations does userspace truly need?  E.g. if the only use case is to
    split/truncate/hole punch an existing memslot, can KVM instead provide a
    memslot flag and exit reason that allows kicking vCPUs to userspace if the
    memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
    but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

  - What precisely needs to be atomic?  If updates only need to be "atomic" for
    an address space, does the API allowing mixing non-SMM and SMM memslots?

  - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
    and recreated with a different HVA, how does KVM ensure that there are no
    outstanding references to the old HVA without introducing non-determinstic
    behavior.  The current API works by forcing userspace to fully delete the
    memslot, i.e. KVM can ensure all references are gone in all TLBs before
    allowing userspace to create new, conflicting entries.  I don't see how this
    can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
    must never see an invalid/deleted memslot, but if the memslot is writable,
    how does KVM prevent some writes from hitting the old HVA and some from hitting
    the new HVA without a quiescent period?

  - If a memslot is truncated while dirty logging is enabled, what happens to
    the bitmap?  Is it preserved?  Dropped?

Again, I completely agree that the current memslots API is far from perfect, but
I'm not convinced that simply extending the existing API to batch updates is the
best solution from a KVM perspective.

> > E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> > do wholesale replacement?  That seems like it would be vastly simpler to handle
> > on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> > API that allows 1->N or N->1 conversions but not arbitrary batching.
> 
> Wholesale replacement was my first idea when I looked at the issue, I think
> at the end of 2020.  I never got to a full implementation, but my impression
> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> easier than arbitrary batch updates.

It's not obvious to me that the memslot metadata is going to be easy to handle
regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
the dirty bitmap if a hole is punched in a memslot that's being dirty logged.

Emanuele Giuseppe Esposito Sept. 29, 2022, 8:05 a.m. UTC | #21

Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
> 
> I guess I'm complaining that there isn't sufficient justification for this new
> feature.  The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.
> 
> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> are no requirements given.
> 
>   - Why can't this be solved in userspace?

Because this would provide the "feature" only to QEMU, leaving each
other hypervisor to implement its own.

In addition (maybe you already answered this question but I couldn't
find an answer in the email thread), does it make sense to stop all
vcpus for a couple of memslot update? What if we have 100 cpus?

> 
>   - Is performance a concern?  I.e. are updates that need to be batched going to
>     be high frequency operations?

Currently they are limited to run only at boot. In an unmodified
KVM/QEMU build, however, I count 86 memslot updates done at boot with

./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none

> 
>   - What operations does userspace truly need?  E.g. if the only use case is to
>     split/truncate/hole punch an existing memslot, can KVM instead provide a
>     memslot flag and exit reason that allows kicking vCPUs to userspace if the
>     memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
>     but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

Could be an alternative solution I guess?
> 
>   - What precisely needs to be atomic?  If updates only need to be "atomic" for
>     an address space, does the API allowing mixing non-SMM and SMM memslots?

Does QEMU pass a list of mixed non-SMM and SMM memslots?
Btw, is there a separate list for SMM memslots in KVM?
If not, and are all thrown together, then for simplicity we make all
updates atomic.

> 
>   - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
>     and recreated with a different HVA, how does KVM ensure that there are no
>     outstanding references to the old HVA without introducing non-determinstic
>     behavior.  The current API works by forcing userspace to fully delete the
>     memslot, i.e. KVM can ensure all references are gone in all TLBs before
>     allowing userspace to create new, conflicting entries.  I don't see how this
>     can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
>     must never see an invalid/deleted memslot, but if the memslot is writable,
>     how does KVM prevent some writes from hitting the old HVA and some from hitting
>     the new HVA without a quiescent period?

Sorry this might be my fault in providing definitions, even though I
think I made plenty of examples. Delete/move operations are not really
"atomic" in the sense that the slot disappears immediately.

The slot(s) is/are first "atomically" invalidated, allowing the guest to
retry the page fault and preparing for the cleanup you mention above,
and then are "atomically" deleted.
The behavior is the same, just instead of doing
invalidate memslot X
swap()
delete memslot X
swap()
invalidate memslot Y
swap()
delete memslot Y
swap()
...

I would do:

invalidate
invalidate
invalidate
swap()
delete
delete
delete
swap()
> 
>   - If a memslot is truncated while dirty logging is enabled, what happens to
>     the bitmap?  Is it preserved?  Dropped?

Can you explain what you mean with "truncated memslot"?
Regarding the bitmap, currently QEMU should (probably wrongly) update it
before even committing the changes to KVM. But I remember upstream
someone pointed that this could be solved later.

> 
> Again, I completely agree that the current memslots API is far from perfect, but
> I'm not convinced that simply extending the existing API to batch updates is the
> best solution from a KVM perspective.
> 
>>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
>>> do wholesale replacement?  That seems like it would be vastly simpler to handle
>>> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
>>> API that allows 1->N or N->1 conversions but not arbitrary batching.
>>
>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020.  I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
> 
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> 

Could you provide an example?
I mean I am not an expert but to me it looks like I preserved the same
old functionalities both here and in QEMU. I just batched and delayed
some operations, which in this case should cause no harm.

Thank you,
Emanuele

David Hildenbrand Sept. 29, 2022, 8:24 a.m. UTC | #22

On 29.09.22 10:05, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
>> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>>> On 9/28/22 17:58, Sean Christopherson wrote:
>>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>>> batch memslot updates", and I really don't want to feel pressured to get this
>>>> code reviewed and merge.
>>>
>>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>>> be seen as bug fixes if you squint hard enough, but they're still features.
>>
>> I guess I'm complaining that there isn't sufficient justification for this new
>> feature.  The cover letter provides a bug that would be fixed by having batched
>> updates, but as above, that's really due to deficiencies in a different KVM ABI.
>>
>> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
>> are no requirements given.
>>
>>    - Why can't this be solved in userspace?
> 
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.
> 
> In addition (maybe you already answered this question but I couldn't
> find an answer in the email thread), does it make sense to stop all
> vcpus for a couple of memslot update? What if we have 100 cpus?
> 
>>
>>    - Is performance a concern?  I.e. are updates that need to be batched going to
>>      be high frequency operations?
> 
> Currently they are limited to run only at boot. In an unmodified
> KVM/QEMU build, however, I count 86 memslot updates done at boot with
> 
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none

I *think* there are only ~3 problematic ones (split/resize), where we 
temporarily delete something we will re-add. At least that's what I 
remember from working on my prototype.

Sean Christopherson Sept. 29, 2022, 3:18 p.m. UTC | #23

On Thu, Sep 29, 2022, Emanuele Giuseppe Esposito wrote:
> 
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> > Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> > are no requirements given.
> > 
> >   - Why can't this be solved in userspace?
> 
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.

But there's no evidence that any other VMM actually needs this feature.

> >   - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
> >     and recreated with a different HVA, how does KVM ensure that there are no
> >     outstanding references to the old HVA without introducing non-determinstic
> >     behavior.  The current API works by forcing userspace to fully delete the
> >     memslot, i.e. KVM can ensure all references are gone in all TLBs before
> >     allowing userspace to create new, conflicting entries.  I don't see how this
> >     can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
> >     must never see an invalid/deleted memslot, but if the memslot is writable,
> >     how does KVM prevent some writes from hitting the old HVA and some from hitting
> >     the new HVA without a quiescent period?
> 
> Sorry this might be my fault in providing definitions, even though I
> think I made plenty of examples. Delete/move operations are not really
> "atomic" in the sense that the slot disappears immediately.
> 
> The slot(s) is/are first "atomically" invalidated, allowing the guest to
> retry the page fault

I completely forgot that KVM x86 now retries on KVM_MEMSLOT_INVALID[*].  That is
somewhat arbitrary behavior, e.g. if x86 didn't do MMIO SPTE caching it wouldn't
be "necessary" and x86 would still treat INVALID as MMIO.  It's also x86 specific,
no other architecture retries on invalid memslots, i.e. relying on that behavior
to provide "atomicity" won't work without updating all other architectures.

That's obviously doable, but since these updates aren't actually atomic, I don't
see any meaningful benefit over adding e.g. KVM_MEM_DISABLED.

[*] e0c378684b65 ("KVM: x86/mmu: Retry page faults that hit an invalid memslot")

> >   - If a memslot is truncated while dirty logging is enabled, what happens to
> >     the bitmap?  Is it preserved?  Dropped?
> 
> Can you explain what you mean with "truncated memslot"?

Shrink the size of the memslot.

> Regarding the bitmap, currently QEMU should (probably wrongly) update it
> before even committing the changes to KVM. But I remember upstream
> someone pointed that this could be solved later.

These details can't be punted, as they affect the ABI.  My interpretation of
"atomic" memslot updates was that KVM would truly honor the promise of atomic
updates, e.g. that a DELETE=>CREATE sequence would effectively allow truncating
a memslot without losing any data.  Those details aren't really something that
can be punted down the road to be fixed at a later point because they very much
matter from an ABI perspective.

> > Again, I completely agree that the current memslots API is far from perfect, but
> > I'm not convinced that simply extending the existing API to batch updates is the
> > best solution from a KVM perspective.
> > 
> >>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> >>> do wholesale replacement?  That seems like it would be vastly simpler to handle
> >>> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> >>> API that allows 1->N or N->1 conversions but not arbitrary batching.
> >>
> >> Wholesale replacement was my first idea when I looked at the issue, I think
> >> at the end of 2020.  I never got to a full implementation, but my impression
> >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> >> easier than arbitrary batch updates.
> > 
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> > 
> 
> Could you provide an example?
> I mean I am not an expert but to me it looks like I preserved the same
> old functionalities both here and in QEMU. I just batched and delayed
> some operations, which in this case should cause no harm.

As above, my confusion is due to calling these updates "atomic".  The new API is
really just "allow batching memslot updates and subtly rely on restarting page
faults on KVM_MEMSLOT_INVALID".

IMO, KVM_MEM_DISABLED or similar is the way to go.  I.e. formalize the "restart
page faults" semantics without taking on the complexity of batched updates.

Paolo Bonzini Sept. 29, 2022, 3:28 p.m. UTC | #24

On 9/28/22 22:41, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
> 
> I guess I'm complaining that there isn't sufficient justification for this new
> feature.  The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.

I disagree.  Failure to fetch should be fixed but is otherwise a red 
herring.  It's the whole memslot API (including dirty logging) that is a 
mess.

If you think we should overhaul it even more than just providing atomic 
batched updates, that's fine.  But still, the impossibility to perform 
atomic updates in batches *is* a suboptimal part of the KVM API.

>    - Why can't this be solved in userspace?

I don't think *can't* is the right word.  If the metric of choice was 
"what can be solved in userspace", we'd all be using microkernels.  The 
question is why userspace would be a better place to solve it.

The only reason to do it in userspace would be if failure to fetch is 
something that is interesting to userspace, other than between two 
KVM_SET_USER_MEMORY_REGION.  Unless you provide an API to pass failures 
to fetch down to userspace, the locking in userspace is going to be 
inferior, because it would have to be unconditional.  This means worse 
performance and more complication, not to mention having to do it N 
times instead of 1 for N implementations.

Not forcing userspace to do "tricks" is in my opinion a strong part of 
deciding whether an API belongs in KVM.

>    - What operations does userspace truly need?  E.g. if the only use case is to
>      split/truncate/hole punch an existing memslot, can KVM instead provide a
>      memslot flag and exit reason that allows kicking vCPUs to userspace if the
>      memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
>      but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

The main cases are:

- for the boot case, splitting and merging existing memslots.  QEMU 
likes to merge adjacent memory regions into a single memslot, so if 
something goes from read-write to read-only it has to be split and vice 
versa.  I guess a "don't merge this memory region" flag would be the 
less hideous way to solve it in userspace.

- however, there is also the case of resizing an existing memslot which 
is what David would like to have for virtio-mem.  This is not really 
fixable because part of the appeal of virtio-mem is to have a single 
huge memslot instead of many smaller ones, in order to reduce the 
granularity of add/remove (David, correct me if I'm wrong).

(The latter _would_ be needed by other VMMs).

> If updates only need to be "atomic" for an address space, does the API allowing
> mixing non-SMM and SMM memslots?

I agree that the address space should be moved out of the single entries 
and into the header if we follow through with this approach.

> The update needs to be "atomic", i.e. vCPUs
> must never see an invalid/deleted memslot, but if the memslot is writable,
> how does KVM prevent some writes from hitting the old HVA and some from hitting
> the new HVA without a quiescent period?

(Heh, and I forgot likewise that non-x86 does not retry on 
KVM_MEMSLOT_INVALID.  Yes, that would be treated as a bug on other 
architectures).

>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020.  I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
> 
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.

Indeed; I would have thought that it is clear with the batch updates API 
(which requires the update to be split into delete and insert), but 
apparently it's not and it's by no means optimal.

Paolo

Maxim Levitsky Sept. 29, 2022, 3:40 p.m. UTC | #25

On Thu, 2022-09-29 at 17:28 +0200, Paolo Bonzini wrote:
> On 9/28/22 22:41, Sean Christopherson wrote:
> > On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> > > On 9/28/22 17:58, Sean Christopherson wrote:
> > > > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > > > orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> > > > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > > > like to have the luxury of being able to explore ideas beyond "let userspace
> > > > batch memslot updates", and I really don't want to feel pressured to get this
> > > > code reviewed and merge.
> > > 
> > > I absolutely agree that this is not a bugfix.  Most new features for KVM can
> > > be seen as bug fixes if you squint hard enough, but they're still features.
> > 
> > I guess I'm complaining that there isn't sufficient justification for this new
> > feature.  The cover letter provides a bug that would be fixed by having batched
> > updates, but as above, that's really due to deficiencies in a different KVM ABI.
> 
> I disagree.  Failure to fetch should be fixed but is otherwise a red 
> herring.  It's the whole memslot API (including dirty logging) that is a 
> mess.
> 
> If you think we should overhaul it even more than just providing atomic 
> batched updates, that's fine.  But still, the impossibility to perform 
> atomic updates in batches *is* a suboptimal part of the KVM API.
> 
> >    - Why can't this be solved in userspace?
> 
> I don't think *can't* is the right word.  If the metric of choice was 
> "what can be solved in userspace", we'd all be using microkernels.  The 
> question is why userspace would be a better place to solve it.
> 
> The only reason to do it in userspace would be if failure to fetch is 
> something that is interesting to userspace, other than between two 
> KVM_SET_USER_MEMORY_REGION.  Unless you provide an API to pass failures 
> to fetch down to userspace, the locking in userspace is going to be 
> inferior, because it would have to be unconditional.  This means worse 
> performance and more complication, not to mention having to do it N 
> times instead of 1 for N implementations.
> 
> Not forcing userspace to do "tricks" is in my opinion a strong part of 
> deciding whether an API belongs in KVM.
> 
> >    - What operations does userspace truly need?  E.g. if the only use case is to
> >      split/truncate/hole punch an existing memslot, can KVM instead provide a
> >      memslot flag and exit reason that allows kicking vCPUs to userspace if the
> >      memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
> >      but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
> 
> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU 
> likes to merge adjacent memory regions into a single memslot, so if 
> something goes from read-write to read-only it has to be split and vice 
> versa.  I guess a "don't merge this memory region" flag would be the 
> less hideous way to solve it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which 
> is what David would like to have for virtio-mem.  This is not really 
> fixable because part of the appeal of virtio-mem is to have a single 
> huge memslot instead of many smaller ones, in order to reduce the 
> granularity of add/remove (David, correct me if I'm wrong).
> 
> (The latter _would_ be needed by other VMMs).
> 
> > If updates only need to be "atomic" for an address space, does the API allowing
> > mixing non-SMM and SMM memslots?
> 
> I agree that the address space should be moved out of the single entries 
> and into the header if we follow through with this approach.
> 
> > The update needs to be "atomic", i.e. vCPUs
> > must never see an invalid/deleted memslot, but if the memslot is writable,
> > how does KVM prevent some writes from hitting the old HVA and some from hitting
> > the new HVA without a quiescent period?
> 
> (Heh, and I forgot likewise that non-x86 does not retry on 
> KVM_MEMSLOT_INVALID.  Yes, that would be treated as a bug on other 
> architectures).
> 
> > > Wholesale replacement was my first idea when I looked at the issue, I think
> > > at the end of 2020.  I never got to a full implementation, but my impression
> > > was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> > > easier than arbitrary batch updates.
> > 
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> 
> Indeed; I would have thought that it is clear with the batch updates API 
> (which requires the update to be split into delete and insert), but 
> apparently it's not and it's by no means optimal.


I 100% agree with everything Paolo said.


Best regards,
	Maxim Levitsky
> 
> Paolo
>

Paolo Bonzini Sept. 29, 2022, 3:41 p.m. UTC | #26

On 9/29/22 17:18, Sean Christopherson wrote:
> IMO, KVM_MEM_DISABLED or similar is the way to go.  I.e. formalize the "restart
> page faults" semantics without taking on the complexity of batched updates.

If userspace has to collaborate, KVM_MEM_DISABLED (or KVM_MEM_USER_EXIT 
would be a better name) is not needed either except as an optimization; 
you can just kick all CPUs unconditionally.

And in fact KVM_MEM_DISABLED is not particularly easy to implement 
either; in order to allow split/merge it should be possible for a new 
memslot to replace multiple disabled memslots, in order to allow 
merging, and to be only partially overlap the first/last disabled 
memslot it replaces.

None of this is allowed for other memslots, so exactly the same metadata 
complications exist as for other options such as wholesale replacement 
or batched updates.  The only semantics with a sane implementation would 
be to destroy the dirty bitmap of disabled memslots when they are 
replaced.  At least it would be possible for userspace to set 
KVM_MEM_DISABLED, issue KVM_GET_DIRTY_LOG and then finally create the 
new memslot.  That would be _correct_, but still not very appealing.

I don't exclude suffering from tunnel vision, but batched updates to me 
still seem to be the least bad option.

Paolo

David Hildenbrand Sept. 29, 2022, 4 p.m. UTC | #27

> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU
> likes to merge adjacent memory regions into a single memslot, so if
> something goes from read-write to read-only it has to be split and vice
> versa.  I guess a "don't merge this memory region" flag would be the
> less hideous way to solve it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which
> is what David would like to have for virtio-mem.  This is not really
> fixable because part of the appeal of virtio-mem is to have a single
> huge memslot instead of many smaller ones, in order to reduce the
> granularity of add/remove (David, correct me if I'm wrong).

Yes, the most important case I am working on in that regard is reducing 
the memslot size/overhead when only exposing comparatively little memory 
towards a VM using virtio-mem (say, a virtio-mem device that could grow 
to 1 TiB, but we initially only expose 1 GiB to the VM).

One approach I prototyped in the past (where my RFC for atomic updates 
came into play because I ran into this issue) to achieve that was 
dynamically growing (and eventually shrinking) a single memslot on demand.

An alternative [1] uses multiple individual memslots, and exposes them 
on demand. There, I have to make sure that QEMU won't merge adjacent 
memslots into a bigger one -- because otherwise, we'd again need atomic 
memslot updates again, which KVM, vhost-user, ... don't support. But in 
the future, I think we want to have that: if there is no fragmentation, 
we should only have a single large memslot and all memslot consumers 
should be able to cope with atomic updates.


So in any case, I will have good use for atomic memslot updates. Just 
like other hypervisors that (will) implement virtio-mem or something 
comparable :)

[1] https://lore.kernel.org/all/20211013103330.26869-1-david@redhat.com/T/#u

Sean Christopherson Sept. 29, 2022, 9:39 p.m. UTC | #28

On Thu, Sep 29, 2022, Paolo Bonzini wrote:
> 
> On 9/28/22 22:41, Sean Christopherson wrote:
> If you think we should overhaul it even more than just providing atomic
> batched updates, that's fine.

Or maybe solve the problem(s) with more precision?  What I don't like about the
batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly
becoming atomic.  When I hear "atomic" updates, I think "all transactions are
commited at once".  That is not what this proposed API does, and due to the TLB
flushing requirements, I don't see how it can be implemented.

> But still, the impossibility to perform
> atomic updates in batches *is* a suboptimal part of the KVM API.

I don't disagree, but I honestly don't see the difference between "batching" and
providing a KVM_MEM_USER_EXIT flag.  The batch API will provide better performance,
but I would be surprised if it's significant enough to matter.  I'm not saying
that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in
agreement that handling memslot metadata will suck regardless.  My point is that
it's simpler to implement in KVM, much easier to document, and for all intents and
purposes provides the same desired functionality: vCPUs that access in-flux memslots
are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to
provide "atomic" updates.

> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU likes to
> merge adjacent memory regions into a single memslot, so if something goes
> from read-write to read-only it has to be split and vice versa.  I guess a
> "don't merge this memory region" flag would be the less hideous way to solve
> it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which is
> what David would like to have for virtio-mem.  This is not really fixable
> because part of the appeal of virtio-mem is to have a single huge memslot
> instead of many smaller ones, in order to reduce the granularity of
> add/remove (David, correct me if I'm wrong).
> 
> (The latter _would_ be needed by other VMMs).

If we really want to provide a better experience for userspace, why not provide
more primitives to address those specific use cases?  E.g. fix KVM's historic wart
of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:

  - Merge 2+ memory regions that are contiguous in GPA and HVA
  - Split a memory region into 2+ memory regions
  - Truncate a memory region
  - Grow a memory region

That would probably require more KVM code overall, but each operation would be more
tightly bounded and thus simpler to define.  And I think more precise APIs would
provide other benefits, e.g. growing a region wouldn't require first deleting the
current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
split, and truncate would have similar benefits, though they might be more
difficult to realize in practice.

What I really don't like about the "batch" API, aside from the "atomic" name, is
that it's too open ended and creates conflicts.  E.g. to allow "atomic" merging
after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that
implies that each operation in the batch operates on the working state created by
the previous operations.

But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting
all "invalidations" to the front half of the "atomic" operations breaks the
ordering.  And there's a ridiculous amount of potential weirdness with MOVE=>MOVE
operations.  

KVM could solve those issues by disallowing MOVE and disallowing DELETE after
CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating
on a single contiguous chunk of memslots.  At that point, it seems that providing
primitives to do each macro operation is an even better experience for userspace.

In other words, make memslots a bit more CISC ;-)

Emanuele Giuseppe Esposito Oct. 13, 2022, 7:43 a.m. UTC | #29

Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
> If we really want to provide a better experience for userspace, why not provide
> more primitives to address those specific use cases?  E.g. fix KVM's historic wart
> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
> 
>   - Merge 2+ memory regions that are contiguous in GPA and HVA
>   - Split a memory region into 2+ memory regions
>   - Truncate a memory region
>   - Grow a memory region

I looked again at the code and specifically the use case that triggers
the crash in bugzilla. I *think* (correct me if I am wrong), that the
only operation that QEMU performs right now is "grow/shrink".

So *if* we want to go this way, we could start with a simple grow/shrink
API.

Even though we need to consider that this could bring additional
complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
performed one after the other (in my innocent fantasy I was expecting to
find 2 subsequent ioctls in the code), but in QEMU's
address_space_set_flatview(), which seems to first remove all regions
and then add them when changing flatviews.

address_space_update_topology_pass(as, old_view2, new_view, adding=false);
address_space_update_topology_pass(as, old_view2, new_view, adding=true);

I don't think we can change this, as other listeners also rely on such
ordering, but we can still batch all callback requests like I currently
do and process them in kvm_commit(), figuring there which operation is
which.

In other words, we would have something like:

region_del() --> DELETE memslot X -> add it to the queue of operations
region_del() --> DELETE memslot Y -> add it to the queue of operations
region_add() --> CREATE memslot X (size doubled) -> add it to the queue
of operations
region_add() --> CREATE memslot Y (size halved) -> add it to the queue
of operations
...
commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
Y (-size) -> issue 2 ioctls, GROW and SHRINK.

> That would probably require more KVM code overall, but each operation would be more
> tightly bounded and thus simpler to define.  And I think more precise APIs would
> provide other benefits, e.g. growing a region wouldn't require first deleting the
> current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
> split, and truncate would have similar benefits, though they might be more
> difficult to realize in practice.

So essentially grow would not require INVALIDATE. Makes sense, but would
it work also with shrink? I guess so, as the memslot is still present
(but shrinked) right?

Paolo, would you be ok with this smaller API? Probably just starting
with grow and shrink first.

I am not against any of the two approaches:
- my approach has the disadvantage that the list could be arbitrarily
long, and it is difficult to rollback the intermediate changes if
something breaks during the request processing (but could be simplified
by making kvm exit or crash).

- Sean approach could potentially provide more burden to the userspace,
as we need to figure which operation is which. Also from my
understanding split and merge won't be really straightforward to
implement, especially in userspace.

David, any concern from userspace prospective on this "CISC" approach?

Thank you,
Emanuele

David Hildenbrand Oct. 13, 2022, 8:44 a.m. UTC | #30

On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>> If we really want to provide a better experience for userspace, why not provide
>> more primitives to address those specific use cases?  E.g. fix KVM's historic wart
>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
>>
>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>    - Split a memory region into 2+ memory regions
>>    - Truncate a memory region
>>    - Grow a memory region
> 
> I looked again at the code and specifically the use case that triggers
> the crash in bugzilla. I *think* (correct me if I am wrong), that the
> only operation that QEMU performs right now is "grow/shrink".

I remember that there were BUG reports where we'd actually split and run 
into that problem. Just don't have them at hand. I think they happened 
during early boot when the OS re-configured some PCI thingies.

> 
> So *if* we want to go this way, we could start with a simple grow/shrink
> API.
> 
> Even though we need to consider that this could bring additional
> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
> performed one after the other (in my innocent fantasy I was expecting to
> find 2 subsequent ioctls in the code), but in QEMU's
> address_space_set_flatview(), which seems to first remove all regions
> and then add them when changing flatviews.
> 
> address_space_update_topology_pass(as, old_view2, new_view, adding=false);
> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
> 
> I don't think we can change this, as other listeners also rely on such
> ordering, but we can still batch all callback requests like I currently
> do and process them in kvm_commit(), figuring there which operation is
> which.
> 
> In other words, we would have something like:
> 
> region_del() --> DELETE memslot X -> add it to the queue of operations
> region_del() --> DELETE memslot Y -> add it to the queue of operations
> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
> of operations
> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
> of operations
> ...
> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
> Y (-size) -> issue 2 ioctls, GROW and SHRINK.

I communicated resizes (region_resize()) to the notifier in patch #3 of 
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/

You could use that independently of the remainder. It should be less 
controversial ;)

But I think it only tackles part of the more generic problem (split/merge).

> 
>> That would probably require more KVM code overall, but each operation would be more
>> tightly bounded and thus simpler to define.  And I think more precise APIs would
>> provide other benefits, e.g. growing a region wouldn't require first deleting the
>> current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
>> split, and truncate would have similar benefits, though they might be more
>> difficult to realize in practice.
> 
> So essentially grow would not require INVALIDATE. Makes sense, but would
> it work also with shrink? I guess so, as the memslot is still present
> (but shrinked) right?
> 
> Paolo, would you be ok with this smaller API? Probably just starting
> with grow and shrink first.
> 
> I am not against any of the two approaches:
> - my approach has the disadvantage that the list could be arbitrarily
> long, and it is difficult to rollback the intermediate changes if
> something breaks during the request processing (but could be simplified
> by making kvm exit or crash).
> 
> - Sean approach could potentially provide more burden to the userspace,
> as we need to figure which operation is which. Also from my
> understanding split and merge won't be really straightforward to
> implement, especially in userspace.
> 
> David, any concern from userspace prospective on this "CISC" approach?

In contrast to resizes in QEMU that only affect a single memory 
region/slot, splitting/merging is harder to factor out and communicate 
to a notifier. As an alternative, we could handle it in the commit stage 
in the notifier itself, similar to what my prototype does, and figure 
out what needs to be done in there and how to call the proper KVM 
interface (and which interface to call).

With virtio-mem (in the future) we might see merges of 2 slots into a 
single one, by closing a gap in-between them. In "theory" we could 
combine some updates into a single transaction. But it won't be 100s ...

I think I'd prefer an API that doesn't require excessive ad-hoc 
extensions later once something in QEMU changes.

I think in essence, all we care about is performing atomic changes that 
*have to be atomic*, because something we add during a transaction 
overlaps with something we remove during a transaction. Not necessarily 
all updates in a transaction!

My prototype essentially detects that scenario, but doesn't call new KVM 
interfaces to deal with these.

I assume once we take that into consideration, we can mostly assume that 
any such atomic updates (only of the regions that really have to be part 
of an atomic update) won't involve more than a handful of memory 
regions. We could add a sane KVM API limit.

And if we run into that API limit in QEMU, we can print a warning and do 
it non-atomically.

Emanuele Giuseppe Esposito Oct. 13, 2022, 11:12 a.m. UTC | #31

Am 13/10/2022 um 10:44 schrieb David Hildenbrand:
> On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>>> If we really want to provide a better experience for userspace, why
>>> not provide
>>> more primitives to address those specific use cases?  E.g. fix KVM's
>>> historic wart
>>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more
>>> ioctls to:
>>>
>>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>>    - Split a memory region into 2+ memory regions
>>>    - Truncate a memory region
>>>    - Grow a memory region
>>
>> I looked again at the code and specifically the use case that triggers
>> the crash in bugzilla. I *think* (correct me if I am wrong), that the
>> only operation that QEMU performs right now is "grow/shrink".
> 
> I remember that there were BUG reports where we'd actually split and run
> into that problem. Just don't have them at hand. I think they happened
> during early boot when the OS re-configured some PCI thingies.

If you could point me where this is happening, it would be nice. So far
I could not find or see any split/merge operation.

> 
>>
>> So *if* we want to go this way, we could start with a simple grow/shrink
>> API.
>>
>> Even though we need to consider that this could bring additional
>> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
>> performed one after the other (in my innocent fantasy I was expecting to
>> find 2 subsequent ioctls in the code), but in QEMU's
>> address_space_set_flatview(), which seems to first remove all regions
>> and then add them when changing flatviews.
>>
>> address_space_update_topology_pass(as, old_view2, new_view,
>> adding=false);
>> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
>>
>> I don't think we can change this, as other listeners also rely on such
>> ordering, but we can still batch all callback requests like I currently
>> do and process them in kvm_commit(), figuring there which operation is
>> which.
>>
>> In other words, we would have something like:
>>
>> region_del() --> DELETE memslot X -> add it to the queue of operations
>> region_del() --> DELETE memslot Y -> add it to the queue of operations
>> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
>> of operations
>> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
>> of operations
>> ...
>> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
>> Y (-size) -> issue 2 ioctls, GROW and SHRINK.
> 
> I communicated resizes (region_resize()) to the notifier in patch #3 of
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
> 
> You could use that independently of the remainder. It should be less
> controversial ;)
> 
> But I think it only tackles part of the more generic problem (split/merge).

Yes, very useful! Using that patch we would have a single place where to
issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since
the operation will be atomic (this time in the true meaning, since we
don't need INVALIDATE.

> 
>>
>>> That would probably require more KVM code overall, but each operation
>>> would be more
>>> tightly bounded and thus simpler to define.  And I think more precise
>>> APIs would
>>> provide other benefits, e.g. growing a region wouldn't require first
>>> deleting the
>>> current region, and so could avoid zapping SPTEs and destroying
>>> metadata.  Merge,
>>> split, and truncate would have similar benefits, though they might be
>>> more
>>> difficult to realize in practice.
>>
>> So essentially grow would not require INVALIDATE. Makes sense, but would
>> it work also with shrink? I guess so, as the memslot is still present
>> (but shrinked) right?
>>
>> Paolo, would you be ok with this smaller API? Probably just starting
>> with grow and shrink first.
>>
>> I am not against any of the two approaches:
>> - my approach has the disadvantage that the list could be arbitrarily
>> long, and it is difficult to rollback the intermediate changes if
>> something breaks during the request processing (but could be simplified
>> by making kvm exit or crash).
>>
>> - Sean approach could potentially provide more burden to the userspace,
>> as we need to figure which operation is which. Also from my
>> understanding split and merge won't be really straightforward to
>> implement, especially in userspace.
>>
>> David, any concern from userspace prospective on this "CISC" approach?
> 
> In contrast to resizes in QEMU that only affect a single memory
> region/slot, splitting/merging is harder to factor out and communicate
> to a notifier. As an alternative, we could handle it in the commit stage
> in the notifier itself, similar to what my prototype does, and figure
> out what needs to be done in there and how to call the proper KVM
> interface (and which interface to call).
> 
> With virtio-mem (in the future) we might see merges of 2 slots into a
> single one, by closing a gap in-between them. In "theory" we could
> combine some updates into a single transaction. But it won't be 100s ...
> 
> I think I'd prefer an API that doesn't require excessive ad-hoc
> extensions later once something in QEMU changes.
> 
> 
> I think in essence, all we care about is performing atomic changes that
> *have to be atomic*, because something we add during a transaction
> overlaps with something we remove during a transaction. Not necessarily
> all updates in a transaction!
> 
> My prototype essentially detects that scenario, but doesn't call new KVM
> interfaces to deal with these.

With "prototype" I assume you mean the patch linked above
(region_resize), not the userspace-only proposal you sent initially right?

> 
> I assume once we take that into consideration, we can mostly assume that
> any such atomic updates (only of the regions that really have to be part
> of an atomic update) won't involve more than a handful of memory
> regions. We could add a sane KVM API limit.
> 
> And if we run into that API limit in QEMU, we can print a warning and do
> it non-atomically.
> 
If we implement single operations (split/merge/grow/shrink), we don't
even need that limit. Except for merge, maybe.

Ok, if it'ok for you all I can try to use David patch and implement some
simple grow/shrink. Then we need to figure where and when exactly QEMU
performs split and merge operations, and maybe implement something
similar to what you did in your proposal?

Thank you,
Emanuele

David Hildenbrand Oct. 13, 2022, 2:45 p.m. UTC | #32

>> I remember that there were BUG reports where we'd actually split and run
>> into that problem. Just don't have them at hand. I think they happened
>> during early boot when the OS re-configured some PCI thingies.
> 
> If you could point me where this is happening, it would be nice. So far
> I could not find or see any split/merge operation.

I remember some bugzilla, but it might be hard to find. Essentially, it 
happened that early during boot that it wasn't really a problem so far 
(single CPU running that triggers it IIRC).

But I might be messing up some details.

>>
>>>
>>>> That would probably require more KVM code overall, but each operation
>>>> would be more
>>>> tightly bounded and thus simpler to define.  And I think more precise
>>>> APIs would
>>>> provide other benefits, e.g. growing a region wouldn't require first
>>>> deleting the
>>>> current region, and so could avoid zapping SPTEs and destroying
>>>> metadata.  Merge,
>>>> split, and truncate would have similar benefits, though they might be
>>>> more
>>>> difficult to realize in practice.
>>>
>>> So essentially grow would not require INVALIDATE. Makes sense, but would
>>> it work also with shrink? I guess so, as the memslot is still present
>>> (but shrinked) right?
>>>
>>> Paolo, would you be ok with this smaller API? Probably just starting
>>> with grow and shrink first.
>>>
>>> I am not against any of the two approaches:
>>> - my approach has the disadvantage that the list could be arbitrarily
>>> long, and it is difficult to rollback the intermediate changes if
>>> something breaks during the request processing (but could be simplified
>>> by making kvm exit or crash).
>>>
>>> - Sean approach could potentially provide more burden to the userspace,
>>> as we need to figure which operation is which. Also from my
>>> understanding split and merge won't be really straightforward to
>>> implement, especially in userspace.
>>>
>>> David, any concern from userspace prospective on this "CISC" approach?
>>
>> In contrast to resizes in QEMU that only affect a single memory
>> region/slot, splitting/merging is harder to factor out and communicate
>> to a notifier. As an alternative, we could handle it in the commit stage
>> in the notifier itself, similar to what my prototype does, and figure
>> out what needs to be done in there and how to call the proper KVM
>> interface (and which interface to call).
>>
>> With virtio-mem (in the future) we might see merges of 2 slots into a
>> single one, by closing a gap in-between them. In "theory" we could
>> combine some updates into a single transaction. But it won't be 100s ...
>>
>> I think I'd prefer an API that doesn't require excessive ad-hoc
>> extensions later once something in QEMU changes.
>>
>>
>> I think in essence, all we care about is performing atomic changes that
>> *have to be atomic*, because something we add during a transaction
>> overlaps with something we remove during a transaction. Not necessarily
>> all updates in a transaction!
>>
>> My prototype essentially detects that scenario, but doesn't call new KVM
>> interfaces to deal with these.
> 
> With "prototype" I assume you mean the patch linked above
> (region_resize), not the userspace-only proposal you sent initially right?

I meant from the userspace-only proposal the part where we collect all 
add/remove callabcks in a list, and in the commit stage try to detect if 
there is an overlap. That way we could isolate the memblocks that really 
only need an atomic operation. Anything that doesn't overlap can just go 
via the old interface.

Not saying that it wins a beauty price, but we wouldn't have to adjust 
QEMU core even more to teach it these special cases (but maybe we should 
and other notifiers might benefit from that in the long term).

> 
>>
>> I assume once we take that into consideration, we can mostly assume that
>> any such atomic updates (only of the regions that really have to be part
>> of an atomic update) won't involve more than a handful of memory
>> regions. We could add a sane KVM API limit.
>>
>> And if we run into that API limit in QEMU, we can print a warning and do
>> it non-atomically.
>>
> If we implement single operations (split/merge/grow/shrink), we don't
> even need that limit. Except for merge, maybe.

Right, in theory you could have multiple splits/merges in a single 
commit that all interact.

Like cutting out 8 pieces out of an existing larger memblock.

> 
> Ok, if it'ok for you all I can try to use David patch and implement some
> simple grow/shrink. Then we need to figure where and when exactly QEMU
> performs split and merge operations, and maybe implement something
> similar to what you did in your proposal?

Looking at grow/shrink first is certainly not a waste of time.

[RFC,0/9] kvm: implement atomic memslot updates

Message

Comments