Message ID | 20220909104506.738478-1-eesposit@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | kvm: implement atomic memslot updates | expand |
On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: > KVM is currently capable of receiving a single memslot update through > the KVM_SET_USER_MEMORY_REGION ioctl. > The problem arises when we want to atomically perform multiple updates, > so that readers of memslot active list avoid seeing incomplete states. > > For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276 I don't have access. Can you provide a TL;DR? > we see how non atomic updates cause boot failure, because vcpus > will se a partial update (old memslot delete, new one not yet created) > and will crash. Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity to take on for something that appears to be solvable in userspace. And if the issue is related to KVM disallowing the toggling of read-only (can't see the bug), we can likely solve that without needing a new ioctl() that allows userspace to batch an arbitrary number of updates.
Am 09/09/2022 um 16:30 schrieb Sean Christopherson: > On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >> KVM is currently capable of receiving a single memslot update through >> the KVM_SET_USER_MEMORY_REGION ioctl. >> The problem arises when we want to atomically perform multiple updates, >> so that readers of memslot active list avoid seeing incomplete states. >> >> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276 > > I don't have access. Can you provide a TL;DR? You should be able to have access to it now. > >> we see how non atomic updates cause boot failure, because vcpus >> will se a partial update (old memslot delete, new one not yet created) >> and will crash. > > Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity > to take on for something that appears to be solvable in userspace. > I think it is not that easy to solve in userspace: see https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will temporarily drop the BQL - something most callers can't handle (esp. when called from vcpu context e.g., in virtio code)." Probably @Paolo and @Maxim can add more to this. Thank you, Emanuele
On 19/9/2022 12:13 am, Emanuele Giuseppe Esposito wrote: > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>> KVM is currently capable of receiving a single memslot update through >>> the KVM_SET_USER_MEMORY_REGION ioctl. >>> The problem arises when we want to atomically perform multiple updates, >>> so that readers of memslot active list avoid seeing incomplete states. >>> >>> For example, in RHBZhttps://bugzilla.redhat.com/show_bug.cgi?id=1979276 Oh, thanks for stepping up to try to address it. As it turns out, this issue was discovered "long" before https://bugzilla.kernel.org/show_bug.cgi?id=213781 As a comment, relevant selftests are necessary and required.
On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: > > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>> KVM is currently capable of receiving a single memslot update through >>> the KVM_SET_USER_MEMORY_REGION ioctl. >>> The problem arises when we want to atomically perform multiple updates, >>> so that readers of memslot active list avoid seeing incomplete states. >>> >>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >> >> I don't have access. Can you provide a TL;DR? > > You should be able to have access to it now. > >> >>> we see how non atomic updates cause boot failure, because vcpus >>> will se a partial update (old memslot delete, new one not yet created) >>> and will crash. >> >> Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity >> to take on for something that appears to be solvable in userspace. >> > > I think it is not that easy to solve in userspace: see > https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ > > > "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will > temporarily drop the BQL - something most callers can't handle (esp. > when called from vcpu context e.g., in virtio code)." Can you please comment on the bigger picture? The patch from me works around *exactly that*, and for that reason, contains that comment.
On 19.09.22 09:53, David Hildenbrand wrote: > On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >> >> >> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>> KVM is currently capable of receiving a single memslot update through >>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>> The problem arises when we want to atomically perform multiple updates, >>>> so that readers of memslot active list avoid seeing incomplete states. >>>> >>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>> >>> I don't have access. Can you provide a TL;DR? >> >> You should be able to have access to it now. >> >>> >>>> we see how non atomic updates cause boot failure, because vcpus >>>> will se a partial update (old memslot delete, new one not yet created) >>>> and will crash. >>> >>> Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity >>> to take on for something that appears to be solvable in userspace. >>> >> >> I think it is not that easy to solve in userspace: see >> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ >> >> >> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will >> temporarily drop the BQL - something most callers can't handle (esp. >> when called from vcpu context e.g., in virtio code)." > > Can you please comment on the bigger picture? The patch from me works > around *exactly that*, and for that reason, contains that comment. > FWIW, I hacked up my RFC to perform atomic updates on any memslot transactions (not just resizes) where ranges do add overlap with ranges to remove. https://github.com/davidhildenbrand/qemu/tree/memslot I only performed simple boot check under x86-64 (where I can see region resizes) and some make checks -- pretty sure it has some rough edges; but should indicate what's possible and what the possible price might be. [one could wire up a new KVM ioctl and call it conditionally on support if really required]
Am 19/09/2022 um 19:30 schrieb David Hildenbrand: > On 19.09.22 09:53, David Hildenbrand wrote: >> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>> >>> >>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>> KVM is currently capable of receiving a single memslot update through >>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>> The problem arises when we want to atomically perform multiple >>>>> updates, >>>>> so that readers of memslot active list avoid seeing incomplete states. >>>>> >>>>> For example, in RHBZ >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>>> >>>> I don't have access. Can you provide a TL;DR? >>> >>> You should be able to have access to it now. >>> >>>> >>>>> we see how non atomic updates cause boot failure, because vcpus >>>>> will se a partial update (old memslot delete, new one not yet created) >>>>> and will crash. >>>> >>>> Why not simply pause vCPUs in this scenario? This is an awful lot >>>> of a complexity >>>> to take on for something that appears to be solvable in userspace. >>>> >>> >>> I think it is not that easy to solve in userspace: see >>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ >>> >>> >>> >>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will >>> temporarily drop the BQL - something most callers can't handle (esp. >>> when called from vcpu context e.g., in virtio code)." >> >> Can you please comment on the bigger picture? The patch from me works >> around *exactly that*, and for that reason, contains that comment. >> > > FWIW, I hacked up my RFC to perform atomic updates on any memslot > transactions (not just resizes) where ranges do add overlap with ranges > to remove. > > https://github.com/davidhildenbrand/qemu/tree/memslot > > > I only performed simple boot check under x86-64 (where I can see region > resizes) and some make checks -- pretty sure it has some rough edges; > but should indicate what's possible and what the possible price might > be. [one could wire up a new KVM ioctl and call it conditionally on > support if really required] > A benefit of my ioctl implementation is that could be also used by other hypervisors, which then do not need to share this kind of "hack". However, after also talking with Maxim and Paolo, we all agreed that the main disadvantage of your approach is that is not scalable with the number of vcpus. It is also inferior to stop *all* vcpus just to allow a memslot update (KVM only pauses vCPUs that access the modified memslots instead). So I took some measurements, to see what is the performance difference between my implementation and yours. I used a machine where I could replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core Processor with kernel 5.19.0 (+ my patches). Then I measured the time it takes that QEMU spends in kvm_commit (ie in memslot updates) while booting a VM. In other words, if kvm_commit takes 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is not called anymore after boot, so this measurement is easy to compare over multiple invocations of QEMU. I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU command is the same to replicate the bug: ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm --display none >> ~/smp_$v; Each boot is reproduced 100 times, and then from results I measure average and stddev (in milliseconds). ioctl: -smp 1: Average: 2.1ms Stdev: 0.8ms -smp 2: Average: 2.5ms Stdev: 1.5ms -smp 4: Average: 2.2ms Stdev: 1.1ms -smp 8: Average: 2.4ms Stdev: 0.7ms -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions) -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions) pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot) -smp 1: Average: 2.2ms Stdev: 1.2ms -smp 2: Average: 3.0ms Stdev: 1.4ms -smp 4: Average: 3.1ms Stdev: 1.3m -smp 8: Average: 3.4ms Stdev: 1.4ms -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions) -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions) Above 24 vCPUs performance gets worse quickly but I think it's already quite clear that the results for ioctl scale better as the number of vcpus increases, while pausing the vCPUs becomes slower already with 16 vcpus. Thank you, Emanuele
On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: > > > Am 19/09/2022 um 19:30 schrieb David Hildenbrand: >> On 19.09.22 09:53, David Hildenbrand wrote: >>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>>> >>>> >>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>>> KVM is currently capable of receiving a single memslot update through >>>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>>> The problem arises when we want to atomically perform multiple >>>>>> updates, >>>>>> so that readers of memslot active list avoid seeing incomplete states. >>>>>> >>>>>> For example, in RHBZ >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>>>> >>>>> I don't have access. Can you provide a TL;DR? >>>> >>>> You should be able to have access to it now. >>>> >>>>> >>>>>> we see how non atomic updates cause boot failure, because vcpus >>>>>> will se a partial update (old memslot delete, new one not yet created) >>>>>> and will crash. >>>>> >>>>> Why not simply pause vCPUs in this scenario? This is an awful lot >>>>> of a complexity >>>>> to take on for something that appears to be solvable in userspace. >>>>> >>>> >>>> I think it is not that easy to solve in userspace: see >>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ >>>> >>>> >>>> >>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will >>>> temporarily drop the BQL - something most callers can't handle (esp. >>>> when called from vcpu context e.g., in virtio code)." >>> >>> Can you please comment on the bigger picture? The patch from me works >>> around *exactly that*, and for that reason, contains that comment. >>> >> >> FWIW, I hacked up my RFC to perform atomic updates on any memslot >> transactions (not just resizes) where ranges do add overlap with ranges >> to remove. >> >> https://github.com/davidhildenbrand/qemu/tree/memslot >> >> >> I only performed simple boot check under x86-64 (where I can see region >> resizes) and some make checks -- pretty sure it has some rough edges; >> but should indicate what's possible and what the possible price might >> be. [one could wire up a new KVM ioctl and call it conditionally on >> support if really required] >> > > A benefit of my ioctl implementation is that could be also used by other > hypervisors, which then do not need to share this kind of "hack". > However, after also talking with Maxim and Paolo, we all agreed that the > main disadvantage of your approach is that is not scalable with the > number of vcpus. It is also inferior to stop *all* vcpus just to allow a > memslot update (KVM only pauses vCPUs that access the modified memslots > instead). > > So I took some measurements, to see what is the performance difference > between my implementation and yours. I used a machine where I could > replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core > Processor with kernel 5.19.0 (+ my patches). > > Then I measured the time it takes that QEMU spends in kvm_commit (ie in > memslot updates) while booting a VM. In other words, if kvm_commit takes > 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is > not called anymore after boot, so this measurement is easy to compare > over multiple invocations of QEMU. > > I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU > command is the same to replicate the bug: > ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm > --display none >> ~/smp_$v; > > Each boot is reproduced 100 times, and then from results I measure > average and stddev (in milliseconds). > > ioctl: > -smp 1: Average: 2.1ms Stdev: 0.8ms > -smp 2: Average: 2.5ms Stdev: 1.5ms > -smp 4: Average: 2.2ms Stdev: 1.1ms > -smp 8: Average: 2.4ms Stdev: 0.7ms > -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions) > -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions) > > > pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot) > -smp 1: Average: 2.2ms Stdev: 1.2ms > -smp 2: Average: 3.0ms Stdev: 1.4ms > -smp 4: Average: 3.1ms Stdev: 1.3m > -smp 8: Average: 3.4ms Stdev: 1.4ms > -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions) > -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions) > > > Above 24 vCPUs performance gets worse quickly but I think it's already > quite clear that the results for ioctl scale better as the number of > vcpus increases, while pausing the vCPUs becomes slower already with 16 > vcpus. Right, the question is if it happens sufficiently enough that we even care and if there are not ways to mitigate. It doesn't necessarily have to scale with the #VCPUs I think. What should dominate the overall time in theory how long it takes for one VCPU (the slowest one) to leave the kernel. I wondered if 1) it might be easier to have a single KVM mechanism/call to kick all VCPUs out of KVM instead of doing it per VCPU. That might speed up things eventually heavily already. 2) One thing I wondered is whether the biggest overhead is actually taking the locks in QEMU and not actually waiting for the VCPUs. Maybe we could optimize that as well. (for now I use one lock per VCPU because it felt like it would reduce the ioctl overhead; maybe there is a better alternative to balance between both users) So treat my patch as a completely unoptimized version.
Am 23/09/2022 um 15:21 schrieb David Hildenbrand: > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: >> >> >> Am 19/09/2022 um 19:30 schrieb David Hildenbrand: >>> On 19.09.22 09:53, David Hildenbrand wrote: >>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>>>> >>>>> >>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>>>> KVM is currently capable of receiving a single memslot update >>>>>>> through >>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>>>> The problem arises when we want to atomically perform multiple >>>>>>> updates, >>>>>>> so that readers of memslot active list avoid seeing incomplete >>>>>>> states. >>>>>>> >>>>>>> For example, in RHBZ >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>>>>> >>>>>> I don't have access. Can you provide a TL;DR? >>>>> >>>>> You should be able to have access to it now. >>>>> >>>>>> >>>>>>> we see how non atomic updates cause boot failure, because vcpus >>>>>>> will se a partial update (old memslot delete, new one not yet >>>>>>> created) >>>>>>> and will crash. >>>>>> >>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot >>>>>> of a complexity >>>>>> to take on for something that appears to be solvable in userspace. >>>>>> >>>>> >>>>> I think it is not that easy to solve in userspace: see >>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ >>>>> >>>>> >>>>> >>>>> >>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it >>>>> will >>>>> temporarily drop the BQL - something most callers can't handle (esp. >>>>> when called from vcpu context e.g., in virtio code)." >>>> >>>> Can you please comment on the bigger picture? The patch from me works >>>> around *exactly that*, and for that reason, contains that comment. >>>> >>> >>> FWIW, I hacked up my RFC to perform atomic updates on any memslot >>> transactions (not just resizes) where ranges do add overlap with ranges >>> to remove. >>> >>> https://github.com/davidhildenbrand/qemu/tree/memslot >>> >>> >>> I only performed simple boot check under x86-64 (where I can see region >>> resizes) and some make checks -- pretty sure it has some rough edges; >>> but should indicate what's possible and what the possible price might >>> be. [one could wire up a new KVM ioctl and call it conditionally on >>> support if really required] >>> >> >> A benefit of my ioctl implementation is that could be also used by other >> hypervisors, which then do not need to share this kind of "hack". >> However, after also talking with Maxim and Paolo, we all agreed that the >> main disadvantage of your approach is that is not scalable with the >> number of vcpus. It is also inferior to stop *all* vcpus just to allow a >> memslot update (KVM only pauses vCPUs that access the modified memslots >> instead). >> >> So I took some measurements, to see what is the performance difference >> between my implementation and yours. I used a machine where I could >> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core >> Processor with kernel 5.19.0 (+ my patches). >> >> Then I measured the time it takes that QEMU spends in kvm_commit (ie in >> memslot updates) while booting a VM. In other words, if kvm_commit takes >> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is >> not called anymore after boot, so this measurement is easy to compare >> over multiple invocations of QEMU. >> >> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU >> command is the same to replicate the bug: >> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm >> --display none >> ~/smp_$v; >> >> Each boot is reproduced 100 times, and then from results I measure >> average and stddev (in milliseconds). >> >> ioctl: >> -smp 1: Average: 2.1ms Stdev: 0.8ms >> -smp 2: Average: 2.5ms Stdev: 1.5ms >> -smp 4: Average: 2.2ms Stdev: 1.1ms >> -smp 8: Average: 2.4ms Stdev: 0.7ms >> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions) >> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions) >> >> >> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot) >> -smp 1: Average: 2.2ms Stdev: 1.2ms >> -smp 2: Average: 3.0ms Stdev: 1.4ms >> -smp 4: Average: 3.1ms Stdev: 1.3m >> -smp 8: Average: 3.4ms Stdev: 1.4ms >> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions) >> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions) >> >> >> Above 24 vCPUs performance gets worse quickly but I think it's already >> quite clear that the results for ioctl scale better as the number of >> vcpus increases, while pausing the vCPUs becomes slower already with 16 >> vcpus. > > Right, the question is if it happens sufficiently enough that we even > care and if there are not ways to mitigate. > > It doesn't necessarily have to scale with the #VCPUs I think. What > should dominate the overall time in theory how long it takes for one > VCPU (the slowest one) to leave the kernel. > > I wondered if > > 1) it might be easier to have a single KVM mechanism/call to kick all > VCPUs out of KVM instead of doing it per VCPU. That might speed up > things eventually heavily already. So if I understand correclty, this implies creating a new ioctl in KVM anyways? What would be then the difference with what I do? We are affecting the kernel anyways. > > 2) One thing I wondered is whether the biggest overhead is actually > taking the locks in QEMU and not actually waiting for the VCPUs. Maybe > we could optimize that as well. (for now I use one lock per VCPU because > it felt like it would reduce the ioctl overhead; maybe there is a better > alternative to balance between both users) > > So treat my patch as a completely unoptimized version. > For what is worth, also my version performs #invalidate+1 swaps, which is not optimized. Honestly, I don't see how the above is easier or simpler than what is being proposed here. Thank you, Emanuele
On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote: > > > Am 23/09/2022 um 15:21 schrieb David Hildenbrand: >> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: >>> >>> >>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand: >>>> On 19.09.22 09:53, David Hildenbrand wrote: >>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>>>>> >>>>>> >>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>>>>> KVM is currently capable of receiving a single memslot update >>>>>>>> through >>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>>>>> The problem arises when we want to atomically perform multiple >>>>>>>> updates, >>>>>>>> so that readers of memslot active list avoid seeing incomplete >>>>>>>> states. >>>>>>>> >>>>>>>> For example, in RHBZ >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>>>>>> >>>>>>> I don't have access. Can you provide a TL;DR? >>>>>> >>>>>> You should be able to have access to it now. >>>>>> >>>>>>> >>>>>>>> we see how non atomic updates cause boot failure, because vcpus >>>>>>>> will se a partial update (old memslot delete, new one not yet >>>>>>>> created) >>>>>>>> and will crash. >>>>>>> >>>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot >>>>>>> of a complexity >>>>>>> to take on for something that appears to be solvable in userspace. >>>>>>> >>>>>> >>>>>> I think it is not that easy to solve in userspace: see >>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it >>>>>> will >>>>>> temporarily drop the BQL - something most callers can't handle (esp. >>>>>> when called from vcpu context e.g., in virtio code)." >>>>> >>>>> Can you please comment on the bigger picture? The patch from me works >>>>> around *exactly that*, and for that reason, contains that comment. >>>>> >>>> >>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot >>>> transactions (not just resizes) where ranges do add overlap with ranges >>>> to remove. >>>> >>>> https://github.com/davidhildenbrand/qemu/tree/memslot >>>> >>>> >>>> I only performed simple boot check under x86-64 (where I can see region >>>> resizes) and some make checks -- pretty sure it has some rough edges; >>>> but should indicate what's possible and what the possible price might >>>> be. [one could wire up a new KVM ioctl and call it conditionally on >>>> support if really required] >>>> >>> >>> A benefit of my ioctl implementation is that could be also used by other >>> hypervisors, which then do not need to share this kind of "hack". >>> However, after also talking with Maxim and Paolo, we all agreed that the >>> main disadvantage of your approach is that is not scalable with the >>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a >>> memslot update (KVM only pauses vCPUs that access the modified memslots >>> instead). >>> >>> So I took some measurements, to see what is the performance difference >>> between my implementation and yours. I used a machine where I could >>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core >>> Processor with kernel 5.19.0 (+ my patches). >>> >>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in >>> memslot updates) while booting a VM. In other words, if kvm_commit takes >>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is >>> not called anymore after boot, so this measurement is easy to compare >>> over multiple invocations of QEMU. >>> >>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU >>> command is the same to replicate the bug: >>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm >>> --display none >> ~/smp_$v; >>> >>> Each boot is reproduced 100 times, and then from results I measure >>> average and stddev (in milliseconds). >>> >>> ioctl: >>> -smp 1: Average: 2.1ms Stdev: 0.8ms >>> -smp 2: Average: 2.5ms Stdev: 1.5ms >>> -smp 4: Average: 2.2ms Stdev: 1.1ms >>> -smp 8: Average: 2.4ms Stdev: 0.7ms >>> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions) >>> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions) >>> >>> >>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot) >>> -smp 1: Average: 2.2ms Stdev: 1.2ms >>> -smp 2: Average: 3.0ms Stdev: 1.4ms >>> -smp 4: Average: 3.1ms Stdev: 1.3m >>> -smp 8: Average: 3.4ms Stdev: 1.4ms >>> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions) >>> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions) >>> >>> >>> Above 24 vCPUs performance gets worse quickly but I think it's already >>> quite clear that the results for ioctl scale better as the number of >>> vcpus increases, while pausing the vCPUs becomes slower already with 16 >>> vcpus. >> >> Right, the question is if it happens sufficiently enough that we even >> care and if there are not ways to mitigate. >> >> It doesn't necessarily have to scale with the #VCPUs I think. What >> should dominate the overall time in theory how long it takes for one >> VCPU (the slowest one) to leave the kernel. >> >> I wondered if >> >> 1) it might be easier to have a single KVM mechanism/call to kick all >> VCPUs out of KVM instead of doing it per VCPU. That might speed up >> things eventually heavily already. > > So if I understand correclty, this implies creating a new ioctl in KVM > anyways? What would be then the difference with what I do? We are > affecting the kernel anyways. > If that turns out to be the actual bottleneck. I haven't analyzed what exactly is causing the delay with increasing # VCPUs, so I'm just guessing. Right now, we primarily use pthread_kill(SIG_IPI) to send a signal to each VCPU thread. >> >> 2) One thing I wondered is whether the biggest overhead is actually >> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe >> we could optimize that as well. (for now I use one lock per VCPU because >> it felt like it would reduce the ioctl overhead; maybe there is a better >> alternative to balance between both users) >> >> So treat my patch as a completely unoptimized version. >> > For what is worth, also my version performs #invalidate+1 swaps, which > is not optimized. > > Honestly, I don't see how the above is easier or simpler than what is > being proposed here. As Sean said "This is an awful lot of a complexity to take on for something that appears to be solvable in userspace." I showed that it's possible. So from that point on, there has to be good justification why that complexity has to exist in kernel space to optimize this scenario. Now, it's not my call to make. I'm just pointing out that it can be done and that there might be room for improvement in my quick prototype.
On Mon, Sep 26, 2022, David Hildenbrand wrote: > On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote: > > > > > > Am 23/09/2022 um 15:21 schrieb David Hildenbrand: > > > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: > > > > > > > > > > > > Am 19/09/2022 um 19:30 schrieb David Hildenbrand: > > > > > On 19.09.22 09:53, David Hildenbrand wrote: > > > > > > On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: > > > > > > > > > > > > > > > > > > > > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson: > > > > > > > > On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: > > > > > > > > > KVM is currently capable of receiving a single memslot update > > > > > > > > > through > > > > > > > > > the KVM_SET_USER_MEMORY_REGION ioctl. > > > > > > > > > The problem arises when we want to atomically perform multiple > > > > > > > > > updates, > > > > > > > > > so that readers of memslot active list avoid seeing incomplete > > > > > > > > > states. > > > > > > > > > > > > > > > > > > For example, in RHBZ > > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1979276 ... > As Sean said "This is an awful lot of a complexity to take on for something > that appears to be solvable in userspace." And if the userspace solution is unpalatable for whatever reason, I'd like to understand exactly what KVM behavior is problematic for userspace. E.g. the above RHBZ bug should no longer be an issue as the buggy commit has since been reverted. If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle the race _if_ userspace chooses not to pause vCPUs.
Am 26/09/2022 um 23:28 schrieb Sean Christopherson: > On Mon, Sep 26, 2022, David Hildenbrand wrote: >> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote: >>> >>> >>> Am 23/09/2022 um 15:21 schrieb David Hildenbrand: >>>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: >>>>> >>>>> >>>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand: >>>>>> On 19.09.22 09:53, David Hildenbrand wrote: >>>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>>>>>>> >>>>>>>> >>>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>>>>>>> KVM is currently capable of receiving a single memslot update >>>>>>>>>> through >>>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>>>>>>> The problem arises when we want to atomically perform multiple >>>>>>>>>> updates, >>>>>>>>>> so that readers of memslot active list avoid seeing incomplete >>>>>>>>>> states. >>>>>>>>>> >>>>>>>>>> For example, in RHBZ >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 > > ... > >> As Sean said "This is an awful lot of a complexity to take on for something >> that appears to be solvable in userspace." > > And if the userspace solution is unpalatable for whatever reason, I'd like to > understand exactly what KVM behavior is problematic for userspace. E.g. the > above RHBZ bug should no longer be an issue as the buggy commit has since been > reverted. It still is because I can reproduce the bug, as also pointed out in multiple comments below. > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle > the race _if_ userspace chooses not to pause vCPUs. > Also on the BZ they all seem (Paolo included) to agree that the issue is non-atomic memslots update. To be more precise, what I did mostly follows what Peter explained in comment 19 : https://bugzilla.redhat.com/show_bug.cgi?id=1979276#c19
On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: > > Am 26/09/2022 um 23:28 schrieb Sean Christopherson: > > On Mon, Sep 26, 2022, David Hildenbrand wrote: > >> As Sean said "This is an awful lot of a complexity to take on for something > >> that appears to be solvable in userspace." > > > > And if the userspace solution is unpalatable for whatever reason, I'd like to > > understand exactly what KVM behavior is problematic for userspace. E.g. the > > above RHBZ bug should no longer be an issue as the buggy commit has since been > > reverted. > > It still is because I can reproduce the bug, as also pointed out in > multiple comments below. You can reproduce _a_ bug, but it's obviously not the original bug, because the last comment says: Second, indeed the patch was reverted and somehow accepted without generating too much noise: ... The underlying issue of course as we both know is still there. You might have luck reproducing it with this bug https://bugzilla.redhat.com/show_bug.cgi?id=1855298 But for me it looks like it is 'working' as well, so you might have to write a unit test to trigger the issue. > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd > > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle > > the race _if_ userspace chooses not to pause vCPUs. > > > > Also on the BZ they all seem (Paolo included) to agree that the issue is > non-atomic memslots update. Yes, non-atomic memslot likely results in the guest fetching from a GPA without a memslot. I'm asking for an explanation of exactly what happens when that occurs, because it should be possible to adjust KVM and/or QEMU to play nice with the fetch, e.g. to resume the guest until the new memslot is installed, in which case an atomic update isn't needed. I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to the MMIO code fetch. I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's ABI in any way, and it can even become "official" behavior since KVM x86 doesn't otherwise exit with mmio.len==0. Compile tested only... --- From: Sean Christopherson <seanjc@google.com> Date: Tue, 27 Sep 2022 08:16:03 -0700 Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on MMIO fetch Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by a memslot. If userspace is manipulating memslots without pausing vCPUs, e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no valid memslot installed. Depending on guest context, KVM will either exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply resume the guest (L2 or CPL>0), neither of which is desirable as exiting with "emulation error" effectively kills the VM, and resuming the guest doesn't provide userspace an opportunity to react the to fetch. Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but there is no other way to communicate "fetch" to userspace without defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean and not a flag, and there is no known use case for actually supporting code fetches from MMIO, i.e. there's no need to allow userspace to fill in the instruction bytes. Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/emulate.c | 2 ++ arch/x86/kvm/kvm_emulate.h | 1 + arch/x86/kvm/x86.c | 9 ++++++++- 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index f092c54d1a2f..e141238d93b0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int done: if (rc == X86EMUL_PROPAGATE_FAULT) ctxt->have_exception = true; + if (rc == X86EMUL_IO_NEEDED) + return EMULATION_IO_FETCH; return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; } diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index 89246446d6aa..3cb2e321fcd2 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); #define EMULATION_OK 0 #define EMULATION_RESTART 1 #define EMULATION_INTERCEPTED 2 +#define EMULATION_IO_FETCH 3 void init_decode_cache(struct x86_emulate_ctxt *ctxt); int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); int emulator_task_switch(struct x86_emulate_ctxt *ctxt, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa5ab0c620de..7eb72694c601 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, bytes = (unsigned)PAGE_SIZE - offset; ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val, offset, bytes); - if (unlikely(ret < 0)) + if (unlikely(ret < 0)) { + vcpu->run->mmio.phys_addr = gpa; + vcpu->run->mmio.len = 0; + vcpu->run->mmio.is_write = 0; + vcpu->run->exit_reason = KVM_EXIT_MMIO; return X86EMUL_IO_NEEDED; + } return X86EMUL_CONTINUE; } @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, r = x86_decode_emulated_instruction(vcpu, emulation_type, insn, insn_len); if (r != EMULATION_OK) { + if (r == EMULATION_IO_FETCH) + return 0; if ((emulation_type & EMULTYPE_TRAP_UD) || (emulation_type & EMULTYPE_TRAP_UD_FORCED)) { kvm_queue_exception(vcpu, UD_VECTOR); base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2 --
Am 27/09/2022 um 17:58 schrieb Sean Christopherson: > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: >> >> Am 26/09/2022 um 23:28 schrieb Sean Christopherson: >>> On Mon, Sep 26, 2022, David Hildenbrand wrote: >>>> As Sean said "This is an awful lot of a complexity to take on for something >>>> that appears to be solvable in userspace." >>> >>> And if the userspace solution is unpalatable for whatever reason, I'd like to >>> understand exactly what KVM behavior is problematic for userspace. E.g. the >>> above RHBZ bug should no longer be an issue as the buggy commit has since been >>> reverted. >> >> It still is because I can reproduce the bug, as also pointed out in >> multiple comments below. > > You can reproduce _a_ bug, but it's obviously not the original bug, because the > last comment says: > > Second, indeed the patch was reverted and somehow accepted without generating > too much noise: > > ... > > The underlying issue of course as we both know is still there. > > You might have luck reproducing it with this bug > > https://bugzilla.redhat.com/show_bug.cgi?id=1855298 > > But for me it looks like it is 'working' as well, so you might have > to write a unit test to trigger the issue. > >>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd >>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle >>> the race _if_ userspace chooses not to pause vCPUs. >>> >> >> Also on the BZ they all seem (Paolo included) to agree that the issue is >> non-atomic memslots update. > > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a > memslot. I'm asking for an explanation of exactly what happens when that occurs, > because it should be possible to adjust KVM and/or QEMU to play nice with the > fetch, e.g. to resume the guest until the new memslot is installed, in which case > an atomic update isn't needed. > > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the > guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to > the MMIO code fetch. > > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't > otherwise exit with mmio.len==0. > > Compile tested only... So basically you are just making KVM catch the failed kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which basically ends up in doing nothing and retry again executing the instruction? I wonder if there are some performance implications in this, but it's definitely simpler than what I did. Tested on the same failing machine used for the BZ, fixes the bug. Do you want me to re-send the patch on your behalf (and add probably a small documentation on Documentation/virt/kvm/api.rst)? Emanuele > > --- > From: Sean Christopherson <seanjc@google.com> > Date: Tue, 27 Sep 2022 08:16:03 -0700 > Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on > MMIO fetch > > Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being > able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by > a memslot. If userspace is manipulating memslots without pausing vCPUs, > e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no > valid memslot installed. Depending on guest context, KVM will either > exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply > resume the guest (L2 or CPL>0), neither of which is desirable as exiting > with "emulation error" effectively kills the VM, and resuming the guest > doesn't provide userspace an opportunity to react the to fetch. > > Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but > there is no other way to communicate "fetch" to userspace without > defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean > and not a flag, and there is no known use case for actually supporting > code fetches from MMIO, i.e. there's no need to allow userspace to fill > in the instruction bytes. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/emulate.c | 2 ++ > arch/x86/kvm/kvm_emulate.h | 1 + > arch/x86/kvm/x86.c | 9 ++++++++- > 3 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index f092c54d1a2f..e141238d93b0 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int > done: > if (rc == X86EMUL_PROPAGATE_FAULT) > ctxt->have_exception = true; > + if (rc == X86EMUL_IO_NEEDED) > + return EMULATION_IO_FETCH; > return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; > } > > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h > index 89246446d6aa..3cb2e321fcd2 100644 > --- a/arch/x86/kvm/kvm_emulate.h > +++ b/arch/x86/kvm/kvm_emulate.h > @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); > #define EMULATION_OK 0 > #define EMULATION_RESTART 1 > #define EMULATION_INTERCEPTED 2 > +#define EMULATION_IO_FETCH 3 > void init_decode_cache(struct x86_emulate_ctxt *ctxt); > int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); > int emulator_task_switch(struct x86_emulate_ctxt *ctxt, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index aa5ab0c620de..7eb72694c601 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, > bytes = (unsigned)PAGE_SIZE - offset; > ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val, > offset, bytes); > - if (unlikely(ret < 0)) > + if (unlikely(ret < 0)) { > + vcpu->run->mmio.phys_addr = gpa; > + vcpu->run->mmio.len = 0; > + vcpu->run->mmio.is_write = 0; > + vcpu->run->exit_reason = KVM_EXIT_MMIO; > return X86EMUL_IO_NEEDED; > + } > > return X86EMUL_CONTINUE; > } > @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > r = x86_decode_emulated_instruction(vcpu, emulation_type, > insn, insn_len); > if (r != EMULATION_OK) { > + if (r == EMULATION_IO_FETCH) > + return 0; > if ((emulation_type & EMULTYPE_TRAP_UD) || > (emulation_type & EMULTYPE_TRAP_UD_FORCED)) { > kvm_queue_exception(vcpu, UD_VECTOR); > > base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2 >
On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote: > > Am 27/09/2022 um 17:58 schrieb Sean Christopherson: > > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: > > > Am 26/09/2022 um 23:28 schrieb Sean Christopherson: > > > > On Mon, Sep 26, 2022, David Hildenbrand wrote: > > > > > As Sean said "This is an awful lot of a complexity to take on for something > > > > > that appears to be solvable in userspace." > > > > > > > > And if the userspace solution is unpalatable for whatever reason, I'd like to > > > > understand exactly what KVM behavior is problematic for userspace. E.g. the > > > > above RHBZ bug should no longer be an issue as the buggy commit has since been > > > > reverted. > > > > > > It still is because I can reproduce the bug, as also pointed out in > > > multiple comments below. > > > > You can reproduce _a_ bug, but it's obviously not the original bug, because the > > last comment says: > > > > Second, indeed the patch was reverted and somehow accepted without generating > > too much noise: > > > > ... > > > > The underlying issue of course as we both know is still there. > > > > You might have luck reproducing it with this bug > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1855298 > > > > But for me it looks like it is 'working' as well, so you might have > > to write a unit test to trigger the issue. > > > > > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd > > > > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle > > > > the race _if_ userspace chooses not to pause vCPUs. > > > > > > > > > > Also on the BZ they all seem (Paolo included) to agree that the issue is > > > non-atomic memslots update. > > > > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a > > memslot. I'm asking for an explanation of exactly what happens when that occurs, > > because it should be possible to adjust KVM and/or QEMU to play nice with the > > fetch, e.g. to resume the guest until the new memslot is installed, in which case > > an atomic update isn't needed. > > > > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the > > guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, > > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead > > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to > > the MMIO code fetch. > > > > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes > > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's > > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't > > otherwise exit with mmio.len==0. > > > > Compile tested only... > > So basically you are just making KVM catch the failed > kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which > basically ends up in doing nothing and retry again executing the > instruction? > > I wonder if there are some performance implications in this, but it's > definitely simpler than what I did. > > Tested on the same failing machine used for the BZ, fixes the bug. > > Do you want me to re-send the patch on your behalf (and add probably a > small documentation on Documentation/virt/kvm/api.rst)? > > Emanuele > > --- > > From: Sean Christopherson <seanjc@google.com> > > Date: Tue, 27 Sep 2022 08:16:03 -0700 > > Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on > > MMIO fetch > > > > Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being > > able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by > > a memslot. If userspace is manipulating memslots without pausing vCPUs, > > e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no > > valid memslot installed. Depending on guest context, KVM will either > > exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply > > resume the guest (L2 or CPL>0), neither of which is desirable as exiting > > with "emulation error" effectively kills the VM, and resuming the guest > > doesn't provide userspace an opportunity to react the to fetch. > > > > Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but > > there is no other way to communicate "fetch" to userspace without > > defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean > > and not a flag, and there is no known use case for actually supporting > > code fetches from MMIO, i.e. there's no need to allow userspace to fill > > in the instruction bytes. > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > arch/x86/kvm/emulate.c | 2 ++ > > arch/x86/kvm/kvm_emulate.h | 1 + > > arch/x86/kvm/x86.c | 9 ++++++++- > > 3 files changed, 11 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > > index f092c54d1a2f..e141238d93b0 100644 > > --- a/arch/x86/kvm/emulate.c > > +++ b/arch/x86/kvm/emulate.c > > @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int > > done: > > if (rc == X86EMUL_PROPAGATE_FAULT) > > ctxt->have_exception = true; > > + if (rc == X86EMUL_IO_NEEDED) > > + return EMULATION_IO_FETCH; > > return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; > > } > > > > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h > > index 89246446d6aa..3cb2e321fcd2 100644 > > --- a/arch/x86/kvm/kvm_emulate.h > > +++ b/arch/x86/kvm/kvm_emulate.h > > @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); > > #define EMULATION_OK 0 > > #define EMULATION_RESTART 1 > > #define EMULATION_INTERCEPTED 2 > > +#define EMULATION_IO_FETCH 3 > > void init_decode_cache(struct x86_emulate_ctxt *ctxt); > > int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); > > int emulator_task_switch(struct x86_emulate_ctxt *ctxt, > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index aa5ab0c620de..7eb72694c601 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, > > bytes = (unsigned)PAGE_SIZE - offset; > > ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val, > > offset, bytes); > > - if (unlikely(ret < 0)) > > + if (unlikely(ret < 0)) { > > + vcpu->run->mmio.phys_addr = gpa; > > + vcpu->run->mmio.len = 0; > > + vcpu->run->mmio.is_write = 0; > > + vcpu->run->exit_reason = KVM_EXIT_MMIO; > > return X86EMUL_IO_NEEDED; > > + } > > > > return X86EMUL_CONTINUE; > > } > > @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > r = x86_decode_emulated_instruction(vcpu, emulation_type, > > insn, insn_len); > > if (r != EMULATION_OK) { > > + if (r == EMULATION_IO_FETCH) > > + return 0; > > if ((emulation_type & EMULTYPE_TRAP_UD) || > > (emulation_type & EMULTYPE_TRAP_UD_FORCED)) { > > kvm_queue_exception(vcpu, UD_VECTOR); > > > > base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2 > > Note that AFAIK, there is another case (and probably more), if TDP is disabled, and MMU root is in mmio, we kill the guest. mmu_alloc_shadow_roots -> mmu_check_root I used to have few hacks in KVM to cope with this, but AFAIK, I gave up on it, because the issue would show up again and again. Best regards, Maxim Levitsky
On 28.09.22 13:14, Maxim Levitsky wrote: > On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote: >> >> Am 27/09/2022 um 17:58 schrieb Sean Christopherson: >>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: >>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson: >>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote: >>>>>> As Sean said "This is an awful lot of a complexity to take on for something >>>>>> that appears to be solvable in userspace." >>>>> >>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to >>>>> understand exactly what KVM behavior is problematic for userspace. E.g. the >>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been >>>>> reverted. >>>> >>>> It still is because I can reproduce the bug, as also pointed out in >>>> multiple comments below. >>> >>> You can reproduce _a_ bug, but it's obviously not the original bug, because the >>> last comment says: >>> >>> Second, indeed the patch was reverted and somehow accepted without generating >>> too much noise: >>> >>> ... >>> >>> The underlying issue of course as we both know is still there. >>> >>> You might have luck reproducing it with this bug >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1855298 >>> >>> But for me it looks like it is 'working' as well, so you might have >>> to write a unit test to trigger the issue. >>> >>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd >>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle >>>>> the race _if_ userspace chooses not to pause vCPUs. >>>>> >>>> >>>> Also on the BZ they all seem (Paolo included) to agree that the issue is >>>> non-atomic memslots update. >>> >>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a >>> memslot. I'm asking for an explanation of exactly what happens when that occurs, >>> because it should be possible to adjust KVM and/or QEMU to play nice with the >>> fetch, e.g. to resume the guest until the new memslot is installed, in which case >>> an atomic update isn't needed. >>> >>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the >>> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, >>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead >>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to >>> the MMIO code fetch. >>> >>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes >>> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's >>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't >>> otherwise exit with mmio.len==0. >>> >>> Compile tested only... >> >> So basically you are just making KVM catch the failed >> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which >> basically ends up in doing nothing and retry again executing the >> instruction? >> >> I wonder if there are some performance implications in this, but it's >> definitely simpler than what I did. >> >> Tested on the same failing machine used for the BZ, fixes the bug. >> >> Do you want me to re-send the patch on your behalf (and add probably a >> small documentation on Documentation/virt/kvm/api.rst)? >> >> Emanuele >>> --- >>> From: Sean Christopherson <seanjc@google.com> >>> Date: Tue, 27 Sep 2022 08:16:03 -0700 >>> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on >>> MMIO fetch >>> >>> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being >>> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by >>> a memslot. If userspace is manipulating memslots without pausing vCPUs, >>> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no >>> valid memslot installed. Depending on guest context, KVM will either >>> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply >>> resume the guest (L2 or CPL>0), neither of which is desirable as exiting >>> with "emulation error" effectively kills the VM, and resuming the guest >>> doesn't provide userspace an opportunity to react the to fetch. >>> >>> Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but >>> there is no other way to communicate "fetch" to userspace without >>> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean >>> and not a flag, and there is no known use case for actually supporting >>> code fetches from MMIO, i.e. there's no need to allow userspace to fill >>> in the instruction bytes. >>> >>> Signed-off-by: Sean Christopherson <seanjc@google.com> >>> --- >>> arch/x86/kvm/emulate.c | 2 ++ >>> arch/x86/kvm/kvm_emulate.h | 1 + >>> arch/x86/kvm/x86.c | 9 ++++++++- >>> 3 files changed, 11 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c >>> index f092c54d1a2f..e141238d93b0 100644 >>> --- a/arch/x86/kvm/emulate.c >>> +++ b/arch/x86/kvm/emulate.c >>> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int >>> done: >>> if (rc == X86EMUL_PROPAGATE_FAULT) >>> ctxt->have_exception = true; >>> + if (rc == X86EMUL_IO_NEEDED) >>> + return EMULATION_IO_FETCH; >>> return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; >>> } >>> >>> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h >>> index 89246446d6aa..3cb2e321fcd2 100644 >>> --- a/arch/x86/kvm/kvm_emulate.h >>> +++ b/arch/x86/kvm/kvm_emulate.h >>> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt); >>> #define EMULATION_OK 0 >>> #define EMULATION_RESTART 1 >>> #define EMULATION_INTERCEPTED 2 >>> +#define EMULATION_IO_FETCH 3 >>> void init_decode_cache(struct x86_emulate_ctxt *ctxt); >>> int x86_emulate_insn(struct x86_emulate_ctxt *ctxt); >>> int emulator_task_switch(struct x86_emulate_ctxt *ctxt, >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index aa5ab0c620de..7eb72694c601 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, >>> bytes = (unsigned)PAGE_SIZE - offset; >>> ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val, >>> offset, bytes); >>> - if (unlikely(ret < 0)) >>> + if (unlikely(ret < 0)) { >>> + vcpu->run->mmio.phys_addr = gpa; >>> + vcpu->run->mmio.len = 0; >>> + vcpu->run->mmio.is_write = 0; >>> + vcpu->run->exit_reason = KVM_EXIT_MMIO; >>> return X86EMUL_IO_NEEDED; >>> + } >>> >>> return X86EMUL_CONTINUE; >>> } >>> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, >>> r = x86_decode_emulated_instruction(vcpu, emulation_type, >>> insn, insn_len); >>> if (r != EMULATION_OK) { >>> + if (r == EMULATION_IO_FETCH) >>> + return 0; >>> if ((emulation_type & EMULTYPE_TRAP_UD) || >>> (emulation_type & EMULTYPE_TRAP_UD_FORCED)) { >>> kvm_queue_exception(vcpu, UD_VECTOR); >>> >>> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2 >>> > > Note that AFAIK, there is another case (and probably more), if TDP is disabled, > and MMU root is in mmio, we kill the guest. > > > mmu_alloc_shadow_roots -> mmu_check_root > > > I used to have few hacks in KVM to cope with this, but AFAIK, > I gave up on it, because the issue would show up again and again. IIRC, s390x can have real problems if we temporarily remove a memslot. In case the emulation/interpretation code tries accessing guest memory and fails because there is no memslot describing that portion of guest RAM. Note that resizing/merging/splitting currently shouldn't happen on s390x, though. But resizing of memslots might happen in the near future with virtio-mem in QEMU.
On 9/27/22 17:58, Sean Christopherson wrote: > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: >> >> Am 26/09/2022 um 23:28 schrieb Sean Christopherson: >>> On Mon, Sep 26, 2022, David Hildenbrand wrote: >>>> As Sean said "This is an awful lot of a complexity to take on for something >>>> that appears to be solvable in userspace." >>> >>> And if the userspace solution is unpalatable for whatever reason, I'd like to >>> understand exactly what KVM behavior is problematic for userspace. E.g. the >>> above RHBZ bug should no longer be an issue as the buggy commit has since been >>> reverted. >> >> It still is because I can reproduce the bug, as also pointed out in >> multiple comments below. > > You can reproduce _a_ bug, but it's obviously not the original bug, because the > last comment says: > > Second, indeed the patch was reverted and somehow accepted without generating > too much noise: > > ... > > The underlying issue of course as we both know is still there. > > You might have luck reproducing it with this bug > > https://bugzilla.redhat.com/show_bug.cgi?id=1855298 > > But for me it looks like it is 'working' as well, so you might have > to write a unit test to trigger the issue. > >>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd >>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle >>> the race _if_ userspace chooses not to pause vCPUs. >>> >> >> Also on the BZ they all seem (Paolo included) to agree that the issue is >> non-atomic memslots update. > > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a > memslot. I'm asking for an explanation of exactly what happens when that occurs, > because it should be possible to adjust KVM and/or QEMU to play nice with the > fetch, e.g. to resume the guest until the new memslot is installed, in which case > an atomic update isn't needed. > > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the > guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to > the MMIO code fetch. > > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't > otherwise exit with mmio.len==0. I think this patch is not a good idea for two reasons: 1) we don't know how userspace behaves if mmio.len is zero. It is of course reasonable to do nothing, but an assertion failure is also a valid behavior 2) more important, there is no way to distinguish a failure due to the guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a guest that correctly caused an internal error to loop forever. While the former could be handled in a "wait and see" manner, the latter in particular is part of the KVM_RUN contract. Of course it is possible for a guest to just loop forever, but in general all of KVM, QEMU and upper userspace layers want a crashed guest to be detected and stopped forever. Yes, QEMU could loop only if memslot updates are in progress, but honestly all the alternatives I have seen to atomic memslot updates are really *awful*. David's patches even invent a new kind of mutex for which I have absolutely no idea what kind of deadlocks one should worry about and why they should not exist; QEMU's locking is already pretty crappy, it's certainly not on my wishlist to make it worse! This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the kernel is the only place where you can have a *good* fix. It should have been fixed years ago. Paolo
On 28.09.22 17:07, Paolo Bonzini wrote: > On 9/27/22 17:58, Sean Christopherson wrote: >> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote: >>> >>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson: >>>> On Mon, Sep 26, 2022, David Hildenbrand wrote: >>>>> As Sean said "This is an awful lot of a complexity to take on for something >>>>> that appears to be solvable in userspace." >>>> >>>> And if the userspace solution is unpalatable for whatever reason, I'd like to >>>> understand exactly what KVM behavior is problematic for userspace. E.g. the >>>> above RHBZ bug should no longer be an issue as the buggy commit has since been >>>> reverted. >>> >>> It still is because I can reproduce the bug, as also pointed out in >>> multiple comments below. >> >> You can reproduce _a_ bug, but it's obviously not the original bug, because the >> last comment says: >> >> Second, indeed the patch was reverted and somehow accepted without generating >> too much noise: >> >> ... >> >> The underlying issue of course as we both know is still there. >> >> You might have luck reproducing it with this bug >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1855298 >> >> But for me it looks like it is 'working' as well, so you might have >> to write a unit test to trigger the issue. >> >>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd >>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle >>>> the race _if_ userspace chooses not to pause vCPUs. >>>> >>> >>> Also on the BZ they all seem (Paolo included) to agree that the issue is >>> non-atomic memslots update. >> >> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a >> memslot. I'm asking for an explanation of exactly what happens when that occurs, >> because it should be possible to adjust KVM and/or QEMU to play nice with the >> fetch, e.g. to resume the guest until the new memslot is installed, in which case >> an atomic update isn't needed. >> >> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the >> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct, >> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead >> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to >> the MMIO code fetch. >> >> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes >> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's >> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't >> otherwise exit with mmio.len==0. > > I think this patch is not a good idea for two reasons: > > 1) we don't know how userspace behaves if mmio.len is zero. It is of > course reasonable to do nothing, but an assertion failure is also a > valid behavior > > 2) more important, there is no way to distinguish a failure due to the > guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from > one due to the KVM_SET_USER_MEMORY_REGION race condition. So this will > cause a guest that correctly caused an internal error to loop forever. > > While the former could be handled in a "wait and see" manner, the latter > in particular is part of the KVM_RUN contract. Of course it is possible > for a guest to just loop forever, but in general all of KVM, QEMU and > upper userspace layers want a crashed guest to be detected and stopped > forever. > > Yes, QEMU could loop only if memslot updates are in progress, but > honestly all the alternatives I have seen to atomic memslot updates are > really *awful*. David's patches even invent a new kind of mutex for > which I have absolutely no idea what kind of deadlocks one should worry > about and why they should not exist; QEMU's locking is already pretty > crappy, it's certainly not on my wishlist to make it worse! Just to comment on that (I'm happy as long as this gets fixed), a simple mutex with trylock should get the thing done as well -- kicking the VCPU if the trylock fails. But I did not look further into locking alternatives.
On Wed, Sep 28, 2022, Paolo Bonzini wrote: > On 9/27/22 17:58, Sean Christopherson wrote: > > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes > > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's > > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't > > otherwise exit with mmio.len==0. > > I think this patch is not a good idea for two reasons: > > 1) we don't know how userspace behaves if mmio.len is zero. It is of course > reasonable to do nothing, but an assertion failure is also a valid behavior Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking point, the mmio.len==0 hack can be avoided by defining a new exit reason. > 2) more important, there is no way to distinguish a failure due to the guest > going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due > to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a > guest that correctly caused an internal error to loop forever. Userspace has the GPA and absolutely should be able to detect if the MMIO may have been due to its memslot manipulation versus the guest jumping into the weeds. > While the former could be handled in a "wait and see" manner, the latter in > particular is part of the KVM_RUN contract. Of course it is possible for a > guest to just loop forever, but in general all of KVM, QEMU and upper > userspace layers want a crashed guest to be detected and stopped forever. > > Yes, QEMU could loop only if memslot updates are in progress, but honestly > all the alternatives I have seen to atomic memslot updates are really > *awful*. David's patches even invent a new kind of mutex for which I have > absolutely no idea what kind of deadlocks one should worry about and why > they should not exist; QEMU's locking is already pretty crappy, it's > certainly not on my wishlist to make it worse! > > This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the > kernel is the only place where you can have a *good* fix. It should have > been fixed years ago. I don't disagree that the memslots API is lacking, but IMO that is somewhat orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd like to have the luxury of being able to explore ideas beyond "let userspace batch memslot updates", and I really don't want to feel pressured to get this code reviewed and merge. E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to do wholesale replacement? That seems like it would be vastly simpler to handle on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an API that allows 1->N or N->1 conversions but not arbitrary batching. And just because QEMU's locking is "already pretty crappy", that's not a good reason to drag KVM down into the mud. E.g. taking a lock and conditionally releasing it... I get that this is an RFC, but IMO anything that requires such shenanigans simply isn't acceptable. /* * Takes kvm->slots_arch_lock, and releases it only if * invalid_slot allocation, kvm_prepare_memory_region failed * or batch->is_move_delete is true. */ static int kvm_prepare_memslot(struct kvm *kvm, struct kvm_internal_memory_region_list *batch)
On 9/28/22 17:58, Sean Christopherson wrote: > On Wed, Sep 28, 2022, Paolo Bonzini wrote: >> On 9/27/22 17:58, Sean Christopherson wrote: >>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes >>> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's >>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't >>> otherwise exit with mmio.len==0. >> >> I think this patch is not a good idea for two reasons: >> >> 1) we don't know how userspace behaves if mmio.len is zero. It is of course >> reasonable to do nothing, but an assertion failure is also a valid behavior > > Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in > L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking > point, the mmio.len==0 hack can be avoided by defining a new exit reason. I agree that doing this at CPL>0 or in L2 is invalid and makes little sense (because either way the invalid address cannot be reached without help from the supervisor or L1's page tables). >> 2) more important, there is no way to distinguish a failure due to the guest >> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due >> to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a >> guest that correctly caused an internal error to loop forever. > > Userspace has the GPA and absolutely should be able to detect if the MMIO may have > been due to its memslot manipulation versus the guest jumping into the weeds. > >> While the former could be handled in a "wait and see" manner, the latter in >> particular is part of the KVM_RUN contract. Of course it is possible for a >> guest to just loop forever, but in general all of KVM, QEMU and upper >> userspace layers want a crashed guest to be detected and stopped forever. >> >> Yes, QEMU could loop only if memslot updates are in progress, but honestly >> all the alternatives I have seen to atomic memslot updates are really >> *awful*. David's patches even invent a new kind of mutex for which I have >> absolutely no idea what kind of deadlocks one should worry about and why >> they should not exist; QEMU's locking is already pretty crappy, it's >> certainly not on my wishlist to make it worse! >> >> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the >> kernel is the only place where you can have a *good* fix. It should have >> been fixed years ago. > > I don't disagree that the memslots API is lacking, but IMO that is somewhat > orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd > like to have the luxury of being able to explore ideas beyond "let userspace > batch memslot updates", and I really don't want to feel pressured to get this > code reviewed and merge. I absolutely agree that this is not a bugfix. Most new features for KVM can be seen as bug fixes if you squint hard enough, but they're still features. > E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to > do wholesale replacement? That seems like it would be vastly simpler to handle > on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an > API that allows 1->N or N->1 conversions but not arbitrary batching. Wholesale replacement was my first idea when I looked at the issue, I think at the end of 2020. I never got to a full implementation, but my impression was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any easier than arbitrary batch updates. > And just because QEMU's locking is "already pretty crappy", that's not a good > reason to drag KVM down into the mud. E.g. taking a lock and conditionally > releasing it... I get that this is an RFC, but IMO anything that requires such > shenanigans simply isn't acceptable. > > /* > * Takes kvm->slots_arch_lock, and releases it only if > * invalid_slot allocation, kvm_prepare_memory_region failed > * or batch->is_move_delete is true. > */ > static int kvm_prepare_memslot(struct kvm *kvm, > struct kvm_internal_memory_region_list *batch) > No objection about that. :) Paolo
On Wed, Sep 28, 2022, Paolo Bonzini wrote: > On 9/28/22 17:58, Sean Christopherson wrote: > > I don't disagree that the memslots API is lacking, but IMO that is somewhat > > orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API > > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd > > like to have the luxury of being able to explore ideas beyond "let userspace > > batch memslot updates", and I really don't want to feel pressured to get this > > code reviewed and merge. > > I absolutely agree that this is not a bugfix. Most new features for KVM can > be seen as bug fixes if you squint hard enough, but they're still features. I guess I'm complaining that there isn't sufficient justification for this new feature. The cover letter provides a bug that would be fixed by having batched updates, but as above, that's really due to deficiencies in a different KVM ABI. Beyond that, there's no explanation of why this exact API is necessary, i.e. there are no requirements given. - Why can't this be solved in userspace? - Is performance a concern? I.e. are updates that need to be batched going to be high frequency operations? - What operations does userspace truly need? E.g. if the only use case is to split/truncate/hole punch an existing memslot, can KVM instead provide a memslot flag and exit reason that allows kicking vCPUs to userspace if the memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot, but KVM exists with a dedicated exit reason instead of generating MMIO semantics. - What precisely needs to be atomic? If updates only need to be "atomic" for an address space, does the API allowing mixing non-SMM and SMM memslots? - When is KVM required to invalidate and flush? E.g. if a memslot is deleted and recreated with a different HVA, how does KVM ensure that there are no outstanding references to the old HVA without introducing non-determinstic behavior. The current API works by forcing userspace to fully delete the memslot, i.e. KVM can ensure all references are gone in all TLBs before allowing userspace to create new, conflicting entries. I don't see how this can work with batched updates. The update needs to be "atomic", i.e. vCPUs must never see an invalid/deleted memslot, but if the memslot is writable, how does KVM prevent some writes from hitting the old HVA and some from hitting the new HVA without a quiescent period? - If a memslot is truncated while dirty logging is enabled, what happens to the bitmap? Is it preserved? Dropped? Again, I completely agree that the current memslots API is far from perfect, but I'm not convinced that simply extending the existing API to batch updates is the best solution from a KVM perspective. > > E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to > > do wholesale replacement? That seems like it would be vastly simpler to handle > > on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an > > API that allows 1->N or N->1 conversions but not arbitrary batching. > > Wholesale replacement was my first idea when I looked at the issue, I think > at the end of 2020. I never got to a full implementation, but my impression > was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any > easier than arbitrary batch updates. It's not obvious to me that the memslot metadata is going to be easy to handle regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt" the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
Am 28/09/2022 um 22:41 schrieb Sean Christopherson: > On Wed, Sep 28, 2022, Paolo Bonzini wrote: >> On 9/28/22 17:58, Sean Christopherson wrote: >>> I don't disagree that the memslots API is lacking, but IMO that is somewhat >>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API >>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd >>> like to have the luxury of being able to explore ideas beyond "let userspace >>> batch memslot updates", and I really don't want to feel pressured to get this >>> code reviewed and merge. >> >> I absolutely agree that this is not a bugfix. Most new features for KVM can >> be seen as bug fixes if you squint hard enough, but they're still features. > > I guess I'm complaining that there isn't sufficient justification for this new > feature. The cover letter provides a bug that would be fixed by having batched > updates, but as above, that's really due to deficiencies in a different KVM ABI. > > Beyond that, there's no explanation of why this exact API is necessary, i.e. there > are no requirements given. > > - Why can't this be solved in userspace? Because this would provide the "feature" only to QEMU, leaving each other hypervisor to implement its own. In addition (maybe you already answered this question but I couldn't find an answer in the email thread), does it make sense to stop all vcpus for a couple of memslot update? What if we have 100 cpus? > > - Is performance a concern? I.e. are updates that need to be batched going to > be high frequency operations? Currently they are limited to run only at boot. In an unmodified KVM/QEMU build, however, I count 86 memslot updates done at boot with ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm --display none > > - What operations does userspace truly need? E.g. if the only use case is to > split/truncate/hole punch an existing memslot, can KVM instead provide a > memslot flag and exit reason that allows kicking vCPUs to userspace if the > memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot, > but KVM exists with a dedicated exit reason instead of generating MMIO semantics. Could be an alternative solution I guess? > > - What precisely needs to be atomic? If updates only need to be "atomic" for > an address space, does the API allowing mixing non-SMM and SMM memslots? Does QEMU pass a list of mixed non-SMM and SMM memslots? Btw, is there a separate list for SMM memslots in KVM? If not, and are all thrown together, then for simplicity we make all updates atomic. > > - When is KVM required to invalidate and flush? E.g. if a memslot is deleted > and recreated with a different HVA, how does KVM ensure that there are no > outstanding references to the old HVA without introducing non-determinstic > behavior. The current API works by forcing userspace to fully delete the > memslot, i.e. KVM can ensure all references are gone in all TLBs before > allowing userspace to create new, conflicting entries. I don't see how this > can work with batched updates. The update needs to be "atomic", i.e. vCPUs > must never see an invalid/deleted memslot, but if the memslot is writable, > how does KVM prevent some writes from hitting the old HVA and some from hitting > the new HVA without a quiescent period? Sorry this might be my fault in providing definitions, even though I think I made plenty of examples. Delete/move operations are not really "atomic" in the sense that the slot disappears immediately. The slot(s) is/are first "atomically" invalidated, allowing the guest to retry the page fault and preparing for the cleanup you mention above, and then are "atomically" deleted. The behavior is the same, just instead of doing invalidate memslot X swap() delete memslot X swap() invalidate memslot Y swap() delete memslot Y swap() ... I would do: invalidate invalidate invalidate swap() delete delete delete swap() > > - If a memslot is truncated while dirty logging is enabled, what happens to > the bitmap? Is it preserved? Dropped? Can you explain what you mean with "truncated memslot"? Regarding the bitmap, currently QEMU should (probably wrongly) update it before even committing the changes to KVM. But I remember upstream someone pointed that this could be solved later. > > Again, I completely agree that the current memslots API is far from perfect, but > I'm not convinced that simply extending the existing API to batch updates is the > best solution from a KVM perspective. > >>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to >>> do wholesale replacement? That seems like it would be vastly simpler to handle >>> on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an >>> API that allows 1->N or N->1 conversions but not arbitrary batching. >> >> Wholesale replacement was my first idea when I looked at the issue, I think >> at the end of 2020. I never got to a full implementation, but my impression >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any >> easier than arbitrary batch updates. > > It's not obvious to me that the memslot metadata is going to be easy to handle > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt" > the dirty bitmap if a hole is punched in a memslot that's being dirty logged. > Could you provide an example? I mean I am not an expert but to me it looks like I preserved the same old functionalities both here and in QEMU. I just batched and delayed some operations, which in this case should cause no harm. Thank you, Emanuele
On 29.09.22 10:05, Emanuele Giuseppe Esposito wrote: > > > Am 28/09/2022 um 22:41 schrieb Sean Christopherson: >> On Wed, Sep 28, 2022, Paolo Bonzini wrote: >>> On 9/28/22 17:58, Sean Christopherson wrote: >>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat >>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API >>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd >>>> like to have the luxury of being able to explore ideas beyond "let userspace >>>> batch memslot updates", and I really don't want to feel pressured to get this >>>> code reviewed and merge. >>> >>> I absolutely agree that this is not a bugfix. Most new features for KVM can >>> be seen as bug fixes if you squint hard enough, but they're still features. >> >> I guess I'm complaining that there isn't sufficient justification for this new >> feature. The cover letter provides a bug that would be fixed by having batched >> updates, but as above, that's really due to deficiencies in a different KVM ABI. >> >> Beyond that, there's no explanation of why this exact API is necessary, i.e. there >> are no requirements given. >> >> - Why can't this be solved in userspace? > > Because this would provide the "feature" only to QEMU, leaving each > other hypervisor to implement its own. > > In addition (maybe you already answered this question but I couldn't > find an answer in the email thread), does it make sense to stop all > vcpus for a couple of memslot update? What if we have 100 cpus? > >> >> - Is performance a concern? I.e. are updates that need to be batched going to >> be high frequency operations? > > Currently they are limited to run only at boot. In an unmodified > KVM/QEMU build, however, I count 86 memslot updates done at boot with > > ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm > --display none I *think* there are only ~3 problematic ones (split/resize), where we temporarily delete something we will re-add. At least that's what I remember from working on my prototype.
On Thu, Sep 29, 2022, Emanuele Giuseppe Esposito wrote: > > Am 28/09/2022 um 22:41 schrieb Sean Christopherson: > > Beyond that, there's no explanation of why this exact API is necessary, i.e. there > > are no requirements given. > > > > - Why can't this be solved in userspace? > > Because this would provide the "feature" only to QEMU, leaving each > other hypervisor to implement its own. But there's no evidence that any other VMM actually needs this feature. > > - When is KVM required to invalidate and flush? E.g. if a memslot is deleted > > and recreated with a different HVA, how does KVM ensure that there are no > > outstanding references to the old HVA without introducing non-determinstic > > behavior. The current API works by forcing userspace to fully delete the > > memslot, i.e. KVM can ensure all references are gone in all TLBs before > > allowing userspace to create new, conflicting entries. I don't see how this > > can work with batched updates. The update needs to be "atomic", i.e. vCPUs > > must never see an invalid/deleted memslot, but if the memslot is writable, > > how does KVM prevent some writes from hitting the old HVA and some from hitting > > the new HVA without a quiescent period? > > Sorry this might be my fault in providing definitions, even though I > think I made plenty of examples. Delete/move operations are not really > "atomic" in the sense that the slot disappears immediately. > > The slot(s) is/are first "atomically" invalidated, allowing the guest to > retry the page fault I completely forgot that KVM x86 now retries on KVM_MEMSLOT_INVALID[*]. That is somewhat arbitrary behavior, e.g. if x86 didn't do MMIO SPTE caching it wouldn't be "necessary" and x86 would still treat INVALID as MMIO. It's also x86 specific, no other architecture retries on invalid memslots, i.e. relying on that behavior to provide "atomicity" won't work without updating all other architectures. That's obviously doable, but since these updates aren't actually atomic, I don't see any meaningful benefit over adding e.g. KVM_MEM_DISABLED. [*] e0c378684b65 ("KVM: x86/mmu: Retry page faults that hit an invalid memslot") > > - If a memslot is truncated while dirty logging is enabled, what happens to > > the bitmap? Is it preserved? Dropped? > > Can you explain what you mean with "truncated memslot"? Shrink the size of the memslot. > Regarding the bitmap, currently QEMU should (probably wrongly) update it > before even committing the changes to KVM. But I remember upstream > someone pointed that this could be solved later. These details can't be punted, as they affect the ABI. My interpretation of "atomic" memslot updates was that KVM would truly honor the promise of atomic updates, e.g. that a DELETE=>CREATE sequence would effectively allow truncating a memslot without losing any data. Those details aren't really something that can be punted down the road to be fixed at a later point because they very much matter from an ABI perspective. > > Again, I completely agree that the current memslots API is far from perfect, but > > I'm not convinced that simply extending the existing API to batch updates is the > > best solution from a KVM perspective. > > > >>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to > >>> do wholesale replacement? That seems like it would be vastly simpler to handle > >>> on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an > >>> API that allows 1->N or N->1 conversions but not arbitrary batching. > >> > >> Wholesale replacement was my first idea when I looked at the issue, I think > >> at the end of 2020. I never got to a full implementation, but my impression > >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any > >> easier than arbitrary batch updates. > > > > It's not obvious to me that the memslot metadata is going to be easy to handle > > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt" > > the dirty bitmap if a hole is punched in a memslot that's being dirty logged. > > > > Could you provide an example? > I mean I am not an expert but to me it looks like I preserved the same > old functionalities both here and in QEMU. I just batched and delayed > some operations, which in this case should cause no harm. As above, my confusion is due to calling these updates "atomic". The new API is really just "allow batching memslot updates and subtly rely on restarting page faults on KVM_MEMSLOT_INVALID". IMO, KVM_MEM_DISABLED or similar is the way to go. I.e. formalize the "restart page faults" semantics without taking on the complexity of batched updates.
On 9/28/22 22:41, Sean Christopherson wrote: > On Wed, Sep 28, 2022, Paolo Bonzini wrote: >> On 9/28/22 17:58, Sean Christopherson wrote: >>> I don't disagree that the memslots API is lacking, but IMO that is somewhat >>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API >>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd >>> like to have the luxury of being able to explore ideas beyond "let userspace >>> batch memslot updates", and I really don't want to feel pressured to get this >>> code reviewed and merge. >> >> I absolutely agree that this is not a bugfix. Most new features for KVM can >> be seen as bug fixes if you squint hard enough, but they're still features. > > I guess I'm complaining that there isn't sufficient justification for this new > feature. The cover letter provides a bug that would be fixed by having batched > updates, but as above, that's really due to deficiencies in a different KVM ABI. I disagree. Failure to fetch should be fixed but is otherwise a red herring. It's the whole memslot API (including dirty logging) that is a mess. If you think we should overhaul it even more than just providing atomic batched updates, that's fine. But still, the impossibility to perform atomic updates in batches *is* a suboptimal part of the KVM API. > - Why can't this be solved in userspace? I don't think *can't* is the right word. If the metric of choice was "what can be solved in userspace", we'd all be using microkernels. The question is why userspace would be a better place to solve it. The only reason to do it in userspace would be if failure to fetch is something that is interesting to userspace, other than between two KVM_SET_USER_MEMORY_REGION. Unless you provide an API to pass failures to fetch down to userspace, the locking in userspace is going to be inferior, because it would have to be unconditional. This means worse performance and more complication, not to mention having to do it N times instead of 1 for N implementations. Not forcing userspace to do "tricks" is in my opinion a strong part of deciding whether an API belongs in KVM. > - What operations does userspace truly need? E.g. if the only use case is to > split/truncate/hole punch an existing memslot, can KVM instead provide a > memslot flag and exit reason that allows kicking vCPUs to userspace if the > memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot, > but KVM exists with a dedicated exit reason instead of generating MMIO semantics. The main cases are: - for the boot case, splitting and merging existing memslots. QEMU likes to merge adjacent memory regions into a single memslot, so if something goes from read-write to read-only it has to be split and vice versa. I guess a "don't merge this memory region" flag would be the less hideous way to solve it in userspace. - however, there is also the case of resizing an existing memslot which is what David would like to have for virtio-mem. This is not really fixable because part of the appeal of virtio-mem is to have a single huge memslot instead of many smaller ones, in order to reduce the granularity of add/remove (David, correct me if I'm wrong). (The latter _would_ be needed by other VMMs). > If updates only need to be "atomic" for an address space, does the API allowing > mixing non-SMM and SMM memslots? I agree that the address space should be moved out of the single entries and into the header if we follow through with this approach. > The update needs to be "atomic", i.e. vCPUs > must never see an invalid/deleted memslot, but if the memslot is writable, > how does KVM prevent some writes from hitting the old HVA and some from hitting > the new HVA without a quiescent period? (Heh, and I forgot likewise that non-x86 does not retry on KVM_MEMSLOT_INVALID. Yes, that would be treated as a bug on other architectures). >> Wholesale replacement was my first idea when I looked at the issue, I think >> at the end of 2020. I never got to a full implementation, but my impression >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any >> easier than arbitrary batch updates. > > It's not obvious to me that the memslot metadata is going to be easy to handle > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt" > the dirty bitmap if a hole is punched in a memslot that's being dirty logged. Indeed; I would have thought that it is clear with the batch updates API (which requires the update to be split into delete and insert), but apparently it's not and it's by no means optimal. Paolo
On Thu, 2022-09-29 at 17:28 +0200, Paolo Bonzini wrote: > On 9/28/22 22:41, Sean Christopherson wrote: > > On Wed, Sep 28, 2022, Paolo Bonzini wrote: > > > On 9/28/22 17:58, Sean Christopherson wrote: > > > > I don't disagree that the memslots API is lacking, but IMO that is somewhat > > > > orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API > > > > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd > > > > like to have the luxury of being able to explore ideas beyond "let userspace > > > > batch memslot updates", and I really don't want to feel pressured to get this > > > > code reviewed and merge. > > > > > > I absolutely agree that this is not a bugfix. Most new features for KVM can > > > be seen as bug fixes if you squint hard enough, but they're still features. > > > > I guess I'm complaining that there isn't sufficient justification for this new > > feature. The cover letter provides a bug that would be fixed by having batched > > updates, but as above, that's really due to deficiencies in a different KVM ABI. > > I disagree. Failure to fetch should be fixed but is otherwise a red > herring. It's the whole memslot API (including dirty logging) that is a > mess. > > If you think we should overhaul it even more than just providing atomic > batched updates, that's fine. But still, the impossibility to perform > atomic updates in batches *is* a suboptimal part of the KVM API. > > > - Why can't this be solved in userspace? > > I don't think *can't* is the right word. If the metric of choice was > "what can be solved in userspace", we'd all be using microkernels. The > question is why userspace would be a better place to solve it. > > The only reason to do it in userspace would be if failure to fetch is > something that is interesting to userspace, other than between two > KVM_SET_USER_MEMORY_REGION. Unless you provide an API to pass failures > to fetch down to userspace, the locking in userspace is going to be > inferior, because it would have to be unconditional. This means worse > performance and more complication, not to mention having to do it N > times instead of 1 for N implementations. > > Not forcing userspace to do "tricks" is in my opinion a strong part of > deciding whether an API belongs in KVM. > > > - What operations does userspace truly need? E.g. if the only use case is to > > split/truncate/hole punch an existing memslot, can KVM instead provide a > > memslot flag and exit reason that allows kicking vCPUs to userspace if the > > memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot, > > but KVM exists with a dedicated exit reason instead of generating MMIO semantics. > > The main cases are: > > - for the boot case, splitting and merging existing memslots. QEMU > likes to merge adjacent memory regions into a single memslot, so if > something goes from read-write to read-only it has to be split and vice > versa. I guess a "don't merge this memory region" flag would be the > less hideous way to solve it in userspace. > > - however, there is also the case of resizing an existing memslot which > is what David would like to have for virtio-mem. This is not really > fixable because part of the appeal of virtio-mem is to have a single > huge memslot instead of many smaller ones, in order to reduce the > granularity of add/remove (David, correct me if I'm wrong). > > (The latter _would_ be needed by other VMMs). > > > If updates only need to be "atomic" for an address space, does the API allowing > > mixing non-SMM and SMM memslots? > > I agree that the address space should be moved out of the single entries > and into the header if we follow through with this approach. > > > The update needs to be "atomic", i.e. vCPUs > > must never see an invalid/deleted memslot, but if the memslot is writable, > > how does KVM prevent some writes from hitting the old HVA and some from hitting > > the new HVA without a quiescent period? > > (Heh, and I forgot likewise that non-x86 does not retry on > KVM_MEMSLOT_INVALID. Yes, that would be treated as a bug on other > architectures). > > > > Wholesale replacement was my first idea when I looked at the issue, I think > > > at the end of 2020. I never got to a full implementation, but my impression > > > was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any > > > easier than arbitrary batch updates. > > > > It's not obvious to me that the memslot metadata is going to be easy to handle > > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt" > > the dirty bitmap if a hole is punched in a memslot that's being dirty logged. > > Indeed; I would have thought that it is clear with the batch updates API > (which requires the update to be split into delete and insert), but > apparently it's not and it's by no means optimal. I 100% agree with everything Paolo said. Best regards, Maxim Levitsky > > Paolo >
On 9/29/22 17:18, Sean Christopherson wrote: > IMO, KVM_MEM_DISABLED or similar is the way to go. I.e. formalize the "restart > page faults" semantics without taking on the complexity of batched updates. If userspace has to collaborate, KVM_MEM_DISABLED (or KVM_MEM_USER_EXIT would be a better name) is not needed either except as an optimization; you can just kick all CPUs unconditionally. And in fact KVM_MEM_DISABLED is not particularly easy to implement either; in order to allow split/merge it should be possible for a new memslot to replace multiple disabled memslots, in order to allow merging, and to be only partially overlap the first/last disabled memslot it replaces. None of this is allowed for other memslots, so exactly the same metadata complications exist as for other options such as wholesale replacement or batched updates. The only semantics with a sane implementation would be to destroy the dirty bitmap of disabled memslots when they are replaced. At least it would be possible for userspace to set KVM_MEM_DISABLED, issue KVM_GET_DIRTY_LOG and then finally create the new memslot. That would be _correct_, but still not very appealing. I don't exclude suffering from tunnel vision, but batched updates to me still seem to be the least bad option. Paolo
> The main cases are: > > - for the boot case, splitting and merging existing memslots. QEMU > likes to merge adjacent memory regions into a single memslot, so if > something goes from read-write to read-only it has to be split and vice > versa. I guess a "don't merge this memory region" flag would be the > less hideous way to solve it in userspace. > > - however, there is also the case of resizing an existing memslot which > is what David would like to have for virtio-mem. This is not really > fixable because part of the appeal of virtio-mem is to have a single > huge memslot instead of many smaller ones, in order to reduce the > granularity of add/remove (David, correct me if I'm wrong). Yes, the most important case I am working on in that regard is reducing the memslot size/overhead when only exposing comparatively little memory towards a VM using virtio-mem (say, a virtio-mem device that could grow to 1 TiB, but we initially only expose 1 GiB to the VM). One approach I prototyped in the past (where my RFC for atomic updates came into play because I ran into this issue) to achieve that was dynamically growing (and eventually shrinking) a single memslot on demand. An alternative [1] uses multiple individual memslots, and exposes them on demand. There, I have to make sure that QEMU won't merge adjacent memslots into a bigger one -- because otherwise, we'd again need atomic memslot updates again, which KVM, vhost-user, ... don't support. But in the future, I think we want to have that: if there is no fragmentation, we should only have a single large memslot and all memslot consumers should be able to cope with atomic updates. So in any case, I will have good use for atomic memslot updates. Just like other hypervisors that (will) implement virtio-mem or something comparable :) [1] https://lore.kernel.org/all/20211013103330.26869-1-david@redhat.com/T/#u
On Thu, Sep 29, 2022, Paolo Bonzini wrote: > > On 9/28/22 22:41, Sean Christopherson wrote: > If you think we should overhaul it even more than just providing atomic > batched updates, that's fine. Or maybe solve the problem(s) with more precision? What I don't like about the batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly becoming atomic. When I hear "atomic" updates, I think "all transactions are commited at once". That is not what this proposed API does, and due to the TLB flushing requirements, I don't see how it can be implemented. > But still, the impossibility to perform > atomic updates in batches *is* a suboptimal part of the KVM API. I don't disagree, but I honestly don't see the difference between "batching" and providing a KVM_MEM_USER_EXIT flag. The batch API will provide better performance, but I would be surprised if it's significant enough to matter. I'm not saying that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in agreement that handling memslot metadata will suck regardless. My point is that it's simpler to implement in KVM, much easier to document, and for all intents and purposes provides the same desired functionality: vCPUs that access in-flux memslots are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to provide "atomic" updates. > The main cases are: > > - for the boot case, splitting and merging existing memslots. QEMU likes to > merge adjacent memory regions into a single memslot, so if something goes > from read-write to read-only it has to be split and vice versa. I guess a > "don't merge this memory region" flag would be the less hideous way to solve > it in userspace. > > - however, there is also the case of resizing an existing memslot which is > what David would like to have for virtio-mem. This is not really fixable > because part of the appeal of virtio-mem is to have a single huge memslot > instead of many smaller ones, in order to reduce the granularity of > add/remove (David, correct me if I'm wrong). > > (The latter _would_ be needed by other VMMs). If we really want to provide a better experience for userspace, why not provide more primitives to address those specific use cases? E.g. fix KVM's historic wart of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to: - Merge 2+ memory regions that are contiguous in GPA and HVA - Split a memory region into 2+ memory regions - Truncate a memory region - Grow a memory region That would probably require more KVM code overall, but each operation would be more tightly bounded and thus simpler to define. And I think more precise APIs would provide other benefits, e.g. growing a region wouldn't require first deleting the current region, and so could avoid zapping SPTEs and destroying metadata. Merge, split, and truncate would have similar benefits, though they might be more difficult to realize in practice. What I really don't like about the "batch" API, aside from the "atomic" name, is that it's too open ended and creates conflicts. E.g. to allow "atomic" merging after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that implies that each operation in the batch operates on the working state created by the previous operations. But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting all "invalidations" to the front half of the "atomic" operations breaks the ordering. And there's a ridiculous amount of potential weirdness with MOVE=>MOVE operations. KVM could solve those issues by disallowing MOVE and disallowing DELETE after CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating on a single contiguous chunk of memslots. At that point, it seems that providing primitives to do each macro operation is an even better experience for userspace. In other words, make memslots a bit more CISC ;-)
Am 29/09/2022 um 23:39 schrieb Sean Christopherson: > If we really want to provide a better experience for userspace, why not provide > more primitives to address those specific use cases? E.g. fix KVM's historic wart > of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to: > > - Merge 2+ memory regions that are contiguous in GPA and HVA > - Split a memory region into 2+ memory regions > - Truncate a memory region > - Grow a memory region I looked again at the code and specifically the use case that triggers the crash in bugzilla. I *think* (correct me if I am wrong), that the only operation that QEMU performs right now is "grow/shrink". So *if* we want to go this way, we could start with a simple grow/shrink API. Even though we need to consider that this could bring additional complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not performed one after the other (in my innocent fantasy I was expecting to find 2 subsequent ioctls in the code), but in QEMU's address_space_set_flatview(), which seems to first remove all regions and then add them when changing flatviews. address_space_update_topology_pass(as, old_view2, new_view, adding=false); address_space_update_topology_pass(as, old_view2, new_view, adding=true); I don't think we can change this, as other listeners also rely on such ordering, but we can still batch all callback requests like I currently do and process them in kvm_commit(), figuring there which operation is which. In other words, we would have something like: region_del() --> DELETE memslot X -> add it to the queue of operations region_del() --> DELETE memslot Y -> add it to the queue of operations region_add() --> CREATE memslot X (size doubled) -> add it to the queue of operations region_add() --> CREATE memslot Y (size halved) -> add it to the queue of operations ... commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK Y (-size) -> issue 2 ioctls, GROW and SHRINK. > That would probably require more KVM code overall, but each operation would be more > tightly bounded and thus simpler to define. And I think more precise APIs would > provide other benefits, e.g. growing a region wouldn't require first deleting the > current region, and so could avoid zapping SPTEs and destroying metadata. Merge, > split, and truncate would have similar benefits, though they might be more > difficult to realize in practice. So essentially grow would not require INVALIDATE. Makes sense, but would it work also with shrink? I guess so, as the memslot is still present (but shrinked) right? Paolo, would you be ok with this smaller API? Probably just starting with grow and shrink first. I am not against any of the two approaches: - my approach has the disadvantage that the list could be arbitrarily long, and it is difficult to rollback the intermediate changes if something breaks during the request processing (but could be simplified by making kvm exit or crash). - Sean approach could potentially provide more burden to the userspace, as we need to figure which operation is which. Also from my understanding split and merge won't be really straightforward to implement, especially in userspace. David, any concern from userspace prospective on this "CISC" approach? Thank you, Emanuele
On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote: > > > Am 29/09/2022 um 23:39 schrieb Sean Christopherson: >> If we really want to provide a better experience for userspace, why not provide >> more primitives to address those specific use cases? E.g. fix KVM's historic wart >> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to: >> >> - Merge 2+ memory regions that are contiguous in GPA and HVA >> - Split a memory region into 2+ memory regions >> - Truncate a memory region >> - Grow a memory region > > I looked again at the code and specifically the use case that triggers > the crash in bugzilla. I *think* (correct me if I am wrong), that the > only operation that QEMU performs right now is "grow/shrink". I remember that there were BUG reports where we'd actually split and run into that problem. Just don't have them at hand. I think they happened during early boot when the OS re-configured some PCI thingies. > > So *if* we want to go this way, we could start with a simple grow/shrink > API. > > Even though we need to consider that this could bring additional > complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not > performed one after the other (in my innocent fantasy I was expecting to > find 2 subsequent ioctls in the code), but in QEMU's > address_space_set_flatview(), which seems to first remove all regions > and then add them when changing flatviews. > > address_space_update_topology_pass(as, old_view2, new_view, adding=false); > address_space_update_topology_pass(as, old_view2, new_view, adding=true); > > I don't think we can change this, as other listeners also rely on such > ordering, but we can still batch all callback requests like I currently > do and process them in kvm_commit(), figuring there which operation is > which. > > In other words, we would have something like: > > region_del() --> DELETE memslot X -> add it to the queue of operations > region_del() --> DELETE memslot Y -> add it to the queue of operations > region_add() --> CREATE memslot X (size doubled) -> add it to the queue > of operations > region_add() --> CREATE memslot Y (size halved) -> add it to the queue > of operations > ... > commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK > Y (-size) -> issue 2 ioctls, GROW and SHRINK. I communicated resizes (region_resize()) to the notifier in patch #3 of https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ You could use that independently of the remainder. It should be less controversial ;) But I think it only tackles part of the more generic problem (split/merge). > >> That would probably require more KVM code overall, but each operation would be more >> tightly bounded and thus simpler to define. And I think more precise APIs would >> provide other benefits, e.g. growing a region wouldn't require first deleting the >> current region, and so could avoid zapping SPTEs and destroying metadata. Merge, >> split, and truncate would have similar benefits, though they might be more >> difficult to realize in practice. > > So essentially grow would not require INVALIDATE. Makes sense, but would > it work also with shrink? I guess so, as the memslot is still present > (but shrinked) right? > > Paolo, would you be ok with this smaller API? Probably just starting > with grow and shrink first. > > I am not against any of the two approaches: > - my approach has the disadvantage that the list could be arbitrarily > long, and it is difficult to rollback the intermediate changes if > something breaks during the request processing (but could be simplified > by making kvm exit or crash). > > - Sean approach could potentially provide more burden to the userspace, > as we need to figure which operation is which. Also from my > understanding split and merge won't be really straightforward to > implement, especially in userspace. > > David, any concern from userspace prospective on this "CISC" approach? In contrast to resizes in QEMU that only affect a single memory region/slot, splitting/merging is harder to factor out and communicate to a notifier. As an alternative, we could handle it in the commit stage in the notifier itself, similar to what my prototype does, and figure out what needs to be done in there and how to call the proper KVM interface (and which interface to call). With virtio-mem (in the future) we might see merges of 2 slots into a single one, by closing a gap in-between them. In "theory" we could combine some updates into a single transaction. But it won't be 100s ... I think I'd prefer an API that doesn't require excessive ad-hoc extensions later once something in QEMU changes. I think in essence, all we care about is performing atomic changes that *have to be atomic*, because something we add during a transaction overlaps with something we remove during a transaction. Not necessarily all updates in a transaction! My prototype essentially detects that scenario, but doesn't call new KVM interfaces to deal with these. I assume once we take that into consideration, we can mostly assume that any such atomic updates (only of the regions that really have to be part of an atomic update) won't involve more than a handful of memory regions. We could add a sane KVM API limit. And if we run into that API limit in QEMU, we can print a warning and do it non-atomically.
Am 13/10/2022 um 10:44 schrieb David Hildenbrand: > On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote: >> >> >> Am 29/09/2022 um 23:39 schrieb Sean Christopherson: >>> If we really want to provide a better experience for userspace, why >>> not provide >>> more primitives to address those specific use cases? E.g. fix KVM's >>> historic wart >>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more >>> ioctls to: >>> >>> - Merge 2+ memory regions that are contiguous in GPA and HVA >>> - Split a memory region into 2+ memory regions >>> - Truncate a memory region >>> - Grow a memory region >> >> I looked again at the code and specifically the use case that triggers >> the crash in bugzilla. I *think* (correct me if I am wrong), that the >> only operation that QEMU performs right now is "grow/shrink". > > I remember that there were BUG reports where we'd actually split and run > into that problem. Just don't have them at hand. I think they happened > during early boot when the OS re-configured some PCI thingies. If you could point me where this is happening, it would be nice. So far I could not find or see any split/merge operation. > >> >> So *if* we want to go this way, we could start with a simple grow/shrink >> API. >> >> Even though we need to consider that this could bring additional >> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not >> performed one after the other (in my innocent fantasy I was expecting to >> find 2 subsequent ioctls in the code), but in QEMU's >> address_space_set_flatview(), which seems to first remove all regions >> and then add them when changing flatviews. >> >> address_space_update_topology_pass(as, old_view2, new_view, >> adding=false); >> address_space_update_topology_pass(as, old_view2, new_view, adding=true); >> >> I don't think we can change this, as other listeners also rely on such >> ordering, but we can still batch all callback requests like I currently >> do and process them in kvm_commit(), figuring there which operation is >> which. >> >> In other words, we would have something like: >> >> region_del() --> DELETE memslot X -> add it to the queue of operations >> region_del() --> DELETE memslot Y -> add it to the queue of operations >> region_add() --> CREATE memslot X (size doubled) -> add it to the queue >> of operations >> region_add() --> CREATE memslot Y (size halved) -> add it to the queue >> of operations >> ... >> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK >> Y (-size) -> issue 2 ioctls, GROW and SHRINK. > > I communicated resizes (region_resize()) to the notifier in patch #3 of > https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ > > You could use that independently of the remainder. It should be less > controversial ;) > > But I think it only tackles part of the more generic problem (split/merge). Yes, very useful! Using that patch we would have a single place where to issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since the operation will be atomic (this time in the true meaning, since we don't need INVALIDATE. > >> >>> That would probably require more KVM code overall, but each operation >>> would be more >>> tightly bounded and thus simpler to define. And I think more precise >>> APIs would >>> provide other benefits, e.g. growing a region wouldn't require first >>> deleting the >>> current region, and so could avoid zapping SPTEs and destroying >>> metadata. Merge, >>> split, and truncate would have similar benefits, though they might be >>> more >>> difficult to realize in practice. >> >> So essentially grow would not require INVALIDATE. Makes sense, but would >> it work also with shrink? I guess so, as the memslot is still present >> (but shrinked) right? >> >> Paolo, would you be ok with this smaller API? Probably just starting >> with grow and shrink first. >> >> I am not against any of the two approaches: >> - my approach has the disadvantage that the list could be arbitrarily >> long, and it is difficult to rollback the intermediate changes if >> something breaks during the request processing (but could be simplified >> by making kvm exit or crash). >> >> - Sean approach could potentially provide more burden to the userspace, >> as we need to figure which operation is which. Also from my >> understanding split and merge won't be really straightforward to >> implement, especially in userspace. >> >> David, any concern from userspace prospective on this "CISC" approach? > > In contrast to resizes in QEMU that only affect a single memory > region/slot, splitting/merging is harder to factor out and communicate > to a notifier. As an alternative, we could handle it in the commit stage > in the notifier itself, similar to what my prototype does, and figure > out what needs to be done in there and how to call the proper KVM > interface (and which interface to call). > > With virtio-mem (in the future) we might see merges of 2 slots into a > single one, by closing a gap in-between them. In "theory" we could > combine some updates into a single transaction. But it won't be 100s ... > > I think I'd prefer an API that doesn't require excessive ad-hoc > extensions later once something in QEMU changes. > > > I think in essence, all we care about is performing atomic changes that > *have to be atomic*, because something we add during a transaction > overlaps with something we remove during a transaction. Not necessarily > all updates in a transaction! > > My prototype essentially detects that scenario, but doesn't call new KVM > interfaces to deal with these. With "prototype" I assume you mean the patch linked above (region_resize), not the userspace-only proposal you sent initially right? > > I assume once we take that into consideration, we can mostly assume that > any such atomic updates (only of the regions that really have to be part > of an atomic update) won't involve more than a handful of memory > regions. We could add a sane KVM API limit. > > And if we run into that API limit in QEMU, we can print a warning and do > it non-atomically. > If we implement single operations (split/merge/grow/shrink), we don't even need that limit. Except for merge, maybe. Ok, if it'ok for you all I can try to use David patch and implement some simple grow/shrink. Then we need to figure where and when exactly QEMU performs split and merge operations, and maybe implement something similar to what you did in your proposal? Thank you, Emanuele
>> I remember that there were BUG reports where we'd actually split and run >> into that problem. Just don't have them at hand. I think they happened >> during early boot when the OS re-configured some PCI thingies. > > If you could point me where this is happening, it would be nice. So far > I could not find or see any split/merge operation. I remember some bugzilla, but it might be hard to find. Essentially, it happened that early during boot that it wasn't really a problem so far (single CPU running that triggers it IIRC). But I might be messing up some details. >> >>> >>>> That would probably require more KVM code overall, but each operation >>>> would be more >>>> tightly bounded and thus simpler to define. And I think more precise >>>> APIs would >>>> provide other benefits, e.g. growing a region wouldn't require first >>>> deleting the >>>> current region, and so could avoid zapping SPTEs and destroying >>>> metadata. Merge, >>>> split, and truncate would have similar benefits, though they might be >>>> more >>>> difficult to realize in practice. >>> >>> So essentially grow would not require INVALIDATE. Makes sense, but would >>> it work also with shrink? I guess so, as the memslot is still present >>> (but shrinked) right? >>> >>> Paolo, would you be ok with this smaller API? Probably just starting >>> with grow and shrink first. >>> >>> I am not against any of the two approaches: >>> - my approach has the disadvantage that the list could be arbitrarily >>> long, and it is difficult to rollback the intermediate changes if >>> something breaks during the request processing (but could be simplified >>> by making kvm exit or crash). >>> >>> - Sean approach could potentially provide more burden to the userspace, >>> as we need to figure which operation is which. Also from my >>> understanding split and merge won't be really straightforward to >>> implement, especially in userspace. >>> >>> David, any concern from userspace prospective on this "CISC" approach? >> >> In contrast to resizes in QEMU that only affect a single memory >> region/slot, splitting/merging is harder to factor out and communicate >> to a notifier. As an alternative, we could handle it in the commit stage >> in the notifier itself, similar to what my prototype does, and figure >> out what needs to be done in there and how to call the proper KVM >> interface (and which interface to call). >> >> With virtio-mem (in the future) we might see merges of 2 slots into a >> single one, by closing a gap in-between them. In "theory" we could >> combine some updates into a single transaction. But it won't be 100s ... >> >> I think I'd prefer an API that doesn't require excessive ad-hoc >> extensions later once something in QEMU changes. >> >> >> I think in essence, all we care about is performing atomic changes that >> *have to be atomic*, because something we add during a transaction >> overlaps with something we remove during a transaction. Not necessarily >> all updates in a transaction! >> >> My prototype essentially detects that scenario, but doesn't call new KVM >> interfaces to deal with these. > > With "prototype" I assume you mean the patch linked above > (region_resize), not the userspace-only proposal you sent initially right? I meant from the userspace-only proposal the part where we collect all add/remove callabcks in a list, and in the commit stage try to detect if there is an overlap. That way we could isolate the memblocks that really only need an atomic operation. Anything that doesn't overlap can just go via the old interface. Not saying that it wins a beauty price, but we wouldn't have to adjust QEMU core even more to teach it these special cases (but maybe we should and other notifiers might benefit from that in the long term). > >> >> I assume once we take that into consideration, we can mostly assume that >> any such atomic updates (only of the regions that really have to be part >> of an atomic update) won't involve more than a handful of memory >> regions. We could add a sane KVM API limit. >> >> And if we run into that API limit in QEMU, we can print a warning and do >> it non-atomically. >> > If we implement single operations (split/merge/grow/shrink), we don't > even need that limit. Except for merge, maybe. Right, in theory you could have multiple splits/merges in a single commit that all interact. Like cutting out 8 pieces out of an existing larger memblock. > > Ok, if it'ok for you all I can try to use David patch and implement some > simple grow/shrink. Then we need to figure where and when exactly QEMU > performs split and merge operations, and maybe implement something > similar to what you did in your proposal? Looking at grow/shrink first is certainly not a waste of time.