Message ID | 20220819005601.198436-2-gshan@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: arm64: Enable ring-based dirty memory tracking | expand |
On Fri, 19 Aug 2022 01:55:57 +0100, Gavin Shan <gshan@redhat.com> wrote: > > The ring-based dirty memory tracking has been available and enabled > on x86 for a while. The feature is beneficial when the number of > dirty pages is small in a checkpointing system or live migration > scenario. More details can be found from fb04a1eddb1a ("KVM: X86: > Implement ring-based dirty memory tracking"). > > This enables the ring-based dirty memory tracking on ARM64. It's > notable that no extra reserved ring entries are needed on ARM64 > because the huge pages are always split into base pages when page > dirty tracking is enabled. Can you please elaborate on this? Adding a per-CPU ring of course results in extra memory allocation, so there must be a subtle x86-specific detail that I'm not aware of... > > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > Documentation/virt/kvm/api.rst | 2 +- > arch/arm64/include/uapi/asm/kvm.h | 1 + > arch/arm64/kvm/Kconfig | 1 + > arch/arm64/kvm/arm.c | 8 ++++++++ > 4 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index abd7c32126ce..19fa1ac017ed 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. > 8.29 KVM_CAP_DIRTY_LOG_RING > --------------------------- > > -:Architectures: x86 > +:Architectures: x86, arm64 > :Parameters: args[0] - size of the dirty log ring > > KVM is capable of tracking dirty memory using ring buffers that are > diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h > index 3bb134355874..7e04b0b8d2b2 100644 > --- a/arch/arm64/include/uapi/asm/kvm.h > +++ b/arch/arm64/include/uapi/asm/kvm.h > @@ -43,6 +43,7 @@ > #define __KVM_HAVE_VCPU_EVENTS > > #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 > +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 For context, the documentation says: <quote> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] </quote> What is the reason for picking this particular value? > > #define KVM_REG_SIZE(id) \ > (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig > index 815cc118c675..0309b2d0f2da 100644 > --- a/arch/arm64/kvm/Kconfig > +++ b/arch/arm64/kvm/Kconfig > @@ -32,6 +32,7 @@ menuconfig KVM > select KVM_VFIO > select HAVE_KVM_EVENTFD > select HAVE_KVM_IRQFD > + select HAVE_KVM_DIRTY_RING > select HAVE_KVM_MSI > select HAVE_KVM_IRQCHIP > select HAVE_KVM_IRQ_ROUTING > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 986cee6fbc7f..3de6b9b39db7 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > if (!ret) > ret = 1; > > + /* Force vcpu exit if its dirty ring is soft-full */ > + if (unlikely(vcpu->kvm->dirty_ring_size && > + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > + trace_kvm_dirty_ring_exit(vcpu); > + ret = 0; > + } > + Why can't this be moved to kvm_vcpu_exit_request() instead? I would also very much like the check to be made a common helper with x86. A seemingly approach would be to make this a request on dirty log insertion, and avoid the whole "check the log size" on every run, which adds pointless overhead to unsuspecting users (aka everyone). Thanks, M.
Hi Marc, On 8/19/22 6:00 PM, Marc Zyngier wrote: > On Fri, 19 Aug 2022 01:55:57 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> >> The ring-based dirty memory tracking has been available and enabled >> on x86 for a while. The feature is beneficial when the number of >> dirty pages is small in a checkpointing system or live migration >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86: >> Implement ring-based dirty memory tracking"). >> >> This enables the ring-based dirty memory tracking on ARM64. It's >> notable that no extra reserved ring entries are needed on ARM64 >> because the huge pages are always split into base pages when page >> dirty tracking is enabled. > > Can you please elaborate on this? Adding a per-CPU ring of course > results in extra memory allocation, so there must be a subtle > x86-specific detail that I'm not aware of... > Sure. I guess it's helpful to explain how it works in next revision. Something like below: This enables the ring-based dirty memory tracking on ARM64. The feature is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is pushed by host when page becomes dirty and pulled by userspace. A vcpu exit is forced when the ring buffer becomes full. The ring buffers on all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS. Yes, I think so. Adding a per-CPU ring results in extra memory allocation. However, it's avoiding synchronization among multiple vcpus when dirty pages happen on multiple vcpus. More discussion can be found from [1] [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/ (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC) >> >> Signed-off-by: Gavin Shan <gshan@redhat.com> >> --- >> Documentation/virt/kvm/api.rst | 2 +- >> arch/arm64/include/uapi/asm/kvm.h | 1 + >> arch/arm64/kvm/Kconfig | 1 + >> arch/arm64/kvm/arm.c | 8 ++++++++ >> 4 files changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst >> index abd7c32126ce..19fa1ac017ed 100644 >> --- a/Documentation/virt/kvm/api.rst >> +++ b/Documentation/virt/kvm/api.rst >> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. >> 8.29 KVM_CAP_DIRTY_LOG_RING >> --------------------------- >> >> -:Architectures: x86 >> +:Architectures: x86, arm64 >> :Parameters: args[0] - size of the dirty log ring >> >> KVM is capable of tracking dirty memory using ring buffers that are >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h >> index 3bb134355874..7e04b0b8d2b2 100644 >> --- a/arch/arm64/include/uapi/asm/kvm.h >> +++ b/arch/arm64/include/uapi/asm/kvm.h >> @@ -43,6 +43,7 @@ >> #define __KVM_HAVE_VCPU_EVENTS >> >> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 >> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 > > For context, the documentation says: > > <quote> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at > KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] > </quote> > > What is the reason for picking this particular value? > It's inherited from x86. I don't think it has to be this particular value. The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET. How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? The virtual area is cheap, I guess it's also nice to use x86's pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET. #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 2 >> >> #define KVM_REG_SIZE(id) \ >> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig >> index 815cc118c675..0309b2d0f2da 100644 >> --- a/arch/arm64/kvm/Kconfig >> +++ b/arch/arm64/kvm/Kconfig >> @@ -32,6 +32,7 @@ menuconfig KVM >> select KVM_VFIO >> select HAVE_KVM_EVENTFD >> select HAVE_KVM_IRQFD >> + select HAVE_KVM_DIRTY_RING >> select HAVE_KVM_MSI >> select HAVE_KVM_IRQCHIP >> select HAVE_KVM_IRQ_ROUTING >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >> index 986cee6fbc7f..3de6b9b39db7 100644 >> --- a/arch/arm64/kvm/arm.c >> +++ b/arch/arm64/kvm/arm.c >> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) >> if (!ret) >> ret = 1; >> >> + /* Force vcpu exit if its dirty ring is soft-full */ >> + if (unlikely(vcpu->kvm->dirty_ring_size && >> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { >> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; >> + trace_kvm_dirty_ring_exit(vcpu); >> + ret = 0; >> + } >> + > > Why can't this be moved to kvm_vcpu_exit_request() instead? I would > also very much like the check to be made a common helper with x86. > > A seemingly approach would be to make this a request on dirty log > insertion, and avoid the whole "check the log size" on every run, > which adds pointless overhead to unsuspecting users (aka everyone). > I though of having the check in kvm_vcpu_exit_request(). The various exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the highest priority and ARM64 is just to follow. I don't think it really matters. I will improve it accordingly in next revision: - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) { struct kvm *kvm = vcpu->vcpu; struct kvm_dirty_ring *ring = &vcpu->dirty_ring; if (unlikely(kvm->dirty_ring_size && kvm_dirty_ring_used(ring) >= ring->soft_limit)) { vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; trace_kvm_dirty_ring_exit(vcpu); return true; } return false; } - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request(). Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. Otherwise, the dirty log in the ring buffer will be overwritten. I'm not sure if anything else I missed? Thanks, Gavin
Hi, Gavin, On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote: > > For context, the documentation says: > > > > <quote> > > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at > > KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] > > </quote> > > > > What is the reason for picking this particular value? > > > > It's inherited from x86. I don't think it has to be this particular value. > The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, > KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET. > > How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? > The virtual area is cheap, I guess it's also nice to use x86's > pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET. > > #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 > #define KVM_DIRTY_LOG_PAGE_OFFSET 2 It was chosen not to be continuous of previous used offset because it'll be the 1st vcpu region that can cover multiple & dynamic number of pages. I wanted to leave the 3-63 (x86 used offset 2 already) for small fields so they can be continuous, which looks a little bit cleaner. Currently how many pages it'll use depends on the size set by the user, though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a maximum of 1MB memory. So I think setting it to 2 is okay, as long as we keep the rest 1MB address space for the per-vcpu ring structure, so any new vcpu fields (even if only 1 page will be needed) need to be after that maximum size of the ring. Thanks,
Hi Gavin, On Mon, 22 Aug 2022 02:58:20 +0100, Gavin Shan <gshan@redhat.com> wrote: > > Hi Marc, > > On 8/19/22 6:00 PM, Marc Zyngier wrote: > > On Fri, 19 Aug 2022 01:55:57 +0100, > > Gavin Shan <gshan@redhat.com> wrote: > >> > >> The ring-based dirty memory tracking has been available and enabled > >> on x86 for a while. The feature is beneficial when the number of > >> dirty pages is small in a checkpointing system or live migration > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86: > >> Implement ring-based dirty memory tracking"). > >> > >> This enables the ring-based dirty memory tracking on ARM64. It's > >> notable that no extra reserved ring entries are needed on ARM64 > >> because the huge pages are always split into base pages when page > >> dirty tracking is enabled. > > > > Can you please elaborate on this? Adding a per-CPU ring of course > > results in extra memory allocation, so there must be a subtle > > x86-specific detail that I'm not aware of... > > > > Sure. I guess it's helpful to explain how it works in next revision. > Something like below: > > This enables the ring-based dirty memory tracking on ARM64. The feature > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is > pushed by host when page becomes dirty and pulled by userspace. A vcpu > exit is forced when the ring buffer becomes full. The ring buffers on > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS. > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation. > However, it's avoiding synchronization among multiple vcpus when dirty > pages happen on multiple vcpus. More discussion can be found from [1] Oh, I totally buy the relaxation of the synchronisation (though I doubt this will have any visible effect until we have something like Oliver's patches to allow parallel faulting). But it is the "no extra reserved ring entries are needed on ARM64" argument that I don't get yet. > > [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/ > (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC) > > > >> > >> Signed-off-by: Gavin Shan <gshan@redhat.com> > >> --- > >> Documentation/virt/kvm/api.rst | 2 +- > >> arch/arm64/include/uapi/asm/kvm.h | 1 + > >> arch/arm64/kvm/Kconfig | 1 + > >> arch/arm64/kvm/arm.c | 8 ++++++++ > >> 4 files changed, 11 insertions(+), 1 deletion(-) > >> > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > >> index abd7c32126ce..19fa1ac017ed 100644 > >> --- a/Documentation/virt/kvm/api.rst > >> +++ b/Documentation/virt/kvm/api.rst > >> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. > >> 8.29 KVM_CAP_DIRTY_LOG_RING > >> --------------------------- > >> -:Architectures: x86 > >> +:Architectures: x86, arm64 > >> :Parameters: args[0] - size of the dirty log ring > >> KVM is capable of tracking dirty memory using ring buffers that > >> are > >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h > >> index 3bb134355874..7e04b0b8d2b2 100644 > >> --- a/arch/arm64/include/uapi/asm/kvm.h > >> +++ b/arch/arm64/include/uapi/asm/kvm.h > >> @@ -43,6 +43,7 @@ > >> #define __KVM_HAVE_VCPU_EVENTS > >> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 > >> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 > > > > For context, the documentation says: > > > > <quote> > > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at > > KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] > > </quote> > > > > What is the reason for picking this particular value? > > > > It's inherited from x86. I don't think it has to be this particular > value. The value is used to distinguish the region's owners like > kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and > KVM_DIRTY_LOG_PAGE_OFFSET. > > How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? > The virtual area is cheap, I guess it's also nice to use x86's > pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET. > > #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 > #define KVM_DIRTY_LOG_PAGE_OFFSET 2 Given that this is just an offset in the vcpu "file", I don't think it matters that much. 64 definitely allows for some struct vcpu growth, and it doesn't hurt to be compatible with x86 (for once...). > > >> #define KVM_REG_SIZE(id) > >> \ > >> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) > >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig > >> index 815cc118c675..0309b2d0f2da 100644 > >> --- a/arch/arm64/kvm/Kconfig > >> +++ b/arch/arm64/kvm/Kconfig > >> @@ -32,6 +32,7 @@ menuconfig KVM > >> select KVM_VFIO > >> select HAVE_KVM_EVENTFD > >> select HAVE_KVM_IRQFD > >> + select HAVE_KVM_DIRTY_RING > >> select HAVE_KVM_MSI > >> select HAVE_KVM_IRQCHIP > >> select HAVE_KVM_IRQ_ROUTING > >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > >> index 986cee6fbc7f..3de6b9b39db7 100644 > >> --- a/arch/arm64/kvm/arm.c > >> +++ b/arch/arm64/kvm/arm.c > >> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > >> if (!ret) > >> ret = 1; > >> + /* Force vcpu exit if its dirty ring is soft-full */ > >> + if (unlikely(vcpu->kvm->dirty_ring_size && > >> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { > >> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > >> + trace_kvm_dirty_ring_exit(vcpu); > >> + ret = 0; > >> + } > >> + > > > > Why can't this be moved to kvm_vcpu_exit_request() instead? I would > > also very much like the check to be made a common helper with x86. > > > > A seemingly approach would be to make this a request on dirty log > > insertion, and avoid the whole "check the log size" on every run, > > which adds pointless overhead to unsuspecting users (aka everyone). > > > > I though of having the check in kvm_vcpu_exit_request(). The various > exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the > highest priority and ARM64 is just to follow. I don't think it really > matters. I will improve it accordingly in next revision: > > - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c > > bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) > { > struct kvm *kvm = vcpu->vcpu; > struct kvm_dirty_ring *ring = &vcpu->dirty_ring; > > if (unlikely(kvm->dirty_ring_size && > kvm_dirty_ring_used(ring) >= ring->soft_limit)) { > vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > trace_kvm_dirty_ring_exit(vcpu); > return true; > } > > return false; > } > > - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request(). > > Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. > Otherwise, the dirty log in the ring buffer will be overwritten. I'm not > sure if anything else I missed? I'm fine with the above, but what I really wanted is a request from the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request. Something like this (which obviously doesn't compile, but you'll get the idea): diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..0b41feb6fb7d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu); + + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) { + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + return 0; + } } return 1; diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..08b2f01164fa 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) { + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); struct kvm_dirty_gfn *entry; /* It should never get full */ @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset); + + if (kvm_dirty_ring_soft_full(vcpu)) + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); } struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) Thanks, M.
Hi Peter, On 8/23/22 4:55 AM, Peter Xu wrote: > On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote: >>> For context, the documentation says: >>> >>> <quote> >>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at >>> KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] >>> </quote> >>> >>> What is the reason for picking this particular value? >>> >> >> It's inherited from x86. I don't think it has to be this particular value. >> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET, >> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET. >> >> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? >> The virtual area is cheap, I guess it's also nice to use x86's >> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET. >> >> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 >> #define KVM_DIRTY_LOG_PAGE_OFFSET 2 > > It was chosen not to be continuous of previous used offset because it'll be > the 1st vcpu region that can cover multiple & dynamic number of pages. I > wanted to leave the 3-63 (x86 used offset 2 already) for small fields so > they can be continuous, which looks a little bit cleaner. > > Currently how many pages it'll use depends on the size set by the user, > though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a > maximum of 1MB memory. > > So I think setting it to 2 is okay, as long as we keep the rest 1MB address > space for the per-vcpu ring structure, so any new vcpu fields (even if only > 1 page will be needed) need to be after that maximum size of the ring. > Thanks for the details. I think it'd keep the layout since virtual area is really cheap. So lets keep it as of being if Marc doesn't object. In this way, the new area, which is usually one page size can go after KVM_COALESCED_MMIO_PAGE_OFFSET. #define KVM_DIRTY_LOG_PAGE_OFFSET 64 Thanks, Gavin
Hi Marc, On 8/23/22 7:42 AM, Marc Zyngier wrote: > On Mon, 22 Aug 2022 02:58:20 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> On 8/19/22 6:00 PM, Marc Zyngier wrote: >>> On Fri, 19 Aug 2022 01:55:57 +0100, >>> Gavin Shan <gshan@redhat.com> wrote: >>>> >>>> The ring-based dirty memory tracking has been available and enabled >>>> on x86 for a while. The feature is beneficial when the number of >>>> dirty pages is small in a checkpointing system or live migration >>>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86: >>>> Implement ring-based dirty memory tracking"). >>>> >>>> This enables the ring-based dirty memory tracking on ARM64. It's >>>> notable that no extra reserved ring entries are needed on ARM64 >>>> because the huge pages are always split into base pages when page >>>> dirty tracking is enabled. >>> >>> Can you please elaborate on this? Adding a per-CPU ring of course >>> results in extra memory allocation, so there must be a subtle >>> x86-specific detail that I'm not aware of... >>> >> >> Sure. I guess it's helpful to explain how it works in next revision. >> Something like below: >> >> This enables the ring-based dirty memory tracking on ARM64. The feature >> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by >> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and >> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is >> pushed by host when page becomes dirty and pulled by userspace. A vcpu >> exit is forced when the ring buffer becomes full. The ring buffers on >> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS. >> >> Yes, I think so. Adding a per-CPU ring results in extra memory allocation. >> However, it's avoiding synchronization among multiple vcpus when dirty >> pages happen on multiple vcpus. More discussion can be found from [1] > > Oh, I totally buy the relaxation of the synchronisation (though I > doubt this will have any visible effect until we have something like > Oliver's patches to allow parallel faulting). > > But it is the "no extra reserved ring entries are needed on ARM64" > argument that I don't get yet. > Ok. The extra reserved ring entries are x86 specific. When x86's PML (Page Modification Logging) hardware capability is enabled, the vcpu exits due to full PML buffer, which is 512 entries. All the information in PML buffer is pushed to the dirty ring buffer in one shoot. To avoid overrunning the dirty ring buffer, there are 512 entries are reserved. === include/linux/kvm_host.h #define KVM_DIRTY_RING_RSVD_ENTRIES 64 // fixed and reserved ring entries === virt/kvm/dirty_ring.c int __weak kvm_cpu_dirty_log_size(void) { return 0; } u32 kvm_dirty_ring_get_rsvd_entries(void) { return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); } === arch/x86/kvm/mmu/mmu.c int kvm_cpu_dirty_log_size(void) { return kvm_x86_ops.cpu_dirty_log_size; // Set to 512 when PML is enabled } kvm_cpu_dirty_log_size() isn't be overrided by ARM64, meaning it returns zero on ARM64. On x86, it returns 512 when PML is enabled. >> >> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/ >> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC) >> >> >>>> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>> --- >>>> Documentation/virt/kvm/api.rst | 2 +- >>>> arch/arm64/include/uapi/asm/kvm.h | 1 + >>>> arch/arm64/kvm/Kconfig | 1 + >>>> arch/arm64/kvm/arm.c | 8 ++++++++ >>>> 4 files changed, 11 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst >>>> index abd7c32126ce..19fa1ac017ed 100644 >>>> --- a/Documentation/virt/kvm/api.rst >>>> +++ b/Documentation/virt/kvm/api.rst >>>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. >>>> 8.29 KVM_CAP_DIRTY_LOG_RING >>>> --------------------------- >>>> -:Architectures: x86 >>>> +:Architectures: x86, arm64 >>>> :Parameters: args[0] - size of the dirty log ring >>>> KVM is capable of tracking dirty memory using ring buffers that >>>> are >>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h >>>> index 3bb134355874..7e04b0b8d2b2 100644 >>>> --- a/arch/arm64/include/uapi/asm/kvm.h >>>> +++ b/arch/arm64/include/uapi/asm/kvm.h >>>> @@ -43,6 +43,7 @@ >>>> #define __KVM_HAVE_VCPU_EVENTS >>>> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 >>>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 >>> >>> For context, the documentation says: >>> >>> <quote> >>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at >>> KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...] >>> </quote> >>> >>> What is the reason for picking this particular value? >>> >> >> It's inherited from x86. I don't think it has to be this particular >> value. The value is used to distinguish the region's owners like >> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and >> KVM_DIRTY_LOG_PAGE_OFFSET. >> >> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision? >> The virtual area is cheap, I guess it's also nice to use x86's >> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET. >> >> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 >> #define KVM_DIRTY_LOG_PAGE_OFFSET 2 > > Given that this is just an offset in the vcpu "file", I don't think it > matters that much. 64 definitely allows for some struct vcpu growth, > and it doesn't hurt to be compatible with x86 (for once...). > Sure, thanks. I think it'd better to have same pattern as x86 either. >> >>>> #define KVM_REG_SIZE(id) >>>> \ >>>> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) >>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig >>>> index 815cc118c675..0309b2d0f2da 100644 >>>> --- a/arch/arm64/kvm/Kconfig >>>> +++ b/arch/arm64/kvm/Kconfig >>>> @@ -32,6 +32,7 @@ menuconfig KVM >>>> select KVM_VFIO >>>> select HAVE_KVM_EVENTFD >>>> select HAVE_KVM_IRQFD >>>> + select HAVE_KVM_DIRTY_RING >>>> select HAVE_KVM_MSI >>>> select HAVE_KVM_IRQCHIP >>>> select HAVE_KVM_IRQ_ROUTING >>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >>>> index 986cee6fbc7f..3de6b9b39db7 100644 >>>> --- a/arch/arm64/kvm/arm.c >>>> +++ b/arch/arm64/kvm/arm.c >>>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) >>>> if (!ret) >>>> ret = 1; >>>> + /* Force vcpu exit if its dirty ring is soft-full */ >>>> + if (unlikely(vcpu->kvm->dirty_ring_size && >>>> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { >>>> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; >>>> + trace_kvm_dirty_ring_exit(vcpu); >>>> + ret = 0; >>>> + } >>>> + >>> >>> Why can't this be moved to kvm_vcpu_exit_request() instead? I would >>> also very much like the check to be made a common helper with x86. >>> >>> A seemingly approach would be to make this a request on dirty log >>> insertion, and avoid the whole "check the log size" on every run, >>> which adds pointless overhead to unsuspecting users (aka everyone). >>> >> >> I though of having the check in kvm_vcpu_exit_request(). The various >> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the >> highest priority and ARM64 is just to follow. I don't think it really >> matters. I will improve it accordingly in next revision: >> >> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c >> >> bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu) >> { >> struct kvm *kvm = vcpu->vcpu; >> struct kvm_dirty_ring *ring = &vcpu->dirty_ring; >> >> if (unlikely(kvm->dirty_ring_size && >> kvm_dirty_ring_used(ring) >= ring->soft_limit)) { >> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; >> trace_kvm_dirty_ring_exit(vcpu); >> return true; >> } >> >> return false; >> } >> >> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request(). >> >> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time. >> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not >> sure if anything else I missed? > > I'm fine with the above, but what I really wanted is a request from > the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request. > Something like this (which obviously doesn't compile, but you'll get > the idea): > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 986cee6fbc7f..0b41feb6fb7d 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > return kvm_vcpu_suspend(vcpu); > + > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) { > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > + trace_kvm_dirty_ring_exit(vcpu); > + return 0; > + } > } > > return 1; > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index f4c2a6eb1666..08b2f01164fa 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > { > + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); > struct kvm_dirty_gfn *entry; > > /* It should never get full */ > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > kvm_dirty_gfn_set_dirtied(entry); > ring->dirty_index++; > trace_kvm_dirty_ring_push(ring, slot, offset); > + > + if (kvm_dirty_ring_soft_full(vcpu)) > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > } > > struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) > Ok, thanks for the details, Marc. I will adopt your code in next revision :) Thanks, Gavin
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote: > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index 986cee6fbc7f..0b41feb6fb7d 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > return kvm_vcpu_suspend(vcpu); > > + > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) { > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > + trace_kvm_dirty_ring_exit(vcpu); > > + return 0; > > + } > > } > > return 1; > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > > index f4c2a6eb1666..08b2f01164fa 100644 > > --- a/virt/kvm/dirty_ring.c > > +++ b/virt/kvm/dirty_ring.c > > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > { > > + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); > > struct kvm_dirty_gfn *entry; > > /* It should never get full */ > > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > kvm_dirty_gfn_set_dirtied(entry); > > ring->dirty_index++; > > trace_kvm_dirty_ring_push(ring, slot, offset); > > + > > + if (kvm_dirty_ring_soft_full(vcpu)) > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > } > > struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) > > > > Ok, thanks for the details, Marc. I will adopt your code in next revision :) Note that there can be a slight difference with the old/new code, in that an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL vmexit and keep kicking VCPU_RUN with the new code. Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code until the next dirty pfn being pushed to the ring, then it'll request ring full exit again. Each time it exits the ring grows 1. At last iiuc it can easily hit the ring full and trigger the warning at the entry of kvm_dirty_ring_push(): /* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring)); We did that because kvm_dirty_ring_push() was previously designed to not be able to fail at all (e.g., in the old bitmap world we never will fail too). We can't because we can't lose any dirty page or migration could silently fail too (consider when we do user exit due to ring full and migration just completed; there could be unsynced pages on src/dst). So even though the old approach will need to read kvm->dirty_ring_size for every entrance which is a pity, it will avoid issue above. Side note: for x86 the dirty ring check was put at the entrance not because it needs to be the highest priority - it should really be the same when check kvm requests. It's just that it'll be the fastest way to fail properly if needed before loading mmu, disabling preemption/irq, etc. Thanks,
On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote: > Hi Gavin, > > On Mon, 22 Aug 2022 02:58:20 +0100, > Gavin Shan <gshan@redhat.com> wrote: > > > > Hi Marc, > > > > On 8/19/22 6:00 PM, Marc Zyngier wrote: > > > On Fri, 19 Aug 2022 01:55:57 +0100, > > > Gavin Shan <gshan@redhat.com> wrote: > > >> > > >> The ring-based dirty memory tracking has been available and enabled > > >> on x86 for a while. The feature is beneficial when the number of > > >> dirty pages is small in a checkpointing system or live migration > > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86: > > >> Implement ring-based dirty memory tracking"). > > >> > > >> This enables the ring-based dirty memory tracking on ARM64. It's > > >> notable that no extra reserved ring entries are needed on ARM64 > > >> because the huge pages are always split into base pages when page > > >> dirty tracking is enabled. > > > > > > Can you please elaborate on this? Adding a per-CPU ring of course > > > results in extra memory allocation, so there must be a subtle > > > x86-specific detail that I'm not aware of... > > > > > > > Sure. I guess it's helpful to explain how it works in next revision. > > Something like below: > > > > This enables the ring-based dirty memory tracking on ARM64. The feature > > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by > > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and > > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is > > pushed by host when page becomes dirty and pulled by userspace. A vcpu > > exit is forced when the ring buffer becomes full. The ring buffers on > > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS. > > > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation. > > However, it's avoiding synchronization among multiple vcpus when dirty > > pages happen on multiple vcpus. More discussion can be found from [1] > > Oh, I totally buy the relaxation of the synchronisation (though I > doubt this will have any visible effect until we have something like > Oliver's patches to allow parallel faulting). > Heh, yeah I need to get that out the door. I'll also note that Gavin's changes are still relevant without that series, as we do write unprotect in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"). -- Thanks, Oliver
On Tue, 23 Aug 2022 14:58:19 +0100, Peter Xu <peterx@redhat.com> wrote: > > On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote: > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > index 986cee6fbc7f..0b41feb6fb7d 100644 > > > --- a/arch/arm64/kvm/arm.c > > > +++ b/arch/arm64/kvm/arm.c > > > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > > return kvm_vcpu_suspend(vcpu); > > > + > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) { > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > > + trace_kvm_dirty_ring_exit(vcpu); > > > + return 0; > > > + } > > > } > > > return 1; > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > > > index f4c2a6eb1666..08b2f01164fa 100644 > > > --- a/virt/kvm/dirty_ring.c > > > +++ b/virt/kvm/dirty_ring.c > > > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > > { > > > + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring); > > > struct kvm_dirty_gfn *entry; > > > /* It should never get full */ > > > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > > kvm_dirty_gfn_set_dirtied(entry); > > > ring->dirty_index++; > > > trace_kvm_dirty_ring_push(ring, slot, offset); > > > + > > > + if (kvm_dirty_ring_soft_full(vcpu)) > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > > } > > > struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) > > > > > > > Ok, thanks for the details, Marc. I will adopt your code in next revision :) > > Note that there can be a slight difference with the old/new code, in that > an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL > vmexit and keep kicking VCPU_RUN with the new code. > > Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code > until the next dirty pfn being pushed to the ring, then it'll request ring > full exit again. > > Each time it exits the ring grows 1. > > At last iiuc it can easily hit the ring full and trigger the warning at the > entry of kvm_dirty_ring_push(): > > /* It should never get full */ > WARN_ON_ONCE(kvm_dirty_ring_full(ring)); Hmmm, yes. Well spotted. > We did that because kvm_dirty_ring_push() was previously designed to not be > able to fail at all (e.g., in the old bitmap world we never will fail too). > We can't because we can't lose any dirty page or migration could silently > fail too (consider when we do user exit due to ring full and migration just > completed; there could be unsynced pages on src/dst). > > So even though the old approach will need to read kvm->dirty_ring_size for > every entrance which is a pity, it will avoid issue above. I don't think we really need this check on the hot path. All we need is to make the request sticky until userspace gets their act together and consumes elements in the ring. Something like: diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..e8ed5e1af159 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) return kvm_vcpu_suspend(vcpu); + + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && + kvm_dirty_ring_soft_full(vcpu)) { + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + return 0; + } } return 1; However, I'm a bit concerned by the reset side of things. It iterates over the vcpus and expects the view of each ring to be consistent, even if userspace is hacking at it from another CPU. For example, I can't see what guarantees that the kernel observes the writes from userspace in the order they are being performed (the documentation provides no requirements other than "it must collect the dirty GFNs in sequence", which doesn't mean much from an ordering perspective). I can see that working on a strongly ordered architecture, but on something as relaxed as ARM, the CPUs may^Wwill aggressively reorder stuff that isn't explicitly ordered. I have the feeling that a CAS operation on both sides would be enough, but someone who actually understands how this works should have a look... Thanks, M.
On Tue, 23 Aug 2022 15:44:42 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote: > > Hi Gavin, > > > > On Mon, 22 Aug 2022 02:58:20 +0100, > > Gavin Shan <gshan@redhat.com> wrote: > > > > > > Hi Marc, > > > > > > On 8/19/22 6:00 PM, Marc Zyngier wrote: > > > > On Fri, 19 Aug 2022 01:55:57 +0100, > > > > Gavin Shan <gshan@redhat.com> wrote: > > > >> > > > >> The ring-based dirty memory tracking has been available and enabled > > > >> on x86 for a while. The feature is beneficial when the number of > > > >> dirty pages is small in a checkpointing system or live migration > > > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86: > > > >> Implement ring-based dirty memory tracking"). > > > >> > > > >> This enables the ring-based dirty memory tracking on ARM64. It's > > > >> notable that no extra reserved ring entries are needed on ARM64 > > > >> because the huge pages are always split into base pages when page > > > >> dirty tracking is enabled. > > > > > > > > Can you please elaborate on this? Adding a per-CPU ring of course > > > > results in extra memory allocation, so there must be a subtle > > > > x86-specific detail that I'm not aware of... > > > > > > > > > > Sure. I guess it's helpful to explain how it works in next revision. > > > Something like below: > > > > > > This enables the ring-based dirty memory tracking on ARM64. The feature > > > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by > > > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and > > > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is > > > pushed by host when page becomes dirty and pulled by userspace. A vcpu > > > exit is forced when the ring buffer becomes full. The ring buffers on > > > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS. > > > > > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation. > > > However, it's avoiding synchronization among multiple vcpus when dirty > > > pages happen on multiple vcpus. More discussion can be found from [1] > > > > Oh, I totally buy the relaxation of the synchronisation (though I > > doubt this will have any visible effect until we have something like > > Oliver's patches to allow parallel faulting). > > > > Heh, yeah I need to get that out the door. I'll also note that Gavin's > changes are still relevant without that series, as we do write unprotect > in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: > Add fast path to handle permission relaxation during dirty logging"). Ah, true. Now if only someone could explain how the whole producer-consumer thing works without a trace of a barrier, that'd be great... Thanks, M.
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote: > I don't think we really need this check on the hot path. All we need > is to make the request sticky until userspace gets their act together > and consumes elements in the ring. Something like: > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 986cee6fbc7f..e8ed5e1af159 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > return kvm_vcpu_suspend(vcpu); > + > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && > + kvm_dirty_ring_soft_full(vcpu)) { > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > + trace_kvm_dirty_ring_exit(vcpu); > + return 0; > + } > } > > return 1; Right, this seems working. We can also use kvm_test_request() here. > > > However, I'm a bit concerned by the reset side of things. It iterates > over the vcpus and expects the view of each ring to be consistent, > even if userspace is hacking at it from another CPU. For example, I > can't see what guarantees that the kernel observes the writes from > userspace in the order they are being performed (the documentation > provides no requirements other than "it must collect the dirty GFNs in > sequence", which doesn't mean much from an ordering perspective). > > I can see that working on a strongly ordered architecture, but on > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder > stuff that isn't explicitly ordered. I have the feeling that a CAS > operation on both sides would be enough, but someone who actually > understands how this works should have a look... I definitely don't think I 100% understand all the ordering things since they're complicated.. but my understanding is that the reset procedure didn't need memory barrier (unlike pushing, where we have explicit wmb), because we assumed the userapp is not hostile so logically it should only modify the flags which is a 32bit field, assuming atomicity guaranteed. IIRC we used to discuss similar questions on "what if the user is hostile and wants to hack the process by messing up with the ring", and our conclusion was as long as the process wouldn't mess up anything outside itself it should be okay. E.g. It should not be able to either cause the host to misfunction, or trigger kernel warnings in dmesg, etc.. Thanks,
On Tue, 23 Aug 2022 22:20:32 +0100, Peter Xu <peterx@redhat.com> wrote: > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote: > > I don't think we really need this check on the hot path. All we need > > is to make the request sticky until userspace gets their act together > > and consumes elements in the ring. Something like: > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index 986cee6fbc7f..e8ed5e1af159 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > return kvm_vcpu_suspend(vcpu); > > + > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && > > + kvm_dirty_ring_soft_full(vcpu)) { > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > + trace_kvm_dirty_ring_exit(vcpu); > > + return 0; > > + } > > } > > > > return 1; > > Right, this seems working. We can also use kvm_test_request() here. > > > > > > > However, I'm a bit concerned by the reset side of things. It iterates > > over the vcpus and expects the view of each ring to be consistent, > > even if userspace is hacking at it from another CPU. For example, I > > can't see what guarantees that the kernel observes the writes from > > userspace in the order they are being performed (the documentation > > provides no requirements other than "it must collect the dirty GFNs in > > sequence", which doesn't mean much from an ordering perspective). > > > > I can see that working on a strongly ordered architecture, but on > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder > > stuff that isn't explicitly ordered. I have the feeling that a CAS > > operation on both sides would be enough, but someone who actually > > understands how this works should have a look... > > I definitely don't think I 100% understand all the ordering things since > they're complicated.. but my understanding is that the reset procedure > didn't need memory barrier (unlike pushing, where we have explicit wmb), > because we assumed the userapp is not hostile so logically it should only > modify the flags which is a 32bit field, assuming atomicity guaranteed. Atomicity doesn't guarantee ordering, unfortunately. Take the following example: CPU0 is changing a bunch of flags for GFNs A, B, C, D that exist in the ring in that order, and CPU1 performs an ioctl to reset the page state. CPU0: write_flag(A, KVM_DIRTY_GFN_F_RESET) write_flag(B, KVM_DIRTY_GFN_F_RESET) write_flag(C, KVM_DIRTY_GFN_F_RESET) write_flag(D, KVM_DIRTY_GFN_F_RESET) [...] CPU1: ioctl(KVM_RESET_DIRTY_RINGS) Since CPU0 writes do not have any ordering, CPU1 can observe the writes in a sequence that have nothing to do with program order, and could for example observe that GFN A and D have been reset, but not B and C. This in turn breaks the logic in the reset code (B, C, and D don't get reset), despite userspace having followed the spec to the letter. If each was a store-release (which is the case on x86), it wouldn't be a problem, but nothing calls it in the documentation. Maybe that's not a big deal if it is expected that each CPU will issue a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own writes. But expecting this to work across CPUs without any barrier is wishful thinking. > IIRC we used to discuss similar questions on "what if the user is hostile > and wants to hack the process by messing up with the ring", and our > conclusion was as long as the process wouldn't mess up anything outside > itself it should be okay. E.g. It should not be able to either cause the > host to misfunction, or trigger kernel warnings in dmesg, etc.. I'm not even discussing safety here. I'm purely discussing the interactions between userspace and kernel based on the documentation and the existing kernel code. Thanks, M.
On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote: > On Tue, 23 Aug 2022 22:20:32 +0100, > Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote: > > > I don't think we really need this check on the hot path. All we need > > > is to make the request sticky until userspace gets their act together > > > and consumes elements in the ring. Something like: > > > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > index 986cee6fbc7f..e8ed5e1af159 100644 > > > --- a/arch/arm64/kvm/arm.c > > > +++ b/arch/arm64/kvm/arm.c > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > > return kvm_vcpu_suspend(vcpu); > > > + > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && > > > + kvm_dirty_ring_soft_full(vcpu)) { > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > > + trace_kvm_dirty_ring_exit(vcpu); > > > + return 0; > > > + } > > > } > > > > > > return 1; > > > > Right, this seems working. We can also use kvm_test_request() here. > > > > > > > > > > > However, I'm a bit concerned by the reset side of things. It iterates > > > over the vcpus and expects the view of each ring to be consistent, > > > even if userspace is hacking at it from another CPU. For example, I > > > can't see what guarantees that the kernel observes the writes from > > > userspace in the order they are being performed (the documentation > > > provides no requirements other than "it must collect the dirty GFNs in > > > sequence", which doesn't mean much from an ordering perspective). > > > > > > I can see that working on a strongly ordered architecture, but on > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder > > > stuff that isn't explicitly ordered. I have the feeling that a CAS > > > operation on both sides would be enough, but someone who actually > > > understands how this works should have a look... > > > > I definitely don't think I 100% understand all the ordering things since > > they're complicated.. but my understanding is that the reset procedure > > didn't need memory barrier (unlike pushing, where we have explicit wmb), > > because we assumed the userapp is not hostile so logically it should only > > modify the flags which is a 32bit field, assuming atomicity guaranteed. > > Atomicity doesn't guarantee ordering, unfortunately. Right, sorry to be misleading. The "atomicity" part I was trying to say the kernel will always see consistent update on the fields. The ordering should also be guaranteed, because things must happen with below sequence: (1) kernel publish dirty GFN data (slot, offset) (2) kernel publish dirty GFN flag (set to DIRTY) (3) user sees DIRTY, collects (slots, offset) (4) user sets it to RESET (5) kernel reads RESET So the ordering of single-entry is guaranteed in that when (5) happens it must be after stablized (1+2). > Take the > following example: CPU0 is changing a bunch of flags for GFNs A, B, C, > D that exist in the ring in that order, and CPU1 performs an ioctl to > reset the page state. > > CPU0: > write_flag(A, KVM_DIRTY_GFN_F_RESET) > write_flag(B, KVM_DIRTY_GFN_F_RESET) > write_flag(C, KVM_DIRTY_GFN_F_RESET) > write_flag(D, KVM_DIRTY_GFN_F_RESET) > [...] > > CPU1: > ioctl(KVM_RESET_DIRTY_RINGS) > > Since CPU0 writes do not have any ordering, CPU1 can observe the > writes in a sequence that have nothing to do with program order, and > could for example observe that GFN A and D have been reset, but not B > and C. This in turn breaks the logic in the reset code (B, C, and D > don't get reset), despite userspace having followed the spec to the > letter. If each was a store-release (which is the case on x86), it > wouldn't be a problem, but nothing calls it in the documentation. > > Maybe that's not a big deal if it is expected that each CPU will issue > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own > writes. But expecting this to work across CPUs without any barrier is > wishful thinking. I see what you meant... Firstly I'm actually curious whether that'll really happen if the gfns are collected in something like a for loop: for(i = 0; i < N; i++) collect_dirty_gfn(ring, i); Because since all the gfps to be read will depend on variable "i", IIUC no reordering should happen, but I'm not really sure, so more of a pure question. Besides, the other thing to mention is that I think it is fine the RESET ioctl didn't recycle all the gfns got set to reset state. Taking above example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D are not reset. But IMHO it's okay because it means we reset partial of the gfns not all of them, and it's safe to do so. It means the next ring full event can come earlier because we recycled less, but that's functionally safe to me. > > > IIRC we used to discuss similar questions on "what if the user is hostile > > and wants to hack the process by messing up with the ring", and our > > conclusion was as long as the process wouldn't mess up anything outside > > itself it should be okay. E.g. It should not be able to either cause the > > host to misfunction, or trigger kernel warnings in dmesg, etc.. > > I'm not even discussing safety here. I'm purely discussing the > interactions between userspace and kernel based on the documentation > and the existing kernel code. I see. Please let me know if above makes sense. Thanks!
On Wed, 24 Aug 2022 00:19:04 +0100, Peter Xu <peterx@redhat.com> wrote: > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote: > > On Tue, 23 Aug 2022 22:20:32 +0100, > > Peter Xu <peterx@redhat.com> wrote: > > > > > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote: > > > > I don't think we really need this check on the hot path. All we need > > > > is to make the request sticky until userspace gets their act together > > > > and consumes elements in the ring. Something like: > > > > > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > > index 986cee6fbc7f..e8ed5e1af159 100644 > > > > --- a/arch/arm64/kvm/arm.c > > > > +++ b/arch/arm64/kvm/arm.c > > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > > > > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > > > return kvm_vcpu_suspend(vcpu); > > > > + > > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && > > > > + kvm_dirty_ring_soft_full(vcpu)) { > > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > > > + trace_kvm_dirty_ring_exit(vcpu); > > > > + return 0; > > > > + } > > > > } > > > > > > > > return 1; > > > > > > Right, this seems working. We can also use kvm_test_request() here. > > > > > > > > > > > > > > > However, I'm a bit concerned by the reset side of things. It iterates > > > > over the vcpus and expects the view of each ring to be consistent, > > > > even if userspace is hacking at it from another CPU. For example, I > > > > can't see what guarantees that the kernel observes the writes from > > > > userspace in the order they are being performed (the documentation > > > > provides no requirements other than "it must collect the dirty GFNs in > > > > sequence", which doesn't mean much from an ordering perspective). > > > > > > > > I can see that working on a strongly ordered architecture, but on > > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder > > > > stuff that isn't explicitly ordered. I have the feeling that a CAS > > > > operation on both sides would be enough, but someone who actually > > > > understands how this works should have a look... > > > > > > I definitely don't think I 100% understand all the ordering things since > > > they're complicated.. but my understanding is that the reset procedure > > > didn't need memory barrier (unlike pushing, where we have explicit wmb), > > > because we assumed the userapp is not hostile so logically it should only > > > modify the flags which is a 32bit field, assuming atomicity guaranteed. > > > > Atomicity doesn't guarantee ordering, unfortunately. > > Right, sorry to be misleading. The "atomicity" part I was trying to say > the kernel will always see consistent update on the fields. > > The ordering should also be guaranteed, because things must happen with > below sequence: > > (1) kernel publish dirty GFN data (slot, offset) > (2) kernel publish dirty GFN flag (set to DIRTY) > (3) user sees DIRTY, collects (slots, offset) > (4) user sets it to RESET > (5) kernel reads RESET Maybe. Maybe not. The reset could well be sitting in the CPU write buffer for as long as it wants and not be seen by the kernel if the read occurs on another CPU. And that's the crucial bit: single-CPU is fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU on collection, and global on reset (this seems like a bad decision, but it is too late to fix this). > > So the ordering of single-entry is guaranteed in that when (5) happens it > must be after stablized (1+2). > > > Take the > > following example: CPU0 is changing a bunch of flags for GFNs A, B, C, > > D that exist in the ring in that order, and CPU1 performs an ioctl to > > reset the page state. > > > > CPU0: > > write_flag(A, KVM_DIRTY_GFN_F_RESET) > > write_flag(B, KVM_DIRTY_GFN_F_RESET) > > write_flag(C, KVM_DIRTY_GFN_F_RESET) > > write_flag(D, KVM_DIRTY_GFN_F_RESET) > > [...] > > > > CPU1: > > ioctl(KVM_RESET_DIRTY_RINGS) > > > > Since CPU0 writes do not have any ordering, CPU1 can observe the > > writes in a sequence that have nothing to do with program order, and > > could for example observe that GFN A and D have been reset, but not B > > and C. This in turn breaks the logic in the reset code (B, C, and D > > don't get reset), despite userspace having followed the spec to the > > letter. If each was a store-release (which is the case on x86), it > > wouldn't be a problem, but nothing calls it in the documentation. > > > > Maybe that's not a big deal if it is expected that each CPU will issue > > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own > > writes. But expecting this to work across CPUs without any barrier is > > wishful thinking. > > I see what you meant... > > Firstly I'm actually curious whether that'll really happen if the gfns are > collected in something like a for loop: > > for(i = 0; i < N; i++) > collect_dirty_gfn(ring, i); > > Because since all the gfps to be read will depend on variable "i", IIUC no > reordering should happen, but I'm not really sure, so more of a pure > question. 'i' has no influence on the write ordering. Each write targets a different address, there is no inter-write dependencies (this concept doesn't exist other than for writes to the same address), so they can be reordered at will. If you want a proof of this, head to http://diy.inria.fr/www/ and run the MP.litmus test (which conveniently gives you a reduction of this problem) on both the x86 and AArch64 models. You will see that the reordering isn't allowed on x86, but definitely allowed on arm64. > Besides, the other thing to mention is that I think it is fine the RESET > ioctl didn't recycle all the gfns got set to reset state. Taking above > example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags > are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D > are not reset. But IMHO it's okay because it means we reset partial of the > gfns not all of them, and it's safe to do so. It means the next ring full > event can come earlier because we recycled less, but that's functionally > safe to me. It may be safe, but it isn't what the userspace API promises. In other words, without further straightening of the API, this doesn't work as expected on relaxed memory architectures. So before this gets enabled on arm64, this whole ordering issue must be addressed. Thanks, M.
On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote: > On Wed, 24 Aug 2022 00:19:04 +0100, > Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote: > > > On Tue, 23 Aug 2022 22:20:32 +0100, > > > Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote: > > > > > I don't think we really need this check on the hot path. All we need > > > > > is to make the request sticky until userspace gets their act together > > > > > and consumes elements in the ring. Something like: > > > > > > > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > > > index 986cee6fbc7f..e8ed5e1af159 100644 > > > > > --- a/arch/arm64/kvm/arm.c > > > > > +++ b/arch/arm64/kvm/arm.c > > > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu) > > > > > > > > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) > > > > > return kvm_vcpu_suspend(vcpu); > > > > > + > > > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) && > > > > > + kvm_dirty_ring_soft_full(vcpu)) { > > > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu); > > > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; > > > > > + trace_kvm_dirty_ring_exit(vcpu); > > > > > + return 0; > > > > > + } > > > > > } > > > > > > > > > > return 1; > > > > > > > > Right, this seems working. We can also use kvm_test_request() here. > > > > > > > > > > > > > > > > > > > However, I'm a bit concerned by the reset side of things. It iterates > > > > > over the vcpus and expects the view of each ring to be consistent, > > > > > even if userspace is hacking at it from another CPU. For example, I > > > > > can't see what guarantees that the kernel observes the writes from > > > > > userspace in the order they are being performed (the documentation > > > > > provides no requirements other than "it must collect the dirty GFNs in > > > > > sequence", which doesn't mean much from an ordering perspective). > > > > > > > > > > I can see that working on a strongly ordered architecture, but on > > > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder > > > > > stuff that isn't explicitly ordered. I have the feeling that a CAS > > > > > operation on both sides would be enough, but someone who actually > > > > > understands how this works should have a look... > > > > > > > > I definitely don't think I 100% understand all the ordering things since > > > > they're complicated.. but my understanding is that the reset procedure > > > > didn't need memory barrier (unlike pushing, where we have explicit wmb), > > > > because we assumed the userapp is not hostile so logically it should only > > > > modify the flags which is a 32bit field, assuming atomicity guaranteed. > > > > > > Atomicity doesn't guarantee ordering, unfortunately. > > > > Right, sorry to be misleading. The "atomicity" part I was trying to say > > the kernel will always see consistent update on the fields. > > > > The ordering should also be guaranteed, because things must happen with > > below sequence: > > > > (1) kernel publish dirty GFN data (slot, offset) > > (2) kernel publish dirty GFN flag (set to DIRTY) > > (3) user sees DIRTY, collects (slots, offset) > > (4) user sets it to RESET > > (5) kernel reads RESET > > Maybe. Maybe not. The reset could well be sitting in the CPU write > buffer for as long as it wants and not be seen by the kernel if the > read occurs on another CPU. And that's the crucial bit: single-CPU is > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU > on collection, and global on reset (this seems like a bad decision, > but it is too late to fix this). Regarding the last statement, that's something I had question too and discussed with Paolo, even though at that time it's not a outcome of discussing memory ordering issues. IIUC the initial design was trying to avoid tlb flush flood when vcpu number is large (each RESET per ring even for one page will need all vcpus to flush, so O(N^2) flushing needed). With global RESET it's O(N). So it's kind of a trade-off, and indeed until now I'm not sure which one is better. E.g., with per-ring reset, we can have locality too in userspace, e.g. the vcpu thread might be able to recycle without holding global locks. Regarding safety I hope I covered that below in previous reply. > > > > > So the ordering of single-entry is guaranteed in that when (5) happens it > > must be after stablized (1+2). > > > > > Take the > > > following example: CPU0 is changing a bunch of flags for GFNs A, B, C, > > > D that exist in the ring in that order, and CPU1 performs an ioctl to > > > reset the page state. > > > > > > CPU0: > > > write_flag(A, KVM_DIRTY_GFN_F_RESET) > > > write_flag(B, KVM_DIRTY_GFN_F_RESET) > > > write_flag(C, KVM_DIRTY_GFN_F_RESET) > > > write_flag(D, KVM_DIRTY_GFN_F_RESET) > > > [...] > > > > > > CPU1: > > > ioctl(KVM_RESET_DIRTY_RINGS) > > > > > > Since CPU0 writes do not have any ordering, CPU1 can observe the > > > writes in a sequence that have nothing to do with program order, and > > > could for example observe that GFN A and D have been reset, but not B > > > and C. This in turn breaks the logic in the reset code (B, C, and D > > > don't get reset), despite userspace having followed the spec to the > > > letter. If each was a store-release (which is the case on x86), it > > > wouldn't be a problem, but nothing calls it in the documentation. > > > > > > Maybe that's not a big deal if it is expected that each CPU will issue > > > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own > > > writes. But expecting this to work across CPUs without any barrier is > > > wishful thinking. > > > > I see what you meant... > > > > Firstly I'm actually curious whether that'll really happen if the gfns are > > collected in something like a for loop: > > > > for(i = 0; i < N; i++) > > collect_dirty_gfn(ring, i); > > > > Because since all the gfps to be read will depend on variable "i", IIUC no > > reordering should happen, but I'm not really sure, so more of a pure > > question. > > 'i' has no influence on the write ordering. Each write targets a > different address, there is no inter-write dependencies (this concept > doesn't exist other than for writes to the same address), so they can > be reordered at will. > > If you want a proof of this, head to http://diy.inria.fr/www/ and run > the MP.litmus test (which conveniently gives you a reduction of this > problem) on both the x86 and AArch64 models. You will see that the > reordering isn't allowed on x86, but definitely allowed on arm64. > > > Besides, the other thing to mention is that I think it is fine the RESET > > ioctl didn't recycle all the gfns got set to reset state. Taking above > > example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags > > are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D > > are not reset. But IMHO it's okay because it means we reset partial of the > > gfns not all of them, and it's safe to do so. It means the next ring full > > event can come earlier because we recycled less, but that's functionally > > safe to me. > > It may be safe, but it isn't what the userspace API promises. The document says: After processing one or more entries in the ring buffer, userspace calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that the kernel will reprotect those collected GFNs. Therefore, the ioctl must be called *before* reading the content of the dirty pages. I'd say it's not an explicit promise, but I think I agree with you that at least it's unclear on the behavior. Since we have a global recycle mechanism, most likely the app (e.g. current qemu impl) will use the same thread to collect/reset dirty GFNs, and trigger the RESET ioctl(). In that case it's safe, IIUC, because no cross-core ops. QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked): if (total) { ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); assert(ret == total); } I think the assert() should never trigger as mentioned above. But ideally maybe it should just be a loop until cleared gfns match total. > In other words, without further straightening of the API, this doesn't > work as expected on relaxed memory architectures. So before this gets > enabled on arm64, this whole ordering issue must be addressed. How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the possibility of recycling partial of the pages, especially when collection and the ioctl() aren't done from the same thread? Any suggestions will be greatly welcomed. Thanks,
On Wed, 24 Aug 2022 17:21:50 +0100, Peter Xu <peterx@redhat.com> wrote: > > On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote: > > On Wed, 24 Aug 2022 00:19:04 +0100, > > Peter Xu <peterx@redhat.com> wrote: > > > > > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote: > > > > Atomicity doesn't guarantee ordering, unfortunately. > > > > > > Right, sorry to be misleading. The "atomicity" part I was trying to say > > > the kernel will always see consistent update on the fields. > > > > > > The ordering should also be guaranteed, because things must happen with > > > below sequence: > > > > > > (1) kernel publish dirty GFN data (slot, offset) > > > (2) kernel publish dirty GFN flag (set to DIRTY) > > > (3) user sees DIRTY, collects (slots, offset) > > > (4) user sets it to RESET > > > (5) kernel reads RESET > > > > Maybe. Maybe not. The reset could well be sitting in the CPU write > > buffer for as long as it wants and not be seen by the kernel if the > > read occurs on another CPU. And that's the crucial bit: single-CPU is > > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU > > on collection, and global on reset (this seems like a bad decision, > > but it is too late to fix this). > > Regarding the last statement, that's something I had question too and > discussed with Paolo, even though at that time it's not a outcome of > discussing memory ordering issues. > > IIUC the initial design was trying to avoid tlb flush flood when vcpu > number is large (each RESET per ring even for one page will need all vcpus > to flush, so O(N^2) flushing needed). With global RESET it's O(N). So > it's kind of a trade-off, and indeed until now I'm not sure which one is > better. E.g., with per-ring reset, we can have locality too in userspace, > e.g. the vcpu thread might be able to recycle without holding global locks. I don't get that. On x86, each CPU must perform the TLB invalidation (there is an IPI for that). So whether you do a per-CPU scan of the ring or a global scan is irrelevant: each entry you find in any of the rings must result in a global invalidation, since you've updated the PTE to make the page RO. The same is true on ARM, except that the broadcast is done in HW instead of being tracked in SW. Buy anyway, this is all moot. The API is what it is, and it isn't going to change any time soon. All we can do is add some clarifications to the API for the more relaxed architectures, and make sure the kernel behaves accordingly. [...] > > It may be safe, but it isn't what the userspace API promises. > > The document says: > > After processing one or more entries in the ring buffer, userspace calls > the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that > the kernel will reprotect those collected GFNs. Therefore, the ioctl > must be called *before* reading the content of the dirty pages. > > I'd say it's not an explicit promise, but I think I agree with you that at > least it's unclear on the behavior. This is the least problematic part of the documentation. The bit I literally choke on is this: <quote> It's not necessary for userspace to harvest the all dirty GFNs at once. However it must collect the dirty GFNs in sequence, i.e., the userspace program cannot skip one dirty GFN to collect the one next to it. </quote> This is the core of the issue. Without ordering rules, the consumer on the other side cannot observe the updates correctly, even if userspace follows the above to the letter. Of course, the kernel itself must do the right thing (I guess it amounts to the kernel doing a load-acquire, and userspace doing a store-release -- effectively emulating x86...). > Since we have a global recycle mechanism, most likely the app (e.g. current > qemu impl) will use the same thread to collect/reset dirty GFNs, and > trigger the RESET ioctl(). In that case it's safe, IIUC, because no > cross-core ops. > > QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked): > > if (total) { > ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); > assert(ret == total); > } > > I think the assert() should never trigger as mentioned above. But ideally > maybe it should just be a loop until cleared gfns match total. Right. If userspace calls the ioctl on every vcpu, then things should work correctly. It is only that the overhead is higher than what it should be if multiple vcpus perform a reset at the same time. > > > In other words, without further straightening of the API, this doesn't > > work as expected on relaxed memory architectures. So before this gets > > enabled on arm64, this whole ordering issue must be addressed. > > How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the > possibility of recycling partial of the pages, especially when collection > and the ioctl() aren't done from the same thread? I'd rather tell people about the ordering rules. That will come at zero cost on x86. > Any suggestions will be greatly welcomed. I'll write a couple of patch when I get the time, most likely next week. Gavin will hopefully be able to take them as part of his series. M.
Hi Marc, On 8/25/22 6:57 AM, Marc Zyngier wrote: > On Wed, 24 Aug 2022 17:21:50 +0100, > Peter Xu <peterx@redhat.com> wrote: >> >> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote: >>> On Wed, 24 Aug 2022 00:19:04 +0100, >>> Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote: >>>>> Atomicity doesn't guarantee ordering, unfortunately. >>>> >>>> Right, sorry to be misleading. The "atomicity" part I was trying to say >>>> the kernel will always see consistent update on the fields. >>>> >>>> The ordering should also be guaranteed, because things must happen with >>>> below sequence: >>>> >>>> (1) kernel publish dirty GFN data (slot, offset) >>>> (2) kernel publish dirty GFN flag (set to DIRTY) >>>> (3) user sees DIRTY, collects (slots, offset) >>>> (4) user sets it to RESET >>>> (5) kernel reads RESET >>> >>> Maybe. Maybe not. The reset could well be sitting in the CPU write >>> buffer for as long as it wants and not be seen by the kernel if the >>> read occurs on another CPU. And that's the crucial bit: single-CPU is >>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU >>> on collection, and global on reset (this seems like a bad decision, >>> but it is too late to fix this). >> >> Regarding the last statement, that's something I had question too and >> discussed with Paolo, even though at that time it's not a outcome of >> discussing memory ordering issues. >> >> IIUC the initial design was trying to avoid tlb flush flood when vcpu >> number is large (each RESET per ring even for one page will need all vcpus >> to flush, so O(N^2) flushing needed). With global RESET it's O(N). So >> it's kind of a trade-off, and indeed until now I'm not sure which one is >> better. E.g., with per-ring reset, we can have locality too in userspace, >> e.g. the vcpu thread might be able to recycle without holding global locks. > > I don't get that. On x86, each CPU must perform the TLB invalidation > (there is an IPI for that). So whether you do a per-CPU scan of the > ring or a global scan is irrelevant: each entry you find in any of the > rings must result in a global invalidation, since you've updated the > PTE to make the page RO. > > The same is true on ARM, except that the broadcast is done in HW > instead of being tracked in SW. > > Buy anyway, this is all moot. The API is what it is, and it isn't > going to change any time soon. All we can do is add some > clarifications to the API for the more relaxed architectures, and make > sure the kernel behaves accordingly. > > [...] > >>> It may be safe, but it isn't what the userspace API promises. >> >> The document says: >> >> After processing one or more entries in the ring buffer, userspace calls >> the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that >> the kernel will reprotect those collected GFNs. Therefore, the ioctl >> must be called *before* reading the content of the dirty pages. >> >> I'd say it's not an explicit promise, but I think I agree with you that at >> least it's unclear on the behavior. > > This is the least problematic part of the documentation. The bit I > literally choke on is this: > > <quote> > It's not necessary for userspace to harvest the all dirty GFNs at once. > However it must collect the dirty GFNs in sequence, i.e., the userspace > program cannot skip one dirty GFN to collect the one next to it. > </quote> > > This is the core of the issue. Without ordering rules, the consumer on > the other side cannot observe the updates correctly, even if userspace > follows the above to the letter. Of course, the kernel itself must do > the right thing (I guess it amounts to the kernel doing a > load-acquire, and userspace doing a store-release -- effectively > emulating x86...). > >> Since we have a global recycle mechanism, most likely the app (e.g. current >> qemu impl) will use the same thread to collect/reset dirty GFNs, and >> trigger the RESET ioctl(). In that case it's safe, IIUC, because no >> cross-core ops. >> >> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked): >> >> if (total) { >> ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); >> assert(ret == total); >> } >> >> I think the assert() should never trigger as mentioned above. But ideally >> maybe it should just be a loop until cleared gfns match total. > > Right. If userspace calls the ioctl on every vcpu, then things should > work correctly. It is only that the overhead is higher than what it > should be if multiple vcpus perform a reset at the same time. > >> >>> In other words, without further straightening of the API, this doesn't >>> work as expected on relaxed memory architectures. So before this gets >>> enabled on arm64, this whole ordering issue must be addressed. >> >> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the >> possibility of recycling partial of the pages, especially when collection >> and the ioctl() aren't done from the same thread? > > I'd rather tell people about the ordering rules. That will come at > zero cost on x86. > >> Any suggestions will be greatly welcomed. > > I'll write a couple of patch when I get the time, most likely next > week. Gavin will hopefully be able to take them as part of his series. > Thanks, Marc. Please let me know where I can check out the patches when they're ready. I can include the patches into this series in next revision :) Thanks, Gavin
On 8/24/22 00:47, Marc Zyngier wrote: >> I definitely don't think I 100% understand all the ordering things since >> they're complicated.. but my understanding is that the reset procedure >> didn't need memory barrier (unlike pushing, where we have explicit wmb), >> because we assumed the userapp is not hostile so logically it should only >> modify the flags which is a 32bit field, assuming atomicity guaranteed. > Atomicity doesn't guarantee ordering, unfortunately. Take the > following example: CPU0 is changing a bunch of flags for GFNs A, B, C, > D that exist in the ring in that order, and CPU1 performs an ioctl to > reset the page state. > > CPU0: > write_flag(A, KVM_DIRTY_GFN_F_RESET) > write_flag(B, KVM_DIRTY_GFN_F_RESET) > write_flag(C, KVM_DIRTY_GFN_F_RESET) > write_flag(D, KVM_DIRTY_GFN_F_RESET) > [...] > > CPU1: > ioctl(KVM_RESET_DIRTY_RINGS) > > Since CPU0 writes do not have any ordering, CPU1 can observe the > writes in a sequence that have nothing to do with program order, and > could for example observe that GFN A and D have been reset, but not B > and C. This in turn breaks the logic in the reset code (B, C, and D > don't get reset), despite userspace having followed the spec to the > letter. If each was a store-release (which is the case on x86), it > wouldn't be a problem, but nothing calls it in the documentation. > > Maybe that's not a big deal if it is expected that each CPU will issue > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own > writes. But expecting this to work across CPUs without any barrier is > wishful thinking. Agreed, but that's a problem for userspace to solve. If userspace wants to reset the fields in different CPUs, it has to synchronize with its own invoking of the ioctl. That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done after (in the memory-ordering sense) its last write_flag(D, KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no guarantee that the write_flag will have any effect. The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was because it takes kvm->slots_lock so the execution would be serialized anyway. Turning slots_lock into an rwsem would be even worse because it also takes kvm->mmu_lock (since slots_lock is a mutex, at least two concurrent invocations won't clash with each other on the mmu_lock). Paolo
On 8/23/22 22:35, Marc Zyngier wrote: >> Heh, yeah I need to get that out the door. I'll also note that Gavin's >> changes are still relevant without that series, as we do write unprotect >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: >> Add fast path to handle permission relaxation during dirty logging"). > > Ah, true. Now if only someone could explain how the whole > producer-consumer thing works without a trace of a barrier, that'd be > great... Do you mean this? void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) { struct kvm_dirty_gfn *entry; /* It should never get full */ WARN_ON_ONCE(kvm_dirty_ring_full(ring)); entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)]; entry->slot = slot; entry->offset = offset; /* * Make sure the data is filled in before we publish this to * the userspace program. There's no paired kernel-side reader. */ smp_wmb(); kvm_dirty_gfn_set_dirtied(entry); ring->dirty_index++; trace_kvm_dirty_ring_push(ring, slot, offset); } The matching smp_rmb() is in userspace. Paolo
On Fri, 26 Aug 2022 11:58:08 +0100, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 8/23/22 22:35, Marc Zyngier wrote: > >> Heh, yeah I need to get that out the door. I'll also note that Gavin's > >> changes are still relevant without that series, as we do write unprotect > >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: > >> Add fast path to handle permission relaxation during dirty logging"). > > > > Ah, true. Now if only someone could explain how the whole > > producer-consumer thing works without a trace of a barrier, that'd be > > great... > > Do you mean this? > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) Of course not. I mean this: static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) { unsigned long i; struct kvm_vcpu *vcpu; int cleared = 0; if (!kvm->dirty_ring_size) return -EINVAL; mutex_lock(&kvm->slots_lock); kvm_for_each_vcpu(i, vcpu, kvm) cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); [...] } and this int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) { u32 cur_slot, next_slot; u64 cur_offset, next_offset; unsigned long mask; int count = 0; struct kvm_dirty_gfn *entry; bool first_round = true; /* This is only needed to make compilers happy */ cur_slot = cur_offset = mask = 0; while (true) { entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)]; if (!kvm_dirty_gfn_harvested(entry)) break; [...] } which provides no ordering whatsoever when a ring is updated from one CPU and reset from another. M.
On Fri, 26 Aug 2022 11:50:24 +0100, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 8/24/22 00:47, Marc Zyngier wrote: > >> I definitely don't think I 100% understand all the ordering things since > >> they're complicated.. but my understanding is that the reset procedure > >> didn't need memory barrier (unlike pushing, where we have explicit wmb), > >> because we assumed the userapp is not hostile so logically it should only > >> modify the flags which is a 32bit field, assuming atomicity guaranteed. > > Atomicity doesn't guarantee ordering, unfortunately. Take the > > following example: CPU0 is changing a bunch of flags for GFNs A, B, C, > > D that exist in the ring in that order, and CPU1 performs an ioctl to > > reset the page state. > > > > CPU0: > > write_flag(A, KVM_DIRTY_GFN_F_RESET) > > write_flag(B, KVM_DIRTY_GFN_F_RESET) > > write_flag(C, KVM_DIRTY_GFN_F_RESET) > > write_flag(D, KVM_DIRTY_GFN_F_RESET) > > [...] > > > > CPU1: > > ioctl(KVM_RESET_DIRTY_RINGS) > > > > Since CPU0 writes do not have any ordering, CPU1 can observe the > > writes in a sequence that have nothing to do with program order, and > > could for example observe that GFN A and D have been reset, but not B > > and C. This in turn breaks the logic in the reset code (B, C, and D > > don't get reset), despite userspace having followed the spec to the > > letter. If each was a store-release (which is the case on x86), it > > wouldn't be a problem, but nothing calls it in the documentation. > > > > Maybe that's not a big deal if it is expected that each CPU will issue > > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own > > writes. But expecting this to work across CPUs without any barrier is > > wishful thinking. > > Agreed, but that's a problem for userspace to solve. If userspace > wants to reset the fields in different CPUs, it has to synchronize > with its own invoking of the ioctl. userspace has no choice. It cannot order on its own the reads that the kernel will do to *other* rings. > That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done > after (in the memory-ordering sense) its last write_flag(D, > KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no > guarantee that the write_flag will have any effect. The problem isn't on CPU0 The problem is that CPU1 does observe inconsistent data on arm64, and I don't think this difference in behaviour is acceptable. Nothing documents this, and there is a baked in assumption that there is a strong ordering between writes as well as between writes and read. > The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl > was because it takes kvm->slots_lock so the execution would be > serialized anyway. Turning slots_lock into an rwsem would be even > worse because it also takes kvm->mmu_lock (since slots_lock is a > mutex, at least two concurrent invocations won't clash with each other > on the mmu_lock). Whatever the reason, the behaviour should be identical on all architectures. As is is, it only really works on x86, and I contend this is a bug that needs fixing. Thankfully, this can be done at zero cost for x86, and at that of a set of load-acquires on other architectures. M.
On 8/26/22 17:49, Marc Zyngier wrote: >> Agreed, but that's a problem for userspace to solve. If userspace >> wants to reset the fields in different CPUs, it has to synchronize >> with its own invoking of the ioctl. > > userspace has no choice. It cannot order on its own the reads that the > kernel will do to *other* rings. Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however, if userspace has never interacted with the ring. So there will be exactly one read on those rings, and there's nothing to reorder. If that's too tricky and you want to add a load-acquire I have no objection though. It also helps avoiding read-read reordering between one entry's flags to the next one's, so it's a good idea to have it anyway. >> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl >> was because it takes kvm->slots_lock so the execution would be >> serialized anyway. Turning slots_lock into an rwsem would be even >> worse because it also takes kvm->mmu_lock (since slots_lock is a >> mutex, at least two concurrent invocations won't clash with each other >> on the mmu_lock). > > Whatever the reason, the behaviour should be identical on all > architectures. As is is, it only really works on x86, and I contend > this is a bug that needs fixing. > > Thankfully, this can be done at zero cost for x86, and at that of a > set of load-acquires on other architectures. Yes, the global-ness of the API is orthogonal to the memory ordering issue. I just wanted to explain why a per-vCPU API probably isn't going to work great. Paolo
On Fri, Aug 26, 2022 at 04:28:51PM +0100, Marc Zyngier wrote: > On Fri, 26 Aug 2022 11:58:08 +0100, > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > On 8/23/22 22:35, Marc Zyngier wrote: > > >> Heh, yeah I need to get that out the door. I'll also note that Gavin's > > >> changes are still relevant without that series, as we do write unprotect > > >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64: > > >> Add fast path to handle permission relaxation during dirty logging"). > > > > > > Ah, true. Now if only someone could explain how the whole > > > producer-consumer thing works without a trace of a barrier, that'd be > > > great... > > > > Do you mean this? > > > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > Of course not. I mean this: > > static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) > { > unsigned long i; > struct kvm_vcpu *vcpu; > int cleared = 0; > > if (!kvm->dirty_ring_size) > return -EINVAL; > > mutex_lock(&kvm->slots_lock); > > kvm_for_each_vcpu(i, vcpu, kvm) > cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); > [...] > } > > and this > > int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > { > u32 cur_slot, next_slot; > u64 cur_offset, next_offset; > unsigned long mask; > int count = 0; > struct kvm_dirty_gfn *entry; > bool first_round = true; > > /* This is only needed to make compilers happy */ > cur_slot = cur_offset = mask = 0; > > while (true) { > entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)]; > > if (!kvm_dirty_gfn_harvested(entry)) > break; > [...] > > } > > which provides no ordering whatsoever when a ring is updated from one > CPU and reset from another. Marc, I thought we won't hit this as long as we properly take care of other orderings of (a) gfn push, and (b) gfn collect, but after a second thought I think it's indeed logically possible that with a reversed ordering here we can be reading some garbage gfn before (a) happens butt also read the valid flag after (b). It seems we must have all the barriers correctly applied always. If that's correct, do you perhaps mean something like this to just add the last piece of barrier? ===8<=== diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index f4c2a6eb1666..ea620bfb012d 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) { - return gfn->flags & KVM_DIRTY_GFN_F_RESET; + return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET; } int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) ===8<=== Thanks,
On 8/30/22 16:42, Peter Xu wrote: > Marc, > > I thought we won't hit this as long as we properly take care of other > orderings of (a) gfn push, and (b) gfn collect, but after a second thought > I think it's indeed logically possible that with a reversed ordering here > we can be reading some garbage gfn before (a) happens butt also read the > valid flag after (b). > > It seems we must have all the barriers correctly applied always. If that's > correct, do you perhaps mean something like this to just add the last piece > of barrier? Okay, so I thought about it some more and it's quite tricky. Strictly speaking, the synchronization is just between userspace and kernel. The fact that the actual producer of dirty pages is in another CPU is a red herring, because reset only cares about harvested pages. In other words, the dirty page ring is essentially two ring buffers in one and we only care about the "harvested ring", not the "produced ring". On the other hand, it may happen that userspace has set more RESET flags while the ioctl is ongoing: CPU0 CPU1 CPU2 fill gfn0 store-rel flags for gfn0 fill gfn1 store-rel flags for gfn1 load-acq flags for gfn0 set RESET for gfn0 load-acq flags for gfn1 set RESET for gfn1 do ioctl! -----------> ioctl(RESET_RINGS) fill gfn2 store-rel flags for gfn2 load-acq flags for gfn2 set RESET for gfn2 process gfn0 process gfn1 process gfn2 do ioctl! etc. The three load-acquire in CPU0 synchronize with the three store-release in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may miss gfn2's fields other than flags. The kernel must be able to cope with invalid values of the fields, and userspace will invoke the ioctl once more. However, once the RESET flag is cleared on gfn2, it is lost forever, therefore in the above scenario CPU1 must read the correct value of gfn2's fields. Therefore RESET must be set with a store-release, that will synchronize with a load-acquire in CPU1 as you suggested. Paolo > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index f4c2a6eb1666..ea620bfb012d 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) > > static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) > { > - return gfn->flags & KVM_DIRTY_GFN_F_RESET; > + return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET; > } > > int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > ===8<=== > > Thanks, > > -- > Peter Xu >
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..19fa1ac017ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING --------------------------- -:Architectures: x86 +:Architectures: x86, arm64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 3bb134355874..7e04b0b8d2b2 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -43,6 +43,7 @@ #define __KVM_HAVE_VCPU_EVENTS #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 #define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 815cc118c675..0309b2d0f2da 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -32,6 +32,7 @@ menuconfig KVM select KVM_VFIO select HAVE_KVM_EVENTFD select HAVE_KVM_IRQFD + select HAVE_KVM_DIRTY_RING select HAVE_KVM_MSI select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQ_ROUTING diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 986cee6fbc7f..3de6b9b39db7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) if (!ret) ret = 1; + /* Force vcpu exit if its dirty ring is soft-full */ + if (unlikely(vcpu->kvm->dirty_ring_size && + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; + trace_kvm_dirty_ring_exit(vcpu); + ret = 0; + } + if (ret > 0) ret = check_vcpu_requests(vcpu);
The ring-based dirty memory tracking has been available and enabled on x86 for a while. The feature is beneficial when the number of dirty pages is small in a checkpointing system or live migration scenario. More details can be found from fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking"). This enables the ring-based dirty memory tracking on ARM64. It's notable that no extra reserved ring entries are needed on ARM64 because the huge pages are always split into base pages when page dirty tracking is enabled. Signed-off-by: Gavin Shan <gshan@redhat.com> --- Documentation/virt/kvm/api.rst | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 8 ++++++++ 4 files changed, 11 insertions(+), 1 deletion(-)