Message ID | 20240809194335.1726916-20-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Allow yielding on mmu_notifier zap | expand |
On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote: > + > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > +{ > + unsigned long old_val, new_val; > + > + old_val = READ_ONCE(rmap_head->val); > + if (!old_val) > + return 0; > + > + do { > + /* > + * If the rmap is locked, wait for it to be unlocked before > + * trying acquire the lock, e.g. to bounce the cache line. > + */ > + while (old_val & KVM_RMAP_LOCKED) { > + old_val = READ_ONCE(rmap_head->val); > + cpu_relax(); The sequence of these two lines of code can be improved. > + } > + > + /* > + * Recheck for an empty rmap, it may have been purged by the > + * task that held the lock. > + */ > + if (!old_val) > + return 0; > + > + new_val = old_val | KVM_RMAP_LOCKED; > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > + > + /* Return the old value, i.e. _without_ the LOCKED bit set. */ > + return old_val; > +}
On Mon, Aug 12, 2024, Lai Jiangshan wrote: > On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote: > > > + > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > +{ > > + unsigned long old_val, new_val; > > + > > + old_val = READ_ONCE(rmap_head->val); > > + if (!old_val) > > + return 0; > > + > > + do { > > + /* > > + * If the rmap is locked, wait for it to be unlocked before > > + * trying acquire the lock, e.g. to bounce the cache line. > > + */ > > + while (old_val & KVM_RMAP_LOCKED) { > > + old_val = READ_ONCE(rmap_head->val); > > + cpu_relax(); > > The sequence of these two lines of code can be improved. Oh yeah, duh, re-read after PAUSE, not before. Definitely holler if you have any alternative ideas for walking rmaps without taking mmu_lock, I guarantee you've spent more time than me thinking about the shadow MMU :-)
On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote: > > Oh yeah, duh, re-read after PAUSE, not before. > > Definitely holler if you have any alternative ideas for walking rmaps > without taking mmu_lock, I guarantee you've spent more time than me > thinking about the shadow MMU :-) We use the same bit and the same way for the rmap lock. We just use bit_spin_lock() and the optimization for empty rmap_head is handled out of kvm_rmap_lock(). bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region. Thanks Lai
On Tue, Aug 13, 2024, Lai Jiangshan wrote: > On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > Oh yeah, duh, re-read after PAUSE, not before. > > > > Definitely holler if you have any alternative ideas for walking rmaps > > without taking mmu_lock, I guarantee you've spent more time than me > > thinking about the shadow MMU :-) > > We use the same bit and the same way for the rmap lock. > > We just use bit_spin_lock() and the optimization for empty rmap_head is > handled out of kvm_rmap_lock(). Hmm, I'm leaning towards keeping the custom locking. There are a handful of benefits, none of which are all that meaningful on their own, but do add up. - Callers don't need to manually check for an empty rmap_head. - Can avoid the redundant preempt_{disable,enable}() entirely in the common case of being called while mmu_lock is held. - Handles the (likely super rare) edge case where a read-only walker encounters an rmap that was just emptied (rmap_head->val goes to zero after the initial check to elide the lock). - Avoids an atomic when releasing the lock, and any extra instructions entirely for writers since they always write the full rmap_head->val when releasing. > bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the > new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region. Oof, it doesn't. Disabling IRQs crossed my mind, but I completely forgot about preemption. Thanks much!
On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > Steal another bit from rmap entries (which are word aligned pointers, i.e. > have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use > the bit to implement a *very* rudimentary per-rmap spinlock. The only > anticipated usage of the lock outside of mmu_lock is for aging gfns, and > collisions between aging and other MMU rmap operations are quite rare, > e.g. unless userspace is being silly and aging a tiny range over and over > in a tight loop, time between contention when aging an actively running VM > is O(seconds). In short, a more sophisticated locking scheme shouldn't be > necessary. > > Note, the lock only protects the rmap structure itself, SPTEs that are > pointed at by a locked rmap can still be modified and zapped by another > task (KVM drops/zaps SPTEs before deleting the rmap entries) > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++----- > 1 file changed, 71 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 8ca7f51c2da3..a683b5fc4026 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu > * About rmap_head encoding: > * > * If the bit zero of rmap_head->val is clear, then it points to the only spte > - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct > + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct > * pte_list_desc containing more mappings. > */ > #define KVM_RMAP_MANY BIT(0) > > +/* > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > + * operates with mmu_lock held for write), but rmaps can be walked without > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > + * being zapped/dropped _while the rmap is locked_. > + * > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > + * done while holding mmu_lock for write. This allows a task walking rmaps > + * without holding mmu_lock to concurrently walk the same entries as a task > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > + * the rmaps, thus the walks are stable. > + * > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > + * ensures all "struct pte_list_desc" fields are stable. > + */ > +#define KVM_RMAP_LOCKED BIT(1) > + > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > +{ > + unsigned long old_val, new_val; > + > + old_val = READ_ONCE(rmap_head->val); > + if (!old_val) > + return 0; I'm having trouble understanding how this bit works. What exactly is stopping the rmap from being populated while we have it "locked"? We aren't holding the MMU lock at all in the lockless case, and given this bit, it is impossible (I think?) for the MMU-write-lock-holding, rmap-modifying side to tell that this rmap is locked. Concretely, my immediate concern is that we will still unconditionally write 0 back at unlock time even if the value has changed. I expect that this works and I'm just not getting it... :) > + > + do { > + /* > + * If the rmap is locked, wait for it to be unlocked before > + * trying acquire the lock, e.g. to bounce the cache line. > + */ > + while (old_val & KVM_RMAP_LOCKED) { > + old_val = READ_ONCE(rmap_head->val); > + cpu_relax(); > + } > + > + /* > + * Recheck for an empty rmap, it may have been purged by the > + * task that held the lock. > + */ > + if (!old_val) > + return 0; > + > + new_val = old_val | KVM_RMAP_LOCKED; > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > + > + /* Return the old value, i.e. _without_ the LOCKED bit set. */ > + return old_val; > +} > + > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, > + unsigned long new_val) > +{ > + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); > + WRITE_ONCE(rmap_head->val, new_val); > +} > + > +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head) > +{ > + return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; > +} > + > /* > * Returns the number of pointers in the rmap chain, not counting the new one. > */ > @@ -924,7 +986,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, > struct pte_list_desc *desc; > int count = 0; > > - old_val = rmap_head->val; > + old_val = kvm_rmap_lock(rmap_head); > > if (!old_val) { > new_val = (unsigned long)spte; > @@ -956,7 +1018,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, > desc->sptes[desc->spte_count++] = spte; > } > > - rmap_head->val = new_val; > + kvm_rmap_unlock(rmap_head, new_val); > > return count; > } > @@ -1004,7 +1066,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, > unsigned long rmap_val; > int i; > > - rmap_val = rmap_head->val; > + rmap_val = kvm_rmap_lock(rmap_head); > if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) > goto out; > > @@ -1030,7 +1092,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, > } > > out: > - rmap_head->val = rmap_val; > + kvm_rmap_unlock(rmap_head, rmap_val); > } > > static void kvm_zap_one_rmap_spte(struct kvm *kvm, > @@ -1048,7 +1110,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, > unsigned long rmap_val; > int i; > > - rmap_val = rmap_head->val; > + rmap_val = kvm_rmap_lock(rmap_head); > if (!rmap_val) > return false; > > @@ -1067,13 +1129,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, > } > out: > /* rmap_head is meaningless now, remember to reset it */ > - rmap_head->val = 0; > + kvm_rmap_unlock(rmap_head, 0); > return true; > } > > unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) > { > - unsigned long rmap_val = rmap_head->val; > + unsigned long rmap_val = kvm_rmap_get(rmap_head); > struct pte_list_desc *desc; > > if (!rmap_val) > @@ -1139,7 +1201,7 @@ struct rmap_iterator { > static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, > struct rmap_iterator *iter) > { > - unsigned long rmap_val = rmap_head->val; > + unsigned long rmap_val = kvm_rmap_get(rmap_head); > u64 *sptep; > > if (!rmap_val) > -- > 2.46.0.76.ge559c4bf1a-goog >
On Tue, Sep 03, 2024, James Houghton wrote: > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > +/* > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > > + * operates with mmu_lock held for write), but rmaps can be walked without > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > > + * being zapped/dropped _while the rmap is locked_. > > + * > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > > + * done while holding mmu_lock for write. This allows a task walking rmaps > > + * without holding mmu_lock to concurrently walk the same entries as a task > > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > > + * the rmaps, thus the walks are stable. > > + * > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > > + * ensures all "struct pte_list_desc" fields are stable. > > + */ > > +#define KVM_RMAP_LOCKED BIT(1) > > + > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > +{ > > + unsigned long old_val, new_val; > > + > > + old_val = READ_ONCE(rmap_head->val); > > + if (!old_val) > > + return 0; > > I'm having trouble understanding how this bit works. What exactly is > stopping the rmap from being populated while we have it "locked"? Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for aging only cares about existing mappings. The key to correctness is that this helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do nothing and thus will never see any rmaps that come along in the future. > aren't holding the MMU lock at all in the lockless case, and given > this bit, it is impossible (I think?) for the MMU-write-lock-holding, > rmap-modifying side to tell that this rmap is locked. > > Concretely, my immediate concern is that we will still unconditionally > write 0 back at unlock time even if the value has changed. The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty, i.e. wasn't actually locked. +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head, + unsigned long old_val) +{ + if (!old_val) + return; + + KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED)); + WRITE_ONCE(rmap_head->val, old_val); +} The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the "normal", non-readonly versions if and only if mmu_lock is held for write. +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +{ + /* + * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is + * held for write. + */ + return __kvm_rmap_lock(rmap_head); +}
On Tue, Sep 3, 2024 at 2:27 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Sep 03, 2024, James Houghton wrote: > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > +/* > > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > > > + * operates with mmu_lock held for write), but rmaps can be walked without > > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > > > + * being zapped/dropped _while the rmap is locked_. > > > + * > > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > > > + * done while holding mmu_lock for write. This allows a task walking rmaps > > > + * without holding mmu_lock to concurrently walk the same entries as a task > > > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > > > + * the rmaps, thus the walks are stable. > > > + * > > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > > > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > > > + * ensures all "struct pte_list_desc" fields are stable. > > > + */ > > > +#define KVM_RMAP_LOCKED BIT(1) > > > + > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > > +{ > > > + unsigned long old_val, new_val; > > > + > > > + old_val = READ_ONCE(rmap_head->val); > > > + if (!old_val) > > > + return 0; > > > > I'm having trouble understanding how this bit works. What exactly is > > stopping the rmap from being populated while we have it "locked"? > > Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for > aging only cares about existing mappings. The key to correctness is that this > helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do > nothing and thus will never see any rmaps that come along in the future. > > > aren't holding the MMU lock at all in the lockless case, and given > > this bit, it is impossible (I think?) for the MMU-write-lock-holding, > > rmap-modifying side to tell that this rmap is locked. > > > > Concretely, my immediate concern is that we will still unconditionally > > write 0 back at unlock time even if the value has changed. > > The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty, > i.e. wasn't actually locked. Ah, that's what I was missing. Thanks! This all makes sense now. > > +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head, > + unsigned long old_val) > +{ > + if (!old_val) > + return; > + > + KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED)); > + WRITE_ONCE(rmap_head->val, old_val); > +} > > The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the > "normal", non-readonly versions if and only if mmu_lock is held for write. > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > +{ > + /* > + * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is > + * held for write. > + */ > + return __kvm_rmap_lock(rmap_head); > +}
On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > Steal another bit from rmap entries (which are word aligned pointers, i.e. > have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use > the bit to implement a *very* rudimentary per-rmap spinlock. The only > anticipated usage of the lock outside of mmu_lock is for aging gfns, and > collisions between aging and other MMU rmap operations are quite rare, > e.g. unless userspace is being silly and aging a tiny range over and over > in a tight loop, time between contention when aging an actively running VM > is O(seconds). In short, a more sophisticated locking scheme shouldn't be > necessary. > > Note, the lock only protects the rmap structure itself, SPTEs that are > pointed at by a locked rmap can still be modified and zapped by another > task (KVM drops/zaps SPTEs before deleting the rmap entries) > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++----- > 1 file changed, 71 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 8ca7f51c2da3..a683b5fc4026 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu > * About rmap_head encoding: > * > * If the bit zero of rmap_head->val is clear, then it points to the only spte > - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct > + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct > * pte_list_desc containing more mappings. > */ > #define KVM_RMAP_MANY BIT(0) > > +/* > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > + * operates with mmu_lock held for write), but rmaps can be walked without > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > + * being zapped/dropped _while the rmap is locked_. > + * > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > + * done while holding mmu_lock for write. This allows a task walking rmaps > + * without holding mmu_lock to concurrently walk the same entries as a task > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > + * the rmaps, thus the walks are stable. > + * > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > + * ensures all "struct pte_list_desc" fields are stable. This last sentence makes me think we need to be careful about memory ordering. > + */ > +#define KVM_RMAP_LOCKED BIT(1) > + > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > +{ > + unsigned long old_val, new_val; > + > + old_val = READ_ONCE(rmap_head->val); > + if (!old_val) > + return 0; > + > + do { > + /* > + * If the rmap is locked, wait for it to be unlocked before > + * trying acquire the lock, e.g. to bounce the cache line. > + */ > + while (old_val & KVM_RMAP_LOCKED) { > + old_val = READ_ONCE(rmap_head->val); > + cpu_relax(); > + } > + > + /* > + * Recheck for an empty rmap, it may have been purged by the > + * task that held the lock. > + */ > + if (!old_val) > + return 0; > + > + new_val = old_val | KVM_RMAP_LOCKED; > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); I think we (technically) need an smp_rmb() here. I think cmpxchg implicitly has that on x86 (and this code is x86-only), but should we nonetheless document that we need smp_rmb() (if it indeed required)? Perhaps we could/should condition the smp_rmb() on `if (old_val)`. kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like adding it here will do the right thing for the read-only lock side. > + > + /* Return the old value, i.e. _without_ the LOCKED bit set. */ > + return old_val; > +} > + > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, > + unsigned long new_val) > +{ > + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); Same goes with having an smp_wmb() here. Is it necessary? If so, should it at least be documented? And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC. > + WRITE_ONCE(rmap_head->val, new_val); > +}
On Mon, Sep 09, 2024, James Houghton wrote: > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > +/* > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > > + * operates with mmu_lock held for write), but rmaps can be walked without > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > > + * being zapped/dropped _while the rmap is locked_. > > + * > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > > + * done while holding mmu_lock for write. This allows a task walking rmaps > > + * without holding mmu_lock to concurrently walk the same entries as a task > > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > > + * the rmaps, thus the walks are stable. > > + * > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > > + * ensures all "struct pte_list_desc" fields are stable. > > This last sentence makes me think we need to be careful about memory ordering. > > > + */ > > +#define KVM_RMAP_LOCKED BIT(1) > > + > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > +{ > > + unsigned long old_val, new_val; > > + > > + old_val = READ_ONCE(rmap_head->val); > > + if (!old_val) > > + return 0; > > + > > + do { > > + /* > > + * If the rmap is locked, wait for it to be unlocked before > > + * trying acquire the lock, e.g. to bounce the cache line. > > + */ > > + while (old_val & KVM_RMAP_LOCKED) { > > + old_val = READ_ONCE(rmap_head->val); > > + cpu_relax(); > > + } > > + > > + /* > > + * Recheck for an empty rmap, it may have been purged by the > > + * task that held the lock. > > + */ > > + if (!old_val) > > + return 0; > > + > > + new_val = old_val | KVM_RMAP_LOCKED; > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > I think we (technically) need an smp_rmb() here. I think cmpxchg > implicitly has that on x86 (and this code is x86-only), but should we > nonetheless document that we need smp_rmb() (if it indeed required)? > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably even wrong. For the !old_val case, there is a address/data dependency that can't be broken by the CPU without violating the x86 memory model (all future actions with relevant memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux kernel memory model, READ_ONCE() is responsible for ensuring that the address dependency can't be morphed into a control dependency by the compiler and subsequently reordered by the CPU. I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of __READ_ONCE() promotes the operation to an acquire. Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, the rmaps are fully protected by the critical section. And for the SPTEs, there is no required ordering. The reader (aging thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for either. I.e. there is no requirement that the reader observe a PRESENT SPTE if there is a valid rmap. So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests an ordering requirement that doesn't exist. > kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like > adding it here will do the right thing for the read-only lock side. > > > + > > + /* Return the old value, i.e. _without_ the LOCKED bit set. */ > > + return old_val; > > +} > > + > > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, > > + unsigned long new_val) > > +{ > > + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); > > Same goes with having an smp_wmb() here. Is it necessary? If so, > should it at least be documented? > > And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC. > > > + WRITE_ONCE(rmap_head->val, new_val); > > +}
On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Sep 09, 2024, James Houghton wrote: > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > + */ > > > +#define KVM_RMAP_LOCKED BIT(1) > > > + > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > > +{ > > > + unsigned long old_val, new_val; > > > + > > > + old_val = READ_ONCE(rmap_head->val); > > > + if (!old_val) > > > + return 0; > > > + > > > + do { > > > + /* > > > + * If the rmap is locked, wait for it to be unlocked before > > > + * trying acquire the lock, e.g. to bounce the cache line. > > > + */ > > > + while (old_val & KVM_RMAP_LOCKED) { > > > + old_val = READ_ONCE(rmap_head->val); > > > + cpu_relax(); > > > + } > > > + > > > + /* > > > + * Recheck for an empty rmap, it may have been purged by the > > > + * task that held the lock. > > > + */ > > > + if (!old_val) > > > + return 0; > > > + > > > + new_val = old_val | KVM_RMAP_LOCKED; > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > > > I think we (technically) need an smp_rmb() here. I think cmpxchg > > implicitly has that on x86 (and this code is x86-only), but should we > > nonetheless document that we need smp_rmb() (if it indeed required)? > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably > even wrong. I don't think smp_mb__after_spinlock() is right either. This seems to be used following the acquisition of a spinlock to promote the memory ordering from an acquire barrier (that is implicit with the lock acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a stronger-than-usual barrier. But I guess I'm not really sure. In this case, I'm complaining that we don't have the usual memory ordering restrictions that come with a spinlock. > For the !old_val case, there is a address/data dependency that can't be broken by > the CPU without violating the x86 memory model (all future actions with relevant > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux > kernel memory model, READ_ONCE() is responsible for ensuring that the address > dependency can't be morphed into a control dependency by the compiler and > subsequently reordered by the CPU. > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of > __READ_ONCE() promotes the operation to an acquire. > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, > the rmaps are fully protected by the critical section. I feel like a spinlock must include the appropriate barriers for it to correctly function as a spinlock, so I'm not sure I fully understand what you mean here. > And for the SPTEs, there > is no required ordering. The reader (aging thread) can observe a !PRESENT or a > PRESENT SPTE, and must be prepared for either. I.e. there is no requirement that > the reader observe a PRESENT SPTE if there is a valid rmap. This makes sense. > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests > an ordering requirement that doesn't exist. So we have: the general kvm_rmap_lock() and the read-only kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use those names (sorry if it's confusing). For kvm_rmap_lock(), we are always holding mmu_lock for writing. So any changes we make to the rmap will be properly published to other threads that subsequently grab kvm_rmap_lock() because we had to properly release and then re-acquire mmu_lock, which comes with the barriers I'm saying we need. For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that it is possible that there may observe external changes to the pte_list_desc while we are in this critical section (for a sufficiently weak architecture). The changes that the kvm_rmap_lock() (mmu_lock) side made were half-published with an smp_wmb() (really [3]), but the read side didn't use a load-acquire or smp_rmb(), so it hasn't held up its end of the deal. I don't think READ_ONCE() has the guarantees we need to be a sufficient replacement for smp_rmb() or a load-acquire that a real lock would use, although I agree with you that, on arm64, it apparently *is* a sufficient replacement. Now this isn't a problem if the kvm_rmap_lock_readonly() side can tolerate changes to pte_list_desc while in the critical section. I don't think this is true (given for_each_rmap_spte_lockless), therefore an smp_rmb() or equivalent is (technically) needed. Am I confused? (Though all of this works just fine as written on x86.) [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62 [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/ [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
On Mon, Sep 09, 2024, James Houghton wrote: > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Sep 09, 2024, James Houghton wrote: > > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > > + */ > > > > +#define KVM_RMAP_LOCKED BIT(1) > > > > + > > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > > > +{ > > > > + unsigned long old_val, new_val; > > > > + > > > > + old_val = READ_ONCE(rmap_head->val); > > > > + if (!old_val) > > > > + return 0; > > > > + > > > > + do { > > > > + /* > > > > + * If the rmap is locked, wait for it to be unlocked before > > > > + * trying acquire the lock, e.g. to bounce the cache line. > > > > + */ > > > > + while (old_val & KVM_RMAP_LOCKED) { > > > > + old_val = READ_ONCE(rmap_head->val); > > > > + cpu_relax(); > > > > + } > > > > + > > > > + /* > > > > + * Recheck for an empty rmap, it may have been purged by the > > > > + * task that held the lock. > > > > + */ > > > > + if (!old_val) > > > > + return 0; > > > > + > > > > + new_val = old_val | KVM_RMAP_LOCKED; > > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > > > > > I think we (technically) need an smp_rmb() here. I think cmpxchg > > > implicitly has that on x86 (and this code is x86-only), but should we > > > nonetheless document that we need smp_rmb() (if it indeed required)? > > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. > > > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably > > even wrong. > > I don't think smp_mb__after_spinlock() is right either. This seems to > be used following the acquisition of a spinlock to promote the memory > ordering from an acquire barrier (that is implicit with the lock > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a > stronger-than-usual barrier. But I guess I'm not really sure. > > In this case, I'm complaining that we don't have the usual memory > ordering restrictions that come with a spinlock. What makes you think that? > > For the !old_val case, there is a address/data dependency that can't be broken by > > the CPU without violating the x86 memory model (all future actions with relevant > > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux > > kernel memory model, READ_ONCE() is responsible for ensuring that the address > > dependency can't be morphed into a control dependency by the compiler and > > subsequently reordered by the CPU. > > > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I > > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of > > __READ_ONCE() promotes the operation to an acquire. > > > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, > > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, > > the rmaps are fully protected by the critical section. > > I feel like a spinlock must include the appropriate barriers for it to > correctly function as a spinlock, so I'm not sure I fully understand > what you mean here. On TSO architectures, the atomic _is_ the barrier. E.g. atomic_try_cmpxchg_acquire() eventually resolves to atomic_try_cmpxchg() on x86. And jumping back to the "we don't have the usual memory ordering restrictions that come with a spinlock", x86's virt_spin_lock() uses atomic_try_cmpxchg(). So while using the acquire variant here is obviously not wrong, it also feels somewhat weird. Though some of that is undoubtedly due to explicit "acquire" semantics being rather rare in x86. > > And for the SPTEs, there is no required ordering. The reader (aging > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for > > either. I.e. there is no requirement that the reader observe a PRESENT > > SPTE if there is a valid rmap. > > This makes sense. > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests > > an ordering requirement that doesn't exist. > > So we have: the general kvm_rmap_lock() and the read-only > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use > those names (sorry if it's confusing). > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So > any changes we make to the rmap will be properly published to other > threads that subsequently grab kvm_rmap_lock() because we had to > properly release and then re-acquire mmu_lock, which comes with the > barriers I'm saying we need. > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that > it is possible that there may observe external changes to the > pte_list_desc while we are in this critical section (for a > sufficiently weak architecture). The changes that the kvm_rmap_lock() > (mmu_lock) side made were half-published with an smp_wmb() (really > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it > hasn't held up its end of the deal. > > I don't think READ_ONCE() has the guarantees we need to be a > sufficient replacement for smp_rmb() or a load-acquire that a real > lock would use, although I agree with you that, on arm64, it > apparently *is* a sufficient replacement. > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can > tolerate changes to pte_list_desc while in the critical section. I > don't think this is true (given for_each_rmap_spte_lockless), > therefore an smp_rmb() or equivalent is (technically) needed. > > Am I confused? Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change. The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[] can change, but the pointers themselves cannot. And with aging, the code is completely tolerant of an instable SPTE _value_ because test-only doesn't care about false negatives/positives, and test-and-clear is itself an atomic access i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives). > > (Though all of this works just fine as written on x86.) > > [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62 > [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/ > [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Sep 09, 2024, James Houghton wrote: > > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Mon, Sep 09, 2024, James Houghton wrote: > > > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > + */ > > > > > +#define KVM_RMAP_LOCKED BIT(1) > > > > > + > > > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > > > > +{ > > > > > + unsigned long old_val, new_val; > > > > > + > > > > > + old_val = READ_ONCE(rmap_head->val); > > > > > + if (!old_val) > > > > > + return 0; > > > > > + > > > > > + do { > > > > > + /* > > > > > + * If the rmap is locked, wait for it to be unlocked before > > > > > + * trying acquire the lock, e.g. to bounce the cache line. > > > > > + */ > > > > > + while (old_val & KVM_RMAP_LOCKED) { > > > > > + old_val = READ_ONCE(rmap_head->val); > > > > > + cpu_relax(); > > > > > + } > > > > > + > > > > > + /* > > > > > + * Recheck for an empty rmap, it may have been purged by the > > > > > + * task that held the lock. > > > > > + */ > > > > > + if (!old_val) > > > > > + return 0; > > > > > + > > > > > + new_val = old_val | KVM_RMAP_LOCKED; > > > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > > > > > > > I think we (technically) need an smp_rmb() here. I think cmpxchg > > > > implicitly has that on x86 (and this code is x86-only), but should we > > > > nonetheless document that we need smp_rmb() (if it indeed required)? > > > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. > > > > > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably > > > even wrong. > > > > I don't think smp_mb__after_spinlock() is right either. This seems to > > be used following the acquisition of a spinlock to promote the memory > > ordering from an acquire barrier (that is implicit with the lock > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a > > stronger-than-usual barrier. But I guess I'm not really sure. > > > > In this case, I'm complaining that we don't have the usual memory > > ordering restrictions that come with a spinlock. > > What makes you think that? Ok I was under the impression that try_cmpxchg() did not carry memory ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()). Sorry.... So the way I would write this is with try_cmpxchg_relaxed() in the loop and then smp_rmb() after we break out of the loop, at least for the `old_val != 0` case. Kinda like this [4]. Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()? Just comparing against bit_spin_lock, test_and_set_bit_lock() documents the requirement for acquire barrier semantics[5]. [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253 [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51 > > > > For the !old_val case, there is a address/data dependency that can't be broken by > > > the CPU without violating the x86 memory model (all future actions with relevant > > > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux > > > kernel memory model, READ_ONCE() is responsible for ensuring that the address > > > dependency can't be morphed into a control dependency by the compiler and > > > subsequently reordered by the CPU. > > > > > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I > > > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of > > > __READ_ONCE() promotes the operation to an acquire. > > > > > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, > > > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, > > > the rmaps are fully protected by the critical section. > > > > I feel like a spinlock must include the appropriate barriers for it to > > correctly function as a spinlock, so I'm not sure I fully understand > > what you mean here. > > On TSO architectures, the atomic _is_ the barrier. E.g. atomic_try_cmpxchg_acquire() > eventually resolves to atomic_try_cmpxchg() on x86. Yeah I'm with you here. > And jumping back to the > "we don't have the usual memory ordering restrictions that come with a spinlock", > x86's virt_spin_lock() uses atomic_try_cmpxchg(). So while using the acquire > variant here is obviously not wrong, it also feels somewhat weird. Yeah that's fine. atomic_try_cmpxchg() is at least as strong as atomic_try_cmpxchg_acquire(), so there is no issue. But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() (and nothing else) instead, then you'd complain right? It would work on x86 (I think?), but it's not written properly! That's basically what I'm saying in this thread. > Though some > of that is undoubtedly due to explicit "acquire" semantics being rather rare in > x86. > > > > And for the SPTEs, there is no required ordering. The reader (aging > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for > > > either. I.e. there is no requirement that the reader observe a PRESENT > > > SPTE if there is a valid rmap. > > > > This makes sense. > > > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests > > > an ordering requirement that doesn't exist. > > > > So we have: the general kvm_rmap_lock() and the read-only > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use > > those names (sorry if it's confusing). > > > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So > > any changes we make to the rmap will be properly published to other > > threads that subsequently grab kvm_rmap_lock() because we had to > > properly release and then re-acquire mmu_lock, which comes with the > > barriers I'm saying we need. > > > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that > > it is possible that there may observe external changes to the > > pte_list_desc while we are in this critical section (for a > > sufficiently weak architecture). The changes that the kvm_rmap_lock() > > (mmu_lock) side made were half-published with an smp_wmb() (really > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it > > hasn't held up its end of the deal. > > > > I don't think READ_ONCE() has the guarantees we need to be a > > sufficient replacement for smp_rmb() or a load-acquire that a real > > lock would use, although I agree with you that, on arm64, it > > apparently *is* a sufficient replacement. > > > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can > > tolerate changes to pte_list_desc while in the critical section. I > > don't think this is true (given for_each_rmap_spte_lockless), > > therefore an smp_rmb() or equivalent is (technically) needed. > > > > Am I confused? > > Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents > any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change. I take back what I said about this working on x86. I think it's possible for there to be a race. Say... 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock(). 2. T2 then locks kvm_rmap_lock_readonly(). The modifications that T1 has made are not guaranteed to be visible to T2 unless T1 has an smp_wmb() (or equivalent) after the modfication and T2 has an smp_rmb() before reading the data. Now the way you had it, T2, because it uses try_cmpxchg() with full ordering, will effectively do a smp_rmb(). But T1 only does an smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1 still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2 may enter its critical section and then *later* observe the changes that T1 made. Now this is impossible on x86 (IIUC) if, in the compiled list of instructions, T1's writes occur in the same order that we have written them in C. I'm not sure if WRITE_ONCE guarantees that this reordering at compile time is forbidden. So what I'm saying is: 1. kvm_rmap_unlock() must have an smp_wmb(). 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which is what I think you want), then you must also have an smp_rmb() following a successful cmpxchg/acquisition (at least for the case where we then follow the pte_list_desc pointer). > The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[] > can change, but the pointers themselves cannot. And with aging, the code is > completely tolerant of an instable SPTE _value_ because test-only doesn't care > about false negatives/positives, and test-and-clear is itself an atomic access > i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives). I think we're on the same page for the rest of this. > > > > > (Though all of this works just fine as written on x86.) > > > > [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62 > > [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/ > > [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
On Mon, Sep 09, 2024, James Houghton wrote: > On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Sep 09, 2024, James Houghton wrote: > > > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be > > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably > > > > even wrong. > > > > > > I don't think smp_mb__after_spinlock() is right either. This seems to > > > be used following the acquisition of a spinlock to promote the memory > > > ordering from an acquire barrier (that is implicit with the lock > > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a > > > stronger-than-usual barrier. But I guess I'm not really sure. > > > > > > In this case, I'm complaining that we don't have the usual memory > > > ordering restrictions that come with a spinlock. > > > > What makes you think that? > > Ok I was under the impression that try_cmpxchg() did not carry memory > ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()). > Sorry.... > > So the way I would write this is with try_cmpxchg_relaxed() in the > loop and then smp_rmb() after we break out of the loop, at least for > the `old_val != 0` case. Kinda like this [4]. > > Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()? Heh, why would I care? On x86, they are the same thing. Yeah, I could tease out some subtle difference to super precisely describe KVM's behavior, but that more or less defeats the value proposition of total-store ordered architectures: take on more complexity in hardware (and possibly loss in performance) to allow for simpler software. > Just comparing against bit_spin_lock, test_and_set_bit_lock() > documents the requirement for acquire barrier semantics[5]. Again, that's generic library code. > [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253 > [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51 ... > > And jumping back to the "we don't have the usual memory ordering > > restrictions that come with a spinlock", x86's virt_spin_lock() uses > > atomic_try_cmpxchg(). So while using the acquire variant here is obviously > > not wrong, it also feels somewhat weird. > > Yeah that's fine. atomic_try_cmpxchg() is at least as strong as > atomic_try_cmpxchg_acquire(), so there is no issue. > > But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() (and > nothing else) instead, then you'd complain right? I'd complain because someone made me think hard for no reason. :-) > It would work on x86 (I think?), but it's not written properly! That's > basically what I'm saying in this thread. > > > Though some of that is undoubtedly due to explicit "acquire" semantics > > being rather rare in x86. > > > > > > And for the SPTEs, there is no required ordering. The reader (aging > > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for > > > > either. I.e. there is no requirement that the reader observe a PRESENT > > > > SPTE if there is a valid rmap. > > > > > > This makes sense. > > > > > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), > > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests > > > > an ordering requirement that doesn't exist. > > > > > > So we have: the general kvm_rmap_lock() and the read-only > > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use > > > those names (sorry if it's confusing). > > > > > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So > > > any changes we make to the rmap will be properly published to other > > > threads that subsequently grab kvm_rmap_lock() because we had to > > > properly release and then re-acquire mmu_lock, which comes with the > > > barriers I'm saying we need. > > > > > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no > > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that > > > it is possible that there may observe external changes to the > > > pte_list_desc while we are in this critical section (for a > > > sufficiently weak architecture). The changes that the kvm_rmap_lock() > > > (mmu_lock) side made were half-published with an smp_wmb() (really > > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it > > > hasn't held up its end of the deal. > > > > > > I don't think READ_ONCE() has the guarantees we need to be a > > > sufficient replacement for smp_rmb() or a load-acquire that a real > > > lock would use, although I agree with you that, on arm64, it > > > apparently *is* a sufficient replacement. > > > > > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can > > > tolerate changes to pte_list_desc while in the critical section. I > > > don't think this is true (given for_each_rmap_spte_lockless), > > > therefore an smp_rmb() or equivalent is (technically) needed. > > > > > > Am I confused? > > > > Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents > > any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed > > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change. > > I take back what I said about this working on x86. I think it's > possible for there to be a race. > > Say... > > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock(). > 2. T2 then locks kvm_rmap_lock_readonly(). > > The modifications that T1 has made are not guaranteed to be visible to > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication > and T2 has an smp_rmb() before reading the data. > > Now the way you had it, T2, because it uses try_cmpxchg() with full > ordering, will effectively do a smp_rmb(). But T1 only does an > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1 > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2 > may enter its critical section and then *later* observe the changes > that T1 made. > > Now this is impossible on x86 (IIUC) if, in the compiled list of > instructions, T1's writes occur in the same order that we have written > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering > at compile time is forbidden. > > So what I'm saying is: > > 1. kvm_rmap_unlock() must have an smp_wmb(). No, because beating a dead horse, this is not generic code, this is x86. > 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which > is what I think you want), No, I don't. If this were generic code, then it would be a different conversation, but using a "relaxed" atomic in x86 specific code is nonsensical, because such an operation simply does not exist in the world of x86. There are memory operations that have relaxed ordering, but none of them are atomic; they are very much the exact opposite of atomic. E.g. the only references to _relaxed that you'll find in x86 are to override the generics to x86's non-relaxed semantics. $ git grep _relaxed arch/x86 arch/x86/include/asm/io.h:#define readb_relaxed(a) __readb(a) arch/x86/include/asm/io.h:#define readw_relaxed(a) __readw(a) arch/x86/include/asm/io.h:#define readl_relaxed(a) __readl(a) arch/x86/include/asm/io.h:#define writeb_relaxed(v, a) __writeb(v, a) arch/x86/include/asm/io.h:#define writew_relaxed(v, a) __writew(v, a) arch/x86/include/asm/io.h:#define writel_relaxed(v, a) __writel(v, a) arch/x86/include/asm/io.h:#define readq_relaxed(a) __readq(a) arch/x86/include/asm/io.h:#define writeq_relaxed(v, a) __writeq(v, a) There are a few places in KVM x86 where _acquire() and _release() variants are used, but that's done purely to document the roles and who is doing what, and almost always (maybe always?) when there are lockless programming shenanigans in play. E.g. kvm_recalculate_apic_map() is probably the closest example, where KVM uses atomic_read_acquire(), atomic_cmpxchg_acquire(), and atomic_cmpxchg_release() to document the ordering between marking the map as dirty and actually processing the map. But that is still quite different, as apic_map_dirty is not a spinlock, there isn't a direct data/address dependency between apic_map_dirty and all of the data it consumes, _and_ the data that is guarded by apic_map_dirty can be modified by other vCPUs _while it is being processed_. Hmm, actually, sev_lock_two_vms() is arguably a better comparison. That's also a rudimentary spinlock (try-lock only behavior). If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then I wouldn't be opposed to doing this in the locking code to document things: s/READ_ONCE/atomic_read_acquire s/WRITE_ONCE/atomic_set_release s/try_cmpxchg/atomic_cmpxchg_acquire But it's an "unsigned long", and so trying to massage it into the above isn't a net positive for me. This is gnarly x86 specific code; it's very unlikely that someone will be able to understand the rest of the rmap code, but won't be able to understand that the code is safe due to x86's strong memory ordering. > then you must also have an smp_rmb() following a successful > cmpxchg/acquisition (at least for the case where we then follow the > pte_list_desc pointer).
On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Sep 09, 2024, James Houghton wrote: > > I take back what I said about this working on x86. I think it's > > possible for there to be a race. > > > > Say... > > > > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock(). > > 2. T2 then locks kvm_rmap_lock_readonly(). > > > > The modifications that T1 has made are not guaranteed to be visible to > > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication > > and T2 has an smp_rmb() before reading the data. > > > > Now the way you had it, T2, because it uses try_cmpxchg() with full > > ordering, will effectively do a smp_rmb(). But T1 only does an > > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1 > > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2 > > may enter its critical section and then *later* observe the changes > > that T1 made. > > > > Now this is impossible on x86 (IIUC) if, in the compiled list of > > instructions, T1's writes occur in the same order that we have written > > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering > > at compile time is forbidden. > > > > So what I'm saying is: > > > > 1. kvm_rmap_unlock() must have an smp_wmb(). > > No, because beating a dead horse, this is not generic code, this is x86. What prevents the compiler from reordering (non-atomic, non-volatile) stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after the WRITE_ONCE()? IMV, such a reordering is currently permitted[1] (i.e., a barrier() is missing), and should the compiler choose to do this, the lock will not function correctly. > If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then > I wouldn't be opposed to doing this in the locking code to document things: > > s/READ_ONCE/atomic_read_acquire > s/WRITE_ONCE/atomic_set_release > s/try_cmpxchg/atomic_cmpxchg_acquire I think we can use atomic_long_t. It would be really great if we did a substitution like this. That would address my above concern about barrier() (atomic_set_release, for example, implies a barrier() that we otherwise need to include). [1]: https://www.kernel.org/doc/Documentation/memory-barriers.txt (GUARANTEES + COMPILER BARRIER)
On Tue, Sep 10, 2024, James Houghton wrote: > On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Sep 09, 2024, James Houghton wrote: > > > I take back what I said about this working on x86. I think it's > > > possible for there to be a race. > > > > > > Say... > > > > > > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock(). > > > 2. T2 then locks kvm_rmap_lock_readonly(). > > > > > > The modifications that T1 has made are not guaranteed to be visible to > > > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication > > > and T2 has an smp_rmb() before reading the data. > > > > > > Now the way you had it, T2, because it uses try_cmpxchg() with full > > > ordering, will effectively do a smp_rmb(). But T1 only does an > > > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1 > > > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2 > > > may enter its critical section and then *later* observe the changes > > > that T1 made. > > > > > > Now this is impossible on x86 (IIUC) if, in the compiled list of > > > instructions, T1's writes occur in the same order that we have written > > > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering > > > at compile time is forbidden. > > > > > > So what I'm saying is: > > > > > > 1. kvm_rmap_unlock() must have an smp_wmb(). > > > > No, because beating a dead horse, this is not generic code, this is x86. > > What prevents the compiler from reordering (non-atomic, non-volatile) > stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after > the WRITE_ONCE()? Oof, right, nothing. Which is why __smp_store_release() has an explicit barrier() before its WRITE_ONCE(). > IMV, such a reordering is currently permitted[1] (i.e., a barrier() is > missing), and should the compiler choose to do this, the lock will not > function correctly. > > > If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then > > I wouldn't be opposed to doing this in the locking code to document things: > > > > s/READ_ONCE/atomic_read_acquire > > s/WRITE_ONCE/atomic_set_release > > s/try_cmpxchg/atomic_cmpxchg_acquire > > I think we can use atomic_long_t. Duh. That's a /facepalm moment. > It would be really great if we did a substitution like this. That > would address my above concern about barrier() (atomic_set_release, > for example, implies a barrier() that we otherwise need to include). Ya, agreed, not just for warm fuzzies, but because it's necessary to prevent the compiler from being clever.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8ca7f51c2da3..a683b5fc4026 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu * About rmap_head encoding: * * If the bit zero of rmap_head->val is clear, then it points to the only spte - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct * pte_list_desc containing more mappings. */ #define KVM_RMAP_MANY BIT(0) +/* + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always + * operates with mmu_lock held for write), but rmaps can be walked without + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain + * being zapped/dropped _while the rmap is locked_. + * + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be + * done while holding mmu_lock for write. This allows a task walking rmaps + * without holding mmu_lock to concurrently walk the same entries as a task + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify + * the rmaps, thus the walks are stable. + * + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, + * only the rmap chains themselves are protected. E.g. holding an rmap's lock + * ensures all "struct pte_list_desc" fields are stable. + */ +#define KVM_RMAP_LOCKED BIT(1) + +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +{ + unsigned long old_val, new_val; + + old_val = READ_ONCE(rmap_head->val); + if (!old_val) + return 0; + + do { + /* + * If the rmap is locked, wait for it to be unlocked before + * trying acquire the lock, e.g. to bounce the cache line. + */ + while (old_val & KVM_RMAP_LOCKED) { + old_val = READ_ONCE(rmap_head->val); + cpu_relax(); + } + + /* + * Recheck for an empty rmap, it may have been purged by the + * task that held the lock. + */ + if (!old_val) + return 0; + + new_val = old_val | KVM_RMAP_LOCKED; + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); + + /* Return the old value, i.e. _without_ the LOCKED bit set. */ + return old_val; +} + +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, + unsigned long new_val) +{ + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); + WRITE_ONCE(rmap_head->val, new_val); +} + +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head) +{ + return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; +} + /* * Returns the number of pointers in the rmap chain, not counting the new one. */ @@ -924,7 +986,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, struct pte_list_desc *desc; int count = 0; - old_val = rmap_head->val; + old_val = kvm_rmap_lock(rmap_head); if (!old_val) { new_val = (unsigned long)spte; @@ -956,7 +1018,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, desc->sptes[desc->spte_count++] = spte; } - rmap_head->val = new_val; + kvm_rmap_unlock(rmap_head, new_val); return count; } @@ -1004,7 +1066,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, unsigned long rmap_val; int i; - rmap_val = rmap_head->val; + rmap_val = kvm_rmap_lock(rmap_head); if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) goto out; @@ -1030,7 +1092,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, } out: - rmap_head->val = rmap_val; + kvm_rmap_unlock(rmap_head, rmap_val); } static void kvm_zap_one_rmap_spte(struct kvm *kvm, @@ -1048,7 +1110,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, unsigned long rmap_val; int i; - rmap_val = rmap_head->val; + rmap_val = kvm_rmap_lock(rmap_head); if (!rmap_val) return false; @@ -1067,13 +1129,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, } out: /* rmap_head is meaningless now, remember to reset it */ - rmap_head->val = 0; + kvm_rmap_unlock(rmap_head, 0); return true; } unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { - unsigned long rmap_val = rmap_head->val; + unsigned long rmap_val = kvm_rmap_get(rmap_head); struct pte_list_desc *desc; if (!rmap_val) @@ -1139,7 +1201,7 @@ struct rmap_iterator { static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, struct rmap_iterator *iter) { - unsigned long rmap_val = rmap_head->val; + unsigned long rmap_val = kvm_rmap_get(rmap_head); u64 *sptep; if (!rmap_val)
Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a *very* rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 71 insertions(+), 9 deletions(-)