diff mbox series

[19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

Message ID 20240809194335.1726916-20-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Allow yielding on mmu_notifier zap | expand

Commit Message

Sean Christopherson Aug. 9, 2024, 7:43 p.m. UTC
Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock.  The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
necessary.

Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 9 deletions(-)

Comments

Lai Jiangshan Aug. 12, 2024, 8:39 a.m. UTC | #1
On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote:

> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;
> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();

The sequence of these two lines of code can be improved.

> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
Sean Christopherson Aug. 12, 2024, 3:22 p.m. UTC | #2
On Mon, Aug 12, 2024, Lai Jiangshan wrote:
> On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> > +
> > +       do {
> > +               /*
> > +                * If the rmap is locked, wait for it to be unlocked before
> > +                * trying acquire the lock, e.g. to bounce the cache line.
> > +                */
> > +               while (old_val & KVM_RMAP_LOCKED) {
> > +                       old_val = READ_ONCE(rmap_head->val);
> > +                       cpu_relax();
> 
> The sequence of these two lines of code can be improved.

Oh yeah, duh, re-read after PAUSE, not before.

Definitely holler if you have any alternative ideas for walking rmaps without
taking mmu_lock, I guarantee you've spent more time than me thinking about the
shadow MMU :-)
Lai Jiangshan Aug. 13, 2024, 6:35 a.m. UTC | #3
On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote:

>
> Oh yeah, duh, re-read after PAUSE, not before.
>
> Definitely holler if you have any alternative ideas for walking rmaps
> without taking mmu_lock, I guarantee you've spent more time than me
> thinking about the shadow MMU :-)

We use the same bit and the same way for the rmap lock.

We just use bit_spin_lock() and the optimization for empty rmap_head is
handled out of kvm_rmap_lock().

bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the
new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region.

Thanks
Lai
Sean Christopherson Aug. 13, 2024, 3:19 p.m. UTC | #4
On Tue, Aug 13, 2024, Lai Jiangshan wrote:
> On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> >
> > Oh yeah, duh, re-read after PAUSE, not before.
> >
> > Definitely holler if you have any alternative ideas for walking rmaps
> > without taking mmu_lock, I guarantee you've spent more time than me
> > thinking about the shadow MMU :-)
> 
> We use the same bit and the same way for the rmap lock.
> 
> We just use bit_spin_lock() and the optimization for empty rmap_head is
> handled out of kvm_rmap_lock().

Hmm, I'm leaning towards keeping the custom locking.  There are a handful of
benefits, none of which are all that meaningful on their own, but do add up.

 - Callers don't need to manually check for an empty rmap_head.
 - Can avoid the redundant preempt_{disable,enable}() entirely in the common case
   of being called while mmu_lock is held.
 - Handles the (likely super rare) edge case where a read-only walker encounters
   an rmap that was just emptied (rmap_head->val goes to zero after the initial
   check to elide the lock).
 - Avoids an atomic when releasing the lock, and any extra instructions entirely
   for writers since they always write the full rmap_head->val when releasing.

> bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the
> new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region.

Oof, it doesn't.  Disabling IRQs crossed my mind, but I completely forgot about
preemption.

Thanks much!
James Houghton Sept. 3, 2024, 8:17 p.m. UTC | #5
On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Steal another bit from rmap entries (which are word aligned pointers, i.e.
> have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
> the bit to implement a *very* rudimentary per-rmap spinlock.  The only
> anticipated usage of the lock outside of mmu_lock is for aging gfns, and
> collisions between aging and other MMU rmap operations are quite rare,
> e.g. unless userspace is being silly and aging a tiny range over and over
> in a tight loop, time between contention when aging an actively running VM
> is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
> necessary.
>
> Note, the lock only protects the rmap structure itself, SPTEs that are
> pointed at by a locked rmap can still be modified and zapped by another
> task (KVM drops/zaps SPTEs before deleting the rmap entries)
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8ca7f51c2da3..a683b5fc4026 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
>   * About rmap_head encoding:
>   *
>   * If the bit zero of rmap_head->val is clear, then it points to the only spte
> - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
> + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
>   * pte_list_desc containing more mappings.
>   */
>  #define KVM_RMAP_MANY  BIT(0)
>
> +/*
> + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> + * operates with mmu_lock held for write), but rmaps can be walked without
> + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> + * being zapped/dropped _while the rmap is locked_.
> + *
> + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> + * done while holding mmu_lock for write.  This allows a task walking rmaps
> + * without holding mmu_lock to concurrently walk the same entries as a task
> + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> + * the rmaps, thus the walks are stable.
> + *
> + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> + * ensures all "struct pte_list_desc" fields are stable.
> + */
> +#define KVM_RMAP_LOCKED        BIT(1)
> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;

I'm having trouble understanding how this bit works. What exactly is
stopping the rmap from being populated while we have it "locked"? We
aren't holding the MMU lock at all in the lockless case, and given
this bit, it is impossible (I think?) for the MMU-write-lock-holding,
rmap-modifying side to tell that this rmap is locked.

Concretely, my immediate concern is that we will still unconditionally
write 0 back at unlock time even if the value has changed.

I expect that this works and I'm just not getting it... :)


> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();
> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
> +
> +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> +                           unsigned long new_val)
> +{
> +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> +       WRITE_ONCE(rmap_head->val, new_val);
> +}
> +
> +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
> +{
> +       return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
> +}
> +
>  /*
>   * Returns the number of pointers in the rmap chain, not counting the new one.
>   */
> @@ -924,7 +986,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>         struct pte_list_desc *desc;
>         int count = 0;
>
> -       old_val = rmap_head->val;
> +       old_val = kvm_rmap_lock(rmap_head);
>
>         if (!old_val) {
>                 new_val = (unsigned long)spte;
> @@ -956,7 +1018,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>                 desc->sptes[desc->spte_count++] = spte;
>         }
>
> -       rmap_head->val = new_val;
> +       kvm_rmap_unlock(rmap_head, new_val);
>
>         return count;
>  }
> @@ -1004,7 +1066,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
>         unsigned long rmap_val;
>         int i;
>
> -       rmap_val = rmap_head->val;
> +       rmap_val = kvm_rmap_lock(rmap_head);
>         if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
>                 goto out;
>
> @@ -1030,7 +1092,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
>         }
>
>  out:
> -       rmap_head->val = rmap_val;
> +       kvm_rmap_unlock(rmap_head, rmap_val);
>  }
>
>  static void kvm_zap_one_rmap_spte(struct kvm *kvm,
> @@ -1048,7 +1110,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
>         unsigned long rmap_val;
>         int i;
>
> -       rmap_val = rmap_head->val;
> +       rmap_val = kvm_rmap_lock(rmap_head);
>         if (!rmap_val)
>                 return false;
>
> @@ -1067,13 +1129,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
>         }
>  out:
>         /* rmap_head is meaningless now, remember to reset it */
> -       rmap_head->val = 0;
> +       kvm_rmap_unlock(rmap_head, 0);
>         return true;
>  }
>
>  unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
>  {
> -       unsigned long rmap_val = rmap_head->val;
> +       unsigned long rmap_val = kvm_rmap_get(rmap_head);
>         struct pte_list_desc *desc;
>
>         if (!rmap_val)
> @@ -1139,7 +1201,7 @@ struct rmap_iterator {
>  static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
>                            struct rmap_iterator *iter)
>  {
> -       unsigned long rmap_val = rmap_head->val;
> +       unsigned long rmap_val = kvm_rmap_get(rmap_head);
>         u64 *sptep;
>
>         if (!rmap_val)
> --
> 2.46.0.76.ge559c4bf1a-goog
>
Sean Christopherson Sept. 3, 2024, 9:27 p.m. UTC | #6
On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> > + */
> > +#define KVM_RMAP_LOCKED        BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> 
> I'm having trouble understanding how this bit works. What exactly is
> stopping the rmap from being populated while we have it "locked"?

Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for
aging only cares about existing mappings.  The key to correctness is that this
helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do
nothing and thus will never see any rmaps that come along in the future.

> aren't holding the MMU lock at all in the lockless case, and given
> this bit, it is impossible (I think?) for the MMU-write-lock-holding,
> rmap-modifying side to tell that this rmap is locked.
> 
> Concretely, my immediate concern is that we will still unconditionally
> write 0 back at unlock time even if the value has changed.

The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty,
i.e. wasn't actually locked.

+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+				     unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
+	WRITE_ONCE(rmap_head->val, old_val);
+}

The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the
"normal", non-readonly versions if and only if mmu_lock is held for write.

+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	/*
+	 * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
+	 *       held for write.
+	 */
+	return __kvm_rmap_lock(rmap_head);
+}
James Houghton Sept. 3, 2024, 9:40 p.m. UTC | #7
On Tue, Sep 3, 2024 at 2:27 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 03, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +/*
> > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > > + * operates with mmu_lock held for write), but rmaps can be walked without
> > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > > + * being zapped/dropped _while the rmap is locked_.
> > > + *
> > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > > + * without holding mmu_lock to concurrently walk the same entries as a task
> > > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > > + * the rmaps, thus the walks are stable.
> > > + *
> > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > > + * ensures all "struct pte_list_desc" fields are stable.
> > > + */
> > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > +
> > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > +{
> > > +       unsigned long old_val, new_val;
> > > +
> > > +       old_val = READ_ONCE(rmap_head->val);
> > > +       if (!old_val)
> > > +               return 0;
> >
> > I'm having trouble understanding how this bit works. What exactly is
> > stopping the rmap from being populated while we have it "locked"?
>
> Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for
> aging only cares about existing mappings.  The key to correctness is that this
> helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do
> nothing and thus will never see any rmaps that come along in the future.
>
> > aren't holding the MMU lock at all in the lockless case, and given
> > this bit, it is impossible (I think?) for the MMU-write-lock-holding,
> > rmap-modifying side to tell that this rmap is locked.
> >
> > Concretely, my immediate concern is that we will still unconditionally
> > write 0 back at unlock time even if the value has changed.
>
> The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty,
> i.e. wasn't actually locked.

Ah, that's what I was missing. Thanks! This all makes sense now.

>
> +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> +                                    unsigned long old_val)
> +{
> +       if (!old_val)
> +               return;
> +
> +       KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
> +       WRITE_ONCE(rmap_head->val, old_val);
> +}
>
> The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the
> "normal", non-readonly versions if and only if mmu_lock is held for write.
>
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       /*
> +        * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
> +        *       held for write.
> +        */
> +       return __kvm_rmap_lock(rmap_head);
> +}
James Houghton Sept. 9, 2024, 7 p.m. UTC | #8
On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Steal another bit from rmap entries (which are word aligned pointers, i.e.
> have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
> the bit to implement a *very* rudimentary per-rmap spinlock.  The only
> anticipated usage of the lock outside of mmu_lock is for aging gfns, and
> collisions between aging and other MMU rmap operations are quite rare,
> e.g. unless userspace is being silly and aging a tiny range over and over
> in a tight loop, time between contention when aging an actively running VM
> is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
> necessary.
>
> Note, the lock only protects the rmap structure itself, SPTEs that are
> pointed at by a locked rmap can still be modified and zapped by another
> task (KVM drops/zaps SPTEs before deleting the rmap entries)
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8ca7f51c2da3..a683b5fc4026 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
>   * About rmap_head encoding:
>   *
>   * If the bit zero of rmap_head->val is clear, then it points to the only spte
> - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
> + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
>   * pte_list_desc containing more mappings.
>   */
>  #define KVM_RMAP_MANY  BIT(0)
>
> +/*
> + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> + * operates with mmu_lock held for write), but rmaps can be walked without
> + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> + * being zapped/dropped _while the rmap is locked_.
> + *
> + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> + * done while holding mmu_lock for write.  This allows a task walking rmaps
> + * without holding mmu_lock to concurrently walk the same entries as a task
> + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> + * the rmaps, thus the walks are stable.
> + *
> + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> + * ensures all "struct pte_list_desc" fields are stable.

This last sentence makes me think we need to be careful about memory ordering.

> + */
> +#define KVM_RMAP_LOCKED        BIT(1)
> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;
> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();
> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));

I think we (technically) need an smp_rmb() here. I think cmpxchg
implicitly has that on x86 (and this code is x86-only), but should we
nonetheless document that we need smp_rmb() (if it indeed required)?
Perhaps we could/should condition the smp_rmb() on `if (old_val)`.

kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like
adding it here will do the right thing for the read-only lock side.

> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
> +
> +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> +                           unsigned long new_val)
> +{
> +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);

Same goes with having an smp_wmb() here. Is it necessary? If so,
should it at least be documented?

And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC.

> +       WRITE_ONCE(rmap_head->val, new_val);
> +}
Sean Christopherson Sept. 9, 2024, 8:02 p.m. UTC | #9
On Mon, Sep 09, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> 
> This last sentence makes me think we need to be careful about memory ordering.
> 
> > + */
> > +#define KVM_RMAP_LOCKED        BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> > +
> > +       do {
> > +               /*
> > +                * If the rmap is locked, wait for it to be unlocked before
> > +                * trying acquire the lock, e.g. to bounce the cache line.
> > +                */
> > +               while (old_val & KVM_RMAP_LOCKED) {
> > +                       old_val = READ_ONCE(rmap_head->val);
> > +                       cpu_relax();
> > +               }
> > +
> > +               /*
> > +                * Recheck for an empty rmap, it may have been purged by the
> > +                * task that held the lock.
> > +                */
> > +               if (!old_val)
> > +                       return 0;
> > +
> > +               new_val = old_val | KVM_RMAP_LOCKED;
> > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> 
> I think we (technically) need an smp_rmb() here. I think cmpxchg
> implicitly has that on x86 (and this code is x86-only), but should we
> nonetheless document that we need smp_rmb() (if it indeed required)?
> Perhaps we could/should condition the smp_rmb() on `if (old_val)`.

Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
even wrong.

For the !old_val case, there is a address/data dependency that can't be broken by
the CPU without violating the x86 memory model (all future actions with relevant
memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
kernel memory model, READ_ONCE() is responsible for ensuring that the address
dependency can't be morphed into a control dependency by the compiler and
subsequently reordered by the CPU.

I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
__READ_ONCE() promotes the operation to an acquire.

Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
the rmaps are fully protected by the critical section.  And for the SPTEs, there
is no required ordering.  The reader (aging thread) can observe a !PRESENT or a
PRESENT SPTE, and must be prepared for either.  I.e. there is no requirement that
the reader observe a PRESENT SPTE if there is a valid rmap.

So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
an ordering requirement that doesn't exist.

> kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like
> adding it here will do the right thing for the read-only lock side.
> 
> > +
> > +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> > +       return old_val;
> > +}
> > +
> > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> > +                           unsigned long new_val)
> > +{
> > +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> 
> Same goes with having an smp_wmb() here. Is it necessary? If so,
> should it at least be documented?
> 
> And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC.
> 
> > +       WRITE_ONCE(rmap_head->val, new_val);
> > +}
James Houghton Sept. 9, 2024, 8:46 p.m. UTC | #10
On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > + */
> > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > +
> > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > +{
> > > +       unsigned long old_val, new_val;
> > > +
> > > +       old_val = READ_ONCE(rmap_head->val);
> > > +       if (!old_val)
> > > +               return 0;
> > > +
> > > +       do {
> > > +               /*
> > > +                * If the rmap is locked, wait for it to be unlocked before
> > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > +                */
> > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > +                       old_val = READ_ONCE(rmap_head->val);
> > > +                       cpu_relax();
> > > +               }
> > > +
> > > +               /*
> > > +                * Recheck for an empty rmap, it may have been purged by the
> > > +                * task that held the lock.
> > > +                */
> > > +               if (!old_val)
> > > +                       return 0;
> > > +
> > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> >
> > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > implicitly has that on x86 (and this code is x86-only), but should we
> > nonetheless document that we need smp_rmb() (if it indeed required)?
> > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
>
> Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> even wrong.

I don't think smp_mb__after_spinlock() is right either. This seems to
be used following the acquisition of a spinlock to promote the memory
ordering from an acquire barrier (that is implicit with the lock
acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
stronger-than-usual barrier. But I guess I'm not really sure.

In this case, I'm complaining that we don't have the usual memory
ordering restrictions that come with a spinlock.

> For the !old_val case, there is a address/data dependency that can't be broken by
> the CPU without violating the x86 memory model (all future actions with relevant
> memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> kernel memory model, READ_ONCE() is responsible for ensuring that the address
> dependency can't be morphed into a control dependency by the compiler and
> subsequently reordered by the CPU.
>
> I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> __READ_ONCE() promotes the operation to an acquire.
>
> Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> the rmaps are fully protected by the critical section.

I feel like a spinlock must include the appropriate barriers for it to
correctly function as a spinlock, so I'm not sure I fully understand
what you mean here.

> And for the SPTEs, there
> is no required ordering.  The reader (aging thread) can observe a !PRESENT or a
> PRESENT SPTE, and must be prepared for either.  I.e. there is no requirement that
> the reader observe a PRESENT SPTE if there is a valid rmap.

This makes sense.

> So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> an ordering requirement that doesn't exist.

So we have: the general kvm_rmap_lock() and the read-only
kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
those names (sorry if it's confusing).

For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
any changes we make to the rmap will be properly published to other
threads that subsequently grab kvm_rmap_lock() because we had to
properly release and then re-acquire mmu_lock, which comes with the
barriers I'm saying we need.

For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
it is possible that there may observe external changes to the
pte_list_desc while we are in this critical section (for a
sufficiently weak architecture). The changes that the kvm_rmap_lock()
(mmu_lock) side made were half-published with an smp_wmb() (really
[3]), but the read side didn't use a load-acquire or smp_rmb(), so it
hasn't held up its end of the deal.

I don't think READ_ONCE() has the guarantees we need to be a
sufficient replacement for smp_rmb() or a load-acquire that a real
lock would use, although I agree with you that, on arm64, it
apparently *is* a sufficient replacement.

Now this isn't a problem if the kvm_rmap_lock_readonly() side can
tolerate changes to pte_list_desc while in the critical section. I
don't think this is true (given for_each_rmap_spte_lockless),
therefore an smp_rmb() or equivalent is (technically) needed.

Am I confused?

(Though all of this works just fine as written on x86.)

[1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
[2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
[3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
Sean Christopherson Sept. 9, 2024, 10:02 p.m. UTC | #11
On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > + */
> > > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > > +
> > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > +{
> > > > +       unsigned long old_val, new_val;
> > > > +
> > > > +       old_val = READ_ONCE(rmap_head->val);
> > > > +       if (!old_val)
> > > > +               return 0;
> > > > +
> > > > +       do {
> > > > +               /*
> > > > +                * If the rmap is locked, wait for it to be unlocked before
> > > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > > +                */
> > > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > > +                       old_val = READ_ONCE(rmap_head->val);
> > > > +                       cpu_relax();
> > > > +               }
> > > > +
> > > > +               /*
> > > > +                * Recheck for an empty rmap, it may have been purged by the
> > > > +                * task that held the lock.
> > > > +                */
> > > > +               if (!old_val)
> > > > +                       return 0;
> > > > +
> > > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > >
> > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > implicitly has that on x86 (and this code is x86-only), but should we
> > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> >
> > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > even wrong.
> 
> I don't think smp_mb__after_spinlock() is right either. This seems to
> be used following the acquisition of a spinlock to promote the memory
> ordering from an acquire barrier (that is implicit with the lock
> acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> stronger-than-usual barrier. But I guess I'm not really sure.
> 
> In this case, I'm complaining that we don't have the usual memory
> ordering restrictions that come with a spinlock.

What makes you think that?

> > For the !old_val case, there is a address/data dependency that can't be broken by
> > the CPU without violating the x86 memory model (all future actions with relevant
> > memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > dependency can't be morphed into a control dependency by the compiler and
> > subsequently reordered by the CPU.
> >
> > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > __READ_ONCE() promotes the operation to an acquire.
> >
> > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> > the rmaps are fully protected by the critical section.
> 
> I feel like a spinlock must include the appropriate barriers for it to
> correctly function as a spinlock, so I'm not sure I fully understand
> what you mean here.

On TSO architectures, the atomic _is_ the barrier.  E.g. atomic_try_cmpxchg_acquire()
eventually resolves to atomic_try_cmpxchg() on x86.  And jumping back to the
"we don't have the usual memory ordering restrictions that come with a spinlock",
x86's virt_spin_lock() uses atomic_try_cmpxchg().  So while using the acquire
variant here is obviously not wrong, it also feels somewhat weird.  Though some
of that is undoubtedly due to explicit "acquire" semantics being rather rare in
x86.

> > And for the SPTEs, there is no required ordering.  The reader (aging
> > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > either.  I.e. there is no requirement that the reader observe a PRESENT
> > SPTE if there is a valid rmap.
> 
> This makes sense.
> 
> > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > an ordering requirement that doesn't exist.
> 
> So we have: the general kvm_rmap_lock() and the read-only
> kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> those names (sorry if it's confusing).
> 
> For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> any changes we make to the rmap will be properly published to other
> threads that subsequently grab kvm_rmap_lock() because we had to
> properly release and then re-acquire mmu_lock, which comes with the
> barriers I'm saying we need.
> 
> For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> it is possible that there may observe external changes to the
> pte_list_desc while we are in this critical section (for a
> sufficiently weak architecture). The changes that the kvm_rmap_lock()
> (mmu_lock) side made were half-published with an smp_wmb() (really
> [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> hasn't held up its end of the deal.
> 
> I don't think READ_ONCE() has the guarantees we need to be a
> sufficient replacement for smp_rmb() or a load-acquire that a real
> lock would use, although I agree with you that, on arm64, it
> apparently *is* a sufficient replacement.
> 
> Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> tolerate changes to pte_list_desc while in the critical section. I
> don't think this is true (given for_each_rmap_spte_lockless),
> therefore an smp_rmb() or equivalent is (technically) needed.
> 
> Am I confused?

Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
can change, but the pointers themselves cannot.  And with aging, the code is
completely tolerant of an instable SPTE _value_ because test-only doesn't care
about false negatives/positives, and test-and-clear is itself an atomic access
i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

> 
> (Though all of this works just fine as written on x86.)
> 
> [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
> [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
James Houghton Sept. 10, 2024, 12:28 a.m. UTC | #12
On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Sep 09, 2024, James Houghton wrote:
> > > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > + */
> > > > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > > > +
> > > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > > +{
> > > > > +       unsigned long old_val, new_val;
> > > > > +
> > > > > +       old_val = READ_ONCE(rmap_head->val);
> > > > > +       if (!old_val)
> > > > > +               return 0;
> > > > > +
> > > > > +       do {
> > > > > +               /*
> > > > > +                * If the rmap is locked, wait for it to be unlocked before
> > > > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > > > +                */
> > > > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > > > +                       old_val = READ_ONCE(rmap_head->val);
> > > > > +                       cpu_relax();
> > > > > +               }
> > > > > +
> > > > > +               /*
> > > > > +                * Recheck for an empty rmap, it may have been purged by the
> > > > > +                * task that held the lock.
> > > > > +                */
> > > > > +               if (!old_val)
> > > > > +                       return 0;
> > > > > +
> > > > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > > >
> > > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > > implicitly has that on x86 (and this code is x86-only), but should we
> > > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> > >
> > > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > even wrong.
> >
> > I don't think smp_mb__after_spinlock() is right either. This seems to
> > be used following the acquisition of a spinlock to promote the memory
> > ordering from an acquire barrier (that is implicit with the lock
> > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > stronger-than-usual barrier. But I guess I'm not really sure.
> >
> > In this case, I'm complaining that we don't have the usual memory
> > ordering restrictions that come with a spinlock.
>
> What makes you think that?

Ok I was under the impression that try_cmpxchg() did not carry memory
ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
Sorry....

So the way I would write this is with try_cmpxchg_relaxed() in the
loop and then smp_rmb() after we break out of the loop, at least for
the `old_val != 0` case. Kinda like this [4].

Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Just comparing against bit_spin_lock, test_and_set_bit_lock()
documents the requirement for acquire barrier semantics[5].

[4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
[5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

>
> > > For the !old_val case, there is a address/data dependency that can't be broken by
> > > the CPU without violating the x86 memory model (all future actions with relevant
> > > memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> > > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > > dependency can't be morphed into a control dependency by the compiler and
> > > subsequently reordered by the CPU.
> > >
> > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > > __READ_ONCE() promotes the operation to an acquire.
> > >
> > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > > hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> > > the rmaps are fully protected by the critical section.
> >
> > I feel like a spinlock must include the appropriate barriers for it to
> > correctly function as a spinlock, so I'm not sure I fully understand
> > what you mean here.
>
> On TSO architectures, the atomic _is_ the barrier.  E.g. atomic_try_cmpxchg_acquire()
> eventually resolves to atomic_try_cmpxchg() on x86.

Yeah I'm with you here.

> And jumping back to the
> "we don't have the usual memory ordering restrictions that come with a spinlock",
> x86's virt_spin_lock() uses atomic_try_cmpxchg().  So while using the acquire
> variant here is obviously not wrong, it also feels somewhat weird.

Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
atomic_try_cmpxchg_acquire(), so there is no issue.

But if virt_spin_lock() were written to use
atomic_try_cmpxchg_relaxed() (and nothing else) instead, then you'd
complain right? It would work on x86 (I think?), but it's not written
properly! That's basically what I'm saying in this thread.

> Though some
> of that is undoubtedly due to explicit "acquire" semantics being rather rare in
> x86.
>
> > > And for the SPTEs, there is no required ordering.  The reader (aging
> > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > either.  I.e. there is no requirement that the reader observe a PRESENT
> > > SPTE if there is a valid rmap.
> >
> > This makes sense.
> >
> > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > an ordering requirement that doesn't exist.
> >
> > So we have: the general kvm_rmap_lock() and the read-only
> > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > those names (sorry if it's confusing).
> >
> > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > any changes we make to the rmap will be properly published to other
> > threads that subsequently grab kvm_rmap_lock() because we had to
> > properly release and then re-acquire mmu_lock, which comes with the
> > barriers I'm saying we need.
> >
> > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > it is possible that there may observe external changes to the
> > pte_list_desc while we are in this critical section (for a
> > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > (mmu_lock) side made were half-published with an smp_wmb() (really
> > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > hasn't held up its end of the deal.
> >
> > I don't think READ_ONCE() has the guarantees we need to be a
> > sufficient replacement for smp_rmb() or a load-acquire that a real
> > lock would use, although I agree with you that, on arm64, it
> > apparently *is* a sufficient replacement.
> >
> > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > tolerate changes to pte_list_desc while in the critical section. I
> > don't think this is true (given for_each_rmap_spte_lockless),
> > therefore an smp_rmb() or equivalent is (technically) needed.
> >
> > Am I confused?
>
> Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
> any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
> at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

I take back what I said about this working on x86. I think it's
possible for there to be a race.

Say...

1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
2. T2 then locks kvm_rmap_lock_readonly().

The modifications that T1 has made are not guaranteed to be visible to
T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
and T2 has an smp_rmb() before reading the data.

Now the way you had it, T2, because it uses try_cmpxchg() with full
ordering, will effectively do a smp_rmb(). But T1 only does an
smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
may enter its critical section and then *later* observe the changes
that T1 made.

Now this is impossible on x86 (IIUC) if, in the compiled list of
instructions, T1's writes occur in the same order that we have written
them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
at compile time is forbidden.

So what I'm saying is:

1. kvm_rmap_unlock() must have an smp_wmb().
2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
is what I think you want), then you must also have an smp_rmb()
following a successful cmpxchg/acquisition (at least for the case
where we then follow the pte_list_desc pointer).

> The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
> can change, but the pointers themselves cannot.  And with aging, the code is
> completely tolerant of an instable SPTE _value_ because test-only doesn't care
> about false negatives/positives, and test-and-clear is itself an atomic access
> i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

I think we're on the same page for the rest of this.

>
> >
> > (Though all of this works just fine as written on x86.)
> >
> > [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> > [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
> > [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
Sean Christopherson Sept. 10, 2024, 1:42 a.m. UTC | #13
On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > > even wrong.
> > >
> > > I don't think smp_mb__after_spinlock() is right either. This seems to
> > > be used following the acquisition of a spinlock to promote the memory
> > > ordering from an acquire barrier (that is implicit with the lock
> > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > > stronger-than-usual barrier. But I guess I'm not really sure.
> > >
> > > In this case, I'm complaining that we don't have the usual memory
> > > ordering restrictions that come with a spinlock.
> >
> > What makes you think that?
> 
> Ok I was under the impression that try_cmpxchg() did not carry memory
> ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
> Sorry....
> 
> So the way I would write this is with try_cmpxchg_relaxed() in the
> loop and then smp_rmb() after we break out of the loop, at least for
> the `old_val != 0` case. Kinda like this [4].
> 
> Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Heh, why would I care?  On x86, they are the same thing.  Yeah, I could tease out
some subtle difference to super precisely describe KVM's behavior, but that more
or less defeats the value proposition of total-store ordered architectures: take
on more complexity in hardware (and possibly loss in performance) to allow for
simpler software.

> Just comparing against bit_spin_lock, test_and_set_bit_lock()
> documents the requirement for acquire barrier semantics[5].

Again, that's generic library code.

> [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
> [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

...

> > And jumping back to the "we don't have the usual memory ordering
> > restrictions that come with a spinlock", x86's virt_spin_lock() uses
> > atomic_try_cmpxchg().  So while using the acquire variant here is obviously
> > not wrong, it also feels somewhat weird.
> 
> Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
> atomic_try_cmpxchg_acquire(), so there is no issue.
> 
> But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() (and
> nothing else) instead, then you'd complain right?

I'd complain because someone made me think hard for no reason. :-)

> It would work on x86 (I think?), but it's not written properly! That's
> basically what I'm saying in this thread.
> 
> > Though some of that is undoubtedly due to explicit "acquire" semantics
> > being rather rare in x86.
> >
> > > > And for the SPTEs, there is no required ordering.  The reader (aging
> > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > > either.  I.e. there is no requirement that the reader observe a PRESENT
> > > > SPTE if there is a valid rmap.
> > >
> > > This makes sense.
> > >
> > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > > an ordering requirement that doesn't exist.
> > >
> > > So we have: the general kvm_rmap_lock() and the read-only
> > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > > those names (sorry if it's confusing).
> > >
> > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > > any changes we make to the rmap will be properly published to other
> > > threads that subsequently grab kvm_rmap_lock() because we had to
> > > properly release and then re-acquire mmu_lock, which comes with the
> > > barriers I'm saying we need.
> > >
> > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > > it is possible that there may observe external changes to the
> > > pte_list_desc while we are in this critical section (for a
> > > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > > (mmu_lock) side made were half-published with an smp_wmb() (really
> > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > > hasn't held up its end of the deal.
> > >
> > > I don't think READ_ONCE() has the guarantees we need to be a
> > > sufficient replacement for smp_rmb() or a load-acquire that a real
> > > lock would use, although I agree with you that, on arm64, it
> > > apparently *is* a sufficient replacement.
> > >
> > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > > tolerate changes to pte_list_desc while in the critical section. I
> > > don't think this is true (given for_each_rmap_spte_lockless),
> > > therefore an smp_rmb() or equivalent is (technically) needed.
> > >
> > > Am I confused?
> >
> > Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
> > any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
> > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.
> 
> I take back what I said about this working on x86. I think it's
> possible for there to be a race.
> 
> Say...
> 
> 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> 2. T2 then locks kvm_rmap_lock_readonly().
> 
> The modifications that T1 has made are not guaranteed to be visible to
> T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> and T2 has an smp_rmb() before reading the data.
> 
> Now the way you had it, T2, because it uses try_cmpxchg() with full
> ordering, will effectively do a smp_rmb(). But T1 only does an
> smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> may enter its critical section and then *later* observe the changes
> that T1 made.
> 
> Now this is impossible on x86 (IIUC) if, in the compiled list of
> instructions, T1's writes occur in the same order that we have written
> them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> at compile time is forbidden.
> 
> So what I'm saying is:
> 
> 1. kvm_rmap_unlock() must have an smp_wmb().

No, because beating a dead horse, this is not generic code, this is x86. 

> 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
> is what I think you want), 

No, I don't.  If this were generic code, then it would be a different conversation,
but using a "relaxed" atomic in x86 specific code is nonsensical, because such an
operation simply does not exist in the world of x86.  There are memory operations
that have relaxed ordering, but none of them are atomic; they are very much the
exact opposite of atomic.

E.g. the only references to _relaxed that you'll find in x86 are to override the
generics to x86's non-relaxed semantics.

$ git grep _relaxed arch/x86
arch/x86/include/asm/io.h:#define readb_relaxed(a) __readb(a)
arch/x86/include/asm/io.h:#define readw_relaxed(a) __readw(a)
arch/x86/include/asm/io.h:#define readl_relaxed(a) __readl(a)
arch/x86/include/asm/io.h:#define writeb_relaxed(v, a) __writeb(v, a)
arch/x86/include/asm/io.h:#define writew_relaxed(v, a) __writew(v, a)
arch/x86/include/asm/io.h:#define writel_relaxed(v, a) __writel(v, a)
arch/x86/include/asm/io.h:#define readq_relaxed(a)      __readq(a)
arch/x86/include/asm/io.h:#define writeq_relaxed(v, a)  __writeq(v, a)

There are a few places in KVM x86 where _acquire() and _release() variants are
used, but that's done purely to document the roles and who is doing what, and
almost always (maybe always?) when there are lockless programming shenanigans
in play.

E.g. kvm_recalculate_apic_map() is probably the closest example, where KVM uses
atomic_read_acquire(), atomic_cmpxchg_acquire(), and atomic_cmpxchg_release() to
document the ordering between marking the map as dirty and actually processing
the map.  But that is still quite different, as apic_map_dirty is not a spinlock,
there isn't a direct data/address dependency between apic_map_dirty and all of
the data it consumes, _and_ the data that is guarded by apic_map_dirty can be
modified by other vCPUs _while it is being processed_.

Hmm, actually, sev_lock_two_vms() is arguably a better comparison.  That's also
a rudimentary spinlock (try-lock only behavior).

If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
I wouldn't be opposed to doing this in the locking code to document things:

 s/READ_ONCE/atomic_read_acquire
 s/WRITE_ONCE/atomic_set_release
 s/try_cmpxchg/atomic_cmpxchg_acquire

But it's an "unsigned long", and so trying to massage it into the above isn't a
net positive for me.  This is gnarly x86 specific code; it's very unlikely that
someone will be able to understand the rest of the rmap code, but won't be able
to understand that the code is safe due to x86's strong memory ordering.

> then you must also have an smp_rmb() following a successful
> cmpxchg/acquisition (at least for the case where we then follow the
> pte_list_desc pointer).
James Houghton Sept. 10, 2024, 9:11 p.m. UTC | #14
On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > I take back what I said about this working on x86. I think it's
> > possible for there to be a race.
> >
> > Say...
> >
> > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> > 2. T2 then locks kvm_rmap_lock_readonly().
> >
> > The modifications that T1 has made are not guaranteed to be visible to
> > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> > and T2 has an smp_rmb() before reading the data.
> >
> > Now the way you had it, T2, because it uses try_cmpxchg() with full
> > ordering, will effectively do a smp_rmb(). But T1 only does an
> > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> > may enter its critical section and then *later* observe the changes
> > that T1 made.
> >
> > Now this is impossible on x86 (IIUC) if, in the compiled list of
> > instructions, T1's writes occur in the same order that we have written
> > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> > at compile time is forbidden.
> >
> > So what I'm saying is:
> >
> > 1. kvm_rmap_unlock() must have an smp_wmb().
>
> No, because beating a dead horse, this is not generic code, this is x86.

What prevents the compiler from reordering (non-atomic, non-volatile)
stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after
the WRITE_ONCE()?

IMV, such a reordering is currently permitted[1] (i.e., a barrier() is
missing), and should the compiler choose to do this, the lock will not
function correctly.

> If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
> I wouldn't be opposed to doing this in the locking code to document things:
>
>  s/READ_ONCE/atomic_read_acquire
>  s/WRITE_ONCE/atomic_set_release
>  s/try_cmpxchg/atomic_cmpxchg_acquire

I think we can use atomic_long_t.

It would be really great if we did a substitution like this. That
would address my above concern about barrier() (atomic_set_release,
for example, implies a barrier() that we otherwise need to include).

[1]: https://www.kernel.org/doc/Documentation/memory-barriers.txt
(GUARANTEES + COMPILER BARRIER)
Sean Christopherson Sept. 11, 2024, 3:33 p.m. UTC | #15
On Tue, Sep 10, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > I take back what I said about this working on x86. I think it's
> > > possible for there to be a race.
> > >
> > > Say...
> > >
> > > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> > > 2. T2 then locks kvm_rmap_lock_readonly().
> > >
> > > The modifications that T1 has made are not guaranteed to be visible to
> > > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> > > and T2 has an smp_rmb() before reading the data.
> > >
> > > Now the way you had it, T2, because it uses try_cmpxchg() with full
> > > ordering, will effectively do a smp_rmb(). But T1 only does an
> > > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> > > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> > > may enter its critical section and then *later* observe the changes
> > > that T1 made.
> > >
> > > Now this is impossible on x86 (IIUC) if, in the compiled list of
> > > instructions, T1's writes occur in the same order that we have written
> > > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> > > at compile time is forbidden.
> > >
> > > So what I'm saying is:
> > >
> > > 1. kvm_rmap_unlock() must have an smp_wmb().
> >
> > No, because beating a dead horse, this is not generic code, this is x86.
> 
> What prevents the compiler from reordering (non-atomic, non-volatile)
> stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after
> the WRITE_ONCE()?

Oof, right, nothing.  Which is why __smp_store_release() has an explicit
barrier() before its WRITE_ONCE().

> IMV, such a reordering is currently permitted[1] (i.e., a barrier() is
> missing), and should the compiler choose to do this, the lock will not
> function correctly.
> 
> > If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
> > I wouldn't be opposed to doing this in the locking code to document things:
> >
> >  s/READ_ONCE/atomic_read_acquire
> >  s/WRITE_ONCE/atomic_set_release
> >  s/try_cmpxchg/atomic_cmpxchg_acquire
> 
> I think we can use atomic_long_t.

Duh.  That's a /facepalm moment.

> It would be really great if we did a substitution like this. That
> would address my above concern about barrier() (atomic_set_release,
> for example, implies a barrier() that we otherwise need to include).

Ya, agreed, not just for warm fuzzies, but because it's necessary to prevent
the compiler from being clever.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8ca7f51c2da3..a683b5fc4026 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -909,11 +909,73 @@  static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
 #define KVM_RMAP_MANY	BIT(0)
 
+/*
+ * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
+ * operates with mmu_lock held for write), but rmaps can be walked without
+ * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
+ * being zapped/dropped _while the rmap is locked_.
+ *
+ * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
+ * done while holding mmu_lock for write.  This allows a task walking rmaps
+ * without holding mmu_lock to concurrently walk the same entries as a task
+ * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
+ * the rmaps, thus the walks are stable.
+ *
+ * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
+ * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
+ * ensures all "struct pte_list_desc" fields are stable.
+ */
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	old_val = READ_ONCE(rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		/*
+		 * If the rmap is locked, wait for it to be unlocked before
+		 * trying acquire the lock, e.g. to bounce the cache line.
+		 */
+		while (old_val & KVM_RMAP_LOCKED) {
+			old_val = READ_ONCE(rmap_head->val);
+			cpu_relax();
+		}
+
+		/*
+		 * Recheck for an empty rmap, it may have been purged by the
+		 * task that held the lock.
+		 */
+		if (!old_val)
+			return 0;
+
+		new_val = old_val | KVM_RMAP_LOCKED;
+	} while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
+
+	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	return old_val;
+}
+
+static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	WRITE_ONCE(rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
+{
+	return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
+}
+
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
@@ -924,7 +986,7 @@  static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = rmap_head->val;
+	old_val = kvm_rmap_lock(rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -956,7 +1018,7 @@  static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		desc->sptes[desc->spte_count++] = spte;
 	}
 
-	rmap_head->val = new_val;
+	kvm_rmap_unlock(rmap_head, new_val);
 
 	return count;
 }
@@ -1004,7 +1066,7 @@  static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -1030,7 +1092,7 @@  static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	}
 
 out:
-	rmap_head->val = rmap_val;
+	kvm_rmap_unlock(rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1048,7 +1110,7 @@  static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1067,13 +1129,13 @@  static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	kvm_rmap_unlock(rmap_head, 0);
 	return true;
 }
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	struct pte_list_desc *desc;
 
 	if (!rmap_val)
@@ -1139,7 +1201,7 @@  struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	u64 *sptep;
 
 	if (!rmap_val)