From patchwork Fri Jan 12 20:38:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Woodhouse, David" X-Patchwork-Id: 13518702 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 698F91641B for ; Fri, 12 Jan 2024 20:38:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="sMqwBKNo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1705091906; x=1736627906; h=from:to:cc:subject:date:message-id; bh=UaZM85RDimVOlVytaQFs2RRgRiISe4jM4WTMQcBZEsM=; b=sMqwBKNoYBFaNoOG1tWRlztu24zQlfas6VR4VVyWFd2YVHyOP+74oFuf zUcL9WjQUlh0/j7EwxXyZqSGNsZppq5+Wxawd4lPX7Rk8v5VO4tmUY6rC iswxoug6cot8MCGZCWhLkHWmxl9ZLnixIx3i0kUE03kkLuOq6/sgbmqhc w=; X-Amazon-filename: smime.p7s X-IronPort-AV: E=Sophos;i="6.04,190,1695686400"; d="p7s'?scan'208";a="57992307" Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-iad-1d-m6i4x-25ac6bd5.us-east-1.amazon.com) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jan 2024 20:38:24 +0000 Received: from smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev (iad7-ws-svc-p70-lb3-vlan2.iad.amazon.com [10.32.235.34]) by email-inbound-relay-iad-1d-m6i4x-25ac6bd5.us-east-1.amazon.com (Postfix) with ESMTPS id 7E64249EB6; Fri, 12 Jan 2024 20:38:23 +0000 (UTC) Received: from EX19MTAEUC001.ant.amazon.com [10.0.10.100:52631] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.14.24:2525] with esmtp (Farcaster) id 3e3b97ba-86e9-46e8-83e1-daf4e018fa7e; Fri, 12 Jan 2024 20:38:22 +0000 (UTC) X-Farcaster-Flow-ID: 3e3b97ba-86e9-46e8-83e1-daf4e018fa7e Received: from EX19D032EUC001.ant.amazon.com (10.252.61.222) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 12 Jan 2024 20:38:22 +0000 Received: from EX19D008UEC001.ant.amazon.com (10.252.135.232) by EX19D032EUC001.ant.amazon.com (10.252.61.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 12 Jan 2024 20:38:21 +0000 Received: from EX19D008UEC001.ant.amazon.com ([fe80::4702:5d1a:c556:797]) by EX19D008UEC001.ant.amazon.com ([fe80::4702:5d1a:c556:797%3]) with mapi id 15.02.1118.040; Fri, 12 Jan 2024 20:38:21 +0000 From: "Woodhouse, David" To: "kvm@vger.kernel.org" CC: "pbonzini@redhat.com" , "seanjc@google.com" , "Durrant, Paul" Subject: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues Thread-Topic: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues Thread-Index: AQHaRZdH/5R5foK5Dkq2EmfI7zjsew== Date: Fri, 12 Jan 2024 20:38:20 +0000 Message-ID: <9a82db197449bdb97ee889d2f3cdd7998abd9692.camel@amazon.co.uk> Accept-Language: en-GB, en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: MIME-Version: 1.0 This function can race with kvm_gpc_deactivate(). Since that function does not take the ->refresh_lock, it can wipe and unmap the pfn and khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock. Then if hva_to_pfn_retry() determines that the PFN hasn't changed and that it can re-use the old pfn and khva, they get assigned back to gpc->pfn and gpc->khva even though the khva was already unmapped by kvm_gpc_deactivate(). This leaves the cache in an apparently valid state but with ->khva pointing to an address which has been unmapped. Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and set_shinfo_evtchn_pending(). It may be possible to fix this just by making kvm_gpc_deactivate() take the ->refresh_lock, but that still leaves ->refresh_lock being basically redundant with the write lock on ->lock, which frankly makes my skin itch, with the way that pfn_to_hva_retry() operates on fields in the gpc without holding ->lock. Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It no longer operates on the gpc object at all; it's called with a uhva and returns the corresponding pfn (pinned), and a mapped khva for it. The calling function __kvm_gpc_refresh() now drops ->lock before calling hva_to_pfn_retry(), then retakes the lock before checking for changes, and discards the new mapping if it lost a race. And will correctly note the old pfn/khva to be unmapped at the right time, instead of preserving them in a local variable while dropping the lock. The optimisation in hva_to_pfn_retry() where it attempts to use the old mapping if the pfn doesn't change is dropped, since it makes the pinning more complex. It's a pointless optimisation anyway, since the odds of the pfn ending up the same when the uhva has changed (i.e. the odds of the two userspace addresses both pointing to the same underlying physical page) are negligible, I remain slightly confused because although this is clearly a race in the gfn_to_pfn_cache code, I don't quite know how the Xen support code actually managed to trigger it. We've seen oopses from dereferencing a valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info) and in set_shinfo_evtchn_pending() (the shared_info). But surely the race shouldn't happen for the vcpu_info gpc because all calls to both refresh and deactivate hold the vcpu mutex, and it shouldn't happen for the shared_info gpc because all calls to both will hold the kvm->arch.xen.xen_lock mutex. Signed-off-by: David Woodhouse --- This is based on (and in) my tree at https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv which has all the other outstanding KVM/Xen fixes.  virt/kvm/pfncache.c | 181 +++++++++++++++++++++-----------------------  1 file changed, 85 insertions(+), 96 deletions(-) diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 70394d7c9a38..adca709a5884 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -135,110 +135,67 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s         return kvm->mmu_invalidate_seq != mmu_seq;  }   -static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) +/* + * Given a user virtual address, obtain a pinned host PFN and kernel mapping + * for it. The caller will release the PFN after installing it into the GPC + * so that the MMU notifier invalidation mechanism is active. + */ +static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva, +                                 kvm_pfn_t *pfn, void **khva)  {         /* Note, the new page offset may be different than the old! */ -       void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);         kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;         void *new_khva = NULL;         unsigned long mmu_seq;   -       lockdep_assert_held(&gpc->refresh_lock); - -       lockdep_assert_held_write(&gpc->lock); - -       /* -        * Invalidate the cache prior to dropping gpc->lock, the gpa=>uhva -        * assets have already been updated and so a concurrent check() from a -        * different task may not fail the gpa/uhva/generation checks. -        */ -       gpc->valid = false; - -       do { -               mmu_seq = gpc->kvm->mmu_invalidate_seq; +       for (;;) { +               mmu_seq = kvm->mmu_invalidate_seq;                 smp_rmb();   -               write_unlock_irq(&gpc->lock); - -               /* -                * If the previous iteration "failed" due to an mmu_notifier -                * event, release the pfn and unmap the kernel virtual address -                * from the previous attempt.  Unmapping might sleep, so this -                * needs to be done after dropping the lock.  Opportunistically -                * check for resched while the lock isn't held. -                */ -               if (new_pfn != KVM_PFN_ERR_FAULT) { -                       /* -                        * Keep the mapping if the previous iteration reused -                        * the existing mapping and didn't create a new one. -                        */ -                       if (new_khva != old_khva) -                               gpc_unmap(new_pfn, new_khva); - -                       kvm_release_pfn_clean(new_pfn); - -                       cond_resched(); -               } -                 /* We always request a writeable mapping */ -               new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); +               new_pfn = hva_to_pfn(uhva, false, false, NULL, true, NULL);                 if (is_error_noslot_pfn(new_pfn)) -                       goto out_error; +                       return -EFAULT;                   /* -                * Obtain a new kernel mapping if KVM itself will access the -                * pfn.  Note, kmap() and memremap() can both sleep, so this -                * too must be done outside of gpc->lock! +                * Always obtain a new kernel mapping. Trying to reuse an +                * existing one is more complex than it's worth.                  */ -               if (new_pfn == gpc->pfn) -                       new_khva = old_khva; -               else -                       new_khva = gpc_map(new_pfn); - +               new_khva = gpc_map(new_pfn);                 if (!new_khva) {                         kvm_release_pfn_clean(new_pfn); -                       goto out_error; +                       return -EFAULT;                 }   -               write_lock_irq(&gpc->lock); +               if (!mmu_notifier_retry_cache(kvm, mmu_seq)) +                       break;                   /* -                * Other tasks must wait for _this_ refresh to complete before -                * attempting to refresh. +                * If this iteration "failed" due to an mmu_notifier event, +                * release the pfn and unmap the kernel virtual address, and +                * loop around again.                  */ -               WARN_ON_ONCE(gpc->valid); -       } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq)); - -       gpc->valid = true; -       gpc->pfn = new_pfn; -       gpc->khva = new_khva + offset_in_page(gpc->uhva); +               if (new_pfn != KVM_PFN_ERR_FAULT) { +                       gpc_unmap(new_pfn, new_khva); +                       kvm_release_pfn_clean(new_pfn); +               } +       }   -       /* -        * Put the reference to the _new_ pfn.  The pfn is now tracked by the -        * cache and can be safely migrated, swapped, etc... as the cache will -        * invalidate any mappings in response to relevant mmu_notifier events. -        */ -       kvm_release_pfn_clean(new_pfn); +       *pfn = new_pfn; +       *khva = new_khva;           return 0; - -out_error: -       write_lock_irq(&gpc->lock); - -       return -EFAULT;  }   -static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva, -                            unsigned long len) +static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, +                            unsigned long uhva, unsigned long len)  {         struct kvm_memslots *slots = kvm_memslots(gpc->kvm);         unsigned long page_offset = (gpa != KVM_XEN_INVALID_GPA) ?                 offset_in_page(gpa) :                 offset_in_page(uhva); -       bool unmap_old = false;         unsigned long old_uhva; -       kvm_pfn_t old_pfn; -       bool hva_change = false; +       kvm_pfn_t old_pfn = KVM_PFN_ERR_FAULT;         void *old_khva;         int ret;   @@ -251,8 +208,9 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l           /*          * If another task is refreshing the cache, wait for it to complete. -        * There is no guarantee that concurrent refreshes will see the same -        * gpa, memslots generation, etc..., so they must be fully serialized. +        * This is purely an optimisation, to avoid concurrent mappings from +        * hva_to_pfn_retry(), all but one of which will be discarded after +        * losing a race to install them in the GPC.          */         mutex_lock(&gpc->refresh_lock);   @@ -272,7 +230,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l                 gpc->uhva = PAGE_ALIGN_DOWN(uhva);                   if (gpc->uhva != old_uhva) -                       hva_change = true; +                       gpc->valid = false;         } else if (gpc->gpa != gpa ||                    gpc->generation != slots->generation ||                    kvm_is_error_hva(gpc->uhva)) { @@ -285,7 +243,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l                   if (kvm_is_error_hva(gpc->uhva)) {                         ret = -EFAULT; -                       goto out; + +                       gpc->valid = false; +                       gpc->pfn = KVM_PFN_ERR_FAULT; +                       gpc->khva = NULL; +                       goto out_unlock;                 }                   /* @@ -293,7 +255,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l                  * HVA may still be the same.                  */                 if (gpc->uhva != old_uhva) -                       hva_change = true; +                       gpc->valid = false;         } else {                 gpc->uhva = old_uhva;         } @@ -305,9 +267,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l          * If the userspace HVA changed or the PFN was already invalid,          * drop the lock and do the HVA to PFN lookup again.          */ -       if (!gpc->valid || hva_change) { -               ret = hva_to_pfn_retry(gpc); -       } else { +       if (gpc->valid) {                 /*                  * If the HVA→PFN mapping was already valid, don't unmap it.                  * But do update gpc->khva because the offset within the page @@ -315,30 +275,59 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l                  */                 gpc->khva = old_khva + page_offset;                 ret = 0; -               goto out_unlock; -       }   - out: -       /* -        * Invalidate the cache and purge the pfn/khva if the refresh failed. -        * Some/all of the uhva, gpa, and memslot generation info may still be -        * valid, leave it as is. -        */ -       if (ret) { +               /* old_pfn must not be unmapped because it was reused. */ +               old_pfn = KVM_PFN_ERR_FAULT; +       } else { +               kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT; +               unsigned long new_uhva = gpc->uhva; +               void *new_khva = NULL; + +               /* +                * Invalidate the cache prior to dropping gpc->lock; the +                * gpa=>uhva assets have already been updated and so a +                * concurrent check() from a different task may not fail +                * the gpa/uhva/generation checks as it should. +                */                 gpc->valid = false; -               gpc->pfn = KVM_PFN_ERR_FAULT; -               gpc->khva = NULL; -       }   -       /* Detect a pfn change before dropping the lock! */ -       unmap_old = (old_pfn != gpc->pfn); +               write_unlock_irq(&gpc->lock); + +               ret = hva_to_pfn_retry(gpc->kvm, new_uhva, &new_pfn, &new_khva); + +               write_lock_irq(&gpc->lock); + +               if (ret || gpc->uhva != new_uhva) { +                       /* +                        * On failure or if another update occurred while the +                        * lock was dropped, just purge the new mapping. */ +                       old_pfn = new_pfn; +                       old_khva = new_khva; +               } else { +                       old_pfn = gpc->pfn; +                       old_khva = gpc->khva; + +                       gpc->pfn = new_pfn; +                       gpc->khva = new_khva + offset_in_page(gpc->uhva); +                       gpc->valid = true; +               } + +               /* +                * Put the reference to the _new_ pfn. On success, the +                * pfn is now tracked by the cache and can safely be +                * migrated, swapped, etc. as the cache will invalidate +                * any mappings in response to relevant mmu_notifier +                * events. +                */ +               kvm_release_pfn_clean(new_pfn); +       }    out_unlock:         write_unlock_irq(&gpc->lock);           mutex_unlock(&gpc->refresh_lock);   -       if (unmap_old) +       if (old_pfn != KVM_PFN_ERR_FAULT)                 gpc_unmap(old_pfn, old_khva);           return ret;