[v2,0/7] Improve gfn-to-memslot performance during page faults

Message ID	20210804222844.1419481-1-dmatlack@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Wed, 4 Aug 2021 22:28:37 +0000 Message-Id: <20210804222844.1419481-1-dmatlack@google.com> Mime-Version: 1.0 Subject: [PATCH v2 0/7] Improve gfn-to-memslot performance during page faults From: David Matlack <dmatlack@google.com> To: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org, Ben Gardon <bgardon@google.com>, Joerg Roedel <joro@8bytes.org>, Jim Mattson <jmattson@google.com>, Wanpeng Li <wanpengli@tencent.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Sean Christopherson <seanjc@google.com>, Junaid Shahid <junaids@google.com>, Andrew Jones <drjones@redhat.com>, Paul Mackerras <paulus@ozlabs.org>, Christian Borntraeger <borntraeger@de.ibm.com>, Janosch Frank <frankja@linux.ibm.com>, David Matlack <dmatlack@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Improve gfn-to-memslot performance during page faults \| expand [v2,0/7] Improve gfn-to-memslot performance during page faults [v2,1/7] KVM: Rename lru_slot to last_used_slot [v2,2/7] KVM: Move last_used_slot logic out of search_memslots [v2,3/7] KVM: Cache the last used slot index per vCPU [v2,4/7] KVM: x86/mmu: Leverage vcpu->last_used_slot in tdp_mmu_map_handle_target_level [v2,5/7] KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_recycle [v2,6/7] KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap [v2,7/7] KVM: selftests: Support multiple slots in dirty_log_perf_test

Message ID

20210804222844.1419481-1-dmatlack@google.com (mailing list archive)

Headers

Date: Wed,  4 Aug 2021 22:28:37 +0000
Message-Id: <20210804222844.1419481-1-dmatlack@google.com>
Mime-Version: 1.0
Subject: [PATCH v2 0/7] Improve gfn-to-memslot performance during page faults
From: David Matlack <dmatlack@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org,
        Ben Gardon <bgardon@google.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jim Mattson <jmattson@google.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Sean Christopherson <seanjc@google.com>,
        Junaid Shahid <junaids@google.com>,
        Andrew Jones <drjones@redhat.com>,
        Paul Mackerras <paulus@ozlabs.org>,
        Christian Borntraeger <borntraeger@de.ibm.com>,
        Janosch Frank <frankja@linux.ibm.com>,
        David Matlack <dmatlack@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

Improve gfn-to-memslot performance during page faults | expand

Message

David Matlack Aug. 4, 2021, 10:28 p.m. UTC

This series improves the performance of gfn-to-memslot lookups during
page faults. Ben Gardon originally identified this performance gap and
sufficiently addressed it in Google's kernel by reading the memslot once
at the beginning of the page fault and passing around the pointer.

This series takes an alternative approach by introducing a per-vCPU
cache of the least recently used memslot index. This avoids needing to
binary search the existing memslots multiple times during a page fault.
Unlike passing around the pointer, the cache has an additional benefit
in that it speeds up gfn-to-memslot lookups *across* faults and during
spte prefetching where the gfn changes.

This difference can be seen clearly when looking at the performance of
fast_page_fault when multiple slots are in play:

Metric                        | Baseline     | Pass*    | Cache**
----------------------------- | ------------ | -------- | ----------
Iteration 2 dirty memory time | 2.8s         | 1.6s     | 0.30s

* Pass: Lookup the memslot once per fault and pass it around.
** Cache: Cache the last used slot per vCPU (i.e. this series).

(Collected via ./dirty_log_perf_test -v64 -x64)

I plan to also send a follow-up series with a version of Ben's patches
to pass the pointer to the memslot through the page fault handling code
rather than looking it up multiple times. Even when applied on top of
the cache series it has some performance improvements by avoiding a few
extra memory accesses (mainly kvm->memslots[as_id] and
slots->used_slots). But it will be a judgement call whether or not it's
worth the code churn and complexity.

v2:
 * Rename lru to last_used [Paolo]
 * Tree-wide replace search_memslots with __gfn_to_memslot [Paolo]
 * Avoid speculation when accessesing slots->memslots [Paolo]
 * Refactor tdp_set_spte_atomic to leverage vcpu->last_used_slot [Paolo]
 * Add Paolo's Reviewed-by tags
 * Fix build failures in mmu_audit.c [kernel test robot]

v1: https://lore.kernel.org/kvm/20210730223707.4083785-1-dmatlack@google.com/

David Matlack (7):
  KVM: Rename lru_slot to last_used_slot
  KVM: Move last_used_slot logic out of search_memslots
  KVM: Cache the last used slot index per vCPU
  KVM: x86/mmu: Leverage vcpu->last_used_slot in
    tdp_mmu_map_handle_target_level
  KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and
    rmap_recycle
  KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap
  KVM: selftests: Support multiple slots in dirty_log_perf_test

 arch/powerpc/kvm/book3s_64_vio.c              |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c           |  2 +-
 arch/s390/kvm/kvm-s390.c                      |  4 +-
 arch/x86/kvm/mmu/mmu.c                        | 54 +++++++------
 arch/x86/kvm/mmu/mmu_audit.c                  |  4 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 42 +++++++---
 include/linux/kvm_host.h                      | 80 +++++++++++++++----
 .../selftests/kvm/access_tracking_perf_test.c |  2 +-
 .../selftests/kvm/demand_paging_test.c        |  2 +-
 .../selftests/kvm/dirty_log_perf_test.c       | 76 +++++++++++++++---
 .../selftests/kvm/include/perf_test_util.h    |  2 +-
 .../selftests/kvm/lib/perf_test_util.c        | 20 +++--
 .../kvm/memslot_modification_stress_test.c    |  2 +-
 virt/kvm/kvm_main.c                           | 26 +++++-
 14 files changed, 238 insertions(+), 80 deletions(-)

Comments

Paolo Bonzini Aug. 5, 2021, 8:11 a.m. UTC | #1

On 05/08/21 00:28, David Matlack wrote:
> This series improves the performance of gfn-to-memslot lookups during
> page faults. Ben Gardon originally identified this performance gap and
> sufficiently addressed it in Google's kernel by reading the memslot once
> at the beginning of the page fault and passing around the pointer.
> 
> This series takes an alternative approach by introducing a per-vCPU
> cache of the least recently used memslot index. This avoids needing to
> binary search the existing memslots multiple times during a page fault.
> Unlike passing around the pointer, the cache has an additional benefit
> in that it speeds up gfn-to-memslot lookups *across* faults and during
> spte prefetching where the gfn changes.
> 
> This difference can be seen clearly when looking at the performance of
> fast_page_fault when multiple slots are in play:
> 
> Metric                        | Baseline     | Pass*    | Cache**
> ----------------------------- | ------------ | -------- | ----------
> Iteration 2 dirty memory time | 2.8s         | 1.6s     | 0.30s
> 
> * Pass: Lookup the memslot once per fault and pass it around.
> ** Cache: Cache the last used slot per vCPU (i.e. this series).
> 
> (Collected via ./dirty_log_perf_test -v64 -x64)
> 
> I plan to also send a follow-up series with a version of Ben's patches
> to pass the pointer to the memslot through the page fault handling code
> rather than looking it up multiple times. Even when applied on top of
> the cache series it has some performance improvements by avoiding a few
> extra memory accesses (mainly kvm->memslots[as_id] and
> slots->used_slots). But it will be a judgement call whether or not it's
> worth the code churn and complexity.

Queued, thanks.

Paolo

> v2:
>   * Rename lru to last_used [Paolo]
>   * Tree-wide replace search_memslots with __gfn_to_memslot [Paolo]
>   * Avoid speculation when accessesing slots->memslots [Paolo]
>   * Refactor tdp_set_spte_atomic to leverage vcpu->last_used_slot [Paolo]
>   * Add Paolo's Reviewed-by tags
>   * Fix build failures in mmu_audit.c [kernel test robot]
> 
> v1: https://lore.kernel.org/kvm/20210730223707.4083785-1-dmatlack@google.com/
> 
> David Matlack (7):
>    KVM: Rename lru_slot to last_used_slot
>    KVM: Move last_used_slot logic out of search_memslots
>    KVM: Cache the last used slot index per vCPU
>    KVM: x86/mmu: Leverage vcpu->last_used_slot in
>      tdp_mmu_map_handle_target_level
>    KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and
>      rmap_recycle
>    KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap
>    KVM: selftests: Support multiple slots in dirty_log_perf_test
> 
>   arch/powerpc/kvm/book3s_64_vio.c              |  2 +-
>   arch/powerpc/kvm/book3s_64_vio_hv.c           |  2 +-
>   arch/s390/kvm/kvm-s390.c                      |  4 +-
>   arch/x86/kvm/mmu/mmu.c                        | 54 +++++++------
>   arch/x86/kvm/mmu/mmu_audit.c                  |  4 +-
>   arch/x86/kvm/mmu/tdp_mmu.c                    | 42 +++++++---
>   include/linux/kvm_host.h                      | 80 +++++++++++++++----
>   .../selftests/kvm/access_tracking_perf_test.c |  2 +-
>   .../selftests/kvm/demand_paging_test.c        |  2 +-
>   .../selftests/kvm/dirty_log_perf_test.c       | 76 +++++++++++++++---
>   .../selftests/kvm/include/perf_test_util.h    |  2 +-
>   .../selftests/kvm/lib/perf_test_util.c        | 20 +++--
>   .../kvm/memslot_modification_stress_test.c    |  2 +-
>   virt/kvm/kvm_main.c                           | 26 +++++-
>   14 files changed, 238 insertions(+), 80 deletions(-)
>