Message ID | 1603331829-33879-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] KVM: arm64: Don't force broadcast tlbi when guest is running | expand |
On 2020-10-22 02:57, Shaokun Zhang wrote: > From: Nianyao Tang <tangnianyao@huawei.com> > > Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, > even those vcpu do not resident on. It would get worse as system > gets larger and more pcpus get online. > Let's disable force-broadcast. We flush tlbi when move vcpu to a > new pcpu, in case old page mapping still exists on new pcpu. > > Cc: Marc Zyngier <maz@kernel.org> > Signed-off-by: Nianyao Tang <tangnianyao@huawei.com> > Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> > --- > arch/arm64/include/asm/kvm_arm.h | 2 +- > arch/arm64/kvm/arm.c | 4 +++- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_arm.h > b/arch/arm64/include/asm/kvm_arm.h > index 64ce29378467..f85ea9c649cb 100644 > --- a/arch/arm64/include/asm/kvm_arm.h > +++ b/arch/arm64/include/asm/kvm_arm.h > @@ -75,7 +75,7 @@ > * PTW: Take a stage2 fault if a stage1 walk steps in device memory > */ > #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | > HCR_VM | \ > - HCR_BSU_IS | HCR_FB | HCR_TAC | \ > + HCR_BSU_IS | HCR_TAC | \ > HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ > HCR_FMO | HCR_IMO | HCR_PTW ) > #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index acf9a993dfb6..845be911f885 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int > cpu) > /* > * We might get preempted before the vCPU actually runs, but > * over-invalidation doesn't affect correctness. > + * Dirty tlb might still exist when vcpu ran on other pcpu > + * and modified page mapping. > */ > - if (*last_ran != vcpu->vcpu_id) { > + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { > kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); > *last_ran = vcpu->vcpu_id; > } This breaks uniprocessor semantics for I-cache invalidation. What could possibly go wrong? You also fail to provide any data that would back up your claim, as usual. M.
Hi Marc, 在 2020/10/22 20:03, Marc Zyngier 写道: > On 2020-10-22 02:57, Shaokun Zhang wrote: >> From: Nianyao Tang <tangnianyao@huawei.com> >> >> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, >> even those vcpu do not resident on. It would get worse as system >> gets larger and more pcpus get online. >> Let's disable force-broadcast. We flush tlbi when move vcpu to a >> new pcpu, in case old page mapping still exists on new pcpu. >> >> Cc: Marc Zyngier <maz@kernel.org> >> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com> >> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >> --- >> arch/arm64/include/asm/kvm_arm.h | 2 +- >> arch/arm64/kvm/arm.c | 4 +++- >> 2 files changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h >> index 64ce29378467..f85ea9c649cb 100644 >> --- a/arch/arm64/include/asm/kvm_arm.h >> +++ b/arch/arm64/include/asm/kvm_arm.h >> @@ -75,7 +75,7 @@ >> * PTW: Take a stage2 fault if a stage1 walk steps in device memory >> */ >> #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ >> - HCR_BSU_IS | HCR_FB | HCR_TAC | \ >> + HCR_BSU_IS | HCR_TAC | \ >> HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ >> HCR_FMO | HCR_IMO | HCR_PTW ) >> #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >> index acf9a993dfb6..845be911f885 100644 >> --- a/arch/arm64/kvm/arm.c >> +++ b/arch/arm64/kvm/arm.c >> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >> /* >> * We might get preempted before the vCPU actually runs, but >> * over-invalidation doesn't affect correctness. >> + * Dirty tlb might still exist when vcpu ran on other pcpu >> + * and modified page mapping. >> */ >> - if (*last_ran != vcpu->vcpu_id) { >> + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { >> kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); >> *last_ran = vcpu->vcpu_id; >> } > > This breaks uniprocessor semantics for I-cache invalidation. What could Oops, It shall consider this. > possibly go wrong? You also fail to provide any data that would back up > your claim, as usual. Test Unixbench spawn in each 4U8G vm on Kunpeng 920: (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure each vm seperated. (2) 1 4U8G VM Result: (1) 800 * 24 (2) 3200 By restricting DVM broadcasting across NUMA, result (1) can be improved to 2300 * 24. In spawn testcase, tlbiis used in creating process. Further, we consider restricting tlbi broadcast range and make tlbi more accurate. One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. Consider I-cache invalidation, KVM also needs to invalid icache when moving vcpu to a new pcpu. Do we miss any cases that need HCR_EL2.FB == 1? Eventually we expect HCR_EL2.FB == 0 if it is possible. Thanks, Shaokun > > M.
On 2020-10-23 02:26, Shaokun Zhang wrote: > Hi Marc, > > 在 2020/10/22 20:03, Marc Zyngier 写道: >> On 2020-10-22 02:57, Shaokun Zhang wrote: >>> From: Nianyao Tang <tangnianyao@huawei.com> >>> >>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, >>> even those vcpu do not resident on. It would get worse as system >>> gets larger and more pcpus get online. >>> Let's disable force-broadcast. We flush tlbi when move vcpu to a >>> new pcpu, in case old page mapping still exists on new pcpu. >>> >>> Cc: Marc Zyngier <maz@kernel.org> >>> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com> >>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >>> --- >>> arch/arm64/include/asm/kvm_arm.h | 2 +- >>> arch/arm64/kvm/arm.c | 4 +++- >>> 2 files changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/arm64/include/asm/kvm_arm.h >>> b/arch/arm64/include/asm/kvm_arm.h >>> index 64ce29378467..f85ea9c649cb 100644 >>> --- a/arch/arm64/include/asm/kvm_arm.h >>> +++ b/arch/arm64/include/asm/kvm_arm.h >>> @@ -75,7 +75,7 @@ >>> * PTW: Take a stage2 fault if a stage1 walk steps in device >>> memory >>> */ >>> #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | >>> HCR_VM | \ >>> - HCR_BSU_IS | HCR_FB | HCR_TAC | \ >>> + HCR_BSU_IS | HCR_TAC | \ >>> HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ >>> HCR_FMO | HCR_IMO | HCR_PTW ) >>> #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) >>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >>> index acf9a993dfb6..845be911f885 100644 >>> --- a/arch/arm64/kvm/arm.c >>> +++ b/arch/arm64/kvm/arm.c >>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, >>> int cpu) >>> /* >>> * We might get preempted before the vCPU actually runs, but >>> * over-invalidation doesn't affect correctness. >>> + * Dirty tlb might still exist when vcpu ran on other pcpu >>> + * and modified page mapping. >>> */ >>> - if (*last_ran != vcpu->vcpu_id) { >>> + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { >>> kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); >>> *last_ran = vcpu->vcpu_id; >>> } >> >> This breaks uniprocessor semantics for I-cache invalidation. What >> could > > Oops, It shall consider this. > >> possibly go wrong? You also fail to provide any data that would back >> up >> your claim, as usual. > > Test Unixbench spawn in each 4U8G vm on Kunpeng 920: > (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure > each vm seperated. That's a very narrow use case. > (2) 1 4U8G VM > Result: > (1) 800 * 24 > (2) 3200 I don't know what these numbers are. > By restricting DVM broadcasting across NUMA, result (1) can > be improved to 2300 * 24. In spawn testcase, tlbiis used > in creating process. Linux always use TLBI IS, except in some very rare cases. If your HW broadcasts invalidations across the whole system without any filtering, it is bound to be bad. > Further, we consider restricting tlbi broadcast range and make > tlbi more accurate. > One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. Ah. That old thing again. No, thank you. The right thing to do is to build CPUs and interconnects that properly deals with IS invalidations by not sending DVM messages to places where things don't run (snoop filters?). I also doubt the arm64 maintainers would be sympathetic to the idea anyway. > Consider I-cache invalidation, KVM also needs to invalid icache > when moving vcpu to a new pcpu. > Do we miss any cases that need HCR_EL2.FB == 1? > Eventually we expect HCR_EL2.FB == 0 if it is possible. I have no intention to ever set FB to 0, as this would result in over-invalidation in the general case, and worse performance on well designed systems. M.
Hi Marc, Apologies that reply later because of out of office some days. 在 2020/10/23 17:17, Marc Zyngier 写道: > On 2020-10-23 02:26, Shaokun Zhang wrote: >> Hi Marc, >> >> 在 2020/10/22 20:03, Marc Zyngier 写道: >>> On 2020-10-22 02:57, Shaokun Zhang wrote: >>>> From: Nianyao Tang <tangnianyao@huawei.com> >>>> >>>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, >>>> even those vcpu do not resident on. It would get worse as system >>>> gets larger and more pcpus get online. >>>> Let's disable force-broadcast. We flush tlbi when move vcpu to a >>>> new pcpu, in case old page mapping still exists on new pcpu. >>>> >>>> Cc: Marc Zyngier <maz@kernel.org> >>>> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com> >>>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >>>> --- >>>> arch/arm64/include/asm/kvm_arm.h | 2 +- >>>> arch/arm64/kvm/arm.c | 4 +++- >>>> 2 files changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h >>>> index 64ce29378467..f85ea9c649cb 100644 >>>> --- a/arch/arm64/include/asm/kvm_arm.h >>>> +++ b/arch/arm64/include/asm/kvm_arm.h >>>> @@ -75,7 +75,7 @@ >>>> * PTW: Take a stage2 fault if a stage1 walk steps in device memory >>>> */ >>>> #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ >>>> - HCR_BSU_IS | HCR_FB | HCR_TAC | \ >>>> + HCR_BSU_IS | HCR_TAC | \ >>>> HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ >>>> HCR_FMO | HCR_IMO | HCR_PTW ) >>>> #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) >>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >>>> index acf9a993dfb6..845be911f885 100644 >>>> --- a/arch/arm64/kvm/arm.c >>>> +++ b/arch/arm64/kvm/arm.c >>>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >>>> /* >>>> * We might get preempted before the vCPU actually runs, but >>>> * over-invalidation doesn't affect correctness. >>>> + * Dirty tlb might still exist when vcpu ran on other pcpu >>>> + * and modified page mapping. >>>> */ >>>> - if (*last_ran != vcpu->vcpu_id) { >>>> + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { >>>> kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); >>>> *last_ran = vcpu->vcpu_id; >>>> } >>> >>> This breaks uniprocessor semantics for I-cache invalidation. What could >> >> Oops, It shall consider this. >> >>> possibly go wrong? You also fail to provide any data that would back up >>> your claim, as usual. >> >> Test Unixbench spawn in each 4U8G vm on Kunpeng 920: >> (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure >> each vm seperated. > > That's a very narrow use case. > >> (2) 1 4U8G VM >> Result: >> (1) 800 * 24 >> (2) 3200 > > I don't know what these numbers are. These are unixbench scores, the higher the better. > >> By restricting DVM broadcasting across NUMA, result (1) can >> be improved to 2300 * 24. In spawn testcase, tlbiis used >> in creating process. > > Linux always use TLBI IS, except in some very rare cases. > If your HW broadcasts invalidations across the whole system > without any filtering, it is bound to be bad. > >> Further, we consider restricting tlbi broadcast range and make >> tlbi more accurate. >> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. > > Ah. That old thing again. No, thank you. The right thing to do > is to build CPUs and interconnects that properly deals with > IS invalidations by not sending DVM messages to places where > things don't run (snoop filters?). I also doubt the arm64 > maintainers would be sympathetic to the idea anyway. We also do the same test on intel 6248 and get better result, less performance drop in multi-vm case compare to single vm. Intel use ipi + flush tlb scenario, do you think this scenario is meaningful on Arm architect hardware? > >> Consider I-cache invalidation, KVM also needs to invalid icache >> when moving vcpu to a new pcpu. >> Do we miss any cases that need HCR_EL2.FB == 1? >> Eventually we expect HCR_EL2.FB == 0 if it is possible. > > I have no intention to ever set FB to 0, as this would resu> in over-invalidation in the general case, and worse performance The reason that we want to disable FB is that IPI solution is needed if it is accepted. Thanks, Shaokun > on well designed systems. > > M.
On 2020-11-03 11:31, Shaokun Zhang wrote: [...] >>> Further, we consider restricting tlbi broadcast range and make >>> tlbi more accurate. >>> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. >> >> Ah. That old thing again. No, thank you. The right thing to do >> is to build CPUs and interconnects that properly deals with >> IS invalidations by not sending DVM messages to places where >> things don't run (snoop filters?). I also doubt the arm64 >> maintainers would be sympathetic to the idea anyway. > > We also do the same test on intel 6248 and get better result, > less performance drop in multi-vm case compare to single vm. > Intel use ipi + flush tlb scenario, do you think this scenario is > meaningful on Arm architect hardware? I can't see how you can compare the two architectures: one only has local TLB invalidation and *must* rely on IPI, while the other supports broadcast TLB invalidation which can lead to better performance and lower latencies if the HW is well designed. >>> Consider I-cache invalidation, KVM also needs to invalid icache >>> when moving vcpu to a new pcpu. >>> Do we miss any cases that need HCR_EL2.FB == 1? >>> Eventually we expect HCR_EL2.FB == 0 if it is possible. >> >> I have no intention to ever set FB to 0, as this would resu> in >> over-invalidation in the general case, and worse performance > > The reason that we want to disable FB is that IPI solution > is needed if it is accepted. Let me repeat it: setting FB=0 may improve bad implementations, but has will have a negative impact on *good* HW implementations. It results in extra TLB and I-cache invalidations in the general case. You are only considering your narrow use case where vcpus are pinned to physical CPUs, where you indeed don't need invalidations to be broadcast. But that's not everybody has your use case. *IF* you come up with a way to find that a vcpu is pinned and cannot be migrated, then that's an avenue for optimisation. But otherwise, no. M.
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 64ce29378467..f85ea9c649cb 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -75,7 +75,7 @@ * PTW: Take a stage2 fault if a stage1 walk steps in device memory */ #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ - HCR_BSU_IS | HCR_FB | HCR_TAC | \ + HCR_BSU_IS | HCR_TAC | \ HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ HCR_FMO | HCR_IMO | HCR_PTW ) #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index acf9a993dfb6..845be911f885 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) /* * We might get preempted before the vCPU actually runs, but * over-invalidation doesn't affect correctness. + * Dirty tlb might still exist when vcpu ran on other pcpu + * and modified page mapping. */ - if (*last_ran != vcpu->vcpu_id) { + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); *last_ran = vcpu->vcpu_id; }