diff mbox series

[RFC] KVM: arm64: Don't force broadcast tlbi when guest is running

Message ID 1603331829-33879-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series [RFC] KVM: arm64: Don't force broadcast tlbi when guest is running | expand

Commit Message

Shaokun Zhang Oct. 22, 2020, 1:57 a.m. UTC
From: Nianyao Tang <tangnianyao@huawei.com>

Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
even those vcpu do not resident on. It would get worse as system
gets larger and more pcpus get online.
Let's disable force-broadcast. We flush tlbi when move vcpu to a
new pcpu, in case old page mapping still exists on new pcpu.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
 arch/arm64/include/asm/kvm_arm.h | 2 +-
 arch/arm64/kvm/arm.c             | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

Marc Zyngier Oct. 22, 2020, 12:03 p.m. UTC | #1
On 2020-10-22 02:57, Shaokun Zhang wrote:
> From: Nianyao Tang <tangnianyao@huawei.com>
> 
> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
> even those vcpu do not resident on. It would get worse as system
> gets larger and more pcpus get online.
> Let's disable force-broadcast. We flush tlbi when move vcpu to a
> new pcpu, in case old page mapping still exists on new pcpu.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>  arch/arm64/kvm/arm.c             | 4 +++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 64ce29378467..f85ea9c649cb 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -75,7 +75,7 @@
>   * PTW:		Take a stage2 fault if a stage1 walk steps in device memory
>   */
>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | 
> HCR_VM | \
> -			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
> +			 HCR_BSU_IS | HCR_TAC | \
>  			 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>  			 HCR_FMO | HCR_IMO | HCR_PTW )
>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index acf9a993dfb6..845be911f885 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int 
> cpu)
>  	/*
>  	 * We might get preempted before the vCPU actually runs, but
>  	 * over-invalidation doesn't affect correctness.
> +	 * Dirty tlb might still exist when vcpu ran on other pcpu
> +	 * and modified page mapping.
>  	 */
> -	if (*last_ran != vcpu->vcpu_id) {
> +	if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>  		kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>  		*last_ran = vcpu->vcpu_id;
>  	}

This breaks uniprocessor semantics for I-cache invalidation. What could
possibly go wrong? You also fail to provide any data that would back up
your claim, as usual.

         M.
Shaokun Zhang Oct. 23, 2020, 1:26 a.m. UTC | #2
Hi Marc,

在 2020/10/22 20:03, Marc Zyngier 写道:
> On 2020-10-22 02:57, Shaokun Zhang wrote:
>> From: Nianyao Tang <tangnianyao@huawei.com>
>>
>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
>> even those vcpu do not resident on. It would get worse as system
>> gets larger and more pcpus get online.
>> Let's disable force-broadcast. We flush tlbi when move vcpu to a
>> new pcpu, in case old page mapping still exists on new pcpu.
>>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>  arch/arm64/kvm/arm.c             | 4 +++-
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 64ce29378467..f85ea9c649cb 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -75,7 +75,7 @@
>>   * PTW:        Take a stage2 fault if a stage1 walk steps in device memory
>>   */
>>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> -             HCR_BSU_IS | HCR_FB | HCR_TAC | \
>> +             HCR_BSU_IS | HCR_TAC | \
>>               HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>>               HCR_FMO | HCR_IMO | HCR_PTW )
>>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index acf9a993dfb6..845be911f885 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>      /*
>>       * We might get preempted before the vCPU actually runs, but
>>       * over-invalidation doesn't affect correctness.
>> +     * Dirty tlb might still exist when vcpu ran on other pcpu
>> +     * and modified page mapping.
>>       */
>> -    if (*last_ran != vcpu->vcpu_id) {
>> +    if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>>          kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>>          *last_ran = vcpu->vcpu_id;
>>      }
> 
> This breaks uniprocessor semantics for I-cache invalidation. What could

Oops, It shall consider this.

> possibly go wrong? You also fail to provide any data that would back up
> your claim, as usual.

Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
(1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
each vm seperated.
(2) 1 4U8G VM
Result:
(1) 800 * 24
(2) 3200
By restricting DVM broadcasting across NUMA, result (1) can
be improved to 2300 * 24. In spawn testcase, tlbiis used
in creating process.
Further, we consider restricting tlbi broadcast range and make
tlbi more accurate.
One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
Consider I-cache invalidation, KVM also needs to invalid icache
when moving vcpu to a new pcpu.
Do we miss any cases that need HCR_EL2.FB == 1?
Eventually we expect HCR_EL2.FB == 0 if it is possible.

Thanks,
Shaokun

> 
>         M.
Marc Zyngier Oct. 23, 2020, 9:17 a.m. UTC | #3
On 2020-10-23 02:26, Shaokun Zhang wrote:
> Hi Marc,
> 
> 在 2020/10/22 20:03, Marc Zyngier 写道:
>> On 2020-10-22 02:57, Shaokun Zhang wrote:
>>> From: Nianyao Tang <tangnianyao@huawei.com>
>>> 
>>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
>>> even those vcpu do not resident on. It would get worse as system
>>> gets larger and more pcpus get online.
>>> Let's disable force-broadcast. We flush tlbi when move vcpu to a
>>> new pcpu, in case old page mapping still exists on new pcpu.
>>> 
>>> Cc: Marc Zyngier <maz@kernel.org>
>>> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
>>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>  arch/arm64/kvm/arm.c             | 4 +++-
>>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h 
>>> b/arch/arm64/include/asm/kvm_arm.h
>>> index 64ce29378467..f85ea9c649cb 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -75,7 +75,7 @@
>>>   * PTW:        Take a stage2 fault if a stage1 walk steps in device 
>>> memory
>>>   */
>>>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | 
>>> HCR_VM | \
>>> -             HCR_BSU_IS | HCR_FB | HCR_TAC | \
>>> +             HCR_BSU_IS | HCR_TAC | \
>>>               HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>>>               HCR_FMO | HCR_IMO | HCR_PTW )
>>>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>> index acf9a993dfb6..845be911f885 100644
>>> --- a/arch/arm64/kvm/arm.c
>>> +++ b/arch/arm64/kvm/arm.c
>>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, 
>>> int cpu)
>>>      /*
>>>       * We might get preempted before the vCPU actually runs, but
>>>       * over-invalidation doesn't affect correctness.
>>> +     * Dirty tlb might still exist when vcpu ran on other pcpu
>>> +     * and modified page mapping.
>>>       */
>>> -    if (*last_ran != vcpu->vcpu_id) {
>>> +    if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>>>          kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>>>          *last_ran = vcpu->vcpu_id;
>>>      }
>> 
>> This breaks uniprocessor semantics for I-cache invalidation. What 
>> could
> 
> Oops, It shall consider this.
> 
>> possibly go wrong? You also fail to provide any data that would back 
>> up
>> your claim, as usual.
> 
> Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
> (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
> each vm seperated.

That's a very narrow use case.

> (2) 1 4U8G VM
> Result:
> (1) 800 * 24
> (2) 3200

I don't know what these numbers are.

> By restricting DVM broadcasting across NUMA, result (1) can
> be improved to 2300 * 24. In spawn testcase, tlbiis used
> in creating process.

Linux always use TLBI IS, except in some very rare cases.
If your HW broadcasts invalidations across the whole system
without any filtering, it is bound to be bad.

> Further, we consider restricting tlbi broadcast range and make
> tlbi more accurate.
> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.

Ah. That old thing again. No, thank you. The right thing to do
is to build CPUs and interconnects that properly deals with
IS invalidations by not sending DVM messages to places where
things don't run (snoop filters?). I also doubt the arm64
maintainers would be sympathetic to the idea anyway.

> Consider I-cache invalidation, KVM also needs to invalid icache
> when moving vcpu to a new pcpu.
> Do we miss any cases that need HCR_EL2.FB == 1?
> Eventually we expect HCR_EL2.FB == 0 if it is possible.

I have no intention to ever set FB to 0, as this would result
in over-invalidation in the general case, and worse performance
on well designed systems.

         M.
Shaokun Zhang Nov. 3, 2020, 11:31 a.m. UTC | #4
Hi Marc,

Apologies that reply later because of out of office some days.

在 2020/10/23 17:17, Marc Zyngier 写道:
> On 2020-10-23 02:26, Shaokun Zhang wrote:
>> Hi Marc,
>>
>> 在 2020/10/22 20:03, Marc Zyngier 写道:
>>> On 2020-10-22 02:57, Shaokun Zhang wrote:
>>>> From: Nianyao Tang <tangnianyao@huawei.com>
>>>>
>>>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
>>>> even those vcpu do not resident on. It would get worse as system
>>>> gets larger and more pcpus get online.
>>>> Let's disable force-broadcast. We flush tlbi when move vcpu to a
>>>> new pcpu, in case old page mapping still exists on new pcpu.
>>>>
>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>> Signed-off-by: Nianyao Tang <tangnianyao@huawei.com>
>>>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>>>> ---
>>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>>  arch/arm64/kvm/arm.c             | 4 +++-
>>>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>>> index 64ce29378467..f85ea9c649cb 100644
>>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>>> @@ -75,7 +75,7 @@
>>>>   * PTW:        Take a stage2 fault if a stage1 walk steps in device memory
>>>>   */
>>>>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>>> -             HCR_BSU_IS | HCR_FB | HCR_TAC | \
>>>> +             HCR_BSU_IS | HCR_TAC | \
>>>>               HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>>>>               HCR_FMO | HCR_IMO | HCR_PTW )
>>>>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index acf9a993dfb6..845be911f885 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>      /*
>>>>       * We might get preempted before the vCPU actually runs, but
>>>>       * over-invalidation doesn't affect correctness.
>>>> +     * Dirty tlb might still exist when vcpu ran on other pcpu
>>>> +     * and modified page mapping.
>>>>       */
>>>> -    if (*last_ran != vcpu->vcpu_id) {
>>>> +    if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>>>>          kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>>>>          *last_ran = vcpu->vcpu_id;
>>>>      }
>>>
>>> This breaks uniprocessor semantics for I-cache invalidation. What could
>>
>> Oops, It shall consider this.
>>
>>> possibly go wrong? You also fail to provide any data that would back up
>>> your claim, as usual.
>>
>> Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
>> (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
>> each vm seperated.
> 
> That's a very narrow use case.
> 
>> (2) 1 4U8G VM
>> Result:
>> (1) 800 * 24
>> (2) 3200
> 
> I don't know what these numbers are.

These are unixbench scores, the higher the better.

> 
>> By restricting DVM broadcasting across NUMA, result (1) can
>> be improved to 2300 * 24. In spawn testcase, tlbiis used
>> in creating process.
> 
> Linux always use TLBI IS, except in some very rare cases.
> If your HW broadcasts invalidations across the whole system
> without any filtering, it is bound to be bad.
> 
>> Further, we consider restricting tlbi broadcast range and make
>> tlbi more accurate.
>> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
> 
> Ah. That old thing again. No, thank you. The right thing to do
> is to build CPUs and interconnects that properly deals with
> IS invalidations by not sending DVM messages to places where
> things don't run (snoop filters?). I also doubt the arm64
> maintainers would be sympathetic to the idea anyway.

We also do the same test on intel 6248 and get better result,
less performance drop in multi-vm case compare to single vm.
Intel use ipi + flush tlb scenario, do you think this scenario is
meaningful on Arm architect hardware?

> 
>> Consider I-cache invalidation, KVM also needs to invalid icache
>> when moving vcpu to a new pcpu.
>> Do we miss any cases that need HCR_EL2.FB == 1?
>> Eventually we expect HCR_EL2.FB == 0 if it is possible.
> 
> I have no intention to ever set FB to 0, as this would resu> in over-invalidation in the general case, and worse performance

The reason that we want to disable FB is that IPI solution
is needed if it is accepted.

Thanks,
Shaokun

> on well designed systems.
> 
>         M.
Marc Zyngier Nov. 3, 2020, 1:26 p.m. UTC | #5
On 2020-11-03 11:31, Shaokun Zhang wrote:

[...]

>>> Further, we consider restricting tlbi broadcast range and make
>>> tlbi more accurate.
>>> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
>> 
>> Ah. That old thing again. No, thank you. The right thing to do
>> is to build CPUs and interconnects that properly deals with
>> IS invalidations by not sending DVM messages to places where
>> things don't run (snoop filters?). I also doubt the arm64
>> maintainers would be sympathetic to the idea anyway.
> 
> We also do the same test on intel 6248 and get better result,
> less performance drop in multi-vm case compare to single vm.
> Intel use ipi + flush tlb scenario, do you think this scenario is
> meaningful on Arm architect hardware?

I can't see how you can compare the two architectures: one only
has local TLB invalidation and *must* rely on IPI, while the other
supports broadcast TLB invalidation which can lead to better performance
and lower latencies if the HW is well designed.

>>> Consider I-cache invalidation, KVM also needs to invalid icache
>>> when moving vcpu to a new pcpu.
>>> Do we miss any cases that need HCR_EL2.FB == 1?
>>> Eventually we expect HCR_EL2.FB == 0 if it is possible.
>> 
>> I have no intention to ever set FB to 0, as this would resu> in 
>> over-invalidation in the general case, and worse performance
> 
> The reason that we want to disable FB is that IPI solution
> is needed if it is accepted.

Let me repeat it: setting FB=0 may improve bad implementations, but
has will have a negative impact on *good* HW implementations.
It results in extra TLB and I-cache invalidations in the general case.

You are only considering your narrow use case where vcpus are pinned
to physical CPUs, where you indeed don't need invalidations to be
broadcast. But that's not everybody has your use case.

*IF* you come up with a way to find that a vcpu is pinned and cannot
be migrated, then that's an avenue for optimisation. But otherwise, no.

         M.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 64ce29378467..f85ea9c649cb 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -75,7 +75,7 @@ 
  * PTW:		Take a stage2 fault if a stage1 walk steps in device memory
  */
 #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
-			 HCR_BSU_IS | HCR_FB | HCR_TAC | \
+			 HCR_BSU_IS | HCR_TAC | \
 			 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
 			 HCR_FMO | HCR_IMO | HCR_PTW )
 #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index acf9a993dfb6..845be911f885 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -334,8 +334,10 @@  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	/*
 	 * We might get preempted before the vCPU actually runs, but
 	 * over-invalidation doesn't affect correctness.
+	 * Dirty tlb might still exist when vcpu ran on other pcpu
+	 * and modified page mapping.
 	 */
-	if (*last_ran != vcpu->vcpu_id) {
+	if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
 		kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
 		*last_ran = vcpu->vcpu_id;
 	}