mbox series

[RFC,0/4] KVM: arm64: Improve efficiency of stage2 page table

Message ID 20210208112250.163568-1-wangyanan55@huawei.com (mailing list archive)
Headers show
Series KVM: arm64: Improve efficiency of stage2 page table | expand

Message

Yanan Wang Feb. 8, 2021, 11:22 a.m. UTC
Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB not supported)
host kernel: Linux mainline (v5.11-rc6)

(1) performance change of patch 1
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s

Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s

Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s

(2) performance change of patch 2, 3(based on patch 1)
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
	   (1 vcpu, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s

---

Yanan Wang (4):
  KVM: arm64: Move the clean of dcache to the map handler
  KVM: arm64: Add an independent API for coalescing tables
  KVM: arm64: Install the block entry before unmapping the page mappings
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 16 -------
 arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
 arch/arm64/kvm/mmu.c             | 39 ++++++---------
 3 files changed, 69 insertions(+), 68 deletions(-)

Comments

Alexandru Elisei Feb. 23, 2021, 3:55 p.m. UTC | #1
Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did you
use another branch as the base for your patches?

Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:
> Hi,
>
> This series makes some efficiency improvement of stage2 page table code,
> and there are some test results to present the performance changes, which
> were tested by a kvm selftest [1] that I have post:
> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 
>
> About patch 1:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Test results:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM create block mappings time: 52.83s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM creating block mappings time: 104.56s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>
> About patch 2, 3:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a lot of time to unmap the numerous page mappings, which means
> the table entry will be left invalid for a long time before installation of
> the block entry, and this will cause many spurious translation faults.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
>
> Test results based on patch 1:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>
> So combined with patch 1, it makes a big difference of KVM creating mappings
> and recovering block mappings with not much code change.
>
> About patch 4:
> A new method to distinguish cases of memcache allocations is introduced.
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> ---
>
> Details of test results
> platform: HiSilicon Kunpeng920 (FWB not supported)
> host kernel: Linux mainline (v5.11-rc6)
>
> (1) performance change of patch 1
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>
> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>
> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>
> (2) performance change of patch 2, 3(based on patch 1)
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>
> ---
>
> Yanan Wang (4):
>   KVM: arm64: Move the clean of dcache to the map handler
>   KVM: arm64: Add an independent API for coalescing tables
>   KVM: arm64: Install the block entry before unmapping the page mappings
>   KVM: arm64: Distinguish cases of memcache allocations completely
>
>  arch/arm64/include/asm/kvm_mmu.h | 16 -------
>  arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>  arch/arm64/kvm/mmu.c             | 39 ++++++---------
>  3 files changed, 69 insertions(+), 68 deletions(-)
>
Yanan Wang Feb. 24, 2021, 2:35 a.m. UTC | #2
Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:
> Hi Yanan,
>
> I wanted to review the patches, but unfortunately I get an error when trying to
> apply the first patch in the series:
>
> Applying: KVM: arm64: Move the clean of dcache to the map handler
> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
> error: patch failed: arch/arm64/kvm/mmu.c:882
> error: arch/arm64/kvm/mmu.c: patch does not apply
> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
> mmu.c from your patch is different than what is found on upstream master. Did you
> use another branch as the base for your patches?
Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.

Thanks,

Yanan.

> Thanks,
>
> Alex
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> Hi,
>>
>> This series makes some efficiency improvement of stage2 page table code,
>> and there are some test results to present the performance changes, which
>> were tested by a kvm selftest [1] that I have post:
>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>
>> About patch 1:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
>>
>> Test results:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM create block mappings time: 52.83s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM creating block mappings time: 104.56s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>
>> About patch 2, 3:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a lot of time to unmap the numerous page mappings, which means
>> the table entry will be left invalid for a long time before installation of
>> the block entry, and this will cause many spurious translation faults.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
>>
>> Test results based on patch 1:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>
>> So combined with patch 1, it makes a big difference of KVM creating mappings
>> and recovering block mappings with not much code change.
>>
>> About patch 4:
>> A new method to distinguish cases of memcache allocations is introduced.
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> ---
>>
>> Details of test results
>> platform: HiSilicon Kunpeng920 (FWB not supported)
>> host kernel: Linux mainline (v5.11-rc6)
>>
>> (1) performance change of patch 1
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>
>> (2) performance change of patch 2, 3(based on patch 1)
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>
>> ---
>>
>> Yanan Wang (4):
>>    KVM: arm64: Move the clean of dcache to the map handler
>>    KVM: arm64: Add an independent API for coalescing tables
>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>
>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>
> .
Alexandru Elisei Feb. 24, 2021, 5:20 p.m. UTC | #3
Hi,

On 2/24/21 2:35 AM, wangyanan (Y) wrote:

> Hi Alex,
>
> On 2021/2/23 23:55, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> I wanted to review the patches, but unfortunately I get an error when trying to
>> apply the first patch in the series:
>>
>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>> error: patch failed: arch/arm64/kvm/mmu.c:882
>> error: arch/arm64/kvm/mmu.c: patch does not apply
>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>> When you have resolved this problem, run "git am --continue".
>> If you prefer to skip this patch, run "git am --skip" instead.
>> To restore the original branch and stop patching, run "git am --abort".
>>
>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>> mmu.c from your patch is different than what is found on upstream master. Did you
>> use another branch as the base for your patches?
> Thanks for your attention.
> Indeed, this series was  more or less based on the patches I post before (Link:
> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
> And they have already been merged into up-to-data upstream master (commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
> v5.11-rc7.
> Could you please try the newest upstream master(since commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
> apply errors occur.

That worked for me, thank you for the quick reply.

Just to double check, when you run the benchmarks, the before results are for a
kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
the fault is handled successfully"), and the after results are with this series on
top, right?

Thanks,

Alex

>
> Thanks,
>
> Yanan.
>
>> Thanks,
>>
>> Alex
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> Hi,
>>>
>>> This series makes some efficiency improvement of stage2 page table code,
>>> and there are some test results to present the performance changes, which
>>> were tested by a kvm selftest [1] that I have post:
>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>
>>> About patch 1:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Test results:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM create block mappings time: 52.83s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM creating block mappings time: 104.56s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>
>>> About patch 2, 3:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>> the table entry will be left invalid for a long time before installation of
>>> the block entry, and this will cause many spurious translation faults.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>>>
>>> Test results based on patch 1:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>
>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>> and recovering block mappings with not much code change.
>>>
>>> About patch 4:
>>> A new method to distinguish cases of memcache allocations is introduced.
>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>> from memcache and cases that don't can be distinguished completely.
>>>
>>> ---
>>>
>>> Details of test results
>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>> host kernel: Linux mainline (v5.11-rc6)
>>>
>>> (1) performance change of patch 1
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>
>>> (2) performance change of patch 2, 3(based on patch 1)
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>        (1 vcpu, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>
>>> ---
>>>
>>> Yanan Wang (4):
>>>    KVM: arm64: Move the clean of dcache to the map handler
>>>    KVM: arm64: Add an independent API for coalescing tables
>>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>>
>>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>>
>> .
Yanan Wang Feb. 25, 2021, 6:13 a.m. UTC | #4
On 2021/2/25 1:20, Alexandru Elisei wrote:
> Hi,
>
> On 2/24/21 2:35 AM, wangyanan (Y) wrote:
>
>> Hi Alex,
>>
>> On 2021/2/23 23:55, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> I wanted to review the patches, but unfortunately I get an error when trying to
>>> apply the first patch in the series:
>>>
>>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>>> error: patch failed: arch/arm64/kvm/mmu.c:882
>>> error: arch/arm64/kvm/mmu.c: patch does not apply
>>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>>> When you have resolved this problem, run "git am --continue".
>>> If you prefer to skip this patch, run "git am --skip" instead.
>>> To restore the original branch and stop patching, run "git am --abort".
>>>
>>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>>> mmu.c from your patch is different than what is found on upstream master. Did you
>>> use another branch as the base for your patches?
>> Thanks for your attention.
>> Indeed, this series was  more or less based on the patches I post before (Link:
>> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
>> And they have already been merged into up-to-data upstream master (commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
>> v5.11-rc7.
>> Could you please try the newest upstream master(since commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
>> apply errors occur.
> That worked for me, thank you for the quick reply.
>
> Just to double check, when you run the benchmarks, the before results are for a
> kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
> the fault is handled successfully"), and the after results are with this series on
> top, right?

Yes, that's right. So the performance change results have nothing to do 
with the series of commit 509552e65ae8.

Thanks,

Yanan

>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan.
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> Hi,
>>>>
>>>> This series makes some efficiency improvement of stage2 page table code,
>>>> and there are some test results to present the performance changes, which
>>>> were tested by a kvm selftest [1] that I have post:
>>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>>
>>>> About patch 1:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>>>
>>>> Test results:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM create block mappings time: 52.83s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM creating block mappings time: 104.56s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>>
>>>> About patch 2, 3:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>>> the table entry will be left invalid for a long time before installation of
>>>> the block entry, and this will cause many spurious translation faults.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>
>>>> Test results based on patch 1:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>>
>>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>>> and recovering block mappings with not much code change.
>>>>
>>>> About patch 4:
>>>> A new method to distinguish cases of memcache allocations is introduced.
>>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>>> from memcache and cases that don't can be distinguished completely.
>>>>
>>>> ---
>>>>
>>>> Details of test results
>>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>>> host kernel: Linux mainline (v5.11-rc6)
>>>>
>>>> (1) performance change of patch 1
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>>
>>>> (2) performance change of patch 2, 3(based on patch 1)
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>>         (1 vcpu, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>>
>>>> ---
>>>>
>>>> Yanan Wang (4):
>>>>     KVM: arm64: Move the clean of dcache to the map handler
>>>>     KVM: arm64: Add an independent API for coalescing tables
>>>>     KVM: arm64: Install the block entry before unmapping the page mappings
>>>>     KVM: arm64: Distinguish cases of memcache allocations completely
>>>>
>>>>    arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>>    arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>>    arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>>    3 files changed, 69 insertions(+), 68 deletions(-)
>>>>
>>> .
> .