mbox series

[0/2] Support Armv8.9/v9.4 FEAT_HAFT

Message ID 20240802093458.32683-1-yangyicong@huawei.com (mailing list archive)
Headers show
Series Support Armv8.9/v9.4 FEAT_HAFT | expand

Message

Yicong Yang Aug. 2, 2024, 9:34 a.m. UTC
From: Yicong Yang <yangyicong@hisilicon.com>

This series adds basic support for FEAT_HAFT introduced in Armv8.9/v9.4
and enable ARCH_HAS_NONLEAF_PMD_YOUNG. The latter will be used in
lru-gen aging. Tested with lru-gen in below steps:
1. Generate a 1GiB workingset by `stress-ng --vm 1`. Then hang the task to
   stop accessing the memory. (AF bit won't be updated)
2. try to age the memory by /sys/kernel/debug/lru_gen

Run above steps with LRU_GEN_NONLEAF_YOUNG(0x4) and not respectively
(switching by /sys/kernel/mm/lru_gen/enabled). LRU_GEN_NONLEAF_YOUNG
will clear and test the PMD AF bit on page walking for aging,
otherwise will clear and test the PTE AF bit for aging. In this case
LRU_GEN_NONLEAF_YOUNG will improve the efficiency of page scanning
since pages won't be accessed and we don't need to scan each PTE.

For lru-gen aging:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/mm/multigen_lru.rst?h=v6.11-rc1#n94

Yicong Yang (2):
  arm64: Add support for FEAT_HAFT
  arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG

 arch/arm64/Kconfig                     | 21 ++++++++++++++
 arch/arm64/include/asm/pgtable-hwdef.h |  5 ++++
 arch/arm64/include/asm/pgtable.h       | 14 ++++++++--
 arch/arm64/kernel/cpufeature.c         | 38 ++++++++++++++++++++++++++
 arch/arm64/tools/cpucaps               |  1 +
 arch/arm64/tools/sysreg                |  1 +
 6 files changed, 78 insertions(+), 2 deletions(-)

Comments

Marc Zyngier Aug. 2, 2024, 10:40 a.m. UTC | #1
On Fri, 02 Aug 2024 10:34:56 +0100,
Yicong Yang <yangyicong@huawei.com> wrote:
> 
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This series adds basic support for FEAT_HAFT introduced in Armv8.9/v9.4
> and enable ARCH_HAS_NONLEAF_PMD_YOUNG. The latter will be used in
> lru-gen aging. Tested with lru-gen in below steps:
> 1. Generate a 1GiB workingset by `stress-ng --vm 1`. Then hang the task to
>    stop accessing the memory. (AF bit won't be updated)
> 2. try to age the memory by /sys/kernel/debug/lru_gen
> 
> Run above steps with LRU_GEN_NONLEAF_YOUNG(0x4) and not respectively
> (switching by /sys/kernel/mm/lru_gen/enabled). LRU_GEN_NONLEAF_YOUNG
> will clear and test the PMD AF bit on page walking for aging,
> otherwise will clear and test the PTE AF bit for aging. In this case
> LRU_GEN_NONLEAF_YOUNG will improve the efficiency of page scanning
> since pages won't be accessed and we don't need to scan each PTE.

Improve by how much? Can you please publish numbers that demonstrate
the effect of this feature?

Thanks,

	M.
Yicong Yang Aug. 6, 2024, 3:43 a.m. UTC | #2
On 2024/8/2 18:40, Marc Zyngier wrote:
> On Fri, 02 Aug 2024 10:34:56 +0100,
> Yicong Yang <yangyicong@huawei.com> wrote:
>>
>> From: Yicong Yang <yangyicong@hisilicon.com>
>>
>> This series adds basic support for FEAT_HAFT introduced in Armv8.9/v9.4
>> and enable ARCH_HAS_NONLEAF_PMD_YOUNG. The latter will be used in
>> lru-gen aging. Tested with lru-gen in below steps:
>> 1. Generate a 1GiB workingset by `stress-ng --vm 1`. Then hang the task to
>>    stop accessing the memory. (AF bit won't be updated)
>> 2. try to age the memory by /sys/kernel/debug/lru_gen
>>
>> Run above steps with LRU_GEN_NONLEAF_YOUNG(0x4) and not respectively
>> (switching by /sys/kernel/mm/lru_gen/enabled). LRU_GEN_NONLEAF_YOUNG
>> will clear and test the PMD AF bit on page walking for aging,
>> otherwise will clear and test the PTE AF bit for aging. In this case
>> LRU_GEN_NONLEAF_YOUNG will improve the efficiency of page scanning
>> since pages won't be accessed and we don't need to scan each PTE.
> 
> Improve by how much? Can you please publish numbers that demonstrate
> the effect of this feature?
> 

With LRU_GEN_NONLEAF_YOUNG ~40% time saved for 1GiB memory observed on our
emulated platform.

Thanks.
Marc Zyngier Aug. 6, 2024, 8:06 a.m. UTC | #3
On Tue, 06 Aug 2024 04:43:52 +0100,
Yicong Yang <yangyicong@huawei.com> wrote:
> 
> On 2024/8/2 18:40, Marc Zyngier wrote:
> > On Fri, 02 Aug 2024 10:34:56 +0100,
> > Yicong Yang <yangyicong@huawei.com> wrote:
> >>
> >> From: Yicong Yang <yangyicong@hisilicon.com>
> >>
> >> This series adds basic support for FEAT_HAFT introduced in Armv8.9/v9.4
> >> and enable ARCH_HAS_NONLEAF_PMD_YOUNG. The latter will be used in
> >> lru-gen aging. Tested with lru-gen in below steps:
> >> 1. Generate a 1GiB workingset by `stress-ng --vm 1`. Then hang the task to
> >>    stop accessing the memory. (AF bit won't be updated)
> >> 2. try to age the memory by /sys/kernel/debug/lru_gen
> >>
> >> Run above steps with LRU_GEN_NONLEAF_YOUNG(0x4) and not respectively
> >> (switching by /sys/kernel/mm/lru_gen/enabled). LRU_GEN_NONLEAF_YOUNG
> >> will clear and test the PMD AF bit on page walking for aging,
> >> otherwise will clear and test the PTE AF bit for aging. In this case
> >> LRU_GEN_NONLEAF_YOUNG will improve the efficiency of page scanning
> >> since pages won't be accessed and we don't need to scan each PTE.
> > 
> > Improve by how much? Can you please publish numbers that demonstrate
> > the effect of this feature?
> > 
> 
> With LRU_GEN_NONLEAF_YOUNG ~40% time saved for 1GiB memory observed on our
> emulated platform.

This certainly looks impressive, but it is a very ad-hoc benchmark,
and emulation numbers don't necessarily result in similar improvement
on actual HW.

How does this translate for a more realistic/useful workload? Even
numbers obtained on another architecture would be useful.

Thanks,

	M.
Yicong Yang Aug. 6, 2024, 1:35 p.m. UTC | #4
On 2024/8/6 16:06, Marc Zyngier wrote:
> On Tue, 06 Aug 2024 04:43:52 +0100,
> Yicong Yang <yangyicong@huawei.com> wrote:
>>
>> On 2024/8/2 18:40, Marc Zyngier wrote:
>>> On Fri, 02 Aug 2024 10:34:56 +0100,
>>> Yicong Yang <yangyicong@huawei.com> wrote:
>>>>
>>>> From: Yicong Yang <yangyicong@hisilicon.com>
>>>>
>>>> This series adds basic support for FEAT_HAFT introduced in Armv8.9/v9.4
>>>> and enable ARCH_HAS_NONLEAF_PMD_YOUNG. The latter will be used in
>>>> lru-gen aging. Tested with lru-gen in below steps:
>>>> 1. Generate a 1GiB workingset by `stress-ng --vm 1`. Then hang the task to
>>>>    stop accessing the memory. (AF bit won't be updated)
>>>> 2. try to age the memory by /sys/kernel/debug/lru_gen
>>>>
>>>> Run above steps with LRU_GEN_NONLEAF_YOUNG(0x4) and not respectively
>>>> (switching by /sys/kernel/mm/lru_gen/enabled). LRU_GEN_NONLEAF_YOUNG
>>>> will clear and test the PMD AF bit on page walking for aging,
>>>> otherwise will clear and test the PTE AF bit for aging. In this case
>>>> LRU_GEN_NONLEAF_YOUNG will improve the efficiency of page scanning
>>>> since pages won't be accessed and we don't need to scan each PTE.
>>>
>>> Improve by how much? Can you please publish numbers that demonstrate
>>> the effect of this feature?
>>>
>>
>> With LRU_GEN_NONLEAF_YOUNG ~40% time saved for 1GiB memory observed on our
>> emulated platform.
> 
> This certainly looks impressive, but it is a very ad-hoc benchmark,
> and emulation numbers don't necessarily result in similar improvement
> on actual HW.
> 

Yes indeed. I just design this case for testing it works. The real case maybe
more complex and not that ideal and may also involves other things like THP
(for THP we may already use the PMD block mapping so the advantage of HAFT
may not take effects).

> How does this translate for a more realistic/useful workload? Even
> numbers obtained on another architecture would be useful.
> 

Currently I have no numbers for the real workload yet. Maybe for the next step
once the platform's available (for a x86 or arm64 one which can run real
workloads).

Thanks.