diff mbox series

arm64: cache: Update cache_line_size for HiSilicon certain platform

Message ID 1553581690-24753-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series arm64: cache: Update cache_line_size for HiSilicon certain platform | expand

Commit Message

Shaokun Zhang March 26, 2019, 6:28 a.m. UTC
For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
of L3C is 128-Byte.
cache_line_size is used mostly for IO device drivers, so we shall correct
the right value and the device drivers can match it accurately to get good
performance.

When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
that the length of the packet is 4-Byte and only one queue and cpu core:
Without this patch: 1.67 Mpps
with this patch   : 2.40 Mpps

Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Zhenfa Qiu <qiuzhenfa@hisilicon.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Reported-by: Zhenfa Qiu <qiuzhenfa@hisilicon.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
 arch/arm64/include/asm/cache.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Catalin Marinas March 26, 2019, 2:55 p.m. UTC | #1
On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
> of L3C is 128-Byte.
> cache_line_size is used mostly for IO device drivers, so we shall correct
> the right value and the device drivers can match it accurately to get good
> performance.

The CWG is supposed to report the write-back granule for your system,
which usually is the largest cache line size visible to software. If
this is indeed 128-byte for your system, CTR_EL0.CWG should say so. If
the L3C is completely transparent to software (not affected by the
D-cache operations), than it may not influence the CWG value.

What's the CTR_EL0.CWG value on your SoC?

> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
> that the length of the packet is 4-Byte and only one queue and cpu core:
> Without this patch: 1.67 Mpps
> with this patch   : 2.40 Mpps

This needs a better explanation. How does cache_line_size() affect the
4-byte packet? Does it send more packets at once?

I've seen in the mlx5 code assumptions about cache_line_size() being
128. It looks to me more like some driver hand-tuning for specific
system configuration. Can the driver be changed to be more generic
instead?

In conclusion, I don't think this patch is the right solution.
Shaokun Zhang March 27, 2019, 7:16 a.m. UTC | #2
Hi Catalin,

On 2019/3/26 22:55, Catalin Marinas wrote:
> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
>> of L3C is 128-Byte.
>> cache_line_size is used mostly for IO device drivers, so we shall correct
>> the right value and the device drivers can match it accurately to get good
>> performance.
> 
> The CWG is supposed to report the write-back granule for your system,
> which usually is the largest cache line size visible to software. If
> this is indeed 128-byte for your system, CTR_EL0.CWG should say so. If

Agree.
For some reasons, there are two kinds of cache line types on our SoC.

> the L3C is completely transparent to software (not affected by the
> D-cache operations), than it may not influence the CWG value.
> 
> What's the CTR_EL0.CWG value on your SoC?
> 

It's 4'b0100 and cache line size is 64-byte.

>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>> that the length of the packet is 4-Byte and only one queue and cpu core:
>> Without this patch: 1.67 Mpps
>> with this patch   : 2.40 Mpps
> 
> This needs a better explanation. How does cache_line_size() affect the
> 4-byte packet? Does it send more packets at once?
> 
> I've seen in the mlx5 code assumptions about cache_line_size() being
> 128. It looks to me more like some driver hand-tuning for specific
> system configuration. Can the driver be changed to be more generic

I'm not sure that mlx5 may implement some actions for different cache line
size from different arch or platforms, so the driver needs to read the
right cache_line_size.

Originally, I thought this interface was used mainly for IO drivers and no
harm to any other places.

Thanks,
Shaokun

> instead?
> 
> In conclusion, I don't think this patch is the right solution.
>
Catalin Marinas March 29, 2019, 6:52 p.m. UTC | #3
On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
> On 2019/3/26 22:55, Catalin Marinas wrote:
> > On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
> >> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
> >> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
> >> of L3C is 128-Byte.
> >> cache_line_size is used mostly for IO device drivers, so we shall correct
> >> the right value and the device drivers can match it accurately to get good
> >> performance.
[...]
> > What's the CTR_EL0.CWG value on your SoC?
> 
> It's 4'b0100 and cache line size is 64-byte.
> 
> >> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
> >> that the length of the packet is 4-Byte and only one queue and cpu core:
> >> Without this patch: 1.67 Mpps
> >> with this patch   : 2.40 Mpps
> > 
> > This needs a better explanation. How does cache_line_size() affect the
> > 4-byte packet? Does it send more packets at once?
> > 
> > I've seen in the mlx5 code assumptions about cache_line_size() being
> > 128. It looks to me more like some driver hand-tuning for specific
> > system configuration. Can the driver be changed to be more generic
> 
> I'm not sure that mlx5 may implement some actions for different cache line
> size from different arch or platforms, so the driver needs to read the
> right cache_line_size.

We need to better understand why the performance hit but at a quick grep
for "128" in the mlx5 code, I can see different code paths executed when
cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
can somehow disable the L3C, do you still see the same performance
difference?

> Originally, I thought this interface was used mainly for IO drivers and no
> harm to any other places.

Looking through the slab code, cache_line_size() is used when
SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
is 128 on arm64).

Anyway, if the performance drop is indeed caused by more L3C cacheline
bouncing, we can look at fixing cache_line_size() for your CPU to return
128 but I'd like it done using the cpufeature/errata framework. We have
an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128
byte cacheline and read this in cache_line_size() but I think this may
cause user accesses to CTR_EL0 to trap into the kernel to be emulated.
This may cause more performance issues for the user than just
misreporting the CWG.

Alternatively, we store the cache_line_size in a variable and return it.
Cc'ing Suzuki for comments on cpu errata approach.
Shaokun Zhang April 2, 2019, 7:51 a.m. UTC | #4
Hi Catalin,

On 2019/3/30 2:52, Catalin Marinas wrote:
> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
>> On 2019/3/26 22:55, Catalin Marinas wrote:
>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
>>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
>>>> of L3C is 128-Byte.
>>>> cache_line_size is used mostly for IO device drivers, so we shall correct
>>>> the right value and the device drivers can match it accurately to get good
>>>> performance.
> [...]
>>> What's the CTR_EL0.CWG value on your SoC?
>>
>> It's 4'b0100 and cache line size is 64-byte.
>>
>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
>>>> Without this patch: 1.67 Mpps
>>>> with this patch   : 2.40 Mpps
>>>
>>> This needs a better explanation. How does cache_line_size() affect the
>>> 4-byte packet? Does it send more packets at once?
>>>
>>> I've seen in the mlx5 code assumptions about cache_line_size() being
>>> 128. It looks to me more like some driver hand-tuning for specific
>>> system configuration. Can the driver be changed to be more generic
>>
>> I'm not sure that mlx5 may implement some actions for different cache line
>> size from different arch or platforms, so the driver needs to read the
>> right cache_line_size.
> 
> We need to better understand why the performance hit but at a quick grep
> for "128" in the mlx5 code, I can see different code paths executed when
> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
> can somehow disable the L3C, do you still see the same performance
> difference?
> 

Unfortunately, we can't disable L3C separately to check the performance.
In this platform, if IO operation is cache-able, it can lookup L3C. If we
return 128-byte, the performance is better.

One more question about CWG, if cache line size of L1$ and L2$ on A76 is
64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?

>> Originally, I thought this interface was used mainly for IO drivers and no
>> harm to any other places.
> 
> Looking through the slab code, cache_line_size() is used when
> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
> is 128 on arm64).
> 

Yeah, I miss it.
ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined
and not used in the drivers.

> Anyway, if the performance drop is indeed caused by more L3C cacheline
> bouncing, we can look at fixing cache_line_size() for your CPU to return
> 128 but I'd like it done using the cpufeature/errata framework. We have
> an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128
> byte cacheline and read this in cache_line_size() but I think this may
> cause user accesses to CTR_EL0 to trap into the kernel to be emulated.
> This may cause more performance issues for the user than just
> misreporting the CWG.
> 

It is nice that if there is other good approach to do it, I am glad to try
it. Your are pretty right and our guys don't want to get more performance
issues because of fixing this.

> Alternatively, we store the cache_line_size in a variable and return it.
> Cc'ing Suzuki for comments on cpu errata approach.
> 

Suzuki, any comments are very appreciated.

Thanks for your guidance,
Shaokun
Suzuki K Poulose April 2, 2019, 1:02 p.m. UTC | #5
On 29/03/2019 18:52, Catalin Marinas wrote:
> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
>> On 2019/3/26 22:55, Catalin Marinas wrote:
>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
>>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
>>>> of L3C is 128-Byte.
>>>> cache_line_size is used mostly for IO device drivers, so we shall correct
>>>> the right value and the device drivers can match it accurately to get good
>>>> performance.
> [...]
>>> What's the CTR_EL0.CWG value on your SoC?
>>
>> It's 4'b0100 and cache line size is 64-byte.
>>
>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
>>>> Without this patch: 1.67 Mpps
>>>> with this patch   : 2.40 Mpps
>>>
>>> This needs a better explanation. How does cache_line_size() affect the
>>> 4-byte packet? Does it send more packets at once?
>>>
>>> I've seen in the mlx5 code assumptions about cache_line_size() being
>>> 128. It looks to me more like some driver hand-tuning for specific
>>> system configuration. Can the driver be changed to be more generic
>>
>> I'm not sure that mlx5 may implement some actions for different cache line
>> size from different arch or platforms, so the driver needs to read the
>> right cache_line_size.
> 
> We need to better understand why the performance hit but at a quick grep
> for "128" in the mlx5 code, I can see different code paths executed when
> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
> can somehow disable the L3C, do you still see the same performance
> difference?
> 
>> Originally, I thought this interface was used mainly for IO drivers and no
>> harm to any other places.
> 
> Looking through the slab code, cache_line_size() is used when
> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
> is 128 on arm64).
> 
> Anyway, if the performance drop is indeed caused by more L3C cacheline
> bouncing, we can look at fixing cache_line_size() for your CPU to return
> 128 but I'd like it done using the cpufeature/errata framework. We have
> an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128
> byte cacheline and read this in cache_line_size() but I think this may
> cause user accesses to CTR_EL0 to trap into the kernel to be emulated.
> This may cause more performance issues for the user than just
> misreporting the CWG.

As Catalin rightly pointed out there are two separate issues here :

1) cache_line_size() should always provide the system wide safe value.
Which we don't do at the moment for the "kernel", as we read the raw
value of CTR_EL0.

2) Handling the specific case for tsv110 CPU. We have two options.

    a) We could have a separate erratum entry to fixup the system wide
       CWG to 128, and thus affecting only the kernel (without any user
       traps).

      OR

    b) Reuse the existing mismatched cache type erratum to trigger for
       tsv110. This would enable the trapping of CTR_EL0 access from user.


Cheers
Suzuki
Catalin Marinas April 3, 2019, 12:57 p.m. UTC | #6
On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
> On 2019/3/30 2:52, Catalin Marinas wrote:
> > On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
> >> On 2019/3/26 22:55, Catalin Marinas wrote:
> >>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
> >>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
> >>>> that the length of the packet is 4-Byte and only one queue and cpu core:
> >>>> Without this patch: 1.67 Mpps
> >>>> with this patch   : 2.40 Mpps
> >>>
> >>> This needs a better explanation. How does cache_line_size() affect the
> >>> 4-byte packet? Does it send more packets at once?
> >>>
> >>> I've seen in the mlx5 code assumptions about cache_line_size() being
> >>> 128. It looks to me more like some driver hand-tuning for specific
> >>> system configuration. Can the driver be changed to be more generic
> >>
> >> I'm not sure that mlx5 may implement some actions for different cache line
> >> size from different arch or platforms, so the driver needs to read the
> >> right cache_line_size.
> > 
> > We need to better understand why the performance hit but at a quick grep
> > for "128" in the mlx5 code, I can see different code paths executed when
> > cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
> > can somehow disable the L3C, do you still see the same performance
> > difference?
> 
> Unfortunately, we can't disable L3C separately to check the performance.
> In this platform, if IO operation is cache-able, it can lookup L3C. If we
> return 128-byte, the performance is better.

I presume the device affected (mlx5) is cache coherent (can look up the
L3C).

> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?

The ARM ARM description of CWG states:

  Log 2 of the number of words of the maximum size of memory that can be
  overwritten as a result of the eviction of a cache entry that has had
  a memory location in it modified.

If you have DMA-capable devices on your platform that do not snoop the
L3C, I would say CWG should report 128-byte. Otherwise it probably
doesn't matter from a correctness perspective.

> >> Originally, I thought this interface was used mainly for IO drivers and no
> >> harm to any other places.
> > 
> > Looking through the slab code, cache_line_size() is used when
> > SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
> > rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
> > is 128 on arm64).
> 
> Yeah, I miss it.
> ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined
> and not used in the drivers.

It's not used by drivers directly but this sets the minimum slab
allocation so that a device can do non-coherent DMA to a kmalloc'ed
buffer for example without the risk of corrupting adjacent objects.
Catalin Marinas April 4, 2019, 10:27 a.m. UTC | #7
On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?

What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
say on your CPU? This would be information provided by firmware.

Since cache_line_size() in Linux seems to be used for performance rather
than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there
may be a use-case for setting it based on the cache topology description
in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep
since he contributed the cacheinfo code).

But before we change anything in the arch code regarding
cache_line_size(), we need a better understanding of why 128-byte makes
a significant difference (does 256 make it even better?).
John Garry April 5, 2019, 8:29 a.m. UTC | #8
On 04/04/2019 11:27, Catalin Marinas wrote:
> On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
>> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
>> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
>> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?
>
> What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
> say on your CPU?
  > This would be information provided by firmware.

Hi Catalin,

Our HQ is on vacation now, but this may answer your question:

https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1620/Pptt/Pptt.c#L129

i.e. 128

John

>
> Since cache_line_size() in Linux seems to be used for performance rather
> than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there
> may be a use-case for setting it based on the cache topology description
> in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep
> since he contributed the cacheinfo code).
>
> But before we change anything in the arch code regarding
> cache_line_size(), we need a better understanding of why 128-byte makes
> a significant difference (does 256 make it even better?).
>
Shaokun Zhang April 8, 2019, 7:51 a.m. UTC | #9
Hi Catalin,

Sorry to reply later because of a short vacation.

On 2019/4/3 20:57, Catalin Marinas wrote:
> On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
>> On 2019/3/30 2:52, Catalin Marinas wrote:
>>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
>>>> On 2019/3/26 22:55, Catalin Marinas wrote:
>>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>>>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
>>>>>> Without this patch: 1.67 Mpps
>>>>>> with this patch   : 2.40 Mpps
>>>>>
>>>>> This needs a better explanation. How does cache_line_size() affect the
>>>>> 4-byte packet? Does it send more packets at once?
>>>>>
>>>>> I've seen in the mlx5 code assumptions about cache_line_size() being
>>>>> 128. It looks to me more like some driver hand-tuning for specific
>>>>> system configuration. Can the driver be changed to be more generic
>>>>
>>>> I'm not sure that mlx5 may implement some actions for different cache line
>>>> size from different arch or platforms, so the driver needs to read the
>>>> right cache_line_size.
>>>
>>> We need to better understand why the performance hit but at a quick grep
>>> for "128" in the mlx5 code, I can see different code paths executed when
>>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
>>> can somehow disable the L3C, do you still see the same performance
>>> difference?
>>
>> Unfortunately, we can't disable L3C separately to check the performance.
>> In this platform, if IO operation is cache-able, it can lookup L3C. If we
>> return 128-byte, the performance is better.
> 
> I presume the device affected (mlx5) is cache coherent (can look up the
> L3C).
> 
>> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
>> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
>> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?
> 
> The ARM ARM description of CWG states:
> 
>   Log 2 of the number of words of the maximum size of memory that can be
>   overwritten as a result of the eviction of a cache entry that has had
>   a memory location in it modified.
> 
> If you have DMA-capable devices on your platform that do not snoop the
> L3C, I would say CWG should report 128-byte. Otherwise it probably
> doesn't matter from a correctness perspective.
> 

Thanks for your more detailed explanation.

If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall
report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I
miss something?

Here we want to pay more attention on performance under the condition that
cache coherence is OK.

Thanks,
Shaokun

>>>> Originally, I thought this interface was used mainly for IO drivers and no
>>>> harm to any other places.
>>>
>>> Looking through the slab code, cache_line_size() is used when
>>> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
>>> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
>>> is 128 on arm64).
>>
>> Yeah, I miss it.
>> ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined
>> and not used in the drivers.
> 
> It's not used by drivers directly but this sets the minimum slab
> allocation so that a device can do non-coherent DMA to a kmalloc'ed
> buffer for example without the risk of corrupting adjacent objects.
>
Shaokun Zhang April 8, 2019, 8:24 a.m. UTC | #10
Hi Catalin,

On 2019/4/4 18:27, Catalin Marinas wrote:
> On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
>> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
>> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
>> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?
> 
> What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
> say on your CPU? This would be information provided by firmware.
> 

As John has replied that it is 128-byte, today I double check it:
linux-ucdrqS:~ # cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
128
linux-ucdrqS:~ #

> Since cache_line_size() in Linux seems to be used for performance rather
> than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there
> may be a use-case for setting it based on the cache topology description
> in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep
> since he contributed the cacheinfo code).
> 

I have checked drivers/acpi/pptt.c and @coherency_line_size is assigned and
not used.
Sudeep, shall it be updated with cache_line_size()?

> But before we change anything in the arch code regarding
> cache_line_size(), we need a better understanding of why 128-byte makes
> a significant difference (does 256 make it even better?).
> 

I appreciate this question, Just as I understand, L3 hit rate may be better
than 64-byte and I will check it.
About 256-byte, I'm not sure and will try it also.

Thanks,
Shaokun
Shaokun Zhang April 8, 2019, 8:33 a.m. UTC | #11
Hi Suzukiļ¼Œ

On 2019/4/2 21:02, Suzuki K Poulose wrote:
> 
> 
> On 29/03/2019 18:52, Catalin Marinas wrote:
>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
>>> On 2019/3/26 22:55, Catalin Marinas wrote:
>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>>>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the
>>>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size
>>>>> of L3C is 128-Byte.
>>>>> cache_line_size is used mostly for IO device drivers, so we shall correct
>>>>> the right value and the device drivers can match it accurately to get good
>>>>> performance.
>> [...]
>>>> What's the CTR_EL0.CWG value on your SoC?
>>>
>>> It's 4'b0100 and cache line size is 64-byte.
>>>
>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
>>>>> Without this patch: 1.67 Mpps
>>>>> with this patch   : 2.40 Mpps
>>>>
>>>> This needs a better explanation. How does cache_line_size() affect the
>>>> 4-byte packet? Does it send more packets at once?
>>>>
>>>> I've seen in the mlx5 code assumptions about cache_line_size() being
>>>> 128. It looks to me more like some driver hand-tuning for specific
>>>> system configuration. Can the driver be changed to be more generic
>>>
>>> I'm not sure that mlx5 may implement some actions for different cache line
>>> size from different arch or platforms, so the driver needs to read the
>>> right cache_line_size.
>>
>> We need to better understand why the performance hit but at a quick grep
>> for "128" in the mlx5 code, I can see different code paths executed when
>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
>> can somehow disable the L3C, do you still see the same performance
>> difference?
>>
>>> Originally, I thought this interface was used mainly for IO drivers and no
>>> harm to any other places.
>>
>> Looking through the slab code, cache_line_size() is used when
>> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons
>> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which
>> is 128 on arm64).
>>
>> Anyway, if the performance drop is indeed caused by more L3C cacheline
>> bouncing, we can look at fixing cache_line_size() for your CPU to return
>> 128 but I'd like it done using the cpufeature/errata framework. We have
>> an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128
>> byte cacheline and read this in cache_line_size() but I think this may
>> cause user accesses to CTR_EL0 to trap into the kernel to be emulated.
>> This may cause more performance issues for the user than just
>> misreporting the CWG.
> 
> As Catalin rightly pointed out there are two separate issues here :
> 
> 1) cache_line_size() should always provide the system wide safe value.
> Which we don't do at the moment for the "kernel", as we read the raw
> value of CTR_EL0.
> 
> 2) Handling the specific case for tsv110 CPU. We have two options.
> 
>    a) We could have a separate erratum entry to fixup the system wide
>       CWG to 128, and thus affecting only the kernel (without any user
>       traps).
> 

Compared with the two choices, a) is a better option which won't cause
unnecessary trapping, but it is a separate entry after all, Let's do what
Catalin said about the difference between the two cache line types firstly.

Thanks,
Shaokun

>      OR
> 
>    b) Reuse the existing mismatched cache type erratum to trigger for
>       tsv110. This would enable the trapping of CTR_EL0 access from user.
> 
> 
> Cheers
> Suzuki
> 
> .
>
Will Deacon April 16, 2019, 1:51 p.m. UTC | #12
On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote:
> On 2019/4/3 20:57, Catalin Marinas wrote:
> > On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
> >> On 2019/3/30 2:52, Catalin Marinas wrote:
> >>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
> >>>> On 2019/3/26 22:55, Catalin Marinas wrote:
> >>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
> >>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
> >>>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
> >>>>>> Without this patch: 1.67 Mpps
> >>>>>> with this patch   : 2.40 Mpps
> >>>>>
> >>>>> This needs a better explanation. How does cache_line_size() affect the
> >>>>> 4-byte packet? Does it send more packets at once?
> >>>>>
> >>>>> I've seen in the mlx5 code assumptions about cache_line_size() being
> >>>>> 128. It looks to me more like some driver hand-tuning for specific
> >>>>> system configuration. Can the driver be changed to be more generic
> >>>>
> >>>> I'm not sure that mlx5 may implement some actions for different cache line
> >>>> size from different arch or platforms, so the driver needs to read the
> >>>> right cache_line_size.
> >>>
> >>> We need to better understand why the performance hit but at a quick grep
> >>> for "128" in the mlx5 code, I can see different code paths executed when
> >>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
> >>> can somehow disable the L3C, do you still see the same performance
> >>> difference?
> >>
> >> Unfortunately, we can't disable L3C separately to check the performance.
> >> In this platform, if IO operation is cache-able, it can lookup L3C. If we
> >> return 128-byte, the performance is better.
> > 
> > I presume the device affected (mlx5) is cache coherent (can look up the
> > L3C).
> > 
> >> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
> >> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
> >> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?
> > 
> > The ARM ARM description of CWG states:
> > 
> >   Log 2 of the number of words of the maximum size of memory that can be
> >   overwritten as a result of the eviction of a cache entry that has had
> >   a memory location in it modified.
> > 
> > If you have DMA-capable devices on your platform that do not snoop the
> > L3C, I would say CWG should report 128-byte. Otherwise it probably
> > doesn't matter from a correctness perspective.
> > 
> 
> Thanks for your more detailed explanation.
> 
> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall
> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I
> miss something?

Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't
you run into coherency problems? For example, if the device is trying to
read a line that is dirty in L3.

Will
Shaokun Zhang April 16, 2019, 2:23 p.m. UTC | #13
Hi Will,

On 2019/4/16 21:51, Will Deacon wrote:
> On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote:
>> On 2019/4/3 20:57, Catalin Marinas wrote:
>>> On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote:
>>>> On 2019/3/30 2:52, Catalin Marinas wrote:
>>>>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote:
>>>>>> On 2019/3/26 22:55, Catalin Marinas wrote:
>>>>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote:
>>>>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition
>>>>>>>> that the length of the packet is 4-Byte and only one queue and cpu core:
>>>>>>>> Without this patch: 1.67 Mpps
>>>>>>>> with this patch   : 2.40 Mpps
>>>>>>>
>>>>>>> This needs a better explanation. How does cache_line_size() affect the
>>>>>>> 4-byte packet? Does it send more packets at once?
>>>>>>>
>>>>>>> I've seen in the mlx5 code assumptions about cache_line_size() being
>>>>>>> 128. It looks to me more like some driver hand-tuning for specific
>>>>>>> system configuration. Can the driver be changed to be more generic
>>>>>>
>>>>>> I'm not sure that mlx5 may implement some actions for different cache line
>>>>>> size from different arch or platforms, so the driver needs to read the
>>>>>> right cache_line_size.
>>>>>
>>>>> We need to better understand why the performance hit but at a quick grep
>>>>> for "128" in the mlx5 code, I can see different code paths executed when
>>>>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you
>>>>> can somehow disable the L3C, do you still see the same performance
>>>>> difference?
>>>>
>>>> Unfortunately, we can't disable L3C separately to check the performance.
>>>> In this platform, if IO operation is cache-able, it can lookup L3C. If we
>>>> return 128-byte, the performance is better.
>>>
>>> I presume the device affected (mlx5) is cache coherent (can look up the
>>> L3C).
>>>
>>>> One more question about CWG, if cache line size of L1$ and L2$ on A76 is
>>>> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall
>>>> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)?
>>>
>>> The ARM ARM description of CWG states:
>>>
>>>   Log 2 of the number of words of the maximum size of memory that can be
>>>   overwritten as a result of the eviction of a cache entry that has had
>>>   a memory location in it modified.
>>>
>>> If you have DMA-capable devices on your platform that do not snoop the
>>> L3C, I would say CWG should report 128-byte. Otherwise it probably
>>> doesn't matter from a correctness perspective.
>>>
>>
>> Thanks for your more detailed explanation.
>>
>> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall
>> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I
>> miss something?
> 
> Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't
> you run into coherency problems? For example, if the device is trying to
> read a line that is dirty in L3.
> 

Apologies that I don't explain it clearly,

Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able
operations must look up L3$ while its cache line is 128-byte, so we want to update
cache_line_size with 128-byte for IO device driver.

What I said above was that I tried to understand Catalin's meaning, because I
am a little confused how to determine the CWG size if cache line size are
different between L1/L2$ and L3$.

Thanks,
Shaokun

> Will
> 
> .
>
Catalin Marinas April 16, 2019, 2:59 p.m. UTC | #14
Hi Shaokun,

(catching up with emails after holiday)

On Tue, Apr 16, 2019 at 10:23:09PM +0800, Zhangshaokun wrote:
> On 2019/4/16 21:51, Will Deacon wrote:
> > On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote:
> >> On 2019/4/3 20:57, Catalin Marinas wrote:
> >>> The ARM ARM description of CWG states:
> >>>
> >>>   Log 2 of the number of words of the maximum size of memory that can be
> >>>   overwritten as a result of the eviction of a cache entry that has had
> >>>   a memory location in it modified.
> >>>
> >>> If you have DMA-capable devices on your platform that do not snoop the
> >>> L3C, I would say CWG should report 128-byte. Otherwise it probably
> >>> doesn't matter from a correctness perspective.
> >>
> >> Thanks for your more detailed explanation.
> >>
> >> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall
> >> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I
> >> miss something?

CWG is used by software and my comment was whether the L3C is
transparent to _software_ rather than to the I/O device. This means that
the OS doesn't need to be aware of the L3C presence for correct
functionality (e.g. no cache maintenance required).

If the I/O device doesn't snoop the L3C (a non-coherent device) and the
kernel needs to perform cache maintenance, then CWG needs to report the
maximum granule among all cache levels. In practice, we hardcoded
ARCH_DMA_MINALIGN in the kernel to 128-byte.

> > Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't
> > you run into coherency problems? For example, if the device is trying to
> > read a line that is dirty in L3.
> 
> Apologies that I don't explain it clearly,
> 
> Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able
> operations must look up L3$ while its cache line is 128-byte, so we want to update
> cache_line_size with 128-byte for IO device driver.
> 
> What I said above was that I tried to understand Catalin's meaning, because I
> am a little confused how to determine the CWG size if cache line size are
> different between L1/L2$ and L3$.

In Linux cache_line_size() currently reports CWG and it is used for
performance rather than correctness. I think for your platform reporting
64-byte in CWG is still correct as this value doesn't affect the I/O
coherency. However, since it affects performance, we may want to change
what cache_line_size() reports.

My proposal is to use the maximum cacheline size as detected from the
PPTT information with a fallback to CWG if such information is absent.
Shaokun Zhang April 17, 2019, 3:41 a.m. UTC | #15
Hi Catalin,

On 2019/4/16 22:59, Catalin Marinas wrote:
> Hi Shaokun,
> 
> (catching up with emails after holiday)
> 
> On Tue, Apr 16, 2019 at 10:23:09PM +0800, Zhangshaokun wrote:
>> On 2019/4/16 21:51, Will Deacon wrote:
>>> On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote:
>>>> On 2019/4/3 20:57, Catalin Marinas wrote:
>>>>> The ARM ARM description of CWG states:
>>>>>
>>>>>   Log 2 of the number of words of the maximum size of memory that can be
>>>>>   overwritten as a result of the eviction of a cache entry that has had
>>>>>   a memory location in it modified.
>>>>>
>>>>> If you have DMA-capable devices on your platform that do not snoop the
>>>>> L3C, I would say CWG should report 128-byte. Otherwise it probably
>>>>> doesn't matter from a correctness perspective.
>>>>
>>>> Thanks for your more detailed explanation.
>>>>
>>>> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall
>>>> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I
>>>> miss something?
> 
> CWG is used by software and my comment was whether the L3C is
> transparent to _software_ rather than to the I/O device. This means that

Oh, sorry, my misunderstanding.

> the OS doesn't need to be aware of the L3C presence for correct
> functionality (e.g. no cache maintenance required).
> 
> If the I/O device doesn't snoop the L3C (a non-coherent device) and the
> kernel needs to perform cache maintenance, then CWG needs to report the
> maximum granule among all cache levels. In practice, we hardcoded
> ARCH_DMA_MINALIGN in the kernel to 128-byte.
> 

Ok, got it.

>>> Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't
>>> you run into coherency problems? For example, if the device is trying to
>>> read a line that is dirty in L3.
>>
>> Apologies that I don't explain it clearly,
>>
>> Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able
>> operations must look up L3$ while its cache line is 128-byte, so we want to update
>> cache_line_size with 128-byte for IO device driver.
>>
>> What I said above was that I tried to understand Catalin's meaning, because I
>> am a little confused how to determine the CWG size if cache line size are
>> different between L1/L2$ and L3$.
> 
> In Linux cache_line_size() currently reports CWG and it is used for
> performance rather than correctness. I think for your platform reporting
> 64-byte in CWG is still correct as this value doesn't affect the I/O
> coherency. However, since it affects performance, we may want to change
> what cache_line_size() reports.

Right, no correctness issue, we want to correct the misreporting cache_line_size
to IO driver mainly for performance.

> 
> My proposal is to use the maximum cacheline size as detected from the
> PPTT information with a fallback to CWG if such information is absent.
> 

Ok, got it and I will try to do it recently.

Thanks,
Shaokun
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 926434f413fa..0f7d9581e0b9 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -93,7 +93,12 @@  static inline u32 cache_type_cwg(void)
 
 static inline int cache_line_size(void)
 {
-	u32 cwg = cache_type_cwg();
+	u32 cwg;
+
+	if (read_cpuid_part_number() == HISI_CPU_PART_TSV110)
+		return ARCH_DMA_MINALIGN;
+
+	cwg = cache_type_cwg();
 	return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
 }