Message ID | 1553581690-24753-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: cache: Update cache_line_size for HiSilicon certain platform | expand |
On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: > For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the > tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size > of L3C is 128-Byte. > cache_line_size is used mostly for IO device drivers, so we shall correct > the right value and the device drivers can match it accurately to get good > performance. The CWG is supposed to report the write-back granule for your system, which usually is the largest cache line size visible to software. If this is indeed 128-byte for your system, CTR_EL0.CWG should say so. If the L3C is completely transparent to software (not affected by the D-cache operations), than it may not influence the CWG value. What's the CTR_EL0.CWG value on your SoC? > When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition > that the length of the packet is 4-Byte and only one queue and cpu core: > Without this patch: 1.67 Mpps > with this patch : 2.40 Mpps This needs a better explanation. How does cache_line_size() affect the 4-byte packet? Does it send more packets at once? I've seen in the mlx5 code assumptions about cache_line_size() being 128. It looks to me more like some driver hand-tuning for specific system configuration. Can the driver be changed to be more generic instead? In conclusion, I don't think this patch is the right solution.
Hi Catalin, On 2019/3/26 22:55, Catalin Marinas wrote: > On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the >> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size >> of L3C is 128-Byte. >> cache_line_size is used mostly for IO device drivers, so we shall correct >> the right value and the device drivers can match it accurately to get good >> performance. > > The CWG is supposed to report the write-back granule for your system, > which usually is the largest cache line size visible to software. If > this is indeed 128-byte for your system, CTR_EL0.CWG should say so. If Agree. For some reasons, there are two kinds of cache line types on our SoC. > the L3C is completely transparent to software (not affected by the > D-cache operations), than it may not influence the CWG value. > > What's the CTR_EL0.CWG value on your SoC? > It's 4'b0100 and cache line size is 64-byte. >> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >> that the length of the packet is 4-Byte and only one queue and cpu core: >> Without this patch: 1.67 Mpps >> with this patch : 2.40 Mpps > > This needs a better explanation. How does cache_line_size() affect the > 4-byte packet? Does it send more packets at once? > > I've seen in the mlx5 code assumptions about cache_line_size() being > 128. It looks to me more like some driver hand-tuning for specific > system configuration. Can the driver be changed to be more generic I'm not sure that mlx5 may implement some actions for different cache line size from different arch or platforms, so the driver needs to read the right cache_line_size. Originally, I thought this interface was used mainly for IO drivers and no harm to any other places. Thanks, Shaokun > instead? > > In conclusion, I don't think this patch is the right solution. >
On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: > On 2019/3/26 22:55, Catalin Marinas wrote: > > On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: > >> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the > >> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size > >> of L3C is 128-Byte. > >> cache_line_size is used mostly for IO device drivers, so we shall correct > >> the right value and the device drivers can match it accurately to get good > >> performance. [...] > > What's the CTR_EL0.CWG value on your SoC? > > It's 4'b0100 and cache line size is 64-byte. > > >> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition > >> that the length of the packet is 4-Byte and only one queue and cpu core: > >> Without this patch: 1.67 Mpps > >> with this patch : 2.40 Mpps > > > > This needs a better explanation. How does cache_line_size() affect the > > 4-byte packet? Does it send more packets at once? > > > > I've seen in the mlx5 code assumptions about cache_line_size() being > > 128. It looks to me more like some driver hand-tuning for specific > > system configuration. Can the driver be changed to be more generic > > I'm not sure that mlx5 may implement some actions for different cache line > size from different arch or platforms, so the driver needs to read the > right cache_line_size. We need to better understand why the performance hit but at a quick grep for "128" in the mlx5 code, I can see different code paths executed when cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you can somehow disable the L3C, do you still see the same performance difference? > Originally, I thought this interface was used mainly for IO drivers and no > harm to any other places. Looking through the slab code, cache_line_size() is used when SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which is 128 on arm64). Anyway, if the performance drop is indeed caused by more L3C cacheline bouncing, we can look at fixing cache_line_size() for your CPU to return 128 but I'd like it done using the cpufeature/errata framework. We have an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128 byte cacheline and read this in cache_line_size() but I think this may cause user accesses to CTR_EL0 to trap into the kernel to be emulated. This may cause more performance issues for the user than just misreporting the CWG. Alternatively, we store the cache_line_size in a variable and return it. Cc'ing Suzuki for comments on cpu errata approach.
Hi Catalin, On 2019/3/30 2:52, Catalin Marinas wrote: > On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: >> On 2019/3/26 22:55, Catalin Marinas wrote: >>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the >>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size >>>> of L3C is 128-Byte. >>>> cache_line_size is used mostly for IO device drivers, so we shall correct >>>> the right value and the device drivers can match it accurately to get good >>>> performance. > [...] >>> What's the CTR_EL0.CWG value on your SoC? >> >> It's 4'b0100 and cache line size is 64-byte. >> >>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >>>> that the length of the packet is 4-Byte and only one queue and cpu core: >>>> Without this patch: 1.67 Mpps >>>> with this patch : 2.40 Mpps >>> >>> This needs a better explanation. How does cache_line_size() affect the >>> 4-byte packet? Does it send more packets at once? >>> >>> I've seen in the mlx5 code assumptions about cache_line_size() being >>> 128. It looks to me more like some driver hand-tuning for specific >>> system configuration. Can the driver be changed to be more generic >> >> I'm not sure that mlx5 may implement some actions for different cache line >> size from different arch or platforms, so the driver needs to read the >> right cache_line_size. > > We need to better understand why the performance hit but at a quick grep > for "128" in the mlx5 code, I can see different code paths executed when > cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you > can somehow disable the L3C, do you still see the same performance > difference? > Unfortunately, we can't disable L3C separately to check the performance. In this platform, if IO operation is cache-able, it can lookup L3C. If we return 128-byte, the performance is better. One more question about CWG, if cache line size of L1$ and L2$ on A76 is 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? >> Originally, I thought this interface was used mainly for IO drivers and no >> harm to any other places. > > Looking through the slab code, cache_line_size() is used when > SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons > rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which > is 128 on arm64). > Yeah, I miss it. ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined and not used in the drivers. > Anyway, if the performance drop is indeed caused by more L3C cacheline > bouncing, we can look at fixing cache_line_size() for your CPU to return > 128 but I'd like it done using the cpufeature/errata framework. We have > an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128 > byte cacheline and read this in cache_line_size() but I think this may > cause user accesses to CTR_EL0 to trap into the kernel to be emulated. > This may cause more performance issues for the user than just > misreporting the CWG. > It is nice that if there is other good approach to do it, I am glad to try it. Your are pretty right and our guys don't want to get more performance issues because of fixing this. > Alternatively, we store the cache_line_size in a variable and return it. > Cc'ing Suzuki for comments on cpu errata approach. > Suzuki, any comments are very appreciated. Thanks for your guidance, Shaokun
On 29/03/2019 18:52, Catalin Marinas wrote: > On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: >> On 2019/3/26 22:55, Catalin Marinas wrote: >>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the >>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size >>>> of L3C is 128-Byte. >>>> cache_line_size is used mostly for IO device drivers, so we shall correct >>>> the right value and the device drivers can match it accurately to get good >>>> performance. > [...] >>> What's the CTR_EL0.CWG value on your SoC? >> >> It's 4'b0100 and cache line size is 64-byte. >> >>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >>>> that the length of the packet is 4-Byte and only one queue and cpu core: >>>> Without this patch: 1.67 Mpps >>>> with this patch : 2.40 Mpps >>> >>> This needs a better explanation. How does cache_line_size() affect the >>> 4-byte packet? Does it send more packets at once? >>> >>> I've seen in the mlx5 code assumptions about cache_line_size() being >>> 128. It looks to me more like some driver hand-tuning for specific >>> system configuration. Can the driver be changed to be more generic >> >> I'm not sure that mlx5 may implement some actions for different cache line >> size from different arch or platforms, so the driver needs to read the >> right cache_line_size. > > We need to better understand why the performance hit but at a quick grep > for "128" in the mlx5 code, I can see different code paths executed when > cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you > can somehow disable the L3C, do you still see the same performance > difference? > >> Originally, I thought this interface was used mainly for IO drivers and no >> harm to any other places. > > Looking through the slab code, cache_line_size() is used when > SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons > rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which > is 128 on arm64). > > Anyway, if the performance drop is indeed caused by more L3C cacheline > bouncing, we can look at fixing cache_line_size() for your CPU to return > 128 but I'd like it done using the cpufeature/errata framework. We have > an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128 > byte cacheline and read this in cache_line_size() but I think this may > cause user accesses to CTR_EL0 to trap into the kernel to be emulated. > This may cause more performance issues for the user than just > misreporting the CWG. As Catalin rightly pointed out there are two separate issues here : 1) cache_line_size() should always provide the system wide safe value. Which we don't do at the moment for the "kernel", as we read the raw value of CTR_EL0. 2) Handling the specific case for tsv110 CPU. We have two options. a) We could have a separate erratum entry to fixup the system wide CWG to 128, and thus affecting only the kernel (without any user traps). OR b) Reuse the existing mismatched cache type erratum to trigger for tsv110. This would enable the trapping of CTR_EL0 access from user. Cheers Suzuki
On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: > On 2019/3/30 2:52, Catalin Marinas wrote: > > On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: > >> On 2019/3/26 22:55, Catalin Marinas wrote: > >>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: > >>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition > >>>> that the length of the packet is 4-Byte and only one queue and cpu core: > >>>> Without this patch: 1.67 Mpps > >>>> with this patch : 2.40 Mpps > >>> > >>> This needs a better explanation. How does cache_line_size() affect the > >>> 4-byte packet? Does it send more packets at once? > >>> > >>> I've seen in the mlx5 code assumptions about cache_line_size() being > >>> 128. It looks to me more like some driver hand-tuning for specific > >>> system configuration. Can the driver be changed to be more generic > >> > >> I'm not sure that mlx5 may implement some actions for different cache line > >> size from different arch or platforms, so the driver needs to read the > >> right cache_line_size. > > > > We need to better understand why the performance hit but at a quick grep > > for "128" in the mlx5 code, I can see different code paths executed when > > cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you > > can somehow disable the L3C, do you still see the same performance > > difference? > > Unfortunately, we can't disable L3C separately to check the performance. > In this platform, if IO operation is cache-able, it can lookup L3C. If we > return 128-byte, the performance is better. I presume the device affected (mlx5) is cache coherent (can look up the L3C). > One more question about CWG, if cache line size of L1$ and L2$ on A76 is > 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall > CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? The ARM ARM description of CWG states: Log 2 of the number of words of the maximum size of memory that can be overwritten as a result of the eviction of a cache entry that has had a memory location in it modified. If you have DMA-capable devices on your platform that do not snoop the L3C, I would say CWG should report 128-byte. Otherwise it probably doesn't matter from a correctness perspective. > >> Originally, I thought this interface was used mainly for IO drivers and no > >> harm to any other places. > > > > Looking through the slab code, cache_line_size() is used when > > SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons > > rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which > > is 128 on arm64). > > Yeah, I miss it. > ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined > and not used in the drivers. It's not used by drivers directly but this sets the minimum slab allocation so that a device can do non-coherent DMA to a kmalloc'ed buffer for example without the risk of corrupting adjacent objects.
On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: > One more question about CWG, if cache line size of L1$ and L2$ on A76 is > 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall > CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size say on your CPU? This would be information provided by firmware. Since cache_line_size() in Linux seems to be used for performance rather than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there may be a use-case for setting it based on the cache topology description in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep since he contributed the cacheinfo code). But before we change anything in the arch code regarding cache_line_size(), we need a better understanding of why 128-byte makes a significant difference (does 256 make it even better?).
On 04/04/2019 11:27, Catalin Marinas wrote: > On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: >> One more question about CWG, if cache line size of L1$ and L2$ on A76 is >> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall >> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? > > What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size > say on your CPU? > This would be information provided by firmware. Hi Catalin, Our HQ is on vacation now, but this may answer your question: https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1620/Pptt/Pptt.c#L129 i.e. 128 John > > Since cache_line_size() in Linux seems to be used for performance rather > than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there > may be a use-case for setting it based on the cache topology description > in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep > since he contributed the cacheinfo code). > > But before we change anything in the arch code regarding > cache_line_size(), we need a better understanding of why 128-byte makes > a significant difference (does 256 make it even better?). >
Hi Catalin, Sorry to reply later because of a short vacation. On 2019/4/3 20:57, Catalin Marinas wrote: > On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: >> On 2019/3/30 2:52, Catalin Marinas wrote: >>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: >>>> On 2019/3/26 22:55, Catalin Marinas wrote: >>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >>>>>> that the length of the packet is 4-Byte and only one queue and cpu core: >>>>>> Without this patch: 1.67 Mpps >>>>>> with this patch : 2.40 Mpps >>>>> >>>>> This needs a better explanation. How does cache_line_size() affect the >>>>> 4-byte packet? Does it send more packets at once? >>>>> >>>>> I've seen in the mlx5 code assumptions about cache_line_size() being >>>>> 128. It looks to me more like some driver hand-tuning for specific >>>>> system configuration. Can the driver be changed to be more generic >>>> >>>> I'm not sure that mlx5 may implement some actions for different cache line >>>> size from different arch or platforms, so the driver needs to read the >>>> right cache_line_size. >>> >>> We need to better understand why the performance hit but at a quick grep >>> for "128" in the mlx5 code, I can see different code paths executed when >>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you >>> can somehow disable the L3C, do you still see the same performance >>> difference? >> >> Unfortunately, we can't disable L3C separately to check the performance. >> In this platform, if IO operation is cache-able, it can lookup L3C. If we >> return 128-byte, the performance is better. > > I presume the device affected (mlx5) is cache coherent (can look up the > L3C). > >> One more question about CWG, if cache line size of L1$ and L2$ on A76 is >> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall >> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? > > The ARM ARM description of CWG states: > > Log 2 of the number of words of the maximum size of memory that can be > overwritten as a result of the eviction of a cache entry that has had > a memory location in it modified. > > If you have DMA-capable devices on your platform that do not snoop the > L3C, I would say CWG should report 128-byte. Otherwise it probably > doesn't matter from a correctness perspective. > Thanks for your more detailed explanation. If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I miss something? Here we want to pay more attention on performance under the condition that cache coherence is OK. Thanks, Shaokun >>>> Originally, I thought this interface was used mainly for IO drivers and no >>>> harm to any other places. >>> >>> Looking through the slab code, cache_line_size() is used when >>> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons >>> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which >>> is 128 on arm64). >> >> Yeah, I miss it. >> ARCH_DMA_MINALIGN is only used for non DMA-coherent, right? It's only defined >> and not used in the drivers. > > It's not used by drivers directly but this sets the minimum slab > allocation so that a device can do non-coherent DMA to a kmalloc'ed > buffer for example without the risk of corrupting adjacent objects. >
Hi Catalin, On 2019/4/4 18:27, Catalin Marinas wrote: > On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: >> One more question about CWG, if cache line size of L1$ and L2$ on A76 is >> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall >> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? > > What does /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size > say on your CPU? This would be information provided by firmware. > As John has replied that it is 128-byte, today I double check it: linux-ucdrqS:~ # cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size 128 linux-ucdrqS:~ # > Since cache_line_size() in Linux seems to be used for performance rather > than DMA coherency (ARCH_DMA_MINALIGN is the one ensuring this), there > may be a use-case for setting it based on the cache topology description > in ACPI (PPTT) or DT (see drivers/base/cacheinfo.c; also cc'ing Sudeep > since he contributed the cacheinfo code). > I have checked drivers/acpi/pptt.c and @coherency_line_size is assigned and not used. Sudeep, shall it be updated with cache_line_size()? > But before we change anything in the arch code regarding > cache_line_size(), we need a better understanding of why 128-byte makes > a significant difference (does 256 make it even better?). > I appreciate this question, Just as I understand, L3 hit rate may be better than 64-byte and I will check it. About 256-byte, I'm not sure and will try it also. Thanks, Shaokun
Hi Suzukiļ¼ On 2019/4/2 21:02, Suzuki K Poulose wrote: > > > On 29/03/2019 18:52, Catalin Marinas wrote: >> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: >>> On 2019/3/26 22:55, Catalin Marinas wrote: >>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >>>>> For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the >>>>> tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size >>>>> of L3C is 128-Byte. >>>>> cache_line_size is used mostly for IO device drivers, so we shall correct >>>>> the right value and the device drivers can match it accurately to get good >>>>> performance. >> [...] >>>> What's the CTR_EL0.CWG value on your SoC? >>> >>> It's 4'b0100 and cache line size is 64-byte. >>> >>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >>>>> that the length of the packet is 4-Byte and only one queue and cpu core: >>>>> Without this patch: 1.67 Mpps >>>>> with this patch : 2.40 Mpps >>>> >>>> This needs a better explanation. How does cache_line_size() affect the >>>> 4-byte packet? Does it send more packets at once? >>>> >>>> I've seen in the mlx5 code assumptions about cache_line_size() being >>>> 128. It looks to me more like some driver hand-tuning for specific >>>> system configuration. Can the driver be changed to be more generic >>> >>> I'm not sure that mlx5 may implement some actions for different cache line >>> size from different arch or platforms, so the driver needs to read the >>> right cache_line_size. >> >> We need to better understand why the performance hit but at a quick grep >> for "128" in the mlx5 code, I can see different code paths executed when >> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you >> can somehow disable the L3C, do you still see the same performance >> difference? >> >>> Originally, I thought this interface was used mainly for IO drivers and no >>> harm to any other places. >> >> Looking through the slab code, cache_line_size() is used when >> SLAB_HWCACHE_ALIGN is passed and IIUC this is for performance reasons >> rather than I/O (the DMA alignment is given by ARCH_DMA_MINALIGN which >> is 128 on arm64). >> >> Anyway, if the performance drop is indeed caused by more L3C cacheline >> bouncing, we can look at fixing cache_line_size() for your CPU to return >> 128 but I'd like it done using the cpufeature/errata framework. We have >> an arm64_ftr_reg_ctrel0.sys_val that we could update to report a 128 >> byte cacheline and read this in cache_line_size() but I think this may >> cause user accesses to CTR_EL0 to trap into the kernel to be emulated. >> This may cause more performance issues for the user than just >> misreporting the CWG. > > As Catalin rightly pointed out there are two separate issues here : > > 1) cache_line_size() should always provide the system wide safe value. > Which we don't do at the moment for the "kernel", as we read the raw > value of CTR_EL0. > > 2) Handling the specific case for tsv110 CPU. We have two options. > > a) We could have a separate erratum entry to fixup the system wide > CWG to 128, and thus affecting only the kernel (without any user > traps). > Compared with the two choices, a) is a better option which won't cause unnecessary trapping, but it is a separate entry after all, Let's do what Catalin said about the difference between the two cache line types firstly. Thanks, Shaokun > OR > > b) Reuse the existing mismatched cache type erratum to trigger for > tsv110. This would enable the trapping of CTR_EL0 access from user. > > > Cheers > Suzuki > > . >
On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote: > On 2019/4/3 20:57, Catalin Marinas wrote: > > On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: > >> On 2019/3/30 2:52, Catalin Marinas wrote: > >>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: > >>>> On 2019/3/26 22:55, Catalin Marinas wrote: > >>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: > >>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition > >>>>>> that the length of the packet is 4-Byte and only one queue and cpu core: > >>>>>> Without this patch: 1.67 Mpps > >>>>>> with this patch : 2.40 Mpps > >>>>> > >>>>> This needs a better explanation. How does cache_line_size() affect the > >>>>> 4-byte packet? Does it send more packets at once? > >>>>> > >>>>> I've seen in the mlx5 code assumptions about cache_line_size() being > >>>>> 128. It looks to me more like some driver hand-tuning for specific > >>>>> system configuration. Can the driver be changed to be more generic > >>>> > >>>> I'm not sure that mlx5 may implement some actions for different cache line > >>>> size from different arch or platforms, so the driver needs to read the > >>>> right cache_line_size. > >>> > >>> We need to better understand why the performance hit but at a quick grep > >>> for "128" in the mlx5 code, I can see different code paths executed when > >>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you > >>> can somehow disable the L3C, do you still see the same performance > >>> difference? > >> > >> Unfortunately, we can't disable L3C separately to check the performance. > >> In this platform, if IO operation is cache-able, it can lookup L3C. If we > >> return 128-byte, the performance is better. > > > > I presume the device affected (mlx5) is cache coherent (can look up the > > L3C). > > > >> One more question about CWG, if cache line size of L1$ and L2$ on A76 is > >> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall > >> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? > > > > The ARM ARM description of CWG states: > > > > Log 2 of the number of words of the maximum size of memory that can be > > overwritten as a result of the eviction of a cache entry that has had > > a memory location in it modified. > > > > If you have DMA-capable devices on your platform that do not snoop the > > L3C, I would say CWG should report 128-byte. Otherwise it probably > > doesn't matter from a correctness perspective. > > > > Thanks for your more detailed explanation. > > If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall > report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I > miss something? Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't you run into coherency problems? For example, if the device is trying to read a line that is dirty in L3. Will
Hi Will, On 2019/4/16 21:51, Will Deacon wrote: > On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote: >> On 2019/4/3 20:57, Catalin Marinas wrote: >>> On Tue, Apr 02, 2019 at 03:51:33PM +0800, Zhangshaokun wrote: >>>> On 2019/3/30 2:52, Catalin Marinas wrote: >>>>> On Wed, Mar 27, 2019 at 03:16:34PM +0800, Zhangshaokun wrote: >>>>>> On 2019/3/26 22:55, Catalin Marinas wrote: >>>>>>> On Tue, Mar 26, 2019 at 02:28:10PM +0800, Shaokun Zhang wrote: >>>>>>>> When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition >>>>>>>> that the length of the packet is 4-Byte and only one queue and cpu core: >>>>>>>> Without this patch: 1.67 Mpps >>>>>>>> with this patch : 2.40 Mpps >>>>>>> >>>>>>> This needs a better explanation. How does cache_line_size() affect the >>>>>>> 4-byte packet? Does it send more packets at once? >>>>>>> >>>>>>> I've seen in the mlx5 code assumptions about cache_line_size() being >>>>>>> 128. It looks to me more like some driver hand-tuning for specific >>>>>>> system configuration. Can the driver be changed to be more generic >>>>>> >>>>>> I'm not sure that mlx5 may implement some actions for different cache line >>>>>> size from different arch or platforms, so the driver needs to read the >>>>>> right cache_line_size. >>>>> >>>>> We need to better understand why the performance hit but at a quick grep >>>>> for "128" in the mlx5 code, I can see different code paths executed when >>>>> cache_line_size() returned 128 (saved in cqe_size). IOW, presuming you >>>>> can somehow disable the L3C, do you still see the same performance >>>>> difference? >>>> >>>> Unfortunately, we can't disable L3C separately to check the performance. >>>> In this platform, if IO operation is cache-able, it can lookup L3C. If we >>>> return 128-byte, the performance is better. >>> >>> I presume the device affected (mlx5) is cache coherent (can look up the >>> L3C). >>> >>>> One more question about CWG, if cache line size of L1$ and L2$ on A76 is >>>> 64-byte and it is used on our platform (L3C cache line is 128-byte), Shall >>>> CWG is 4b'0100(64-byte) or 4'b0101(128-byte)? >>> >>> The ARM ARM description of CWG states: >>> >>> Log 2 of the number of words of the maximum size of memory that can be >>> overwritten as a result of the eviction of a cache entry that has had >>> a memory location in it modified. >>> >>> If you have DMA-capable devices on your platform that do not snoop the >>> L3C, I would say CWG should report 128-byte. Otherwise it probably >>> doesn't matter from a correctness perspective. >>> >> >> Thanks for your more detailed explanation. >> >> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall >> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I >> miss something? > > Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't > you run into coherency problems? For example, if the device is trying to > read a line that is dirty in L3. > Apologies that I don't explain it clearly, Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able operations must look up L3$ while its cache line is 128-byte, so we want to update cache_line_size with 128-byte for IO device driver. What I said above was that I tried to understand Catalin's meaning, because I am a little confused how to determine the CWG size if cache line size are different between L1/L2$ and L3$. Thanks, Shaokun > Will > > . >
Hi Shaokun, (catching up with emails after holiday) On Tue, Apr 16, 2019 at 10:23:09PM +0800, Zhangshaokun wrote: > On 2019/4/16 21:51, Will Deacon wrote: > > On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote: > >> On 2019/4/3 20:57, Catalin Marinas wrote: > >>> The ARM ARM description of CWG states: > >>> > >>> Log 2 of the number of words of the maximum size of memory that can be > >>> overwritten as a result of the eviction of a cache entry that has had > >>> a memory location in it modified. > >>> > >>> If you have DMA-capable devices on your platform that do not snoop the > >>> L3C, I would say CWG should report 128-byte. Otherwise it probably > >>> doesn't matter from a correctness perspective. > >> > >> Thanks for your more detailed explanation. > >> > >> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall > >> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I > >> miss something? CWG is used by software and my comment was whether the L3C is transparent to _software_ rather than to the I/O device. This means that the OS doesn't need to be aware of the L3C presence for correct functionality (e.g. no cache maintenance required). If the I/O device doesn't snoop the L3C (a non-coherent device) and the kernel needs to perform cache maintenance, then CWG needs to report the maximum granule among all cache levels. In practice, we hardcoded ARCH_DMA_MINALIGN in the kernel to 128-byte. > > Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't > > you run into coherency problems? For example, if the device is trying to > > read a line that is dirty in L3. > > Apologies that I don't explain it clearly, > > Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able > operations must look up L3$ while its cache line is 128-byte, so we want to update > cache_line_size with 128-byte for IO device driver. > > What I said above was that I tried to understand Catalin's meaning, because I > am a little confused how to determine the CWG size if cache line size are > different between L1/L2$ and L3$. In Linux cache_line_size() currently reports CWG and it is used for performance rather than correctness. I think for your platform reporting 64-byte in CWG is still correct as this value doesn't affect the I/O coherency. However, since it affects performance, we may want to change what cache_line_size() reports. My proposal is to use the maximum cacheline size as detected from the PPTT information with a fallback to CWG if such information is absent.
Hi Catalin, On 2019/4/16 22:59, Catalin Marinas wrote: > Hi Shaokun, > > (catching up with emails after holiday) > > On Tue, Apr 16, 2019 at 10:23:09PM +0800, Zhangshaokun wrote: >> On 2019/4/16 21:51, Will Deacon wrote: >>> On Mon, Apr 08, 2019 at 03:51:07PM +0800, Zhangshaokun wrote: >>>> On 2019/4/3 20:57, Catalin Marinas wrote: >>>>> The ARM ARM description of CWG states: >>>>> >>>>> Log 2 of the number of words of the maximum size of memory that can be >>>>> overwritten as a result of the eviction of a cache entry that has had >>>>> a memory location in it modified. >>>>> >>>>> If you have DMA-capable devices on your platform that do not snoop the >>>>> L3C, I would say CWG should report 128-byte. Otherwise it probably >>>>> doesn't matter from a correctness perspective. >>>> >>>> Thanks for your more detailed explanation. >>>> >>>> If devices doesn't snoop L3C, L3C is transparent to IO device, right? We shall >>>> report L1$/L2$ cache line size to CWG field, not L3$ cache line size. May I >>>> miss something? > > CWG is used by software and my comment was whether the L3C is > transparent to _software_ rather than to the I/O device. This means that Oh, sorry, my misunderstanding. > the OS doesn't need to be aware of the L3C presence for correct > functionality (e.g. no cache maintenance required). > > If the I/O device doesn't snoop the L3C (a non-coherent device) and the > kernel needs to perform cache maintenance, then CWG needs to report the > maximum granule among all cache levels. In practice, we hardcoded > ARCH_DMA_MINALIGN in the kernel to 128-byte. > Ok, got it. >>> Sorry if I'm missing something, but if a device doesn't snoop the L3C, don't >>> you run into coherency problems? For example, if the device is trying to >>> read a line that is dirty in L3. >> >> Apologies that I don't explain it clearly, >> >> Firstly, L3$ is not transparent to IO device on our this platform and IO cache-able >> operations must look up L3$ while its cache line is 128-byte, so we want to update >> cache_line_size with 128-byte for IO device driver. >> >> What I said above was that I tried to understand Catalin's meaning, because I >> am a little confused how to determine the CWG size if cache line size are >> different between L1/L2$ and L3$. > > In Linux cache_line_size() currently reports CWG and it is used for > performance rather than correctness. I think for your platform reporting > 64-byte in CWG is still correct as this value doesn't affect the I/O > coherency. However, since it affects performance, we may want to change > what cache_line_size() reports. Right, no correctness issue, we want to correct the misreporting cache_line_size to IO driver mainly for performance. > > My proposal is to use the maximum cacheline size as detected from the > PPTT information with a fallback to CWG if such information is absent. > Ok, got it and I will try to do it recently. Thanks, Shaokun
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index 926434f413fa..0f7d9581e0b9 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -93,7 +93,12 @@ static inline u32 cache_type_cwg(void) static inline int cache_line_size(void) { - u32 cwg = cache_type_cwg(); + u32 cwg; + + if (read_cpuid_part_number() == HISI_CPU_PART_TSV110) + return ARCH_DMA_MINALIGN; + + cwg = cache_type_cwg(); return cwg ? 4 << cwg : ARCH_DMA_MINALIGN; }
For HiSilicon's certain platform, like Kunpeng920 server SoC, it uses the tsv110 CPUs whose L1 cache line size is 64-Byte, while the cache line size of L3C is 128-Byte. cache_line_size is used mostly for IO device drivers, so we shall correct the right value and the device drivers can match it accurately to get good performance. When test mlx5 with Kunpeng920 SoC, ib_send_bw is run under the condition that the length of the packet is 4-Byte and only one queue and cpu core: Without this patch: 1.67 Mpps with this patch : 2.40 Mpps Cc: Hanjun Guo <guohanjun@huawei.com> Cc: John Garry <john.garry@huawei.com> Cc: Zhenfa Qiu <qiuzhenfa@hisilicon.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Reported-by: Zhenfa Qiu <qiuzhenfa@hisilicon.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> --- arch/arm64/include/asm/cache.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)