Message ID | 20241219083237.265419-1-zhao1.liu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | i386: Support SMP Cache Topology | expand |
On 12/19/24 09:32, Zhao Liu wrote: > Hi folks, > > This is my v6. since Phili has already merged the general smp cache > part, v6 just includes the remaining i386-specific changes to support > SMP cache topology for PC machine (currently all patches have got > Reviewed-by from previous review). > > Compared with v5 [1], there's no change and just series just picks > the unmerged patches and rebases on the master branch (based on the > commit 8032c78e556c "Merge tag 'firmware-20241216-pull-request' of > https://gitlab.com/kraxel/qemu into staging"). > > The patch 4 ("i386/cpu: add has_caches flag to check smp_cache"), which > introduced a has_caches flag, is also ARM side wanted. > > Though now this series targets to i386, to help review, I still include > the previous introduction about smp cache topology feature. > > > Background > ========== > > The x86 and ARM (RISCV) need to allow user to configure cache properties > (current only topology): > * For x86, the default cache topology model (of max/host CPU) does not > always match the Host's real physical cache topology. Performance can > increase when the configured virtual topology is closer to the > physical topology than a default topology would be. > * For ARM, QEMU can't get the cache topology information from the CPU > registers, then user configuration is necessary. Additionally, the > cache information is also needed for MPAM emulation (for TCG) to > build the right PPTT. (Originally from Jonathan) > > > About smp-cache > =============== > > The API design has been discussed heavily in [3]. > > Now, smp-cache is implemented as a array integrated in -machine. Though > -machine currently can't support JSON format, this is the one of the > directions of future. > > An example is as follows: > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > "cache" specifies the cache that the properties will be applied on. This > field is the combination of cache level and cache type. Now it supports > "l1d" (L1 data cache), "l1i" (L1 instruction cache), "l2" (L2 unified > cache) and "l3" (L3 unified cache). > > "topology" field accepts CPU topology levels including "thread", "core", > "module", "cluster", "die", "socket", "book", "drawer" and a special > value "default". Looks good; just one thing, does "thread" make sense? I think that it's almost by definition that threads within a core share all caches, but maybe I'm missing some hardware configurations. Paolo > The "default" is introduced to make it easier for libvirt to set a > default parameter value without having to care about the specific > machine (because currently there is no proper way for machine to > expose supported topology levels and caches). > > If "default" is set, then the cache topology will follow the > architecture's default cache topology model. If other CPU topology level > is set, the cache will be shared at corresponding CPU topology level. > > > [1]: Patch v5: https://lore.kernel.org/qemu-devel/20241101083331.340178-1-zhao1.liu@intel.com/ > [2]: ARM smp-cache: https://lore.kernel.org/qemu-devel/20241010111822.345-1-alireza.sanaee@huawei.com/ > [3]: API disscussion: https://lore.kernel.org/qemu-devel/8734ndj33j.fsf@pond.sub.org/ > > Thanks and Best Regards, > Zhao > --- > Alireza Sanaee (1): > i386/cpu: add has_caches flag to check smp_cache configuration > > Zhao Liu (3): > i386/cpu: Support thread and module level cache topology > i386/cpu: Update cache topology with machine's configuration > i386/pc: Support cache topology in -machine for PC machine > > hw/core/machine-smp.c | 2 ++ > hw/i386/pc.c | 4 +++ > include/hw/boards.h | 3 ++ > qemu-options.hx | 31 +++++++++++++++++- > target/i386/cpu.c | 76 ++++++++++++++++++++++++++++++++++++++++--- > 5 files changed, 111 insertions(+), 5 deletions(-) >
> > About smp-cache > > =============== > > > > The API design has been discussed heavily in [3]. > > > > Now, smp-cache is implemented as a array integrated in -machine. Though > > -machine currently can't support JSON format, this is the one of the > > directions of future. > > > > An example is as follows: > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > "cache" specifies the cache that the properties will be applied on. This > > field is the combination of cache level and cache type. Now it supports > > "l1d" (L1 data cache), "l1i" (L1 instruction cache), "l2" (L2 unified > > cache) and "l3" (L3 unified cache). > > > > "topology" field accepts CPU topology levels including "thread", "core", > > "module", "cluster", "die", "socket", "book", "drawer" and a special > > value "default". > > Looks good; just one thing, does "thread" make sense? I think that it's > almost by definition that threads within a core share all caches, but maybe > I'm missing some hardware configurations. Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has thread level cache. I considered the thread case is that it could be used for vCPU scheduling optimization (although I haven't rigorously tested the actual impact). Without CPU affinity, tasks in Linux are generally distributed evenly across different cores (for example, vCPU0 on Core 0, vCPU1 on Core 1). In this case, the thread-level cache settings are closer to the actual situation, with vCPU0 occupying the L1/L2 of Host core0 and vCPU1 occupying the L1/L2 of Host core1. ┌───┐ ┌───┐ vCPU0 vCPU1 │ │ │ │ └───┘ └───┘ ┌┌───┐┌───┐┐ ┌┌───┐┌───┐┐ ││T0 ││T1 ││ ││T2 ││T3 ││ │└───┘└───┘│ │└───┘└───┘│ └────C0────┘ └────C1────┘ The L2 cache topology affects performance, and cluster-aware scheduling feature in the Linux kernel will try to schedule tasks on the same L2 cache. So, in cases like the figure above, setting the L2 cache to be per thread should, in principle, be better. Thanks, Zhao
On Wed, 25 Dec 2024 11:03:42 +0800 Zhao Liu <zhao1.liu@intel.com> wrote: > > > About smp-cache > > > =============== > > > > > > The API design has been discussed heavily in [3]. > > > > > > Now, smp-cache is implemented as a array integrated in -machine. > > > Though -machine currently can't support JSON format, this is the > > > one of the directions of future. > > > > > > An example is as follows: > > > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > > > "cache" specifies the cache that the properties will be applied > > > on. This field is the combination of cache level and cache type. > > > Now it supports "l1d" (L1 data cache), "l1i" (L1 instruction > > > cache), "l2" (L2 unified cache) and "l3" (L3 unified cache). > > > > > > "topology" field accepts CPU topology levels including "thread", > > > "core", "module", "cluster", "die", "socket", "book", "drawer" > > > and a special value "default". > > > > Looks good; just one thing, does "thread" make sense? I think that > > it's almost by definition that threads within a core share all > > caches, but maybe I'm missing some hardware configurations. > > Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has thread > level cache. Hi Zhao and Paolo, While the example looks OK to me, and makes sense. But would be curious to know more scenarios where I can legitimately see benefit there. I am wrestling with this point on ARM too. If I were to have device trees describing caches in a way that threads get their own private caches then this would not be possible to be described via device tree due to spec limitations (+CCed Rob) if I understood correctly. Thanks, Alireza > > I considered the thread case is that it could be used for vCPU > scheduling optimization (although I haven't rigorously tested the > actual impact). Without CPU affinity, tasks in Linux are generally > distributed evenly across different cores (for example, vCPU0 on Core > 0, vCPU1 on Core 1). In this case, the thread-level cache settings > are closer to the actual situation, with vCPU0 occupying the L1/L2 of > Host core0 and vCPU1 occupying the L1/L2 of Host core1. > > > ┌───┐ ┌───┐ > vCPU0 vCPU1 > │ │ │ │ > └───┘ └───┘ > ┌┌───┐┌───┐┐ ┌┌───┐┌───┐┐ > ││T0 ││T1 ││ ││T2 ││T3 ││ > │└───┘└───┘│ │└───┘└───┘│ > └────C0────┘ └────C1────┘ > > > The L2 cache topology affects performance, and cluster-aware > scheduling feature in the Linux kernel will try to schedule tasks on > the same L2 cache. So, in cases like the figure above, setting the L2 > cache to be per thread should, in principle, be better. > > Thanks, > Zhao > >
On Thu, Jan 2, 2025 at 8:57 AM Alireza Sanaee <alireza.sanaee@huawei.com> wrote: > > On Wed, 25 Dec 2024 11:03:42 +0800 > Zhao Liu <zhao1.liu@intel.com> wrote: > > > > > About smp-cache > > > > =============== > > > > > > > > The API design has been discussed heavily in [3]. > > > > > > > > Now, smp-cache is implemented as a array integrated in -machine. > > > > Though -machine currently can't support JSON format, this is the > > > > one of the directions of future. > > > > > > > > An example is as follows: > > > > > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > > > > > "cache" specifies the cache that the properties will be applied > > > > on. This field is the combination of cache level and cache type. > > > > Now it supports "l1d" (L1 data cache), "l1i" (L1 instruction > > > > cache), "l2" (L2 unified cache) and "l3" (L3 unified cache). > > > > > > > > "topology" field accepts CPU topology levels including "thread", > > > > "core", "module", "cluster", "die", "socket", "book", "drawer" > > > > and a special value "default". > > > > > > Looks good; just one thing, does "thread" make sense? I think that > > > it's almost by definition that threads within a core share all > > > caches, but maybe I'm missing some hardware configurations. > > > > Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has thread > > level cache. > > Hi Zhao and Paolo, > > While the example looks OK to me, and makes sense. But would be curious > to know more scenarios where I can legitimately see benefit there. > > I am wrestling with this point on ARM too. If I were to > have device trees describing caches in a way that threads get their own > private caches then this would not be possible to be > described via device tree due to spec limitations (+CCed Rob) if I > understood correctly. You asked me for the opposite though, and I described how you can share the cache. If you want a cache per thread, then you probably want a node per thread. Rob
On Thu, 2 Jan 2025 11:09:51 -0600 Rob Herring <robh@kernel.org> wrote: > On Thu, Jan 2, 2025 at 8:57 AM Alireza Sanaee > <alireza.sanaee@huawei.com> wrote: > > > > On Wed, 25 Dec 2024 11:03:42 +0800 > > Zhao Liu <zhao1.liu@intel.com> wrote: > > > > > > > About smp-cache > > > > > =============== > > > > > > > > > > The API design has been discussed heavily in [3]. > > > > > > > > > > Now, smp-cache is implemented as a array integrated in > > > > > -machine. Though -machine currently can't support JSON > > > > > format, this is the one of the directions of future. > > > > > > > > > > An example is as follows: > > > > > > > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > > > > > > > "cache" specifies the cache that the properties will be > > > > > applied on. This field is the combination of cache level and > > > > > cache type. Now it supports "l1d" (L1 data cache), "l1i" (L1 > > > > > instruction cache), "l2" (L2 unified cache) and "l3" (L3 > > > > > unified cache). > > > > > > > > > > "topology" field accepts CPU topology levels including > > > > > "thread", "core", "module", "cluster", "die", "socket", > > > > > "book", "drawer" and a special value "default". > > > > > > > > Looks good; just one thing, does "thread" make sense? I think > > > > that it's almost by definition that threads within a core share > > > > all caches, but maybe I'm missing some hardware configurations. > > > > > > > > > > Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has > > > thread level cache. > > > > Hi Zhao and Paolo, > > > > While the example looks OK to me, and makes sense. But would be > > curious to know more scenarios where I can legitimately see benefit > > there. > > > > I am wrestling with this point on ARM too. If I were to > > have device trees describing caches in a way that threads get their > > own private caches then this would not be possible to be > > described via device tree due to spec limitations (+CCed Rob) if I > > understood correctly. > > You asked me for the opposite though, and I described how you can > share the cache. If you want a cache per thread, then you probably > want a node per thread. > > Rob > Hi Rob, That's right, I made the mistake in my prior message, and you recalled correctly. I wanted shared caches between two threads, though I have missed your answer before, just found it. Thanks, Alireza
On Thu, Jan 02, 2025 at 06:01:41PM +0000, Alireza Sanaee wrote: > Date: Thu, 2 Jan 2025 18:01:41 +0000 > From: Alireza Sanaee <alireza.sanaee@huawei.com> > Subject: Re: [PATCH v6 0/4] i386: Support SMP Cache Topology > X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) > > On Thu, 2 Jan 2025 11:09:51 -0600 > Rob Herring <robh@kernel.org> wrote: > > > On Thu, Jan 2, 2025 at 8:57 AM Alireza Sanaee > > <alireza.sanaee@huawei.com> wrote: > > > > > > On Wed, 25 Dec 2024 11:03:42 +0800 > > > Zhao Liu <zhao1.liu@intel.com> wrote: > > > > > > > > > About smp-cache > > > > > > =============== > > > > > > > > > > > > The API design has been discussed heavily in [3]. > > > > > > > > > > > > Now, smp-cache is implemented as a array integrated in > > > > > > -machine. Though -machine currently can't support JSON > > > > > > format, this is the one of the directions of future. > > > > > > > > > > > > An example is as follows: > > > > > > > > > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > > > > > > > > > "cache" specifies the cache that the properties will be > > > > > > applied on. This field is the combination of cache level and > > > > > > cache type. Now it supports "l1d" (L1 data cache), "l1i" (L1 > > > > > > instruction cache), "l2" (L2 unified cache) and "l3" (L3 > > > > > > unified cache). > > > > > > > > > > > > "topology" field accepts CPU topology levels including > > > > > > "thread", "core", "module", "cluster", "die", "socket", > > > > > > "book", "drawer" and a special value "default". > > > > > > > > > > Looks good; just one thing, does "thread" make sense? I think > > > > > that it's almost by definition that threads within a core share > > > > > all caches, but maybe I'm missing some hardware configurations. > > > > > > > > > > > > > Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has > > > > thread level cache. > > > > > > Hi Zhao and Paolo, > > > > > > While the example looks OK to me, and makes sense. But would be > > > curious to know more scenarios where I can legitimately see benefit > > > there. > > > > > > I am wrestling with this point on ARM too. If I were to > > > have device trees describing caches in a way that threads get their > > > own private caches then this would not be possible to be > > > described via device tree due to spec limitations (+CCed Rob) if I > > > understood correctly. > > > > You asked me for the opposite though, and I described how you can > > share the cache. If you want a cache per thread, then you probably > > want a node per thread. > > > > Rob > > > > Hi Rob, > > That's right, I made the mistake in my prior message, and you recalled > correctly. I wanted shared caches between two threads, though I have > missed your answer before, just found it. > Thank you all! Alireza, do you know how to configure arm node through QEMU options? However, IIUC, arm needs more effort to configure cache per thread (by configuring node topology)...In that case, since no one has explicitly requested the need for cache per thread, I can disable cache per thread for now. I can return an error for this scenario during the general smp-cache option parsing (in the future, if there is a real need, it can be easily re-enabled). Will drop cache per thread in the next version. Thanks, Zhao
On Fri, 3 Jan 2025 16:25:58 +0800 Zhao Liu <zhao1.liu@intel.com> wrote: > On Thu, Jan 02, 2025 at 06:01:41PM +0000, Alireza Sanaee wrote: > > Date: Thu, 2 Jan 2025 18:01:41 +0000 > > From: Alireza Sanaee <alireza.sanaee@huawei.com> > > Subject: Re: [PATCH v6 0/4] i386: Support SMP Cache Topology > > X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) > > > > On Thu, 2 Jan 2025 11:09:51 -0600 > > Rob Herring <robh@kernel.org> wrote: > > > > > On Thu, Jan 2, 2025 at 8:57 AM Alireza Sanaee > > > <alireza.sanaee@huawei.com> wrote: > > > > > > > > On Wed, 25 Dec 2024 11:03:42 +0800 > > > > Zhao Liu <zhao1.liu@intel.com> wrote: > > > > > > > > > > > About smp-cache > > > > > > > =============== > > > > > > > > > > > > > > The API design has been discussed heavily in [3]. > > > > > > > > > > > > > > Now, smp-cache is implemented as a array integrated in > > > > > > > -machine. Though -machine currently can't support JSON > > > > > > > format, this is the one of the directions of future. > > > > > > > > > > > > > > An example is as follows: > > > > > > > > > > > > > > smp_cache=smp-cache.0.cache=l1i,smp-cache.0.topology=core,smp-cache.1.cache=l1d,smp-cache.1.topology=core,smp-cache.2.cache=l2,smp-cache.2.topology=module,smp-cache.3.cache=l3,smp-cache.3.topology=die > > > > > > > > > > > > > > "cache" specifies the cache that the properties will be > > > > > > > applied on. This field is the combination of cache level > > > > > > > and cache type. Now it supports "l1d" (L1 data cache), > > > > > > > "l1i" (L1 instruction cache), "l2" (L2 unified cache) and > > > > > > > "l3" (L3 unified cache). > > > > > > > > > > > > > > "topology" field accepts CPU topology levels including > > > > > > > "thread", "core", "module", "cluster", "die", "socket", > > > > > > > "book", "drawer" and a special value "default". > > > > > > > > > > > > Looks good; just one thing, does "thread" make sense? I > > > > > > think that it's almost by definition that threads within a > > > > > > core share all caches, but maybe I'm missing some hardware > > > > > > configurations. > > > > > > > > > > Hi Paolo, merry Christmas. Yes, AFAIK, there's no hardware has > > > > > thread level cache. > > > > > > > > Hi Zhao and Paolo, > > > > > > > > While the example looks OK to me, and makes sense. But would be > > > > curious to know more scenarios where I can legitimately see > > > > benefit there. > > > > > > > > I am wrestling with this point on ARM too. If I were to > > > > have device trees describing caches in a way that threads get > > > > their own private caches then this would not be possible to be > > > > described via device tree due to spec limitations (+CCed Rob) > > > > if I understood correctly. > > > > > > You asked me for the opposite though, and I described how you can > > > share the cache. If you want a cache per thread, then you probably > > > want a node per thread. > > > > > > Rob > > > > > > > Hi Rob, > > > > That's right, I made the mistake in my prior message, and you > > recalled correctly. I wanted shared caches between two threads, > > though I have missed your answer before, just found it. > > > > Thank you all! > > Alireza, do you know how to configure arm node through QEMU options? Hi Zhao, do you mean the -smp param? > > However, IIUC, arm needs more effort to configure cache per thread (by > configuring node topology)...In that case, since no one has explicitly > requested the need for cache per thread, I can disable cache per > thread for now. I can return an error for this scenario during the > general smp-cache option parsing (in the future, if there is a real > need, it can be easily re-enabled). > > Will drop cache per thread in the next version. > > Thanks, > Zhao > >
> > > > You asked me for the opposite though, and I described how you can > > > > share the cache. If you want a cache per thread, then you probably > > > > want a node per thread. > > > > > > > > Rob > > > > > > > > > > Hi Rob, > > > > > > That's right, I made the mistake in my prior message, and you > > > recalled correctly. I wanted shared caches between two threads, > > > though I have missed your answer before, just found it. > > > > > > > Thank you all! > > > > Alireza, do you know how to configure arm node through QEMU options? > > Hi Zhao, do you mean the -smp param? > I mean do you know how to configure something like "a node per thread" by QEMU option? :-) I'm curious about the relationship between "node" and the SMP topology on the ARM side in the current QEMU. I'm not sure if this "node" refers to the NUMA node. Thanks, Zhao