Message ID | 20230208073533.715-1-bharata@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Memory access profiler(IBS) driven NUMA balancing | expand |
On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote: > - Perf uses IBS and we are using the same IBS for access profiling here. > There needs to be a proper way to make the use mutually exclusive. No, IFF this lives it needs to use in-kernel perf. > - Is tying this up with NUMA balancing a reasonable approach or > should we look at a completely new approach? Is it giving sufficient win to be worth it, afaict it doesn't come even close to justifying it. > - Hardware provided access information could be very useful for driving > hot page promotion in tiered memory systems. Need to check if this > requires different tuning/heuristics apart from what NUMA balancing > already does. I think Huang Ying looked at that from the Intel POV and I think the conclusion was that it doesn't really work out. What you need is frequency information, but the PMU doesn't really give you that. You need to process a *ton* of PMU data in-kernel.
On 2/8/23 10:03, Peter Zijlstra wrote: >> - Hardware provided access information could be very useful for driving >> hot page promotion in tiered memory systems. Need to check if this >> requires different tuning/heuristics apart from what NUMA balancing >> already does. > I think Huang Ying looked at that from the Intel POV and I think the > conclusion was that it doesn't really work out. What you need is > frequency information, but the PMU doesn't really give you that. You > need to process a *ton* of PMU data in-kernel. Yeah, there were two big problems. First, IIRC, Intel PEBS at the time only gave guest virtual addresses in the PEBS records. They had to be translated back to host addresses to be usable. That was extra expensive. Second, it *did* take a lot of processing to turn raw memory accesses into actionable frequency data. That meant that we started in a hole performance-wise and had to make *REALLY* good decisions about page migration to make up for it. The performance data here don't look awful, but they don't seem to add up to a clear win either. I'm having a hard time imagining who would turn this on and how widely it would get used in practice.
On 2/8/2023 11:33 PM, Peter Zijlstra wrote: > On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote: > >> - Perf uses IBS and we are using the same IBS for access profiling here. >> There needs to be a proper way to make the use mutually exclusive. > > No, IFF this lives it needs to use in-kernel perf. In fact I started out with in-kernel perf by using the perf_event_create_kernel_counter() API. However there are issues with using in-kernel perf: - We want to reprogram the counter potentially during every context switch. The IBS hardware sample counter needs to be reprogrammed based on the incoming thread's view of sample period. Additionally sampling needs to be disabled for kernel threads. So I wanted to use perf_event_enable/disable() and perf_event_period(). However they take mutexes and hence it is not possible to use them from the sched switch atomic context. - In-kernel perf gives a per-cpu counter, but we want it to count based on the task that is currently running. I,e., the period should be modified on per-task basis. I don't see how an in-kernel perf event counter can be associated with per-task like this. Hence I didn't see an easy option other than making the use of IBS in perf and NUMA balancing mutually exclusive. > >> - Is tying this up with NUMA balancing a reasonable approach or >> should we look at a completely new approach? > > Is it giving sufficient win to be worth it, afaict it doesn't come even > close to justifying it. > >> - Hardware provided access information could be very useful for driving >> hot page promotion in tiered memory systems. Need to check if this >> requires different tuning/heuristics apart from what NUMA balancing >> already does. > > I think Huang Ying looked at that from the Intel POV and I think the > conclusion was that it doesn't really work out. What you need is > frequency information, but the PMU doesn't really give you that. You > need to process a *ton* of PMU data in-kernel. What I am doing here is to feed the access data into NUMA balancing which already has the logic to aggregate that at task and numa group level and decide if that access is actionable in terms of migrating the page. In this context, I am not sure about the frequency information that you and Dave are mentioning. AFAIU, existing NUMA balancing takes care of taking action, IBS becomes an alternative source of access information to NUMA hint faults. Thanks for your inputs. Regards, Bharata.
On 2/8/2023 11:42 PM, Dave Hansen wrote: > On 2/8/23 10:03, Peter Zijlstra wrote: >>> - Hardware provided access information could be very useful for driving >>> hot page promotion in tiered memory systems. Need to check if this >>> requires different tuning/heuristics apart from what NUMA balancing >>> already does. >> I think Huang Ying looked at that from the Intel POV and I think the >> conclusion was that it doesn't really work out. What you need is >> frequency information, but the PMU doesn't really give you that. You >> need to process a *ton* of PMU data in-kernel. > > Yeah, there were two big problems. > > First, IIRC, Intel PEBS at the time only gave guest virtual addresses in > the PEBS records. They had to be translated back to host addresses to > be usable. That was extra expensive. Just to be clear, I am using IBS in host only and it can give both virtual and physical address. > > Second, it *did* take a lot of processing to turn raw memory accesses > into actionable frequency data. That meant that we started in a hole > performance-wise and had to make *REALLY* good decisions about page > migration to make up for it. I touched upon the frequency aspect in reply to Peter, but please let me know if I am missing something. > > The performance data here don't look awful, but they don't seem to add > up to a clear win either. I'm having a hard time imagining who would > turn this on and how widely it would get used in practice. I am hopeful with more appropriate tuning of NUMA balancing logic to work with hardware-provided access info (as against scan based NUMA hint faults), we should be able to see a clear win. At least theoretically we wouldn't have the overheads of address space scanning and hint faults handling. Thanks for your inputs. Regards, Bharata.
On 2/8/23 22:04, Bharata B Rao wrote: >> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in >> the PEBS records. They had to be translated back to host addresses to >> be usable. That was extra expensive. > Just to be clear, I am using IBS in host only and it can give both virtual > and physical address. Could you talk a little bit about how IBS might get used for NUMA balancing guest memory?
On 2/9/2023 7:58 PM, Dave Hansen wrote: > On 2/8/23 22:04, Bharata B Rao wrote: >>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in >>> the PEBS records. They had to be translated back to host addresses to >>> be usable. That was extra expensive. >> Just to be clear, I am using IBS in host only and it can give both virtual >> and physical address. > > Could you talk a little bit about how IBS might get used for NUMA > balancing guest memory? IBS can work for guest, but will not provide physical address. Also the support for virtualized IBS isn't upstream yet. However I am not sure how effective or useful NUMA balancing within a guest is, as the actual physical addresses are transparent to the guest. Additionally when using IBS in host, it is possible to prevent collection of samples from secure guests by using the PreventHostIBS feature. (https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#) Regards, Bharata.
On 2/9/23 20:28, Bharata B Rao wrote: > On 2/9/2023 7:58 PM, Dave Hansen wrote: >> On 2/8/23 22:04, Bharata B Rao wrote: >>>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in >>>> the PEBS records. They had to be translated back to host addresses to >>>> be usable. That was extra expensive. >>> Just to be clear, I am using IBS in host only and it can give both virtual >>> and physical address. >> Could you talk a little bit about how IBS might get used for NUMA >> balancing guest memory? > IBS can work for guest, but will not provide physical address. Also > the support for virtualized IBS isn't upstream yet. > > However I am not sure how effective or useful NUMA balancing within a guest > is, as the actual physical addresses are transparent to the guest. > > Additionally when using IBS in host, it is possible to prevent collection > of samples from secure guests by using the PreventHostIBS feature. > (https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#) I was wondering specifically about how a host might use IBS to balance guest memory transparently to the guest. Now how a guest might use IBS to balance its own memory.
On 2/10/2023 10:10 AM, Dave Hansen wrote: > On 2/9/23 20:28, Bharata B Rao wrote: >> On 2/9/2023 7:58 PM, Dave Hansen wrote: >>> On 2/8/23 22:04, Bharata B Rao wrote: >>>>> First, IIRC, Intel PEBS at the time only gave guest virtual addresses in >>>>> the PEBS records. They had to be translated back to host addresses to >>>>> be usable. That was extra expensive. >>>> Just to be clear, I am using IBS in host only and it can give both virtual >>>> and physical address. >>> Could you talk a little bit about how IBS might get used for NUMA >>> balancing guest memory? >> IBS can work for guest, but will not provide physical address. Also >> the support for virtualized IBS isn't upstream yet. >> >> However I am not sure how effective or useful NUMA balancing within a guest >> is, as the actual physical addresses are transparent to the guest. >> >> Additionally when using IBS in host, it is possible to prevent collection >> of samples from secure guests by using the PreventHostIBS feature. >> (https://lore.kernel.org/linux-perf-users/20230206060545.628502-1-manali.shukla@amd.com/T/#) > I was wondering specifically about how a host might use IBS to balance > guest memory transparently to the guest. Now how a guest might use IBS > to balance its own memory. When guest memory accesses are captured by IBS in the host, IBS provides the host physical address. Hence IBS based NUMA balancing in the host should be able to balance guest memory transparently to the guest. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 2/8/2023 11:33 PM, Peter Zijlstra wrote: >> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote: >> >> >>> - Hardware provided access information could be very useful for driving >>> hot page promotion in tiered memory systems. Need to check if this >>> requires different tuning/heuristics apart from what NUMA balancing >>> already does. >> >> I think Huang Ying looked at that from the Intel POV and I think the >> conclusion was that it doesn't really work out. What you need is >> frequency information, but the PMU doesn't really give you that. You >> need to process a *ton* of PMU data in-kernel. > > What I am doing here is to feed the access data into NUMA balancing which > already has the logic to aggregate that at task and numa group level and > decide if that access is actionable in terms of migrating the page. In this > context, I am not sure about the frequency information that you and Dave > are mentioning. AFAIU, existing NUMA balancing takes care of taking > action, IBS becomes an alternative source of access information to NUMA > hint faults. We do need frequency information to determine whether a page is hot enough to be migrated to the fast memory (promotion). What PMU provided is just "recently" accessed pages, not "frequently" accessed pages. For current NUMA balancing implementation, please check NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory(). In general, it estimates the page access frequency via measuring the latency between page table scanning and page fault, the shorter the latency, the higher the frequency. This isn't perfect, but provides a starting point. You need to consider how to get frequency information via PMU. For example, you may count access number for each page, aging them periodically, and get hot threshold via some statistics. Best Regards, Huang, Ying
On 2/13/2023 8:26 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 2/8/2023 11:33 PM, Peter Zijlstra wrote: >>> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote: >>> >>> >>>> - Hardware provided access information could be very useful for driving >>>> hot page promotion in tiered memory systems. Need to check if this >>>> requires different tuning/heuristics apart from what NUMA balancing >>>> already does. >>> >>> I think Huang Ying looked at that from the Intel POV and I think the >>> conclusion was that it doesn't really work out. What you need is >>> frequency information, but the PMU doesn't really give you that. You >>> need to process a *ton* of PMU data in-kernel. >> >> What I am doing here is to feed the access data into NUMA balancing which >> already has the logic to aggregate that at task and numa group level and >> decide if that access is actionable in terms of migrating the page. In this >> context, I am not sure about the frequency information that you and Dave >> are mentioning. AFAIU, existing NUMA balancing takes care of taking >> action, IBS becomes an alternative source of access information to NUMA >> hint faults. > > We do need frequency information to determine whether a page is hot > enough to be migrated to the fast memory (promotion). What PMU provided > is just "recently" accessed pages, not "frequently" accessed pages. For > current NUMA balancing implementation, please check > NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory(). In > general, it estimates the page access frequency via measuring the > latency between page table scanning and page fault, the shorter the > latency, the higher the frequency. This isn't perfect, but provides a > starting point. You need to consider how to get frequency information > via PMU. For example, you may count access number for each page, aging > them periodically, and get hot threshold via some statistics. For the tiered memory hot page promotion case of NUMA balancing, we will have to maintain frequency information in software when such information isn't available from the hardware. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > Hi, > > Some hardware platforms can provide information about memory accesses > that can be used to do optimal page and task placement on NUMA > systems. AMD processors have a hardware facility called Instruction- > Based Sampling (IBS) that can be used to gather specific metrics > related to instruction fetch and execution activity. This facility > can be used to perform memory access profiling based on statistical > sampling. > > This RFC is a proof-of-concept implementation where the access > information obtained from the hardware is used to drive NUMA balancing. > With this it is no longer necessary to scan the address space and > introduce NUMA hint faults to build task-to-page association. Hence > the approach taken here is to replace the address space scanning plus > hint faults with the access information provided by the hardware. You method can avoid the address space scanning, but cannot avoid memory access fault in fact. PMU will raise NMI and then task_work to process the sampled memory accesses. The overhead depends on the frequency of the memory access sampling. Please measure the overhead of your method in details. > The access samples obtained from hardware are fed to NUMA balancing > as fault-equivalents. The rest of the NUMA balancing logic that > collects/aggregates the shared/private/local/remote faults and does > pages/task migrations based on the faults is retained except that > accesses replace faults. > > This early implementation is an attempt to get a working solution > only and as such a lot of TODOs exist: > > - Perf uses IBS and we are using the same IBS for access profiling here. > There needs to be a proper way to make the use mutually exclusive. > - Is tying this up with NUMA balancing a reasonable approach or > should we look at a completely new approach? > - When accesses replace faults in NUMA balancing, a few things have > to be tuned differently. All such decision points need to be > identified and appropriate tuning needs to be done. > - Hardware provided access information could be very useful for driving > hot page promotion in tiered memory systems. Need to check if this > requires different tuning/heuristics apart from what NUMA balancing > already does. > - Some of the values used to program the IBS counters like the sampling > period etc may not be the optimal or ideal values. The sample period > adjustment follows the same logic as scan period modification which > may not be ideal. More experimentation is required to fine-tune all > these aspects. > - Currently I am acting (i,e., attempt to migrate a page) on each sampled > access. Need to check if it makes sense to delay it and do batched page > migration. You current implementation is tied with AMD IBS. You will need a architecture/vendor independent framework for upstreaming. BTW: can IBS sampling memory writing too? Or just memory reading? > This RFC is mainly about showing how hardware provided access > information could be used for NUMA balancing but I have run a > few basic benchmarks from mmtests to check if this is any severe > regression/overhead to any of those. Some benchmarks show some > improvement, some show no significant change and a few regress. > I am hopeful that with more appropriate tuning there is scope for > futher improvement here especially for workloads for which NUMA > matters. What's your expected improvement of the PMU based NUMA balancing? It should come from reduced overhead? higher accuracy? Quicker response? I think that it may be better to prove that with appropriate statistics for at least one workload. > FWIW, here are the numbers in brief: > (1st column is default kernel, 2nd column is with this patchset) > > kernbench > ========= > 6.2.0-rc5 6.2.0-rc5-ibs > Amean user-512 19385.27 ( 0.00%) 18140.69 * 6.42%* > Amean syst-512 21620.40 ( 0.00%) 19984.87 * 7.56%* > Amean elsp-512 95.91 ( 0.00%) 88.60 * 7.62%* > > Duration User 19385.45 18140.89 > Duration System 21620.90 19985.37 > Duration Elapsed 96.52 89.20 > > Ops NUMA alloc hit 552153976.00 499596610.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 552152782.00 499595620.00 > Ops NUMA base-page range updates 758004.00 0.00 > Ops NUMA PTE updates 758004.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 215654.00 1797848.00 > Ops NUMA hint local faults % 2054.00 1775103.00 > Ops NUMA hint local percent 0.95 98.73 > Ops NUMA pages migrated 213600.00 22745.00 > Ops AutoNUMA cost 1087.63 8989.67 > > autonumabench > ============= > Amean syst-NUMA01 90516.91 ( 0.00%) 65272.04 * 27.89%* > Amean syst-NUMA01_THREADLOCAL 0.26 ( 0.00%) 0.27 * -3.80%* > Amean syst-NUMA02 1.10 ( 0.00%) 1.02 * 7.24%* > Amean syst-NUMA02_SMT 0.74 ( 0.00%) 0.90 * -21.77%* > Amean elsp-NUMA01 747.73 ( 0.00%) 625.29 * 16.37%* > Amean elsp-NUMA01_THREADLOCAL 1.07 ( 0.00%) 1.07 * -0.13%* > Amean elsp-NUMA02 1.75 ( 0.00%) 1.72 * 1.96%* > Amean elsp-NUMA02_SMT 3.03 ( 0.00%) 3.04 * -0.47%* > > Duration User 1312937.34 1148196.94 > Duration System 633634.59 456921.29 > Duration Elapsed 5289.47 4427.82 > > Ops NUMA alloc hit 1115625106.00 704004226.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 599879745.00 459968338.00 > Ops NUMA base-page range updates 74310268.00 0.00 > Ops NUMA PTE updates 74310268.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 110504178.00 27624054.00 > Ops NUMA hint local faults % 54257985.00 17310888.00 > Ops NUMA hint local percent 49.10 62.67 > Ops NUMA pages migrated 11399016.00 7983717.00 > Ops AutoNUMA cost 553257.64 138271.96 > > tbench4 Latency > =============== > Amean latency-1 0.08 ( 0.00%) 0.08 * 1.43%* > Amean latency-2 0.10 ( 0.00%) 0.11 * -2.75%* > Amean latency-4 0.14 ( 0.00%) 0.13 * 4.31%* > Amean latency-8 0.14 ( 0.00%) 0.14 * -0.94%* > Amean latency-16 0.20 ( 0.00%) 0.19 * 8.01%* > Amean latency-32 0.24 ( 0.00%) 0.20 * 12.92%* > Amean latency-64 0.34 ( 0.00%) 0.28 * 18.30%* > Amean latency-128 1.71 ( 0.00%) 1.44 * 16.04%* > Amean latency-256 0.52 ( 0.00%) 0.69 * -32.26%* > Amean latency-512 3.27 ( 0.00%) 5.32 * -62.62%* > Amean latency-1024 0.00 ( 0.00%) 0.00 * 0.00%* > Amean latency-2048 0.00 ( 0.00%) 0.00 * 0.00%* > > tbench4 Throughput > ================== > Hmean 1 504.57 ( 0.00%) 496.80 * -1.54%* > Hmean 2 1006.46 ( 0.00%) 990.04 * -1.63%* > Hmean 4 1855.11 ( 0.00%) 1933.76 * 4.24%* > Hmean 8 3711.49 ( 0.00%) 3582.32 * -3.48%* > Hmean 16 6707.58 ( 0.00%) 6674.46 * -0.49%* > Hmean 32 13146.81 ( 0.00%) 12649.49 * -3.78%* > Hmean 64 20922.72 ( 0.00%) 22605.55 * 8.04%* > Hmean 128 33637.07 ( 0.00%) 37870.35 * 12.59%* > Hmean 256 54083.12 ( 0.00%) 50257.25 * -7.07%* > Hmean 512 72455.66 ( 0.00%) 53141.88 * -26.66%* > Hmean 1024 124413.95 ( 0.00%) 117398.40 * -5.64%* > Hmean 2048 124481.61 ( 0.00%) 124892.12 * 0.33%* > > Ops NUMA alloc hit 2092196681.00 2007852353.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 2092193601.00 2007849231.00 > Ops NUMA base-page range updates 298999.00 0.00 > Ops NUMA PTE updates 298999.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 287539.00 4499166.00 > Ops NUMA hint local faults % 98931.00 4349685.00 > Ops NUMA hint local percent 34.41 96.68 > Ops NUMA pages migrated 169086.00 149476.00 > Ops AutoNUMA cost 1443.00 22498.67 > > Duration User 23999.54 24476.30 > Duration System 160480.07 164366.91 > Duration Elapsed 2685.19 2685.69 > > netperf-udp > =========== > Hmean send-64 226.57 ( 0.00%) 225.41 * -0.51%* > Hmean send-128 450.89 ( 0.00%) 448.90 * -0.44%* > Hmean send-256 899.63 ( 0.00%) 898.02 * -0.18%* > Hmean send-1024 3510.63 ( 0.00%) 3526.24 * 0.44%* > Hmean send-2048 6493.15 ( 0.00%) 6493.27 * 0.00%* > Hmean send-3312 9778.22 ( 0.00%) 9801.03 * 0.23%* > Hmean send-4096 11523.43 ( 0.00%) 11490.57 * -0.29%* > Hmean send-8192 18666.11 ( 0.00%) 18686.99 * 0.11%* > Hmean send-16384 28112.56 ( 0.00%) 28223.81 * 0.40%* > Hmean recv-64 226.57 ( 0.00%) 225.41 * -0.51%* > Hmean recv-128 450.88 ( 0.00%) 448.90 * -0.44%* > Hmean recv-256 899.63 ( 0.00%) 898.01 * -0.18%* > Hmean recv-1024 3510.61 ( 0.00%) 3526.21 * 0.44%* > Hmean recv-2048 6493.07 ( 0.00%) 6493.15 * 0.00%* > Hmean recv-3312 9777.95 ( 0.00%) 9800.85 * 0.23%* > Hmean recv-4096 11522.87 ( 0.00%) 11490.47 * -0.28%* > Hmean recv-8192 18665.83 ( 0.00%) 18686.56 * 0.11%* > Hmean recv-16384 28112.13 ( 0.00%) 28223.73 * 0.40%* > > Duration User 48.52 48.74 > Duration System 931.24 925.83 > Duration Elapsed 1934.05 1934.79 > > Ops NUMA alloc hit 60042365.00 60144256.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 60042305.00 60144228.00 > Ops NUMA base-page range updates 6630.00 0.00 > Ops NUMA PTE updates 6630.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 5709.00 26249.00 > Ops NUMA hint local faults % 3030.00 25130.00 > Ops NUMA hint local percent 53.07 95.74 > Ops NUMA pages migrated 2500.00 1119.00 > Ops AutoNUMA cost 28.64 131.27 > > netperf-udp-rr > ============== > Hmean 1 132319.16 ( 0.00%) 130621.99 * -1.28%* > > Duration User 9.92 9.97 > Duration System 118.02 119.26 > Duration Elapsed 432.12 432.91 > > Ops NUMA alloc hit 289650.00 289222.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 289642.00 289222.00 > Ops NUMA base-page range updates 1.00 0.00 > Ops NUMA PTE updates 1.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 1.00 51.00 > Ops NUMA hint local faults % 0.00 45.00 > Ops NUMA hint local percent 0.00 88.24 > Ops NUMA pages migrated 1.00 6.00 > Ops AutoNUMA cost 0.01 0.26 > > netperf-tcp-rr > ============== > Hmean 1 118141.46 ( 0.00%) 115515.41 * -2.22%* > > Duration User 9.59 9.52 > Duration System 120.32 121.66 > Duration Elapsed 432.20 432.49 > > Ops NUMA alloc hit 291257.00 290927.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 291233.00 290923.00 > Ops NUMA base-page range updates 2.00 0.00 > Ops NUMA PTE updates 2.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 2.00 46.00 > Ops NUMA hint local faults % 0.00 42.00 > Ops NUMA hint local percent 0.00 91.30 > Ops NUMA pages migrated 2.00 4.00 > Ops AutoNUMA cost 0.01 0.23 > > dbench > ====== > dbench4 Latency > > Amean latency-1 2.13 ( 0.00%) 10.92 *-411.44%* > Amean latency-2 12.03 ( 0.00%) 8.17 * 32.07%* > Amean latency-4 21.12 ( 0.00%) 9.60 * 54.55%* > Amean latency-8 41.20 ( 0.00%) 33.59 * 18.45%* > Amean latency-16 76.85 ( 0.00%) 75.84 * 1.31%* > Amean latency-32 91.68 ( 0.00%) 90.26 * 1.55%* > Amean latency-64 124.61 ( 0.00%) 113.31 * 9.07%* > Amean latency-128 140.14 ( 0.00%) 126.29 * 9.89%* > Amean latency-256 155.63 ( 0.00%) 142.11 * 8.69%* > Amean latency-512 258.60 ( 0.00%) 243.13 * 5.98%* > > dbench4 Throughput (misleading but traditional) > > Hmean 1 987.47 ( 0.00%) 938.07 * -5.00%* > Hmean 2 1750.10 ( 0.00%) 1697.08 * -3.03%* > Hmean 4 2990.33 ( 0.00%) 3023.23 * 1.10%* > Hmean 8 3557.38 ( 0.00%) 3863.32 * 8.60%* > Hmean 16 2705.90 ( 0.00%) 2660.48 * -1.68%* > Hmean 32 2954.08 ( 0.00%) 3101.59 * 4.99%* > Hmean 64 3061.68 ( 0.00%) 3206.15 * 4.72%* > Hmean 128 2867.74 ( 0.00%) 3080.21 * 7.41%* > Hmean 256 2585.58 ( 0.00%) 2875.44 * 11.21%* > Hmean 512 1777.80 ( 0.00%) 1777.79 * -0.00%* > > Duration User 2359.02 2246.44 > Duration System 18927.83 16856.91 > Duration Elapsed 1901.54 1901.44 > > Ops NUMA alloc hit 240556255.00 255283721.00 > Ops NUMA alloc miss 408851.00 62903.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 240547816.00 255264974.00 > Ops NUMA base-page range updates 204316.00 0.00 > Ops NUMA PTE updates 204316.00 0.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 201101.00 287642.00 > Ops NUMA hint local faults % 104199.00 153547.00 > Ops NUMA hint local percent 51.81 53.38 > Ops NUMA pages migrated 96158.00 134083.00 > Ops AutoNUMA cost 1008.76 1440.76 > > Bharata B Rao (5): > x86/ibs: In-kernel IBS driver for page access profiling > x86/ibs: Drive NUMA balancing via IBS access data > x86/ibs: Enable per-process IBS from sched switch path > x86/ibs: Adjust access faults sampling period > x86/ibs: Delay the collection of HW-provided access info > > arch/x86/events/amd/ibs.c | 6 + > arch/x86/include/asm/msr-index.h | 12 ++ > arch/x86/mm/Makefile | 1 + > arch/x86/mm/ibs.c | 250 +++++++++++++++++++++++++++++++ > include/linux/migrate.h | 1 + > include/linux/mm.h | 2 + > include/linux/mm_types.h | 3 + > include/linux/sched.h | 4 + > include/linux/vm_event_item.h | 12 ++ > kernel/sched/core.c | 1 + > kernel/sched/debug.c | 10 ++ > kernel/sched/fair.c | 142 ++++++++++++++++-- > kernel/sched/sched.h | 9 ++ > mm/memory.c | 92 ++++++++++++ > mm/vmstat.c | 12 ++ > 15 files changed, 544 insertions(+), 13 deletions(-) > create mode 100644 arch/x86/mm/ibs.c Best Regards, Huang, Ying
Bharata B Rao <bharata@amd.com> writes: > On 2/13/2023 8:26 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 2/8/2023 11:33 PM, Peter Zijlstra wrote: >>>> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote: >>>> >>>> >>>>> - Hardware provided access information could be very useful for driving >>>>> hot page promotion in tiered memory systems. Need to check if this >>>>> requires different tuning/heuristics apart from what NUMA balancing >>>>> already does. >>>> >>>> I think Huang Ying looked at that from the Intel POV and I think the >>>> conclusion was that it doesn't really work out. What you need is >>>> frequency information, but the PMU doesn't really give you that. You >>>> need to process a *ton* of PMU data in-kernel. >>> >>> What I am doing here is to feed the access data into NUMA balancing which >>> already has the logic to aggregate that at task and numa group level and >>> decide if that access is actionable in terms of migrating the page. In this >>> context, I am not sure about the frequency information that you and Dave >>> are mentioning. AFAIU, existing NUMA balancing takes care of taking >>> action, IBS becomes an alternative source of access information to NUMA >>> hint faults. >> >> We do need frequency information to determine whether a page is hot >> enough to be migrated to the fast memory (promotion). What PMU provided >> is just "recently" accessed pages, not "frequently" accessed pages. For >> current NUMA balancing implementation, please check >> NUMA_BALANCING_MEMORY_TIERING in should_numa_migrate_memory(). In >> general, it estimates the page access frequency via measuring the >> latency between page table scanning and page fault, the shorter the >> latency, the higher the frequency. This isn't perfect, but provides a >> starting point. You need to consider how to get frequency information >> via PMU. For example, you may count access number for each page, aging >> them periodically, and get hot threshold via some statistics. > > For the tiered memory hot page promotion case of NUMA balancing, we will > have to maintain frequency information in software when such information > isn't available from the hardware. Yes. It's challenging to calculate frequency information. Please consider how to do that. Best Regards, Huang, Ying
On 2/13/2023 8:56 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> Hi, >> >> Some hardware platforms can provide information about memory accesses >> that can be used to do optimal page and task placement on NUMA >> systems. AMD processors have a hardware facility called Instruction- >> Based Sampling (IBS) that can be used to gather specific metrics >> related to instruction fetch and execution activity. This facility >> can be used to perform memory access profiling based on statistical >> sampling. >> >> This RFC is a proof-of-concept implementation where the access >> information obtained from the hardware is used to drive NUMA balancing. >> With this it is no longer necessary to scan the address space and >> introduce NUMA hint faults to build task-to-page association. Hence >> the approach taken here is to replace the address space scanning plus >> hint faults with the access information provided by the hardware. > > You method can avoid the address space scanning, but cannot avoid memory > access fault in fact. PMU will raise NMI and then task_work to process > the sampled memory accesses. The overhead depends on the frequency of > the memory access sampling. Please measure the overhead of your method > in details. Yes, the address space scanning is avoided. I will measure the overhead of hint fault vs NMI handling path. The actual processing of the access from task_work context is pretty much similar to the stats processing from hint faults. As you note the overhead depends on the frequency of sampling. In this current approach, the sampling period is per-task and it varies based on the same logic that NUMA balancing uses to vary the scan period. > >> The access samples obtained from hardware are fed to NUMA balancing >> as fault-equivalents. The rest of the NUMA balancing logic that >> collects/aggregates the shared/private/local/remote faults and does >> pages/task migrations based on the faults is retained except that >> accesses replace faults. >> >> This early implementation is an attempt to get a working solution >> only and as such a lot of TODOs exist: >> >> - Perf uses IBS and we are using the same IBS for access profiling here. >> There needs to be a proper way to make the use mutually exclusive. >> - Is tying this up with NUMA balancing a reasonable approach or >> should we look at a completely new approach? >> - When accesses replace faults in NUMA balancing, a few things have >> to be tuned differently. All such decision points need to be >> identified and appropriate tuning needs to be done. >> - Hardware provided access information could be very useful for driving >> hot page promotion in tiered memory systems. Need to check if this >> requires different tuning/heuristics apart from what NUMA balancing >> already does. >> - Some of the values used to program the IBS counters like the sampling >> period etc may not be the optimal or ideal values. The sample period >> adjustment follows the same logic as scan period modification which >> may not be ideal. More experimentation is required to fine-tune all >> these aspects. >> - Currently I am acting (i,e., attempt to migrate a page) on each sampled >> access. Need to check if it makes sense to delay it and do batched page >> migration. > > You current implementation is tied with AMD IBS. You will need a > architecture/vendor independent framework for upstreaming. I have tried to keep it vendor and arch neutral as far as possible, will re-look into this of course to make the interfaces more robust and useful. I have defined a static key (hw_access_hints=false) which will be set only by the platform driver when it detects the hardware capability to provide memory access information. NUMA balancing code skips the address space scanning when it sees this capability. The platform driver (access fault handler) will call into the NUMA balancing API with linear and physical address information of the accessed sample. Hence any equivalent hardware functionality could plug into this scheme in its current form. There are checks for this static key in the NUMA balancing logic at a few points to decide if it should work based on access faults or hint faults. > > BTW: can IBS sampling memory writing too? Or just memory reading? IBS can tag both store and load operations. > >> This RFC is mainly about showing how hardware provided access >> information could be used for NUMA balancing but I have run a >> few basic benchmarks from mmtests to check if this is any severe >> regression/overhead to any of those. Some benchmarks show some >> improvement, some show no significant change and a few regress. >> I am hopeful that with more appropriate tuning there is scope for >> futher improvement here especially for workloads for which NUMA >> matters. > > What's your expected improvement of the PMU based NUMA balancing? It > should come from reduced overhead? higher accuracy? Quicker response? > I think that it may be better to prove that with appropriate statistics > for at least one workload. Just to clarify, unlike PEBS, IBS works independently of PMU. I believe the improvement will come from reduced overhead due to sampling of relevant accesses only. I have a microbenchmark where two sets of threads bound to two NUMA nodes access the two different halves of memory which is initially allocated on the 1st node. On a two node Zen4 system, with 64 threads in each set accessing 8G of memory each from the initial allocation of 16G, I see that IBS driven NUMA balancing (i,e., this patchset) takes 50% less time to complete a fixed number of memory accesses. This could well be the best case and real workloads/benchmarks may not get this much uplift, but it does show the potential gain to be had. Thanks for your inputs. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 2/13/2023 8:56 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> Hi, >>> >>> Some hardware platforms can provide information about memory accesses >>> that can be used to do optimal page and task placement on NUMA >>> systems. AMD processors have a hardware facility called Instruction- >>> Based Sampling (IBS) that can be used to gather specific metrics >>> related to instruction fetch and execution activity. This facility >>> can be used to perform memory access profiling based on statistical >>> sampling. >>> >>> This RFC is a proof-of-concept implementation where the access >>> information obtained from the hardware is used to drive NUMA balancing. >>> With this it is no longer necessary to scan the address space and >>> introduce NUMA hint faults to build task-to-page association. Hence >>> the approach taken here is to replace the address space scanning plus >>> hint faults with the access information provided by the hardware. >> >> You method can avoid the address space scanning, but cannot avoid memory >> access fault in fact. PMU will raise NMI and then task_work to process >> the sampled memory accesses. The overhead depends on the frequency of >> the memory access sampling. Please measure the overhead of your method >> in details. > > Yes, the address space scanning is avoided. I will measure the overhead > of hint fault vs NMI handling path. The actual processing of the access > from task_work context is pretty much similar to the stats processing > from hint faults. As you note the overhead depends on the frequency of > sampling. In this current approach, the sampling period is per-task > and it varies based on the same logic that NUMA balancing uses to > vary the scan period. > >> >>> The access samples obtained from hardware are fed to NUMA balancing >>> as fault-equivalents. The rest of the NUMA balancing logic that >>> collects/aggregates the shared/private/local/remote faults and does >>> pages/task migrations based on the faults is retained except that >>> accesses replace faults. >>> >>> This early implementation is an attempt to get a working solution >>> only and as such a lot of TODOs exist: >>> >>> - Perf uses IBS and we are using the same IBS for access profiling here. >>> There needs to be a proper way to make the use mutually exclusive. >>> - Is tying this up with NUMA balancing a reasonable approach or >>> should we look at a completely new approach? >>> - When accesses replace faults in NUMA balancing, a few things have >>> to be tuned differently. All such decision points need to be >>> identified and appropriate tuning needs to be done. >>> - Hardware provided access information could be very useful for driving >>> hot page promotion in tiered memory systems. Need to check if this >>> requires different tuning/heuristics apart from what NUMA balancing >>> already does. >>> - Some of the values used to program the IBS counters like the sampling >>> period etc may not be the optimal or ideal values. The sample period >>> adjustment follows the same logic as scan period modification which >>> may not be ideal. More experimentation is required to fine-tune all >>> these aspects. >>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled >>> access. Need to check if it makes sense to delay it and do batched page >>> migration. >> >> You current implementation is tied with AMD IBS. You will need a >> architecture/vendor independent framework for upstreaming. > > I have tried to keep it vendor and arch neutral as far > as possible, will re-look into this of course to make the > interfaces more robust and useful. > > I have defined a static key (hw_access_hints=false) which will be > set only by the platform driver when it detects the hardware > capability to provide memory access information. NUMA balancing > code skips the address space scanning when it sees this capability. > The platform driver (access fault handler) will call into the NUMA > balancing API with linear and physical address information of the > accessed sample. Hence any equivalent hardware functionality could > plug into this scheme in its current form. There are checks for this > static key in the NUMA balancing logic at a few points to decide if > it should work based on access faults or hint faults. > >> >> BTW: can IBS sampling memory writing too? Or just memory reading? > > IBS can tag both store and load operations. Thanks for your information! >> >>> This RFC is mainly about showing how hardware provided access >>> information could be used for NUMA balancing but I have run a >>> few basic benchmarks from mmtests to check if this is any severe >>> regression/overhead to any of those. Some benchmarks show some >>> improvement, some show no significant change and a few regress. >>> I am hopeful that with more appropriate tuning there is scope for >>> futher improvement here especially for workloads for which NUMA >>> matters. >> >> What's your expected improvement of the PMU based NUMA balancing? It >> should come from reduced overhead? higher accuracy? Quicker response? >> I think that it may be better to prove that with appropriate statistics >> for at least one workload. > > Just to clarify, unlike PEBS, IBS works independently of PMU. Good to known this, Thanks! > I believe the improvement will come from reduced overhead due to > sampling of relevant accesses only. > > I have a microbenchmark where two sets of threads bound to two > NUMA nodes access the two different halves of memory which is > initially allocated on the 1st node. > > On a two node Zen4 system, with 64 threads in each set accessing > 8G of memory each from the initial allocation of 16G, I see that > IBS driven NUMA balancing (i,e., this patchset) takes 50% less time > to complete a fixed number of memory accesses. This could well > be the best case and real workloads/benchmarks may not get this much > uplift, but it does show the potential gain to be had. Can you find a way to show the overhead of the original implementation and your method? Then we can compare between them? Because you think the improvement comes from the reduced overhead. I also have interest in the pages migration throughput per second during the test, because I suspect your method can migrate pages faster. Best Regards, Huang, Ying
On 13-Feb-23 12:00 PM, Huang, Ying wrote: >> I have a microbenchmark where two sets of threads bound to two >> NUMA nodes access the two different halves of memory which is >> initially allocated on the 1st node. >> >> On a two node Zen4 system, with 64 threads in each set accessing >> 8G of memory each from the initial allocation of 16G, I see that >> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >> to complete a fixed number of memory accesses. This could well >> be the best case and real workloads/benchmarks may not get this much >> uplift, but it does show the potential gain to be had. > > Can you find a way to show the overhead of the original implementation > and your method? Then we can compare between them? Because you think > the improvement comes from the reduced overhead. Sure, will measure the overhead. > > I also have interest in the pages migration throughput per second during > the test, because I suspect your method can migrate pages faster. I have some data on pages migrated over time for the benchmark I mentioned above. Pages migrated vs Time(s) 2500000 +---------------------------------------------------------------+ | + + + + + + + | | Default ******* | | IBS ####### | | | | ****************************| | * | 2000000 |-+ * +-| | * | | ** | P | * ## | a | *### | g | **# | e 1500000 |-+ *## +-| s | ## | | # | m | # | i | *# | g | *# | r | ## | a 1000000 |-+ # +-| t | # | e | #* | d | #* | | # * | | # * | 500000 |-+ # * +-| | # * | | # * | | # * | | ## * | | # * | | # + * + + + + + + | 0 +---------------------------------------------------------------+ 0 20 40 60 80 100 120 140 160 Time (s) So acting upon the relevant accesses early enough seem to result in pages migrating faster in the beginning. Here is the actual data in case the above ascii graph gets jumbled up: numa_pages_migrated vs time in seconds ====================================== Time Default IBS --------------------------- 5 2639 511 10 2639 17724 15 2699 134632 20 2699 253485 25 2699 386296 30 159805 524651 35 450678 667622 40 741762 811603 45 971848 950691 50 1108475 1084537 55 1246229 1215265 60 1385920 1336521 65 1508354 1446950 70 1624068 1544890 75 1739311 1629162 80 1854639 1700068 85 1979906 1759025 90 2099857 <end> 95 2099857 100 2099857 105 2099859 110 2099859 115 2099859 120 2099859 125 2099859 130 2099859 135 2099859 140 2099859 145 2099859 150 2099859 155 2099859 160 2099859 Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>> I have a microbenchmark where two sets of threads bound to two >>> NUMA nodes access the two different halves of memory which is >>> initially allocated on the 1st node. >>> >>> On a two node Zen4 system, with 64 threads in each set accessing >>> 8G of memory each from the initial allocation of 16G, I see that >>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>> to complete a fixed number of memory accesses. This could well >>> be the best case and real workloads/benchmarks may not get this much >>> uplift, but it does show the potential gain to be had. >> >> Can you find a way to show the overhead of the original implementation >> and your method? Then we can compare between them? Because you think >> the improvement comes from the reduced overhead. > > Sure, will measure the overhead. > >> >> I also have interest in the pages migration throughput per second during >> the test, because I suspect your method can migrate pages faster. > > I have some data on pages migrated over time for the benchmark I mentioned > above. > > > Pages migrated vs Time(s) > 2500000 +---------------------------------------------------------------+ > | + + + + + + + | > | Default ******* | > | IBS ####### | > | | > | ****************************| > | * | > 2000000 |-+ * +-| > | * | > | ** | > P | * ## | > a | *### | > g | **# | > e 1500000 |-+ *## +-| > s | ## | > | # | > m | # | > i | *# | > g | *# | > r | ## | > a 1000000 |-+ # +-| > t | # | > e | #* | > d | #* | > | # * | > | # * | > 500000 |-+ # * +-| > | # * | > | # * | > | # * | > | ## * | > | # * | > | # + * + + + + + + | > 0 +---------------------------------------------------------------+ > 0 20 40 60 80 100 120 140 160 > Time (s) > > So acting upon the relevant accesses early enough seem to result in > pages migrating faster in the beginning. One way to prove this is to output the benchmark performance periodically. So we can find how the benchmark score change over time. Best Regards, Huang, Ying > Here is the actual data in case the above ascii graph gets jumbled up: > > numa_pages_migrated vs time in seconds > ====================================== > > Time Default IBS > --------------------------- > 5 2639 511 > 10 2639 17724 > 15 2699 134632 > 20 2699 253485 > 25 2699 386296 > 30 159805 524651 > 35 450678 667622 > 40 741762 811603 > 45 971848 950691 > 50 1108475 1084537 > 55 1246229 1215265 > 60 1385920 1336521 > 65 1508354 1446950 > 70 1624068 1544890 > 75 1739311 1629162 > 80 1854639 1700068 > 85 1979906 1759025 > 90 2099857 <end> > 95 2099857 > 100 2099857 > 105 2099859 > 110 2099859 > 115 2099859 > 120 2099859 > 125 2099859 > 130 2099859 > 135 2099859 > 140 2099859 > 145 2099859 > 150 2099859 > 155 2099859 > 160 2099859 > > Regards, > Bharata.
On 14-Feb-23 10:25 AM, Bharata B Rao wrote: > On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>> I have a microbenchmark where two sets of threads bound to two >>> NUMA nodes access the two different halves of memory which is >>> initially allocated on the 1st node. >>> >>> On a two node Zen4 system, with 64 threads in each set accessing >>> 8G of memory each from the initial allocation of 16G, I see that >>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>> to complete a fixed number of memory accesses. This could well >>> be the best case and real workloads/benchmarks may not get this much >>> uplift, but it does show the potential gain to be had. >> >> Can you find a way to show the overhead of the original implementation >> and your method? Then we can compare between them? Because you think >> the improvement comes from the reduced overhead. > > Sure, will measure the overhead. I used ftrace function_graph tracer to measure the amount of time (in us) spent in fault handling and task_work handling in both the methods when the above mentioned benchmark was running. Default IBS Fault handling 29879668.71 1226770.84 Task work handling 24878.894 10635593.82 Sched switch handling 78159.846 Total 29904547.6 11940524.51 In the default case, the fault handling duration is measured by tracing do_numa_page() and the task_work duration is tracked by task_numa_work(). In the IBS case, the fault handling is tracked by the NMI handler ibs_overflow_handler(), the task_work is tracked by task_ibs_access_work() and sched switch time overhead is tracked by hw_access_sched_in(). Note that in IBS case, not much is done in NMI handler but bulk of the work (page migration etc) happens in task_work context unlike the default case. The breakup in numbers is given below: Default ======= Duration Min Max Avg do_numa_page 29879668.71 0.08 317.166 17.16 task_numa_work 24878.894 0.2 3424.19 388.73 Total 29904547.6 IBS === Duration Min Max Avg ibs_overflow_handler 1226770.84 0.15 104.918 1.26 task_ibs_access_work 10635593.82 0.21 398.428 29.81 hw_access_sched_in 78159.846 0.15 247.922 1.29 Total 11940524.51 Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 14-Feb-23 10:25 AM, Bharata B Rao wrote: >> On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>>> I have a microbenchmark where two sets of threads bound to two >>>> NUMA nodes access the two different halves of memory which is >>>> initially allocated on the 1st node. >>>> >>>> On a two node Zen4 system, with 64 threads in each set accessing >>>> 8G of memory each from the initial allocation of 16G, I see that >>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>>> to complete a fixed number of memory accesses. This could well >>>> be the best case and real workloads/benchmarks may not get this much >>>> uplift, but it does show the potential gain to be had. >>> >>> Can you find a way to show the overhead of the original implementation >>> and your method? Then we can compare between them? Because you think >>> the improvement comes from the reduced overhead. >> >> Sure, will measure the overhead. > > I used ftrace function_graph tracer to measure the amount of time (in us) > spent in fault handling and task_work handling in both the methods when > the above mentioned benchmark was running. > > Default IBS > Fault handling 29879668.71 1226770.84 > Task work handling 24878.894 10635593.82 > Sched switch handling 78159.846 > > Total 29904547.6 11940524.51 Thanks! You have shown the large overhead difference between the original method and your method. Can you show the number of the pages migrated too? I think the overhead / page can be a good overhead indicator too. Can it be translated to the performance improvement? Per my understanding, the total overhead is small compared with total run time. Best Regards, Huang, Ying > In the default case, the fault handling duration is measured > by tracing do_numa_page() and the task_work duration is tracked > by task_numa_work(). > > In the IBS case, the fault handling is tracked by the NMI handler > ibs_overflow_handler(), the task_work is tracked by task_ibs_access_work() > and sched switch time overhead is tracked by hw_access_sched_in(). Note > that in IBS case, not much is done in NMI handler but bulk of the work > (page migration etc) happens in task_work context unlike the default case. > > The breakup in numbers is given below: > > Default > ======= > Duration Min Max Avg > do_numa_page 29879668.71 0.08 317.166 17.16 > task_numa_work 24878.894 0.2 3424.19 388.73 > Total 29904547.6 > > IBS > === > Duration Min Max Avg > ibs_overflow_handler 1226770.84 0.15 104.918 1.26 > task_ibs_access_work 10635593.82 0.21 398.428 29.81 > hw_access_sched_in 78159.846 0.15 247.922 1.29 > Total 11940524.51 > > Regards, > Bharata.
On 15-Feb-23 11:37 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>>> I have a microbenchmark where two sets of threads bound to two >>>> NUMA nodes access the two different halves of memory which is >>>> initially allocated on the 1st node. >>>> >>>> On a two node Zen4 system, with 64 threads in each set accessing >>>> 8G of memory each from the initial allocation of 16G, I see that >>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>>> to complete a fixed number of memory accesses. This could well >>>> be the best case and real workloads/benchmarks may not get this much >>>> uplift, but it does show the potential gain to be had. >>> >>> Can you find a way to show the overhead of the original implementation >>> and your method? Then we can compare between them? Because you think >>> the improvement comes from the reduced overhead. >> >> Sure, will measure the overhead. >> >>> >>> I also have interest in the pages migration throughput per second during >>> the test, because I suspect your method can migrate pages faster. >> >> I have some data on pages migrated over time for the benchmark I mentioned >> above. >> >> >> Pages migrated vs Time(s) >> 2500000 +---------------------------------------------------------------+ >> | + + + + + + + | >> | Default ******* | >> | IBS ####### | >> | | >> | ****************************| >> | * | >> 2000000 |-+ * +-| >> | * | >> | ** | >> P | * ## | >> a | *### | >> g | **# | >> e 1500000 |-+ *## +-| >> s | ## | >> | # | >> m | # | >> i | *# | >> g | *# | >> r | ## | >> a 1000000 |-+ # +-| >> t | # | >> e | #* | >> d | #* | >> | # * | >> | # * | >> 500000 |-+ # * +-| >> | # * | >> | # * | >> | # * | >> | ## * | >> | # * | >> | # + * + + + + + + | >> 0 +---------------------------------------------------------------+ >> 0 20 40 60 80 100 120 140 160 >> Time (s) >> >> So acting upon the relevant accesses early enough seem to result in >> pages migrating faster in the beginning. > > One way to prove this is to output the benchmark performance > periodically. So we can find how the benchmark score change over time. Here is the data from a different run that captures the benchmark scores periodically. The benchmark touches a fixed amount of memory a fixed number of times iteratively. I am capturing the iteration number for one of the threads that starts touching memory which is completely remote at the beginning. The higher iteration number suggests that the thread is making progress quickly which eventually reflects as the benchmark score. Access iterations vs Time 500 +-------------------------------------------------------------------+ | + + + + + + + + * | | Default ******* | 450 |-+ # IBS #######-| | # * | | # * | | # * | 400 |-+ # * +-| | # * | A | ****#********************************************* | c 350 |-+ * # +-| c | * # | e | * # | s 300 |-+ * # +-| s | * # | | * # | i 250 |-+ * # +-| t | * # | e | * # | r | * # | a 200 |-+ * # +-| t | *# | i | * # | o 150 |-+ *# +-| n | *# | s | *# | 100 |-+ *# +-| | # | | # | | # | 50 |-# +-| |# | |# + + + + + + + + | 0 +-------------------------------------------------------------------+ 0 20 40 60 80 100 120 140 160 180 Time (s) The way the number of migrated pages varies for the above runs is shown in the below graph: Pages migrated vs Time 2500000 +---------------------------------------------------------------+ | + + + + + + + + + | | Default ******* | | IBS ####### | | | | ******** | | * | 2000000 |-+ ** +-| | *** | | ** | p | * | a | ** | g | ** | e 1500000 |-+ * +-| s | *** | | ** | m | ** | i | * | g | ** | r | * | a 1000000 |-+ * +-| t | * | e | * | d | * | | * | | ##* | 500000 |-+ # * +-| | ## * | | ## * | | ### * | | # * | | #### * | | # + * + + + + + + + | 0 +---------------------------------------------------------------+ 0 20 40 60 80 100 120 140 160 180 200 Time (s) The final benchmark scores for the above runs compare like this: Default IBS Time (us) 174459192.0 54710778.0 Regards, Bharata.
On 17-Feb-23 11:33 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 14-Feb-23 10:25 AM, Bharata B Rao wrote: >>> On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>>>> I have a microbenchmark where two sets of threads bound to two >>>>> NUMA nodes access the two different halves of memory which is >>>>> initially allocated on the 1st node. >>>>> >>>>> On a two node Zen4 system, with 64 threads in each set accessing >>>>> 8G of memory each from the initial allocation of 16G, I see that >>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>>>> to complete a fixed number of memory accesses. This could well >>>>> be the best case and real workloads/benchmarks may not get this much >>>>> uplift, but it does show the potential gain to be had. >>>> >>>> Can you find a way to show the overhead of the original implementation >>>> and your method? Then we can compare between them? Because you think >>>> the improvement comes from the reduced overhead. >>> >>> Sure, will measure the overhead. >> >> I used ftrace function_graph tracer to measure the amount of time (in us) >> spent in fault handling and task_work handling in both the methods when >> the above mentioned benchmark was running. >> >> Default IBS >> Fault handling 29879668.71 1226770.84 >> Task work handling 24878.894 10635593.82 >> Sched switch handling 78159.846 >> >> Total 29904547.6 11940524.51 > > Thanks! You have shown the large overhead difference between the > original method and your method. Can you show the number of the pages > migrated too? I think the overhead / page can be a good overhead > indicator too. > > Can it be translated to the performance improvement? Per my > understanding, the total overhead is small compared with total run time. I captured some of the numbers that you wanted for two different runs. The first case shows the data for a short run (less number of memory access iterations) and the second one is for a long run (more number of iterations) Short-run ========= Time taken or overhead (us) for fault, task_work and sched_switch handling Default IBS Fault handling 29017953.99 1196828.67 Task work handling 10354.40 10356778.53 Sched switch handling 56572.21 Total overhead 29028308.39 11610179.41 Benchmark score(us) 194050290 53963650 numa_pages_migrated 2097256 662755 Overhead / page 13.84 17.51 Pages migrated per sec 72248.64 57083.95 Default ------- Total Min Max Avg do_numa_page 29017953.99 0.1 307.63 15.97 task_numa_work 10354.40 2.86 4573.60 175.50 Total 29028308.39 IBS --- Total Min Max Avg ibs_overflow_handler 1196828.67 0.15 100.28 1.26 task_ibs_access_work 10356778.53 0.21 10504.14 28.42 hw_access_sched_in 56572.21 0.15 16.94 1.45 Total 11610179.41 Long-run ======== Time taken or overhead (us) for fault, task_work and sched_switch handling Default IBS Fault handling 27437756.73 901406.37 Task work handling 1741.66 4902935.32 Sched switch handling 100590.33 Total overhead 27439498.38 5904932.02 Benchmark score(us) 306786210.0 153422489.0 numa_pages_migrated 2097218 1746099 Overhead / page 13.08 3.38 Pages migrated per sec 6836.08 11380.98 Default ------- Total Min Max Avg do_numa_page 27437756.73 0.08 363.475 15.03 task_numa_work 1741.66 3.294 1200.71 42.48 Total 27439498.38 IBS --- Total Min Max Avg ibs_overflow_handler 901406.37 0.15 95.51 1.06 task_ibs_access_work 4902935.32 0.22 11013.68 9.64 hw_access_sched_in 100590.33 0.14 91.97 1.52 Total 5904932.02 Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 17-Feb-23 11:33 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 14-Feb-23 10:25 AM, Bharata B Rao wrote: >>>> On 13-Feb-23 12:00 PM, Huang, Ying wrote: >>>>>> I have a microbenchmark where two sets of threads bound to two >>>>>> NUMA nodes access the two different halves of memory which is >>>>>> initially allocated on the 1st node. >>>>>> >>>>>> On a two node Zen4 system, with 64 threads in each set accessing >>>>>> 8G of memory each from the initial allocation of 16G, I see that >>>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time >>>>>> to complete a fixed number of memory accesses. This could well >>>>>> be the best case and real workloads/benchmarks may not get this much >>>>>> uplift, but it does show the potential gain to be had. >>>>> >>>>> Can you find a way to show the overhead of the original implementation >>>>> and your method? Then we can compare between them? Because you think >>>>> the improvement comes from the reduced overhead. >>>> >>>> Sure, will measure the overhead. >>> >>> I used ftrace function_graph tracer to measure the amount of time (in us) >>> spent in fault handling and task_work handling in both the methods when >>> the above mentioned benchmark was running. >>> >>> Default IBS >>> Fault handling 29879668.71 1226770.84 >>> Task work handling 24878.894 10635593.82 >>> Sched switch handling 78159.846 >>> >>> Total 29904547.6 11940524.51 >> >> Thanks! You have shown the large overhead difference between the >> original method and your method. Can you show the number of the pages >> migrated too? I think the overhead / page can be a good overhead >> indicator too. >> >> Can it be translated to the performance improvement? Per my >> understanding, the total overhead is small compared with total run time. > > I captured some of the numbers that you wanted for two different runs. > The first case shows the data for a short run (less number of memory access > iterations) and the second one is for a long run (more number of iterations) > > Short-run > ========= > Time taken or overhead (us) for fault, task_work and sched_switch > handling > > Default IBS > Fault handling 29017953.99 1196828.67 > Task work handling 10354.40 10356778.53 > Sched switch handling 56572.21 > Total overhead 29028308.39 11610179.41 > > Benchmark score(us) 194050290 53963650 > numa_pages_migrated 2097256 662755 > Overhead / page 13.84 17.51 From above, the overhead/page is similar. > Pages migrated per sec 72248.64 57083.95 > > Default > ------- > Total Min Max Avg > do_numa_page 29017953.99 0.1 307.63 15.97 > task_numa_work 10354.40 2.86 4573.60 175.50 > Total 29028308.39 > > IBS > --- > Total Min Max Avg > ibs_overflow_handler 1196828.67 0.15 100.28 1.26 > task_ibs_access_work 10356778.53 0.21 10504.14 28.42 > hw_access_sched_in 56572.21 0.15 16.94 1.45 > Total 11610179.41 > > > Long-run > ======== > Time taken or overhead (us) for fault, task_work and sched_switch > handling > Default IBS > Fault handling 27437756.73 901406.37 > Task work handling 1741.66 4902935.32 > Sched switch handling 100590.33 > Total overhead 27439498.38 5904932.02 > > Benchmark score(us) 306786210.0 153422489.0 > numa_pages_migrated 2097218 1746099 > Overhead / page 13.08 3.38 But from this, the overhead/page is quite different. One possibility is that there's more "local" hint page faults in the original implementation, we can check "numa_hint_faults" and "numa_hint_faults_local" in /proc/vmstat for that. If numa_hint_faults_local / numa_hint_faults is similar. For each page migrated, the number of hint page fault is similar, and the run time for each hint page fault handler is similar too. Or I made some mistake in analysis? > Pages migrated per sec 6836.08 11380.98 > > Default > ------- > Total Min Max Avg > do_numa_page 27437756.73 0.08 363.475 15.03 > task_numa_work 1741.66 3.294 1200.71 42.48 > Total 27439498.38 > > IBS > --- > Total Min Max Avg > ibs_overflow_handler 901406.37 0.15 95.51 1.06 > task_ibs_access_work 4902935.32 0.22 11013.68 9.64 > hw_access_sched_in 100590.33 0.14 91.97 1.52 > Total 5904932.02 Thank you very much for detailed data. Can you provide some analysis for your data? Best Regards, Huang, Ying
On 27-Feb-23 1:24 PM, Huang, Ying wrote: > Thank you very much for detailed data. Can you provide some analysis > for your data? The overhead numbers I shared earlier weren't correct as I realized that while obtaining those numbers from function_graph tracing, the trace buffer was silently getting overrun. I had to reduce the number of memory access iterations to ensure that I get the full trace buffer. I will be summarizing the findings based on this new numbers below. Just to recap - The microbenchmark is run on an AMD Genoa two node system. The benchmark has two set of threads, (one affined to each node) accessing two different chunks of memory (chunk size 8G) which are initially allocated on first node. The benchmark touches each page in the chunk iteratively for a fixed number of iterations (384 in this case given below). The benchmark score is the amount of time it takes to complete the specified number of accesses. Here is the data for the benchmark run: Time taken or overhead (us) for fault, task_work and sched_switch handling Default IBS Fault handling 2875354862 2602455 Task work handling 139023 24008121 Sched switch handling 37712 Total overhead 2875493885 26648288 Default ------- Total Min Max Avg do_numa_page 2875354862 0.08 392.13 22.11 task_numa_work 139023 0.14 5365.77 532.66 Total 2875493885 IBS --- Total Min Max Avg ibs_overflow_handler 2602455 0.14 103.91 1.29 task_ibs_access_work 24008121 0.17 485.09 37.65 hw_access_sched_in 37712 0.15 287.55 1.35 Total 26648288 Default IBS Benchmark score(us) 160171762.0 40323293.0 numa_pages_migrated 2097220 511791 Overhead per page 1371 52 Pages migrated per sec 13094 12692 numa_hint_faults_local 2820311 140856 numa_hint_faults 38589520 652647 hint_faults_local/hint_faults 7% 22% Here is the summary: - In case of IBS, the benchmark completes 75% faster compared to the default case. The gain varies based on how many iterations of memory accesses we run as part of the benchmark. For 2048 iterations of accesses, I have seen a gain of around 50%. - The overhead of NUMA balancing (as measured by the time taken in the fault handling, task_work time handling and sched_switch time handling) in the default case is seen to be pretty high compared to the IBS case. - The number of hint-faults in the default case is significantly higher than the IBS case. - The local hint-faults percentage is much better in the IBS case compared to the default case. - As shown in the graphs (in other threads of this mail thread), in the default case, the page migrations start a bit slowly while IBS case shows steady migrations right from the start. - I have also shown (via graphs in other threads of this mail thread) that in IBS case the benchmark is able to steadily increase the access iterations over time, while in the default case, the benchmark doesn't do forward progress for a long time after an initial increase. - Early migrations due to relevant access sampling from IBS, is most probably the significant reason for the uplift that IBS case gets. - It is consistently seen that the benchmark in the IBS case manages to complete the specified number of accesses even before the entire chunk of memory gets migrated. The early migrations are offsetting the cost of remote accesses too. - In the IBS case, we re-program the IBS counters for the incoming task in the sched_switch path. It is seen that this overhead isn't that significant to slow down the benchmark. - One of the differences between the default case and the IBS case is about when the faults-since-last-scan is updated/folded into the historical faults stats and subsequent scan period update. Since we don't have the notion of scanning in IBS, I have a threshold (number of access faults) to determine when to update the historical faults and the IBS sample period. I need to check if quicker migrations could result from this change. - Finally, all this is for the above mentioned microbenchmark. The gains on other benchmarks is yet to be evaluated. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 27-Feb-23 1:24 PM, Huang, Ying wrote: >> Thank you very much for detailed data. Can you provide some analysis >> for your data? > > The overhead numbers I shared earlier weren't correct as I > realized that while obtaining those numbers from function_graph > tracing, the trace buffer was silently getting overrun. I had to > reduce the number of memory access iterations to ensure that I get > the full trace buffer. I will be summarizing the findings > based on this new numbers below. > > Just to recap - The microbenchmark is run on an AMD Genoa > two node system. The benchmark has two set of threads, > (one affined to each node) accessing two different chunks > of memory (chunk size 8G) which are initially allocated > on first node. The benchmark touches each page in the > chunk iteratively for a fixed number of iterations (384 > in this case given below). The benchmark score is the > amount of time it takes to complete the specified number > of accesses. > > Here is the data for the benchmark run: > > Time taken or overhead (us) for fault, task_work and sched_switch > handling > > Default IBS > Fault handling 2875354862 2602455 > Task work handling 139023 24008121 > Sched switch handling 37712 > Total overhead 2875493885 26648288 > > Default > ------- > Total Min Max Avg > do_numa_page 2875354862 0.08 392.13 22.11 > task_numa_work 139023 0.14 5365.77 532.66 > Total 2875493885 > > IBS > --- > Total Min Max Avg > ibs_overflow_handler 2602455 0.14 103.91 1.29 > task_ibs_access_work 24008121 0.17 485.09 37.65 > hw_access_sched_in 37712 0.15 287.55 1.35 > Total 26648288 > > > Default IBS > Benchmark score(us) 160171762.0 40323293.0 > numa_pages_migrated 2097220 511791 > Overhead per page 1371 52 > Pages migrated per sec 13094 12692 > numa_hint_faults_local 2820311 140856 > numa_hint_faults 38589520 652647 For default, numa_hint_faults >> numa_pages_migrated. It's hard to be understood. I guess that there aren't many shared pages in the benchmark? And I guess that the free pages in the target node is enough too? > hint_faults_local/hint_faults 7% 22% > > Here is the summary: > > - In case of IBS, the benchmark completes 75% faster compared to > the default case. The gain varies based on how many iterations of > memory accesses we run as part of the benchmark. For 2048 iterations > of accesses, I have seen a gain of around 50%. > - The overhead of NUMA balancing (as measured by the time taken in > the fault handling, task_work time handling and sched_switch time > handling) in the default case is seen to be pretty high compared to > the IBS case. > - The number of hint-faults in the default case is significantly > higher than the IBS case. > - The local hint-faults percentage is much better in the IBS > case compared to the default case. > - As shown in the graphs (in other threads of this mail thread), in > the default case, the page migrations start a bit slowly while IBS > case shows steady migrations right from the start. > - I have also shown (via graphs in other threads of this mail thread) > that in IBS case the benchmark is able to steadily increase > the access iterations over time, while in the default case, the > benchmark doesn't do forward progress for a long time after > an initial increase. Hard to understand this too. Pages are migrated to local, but performance doesn't improve. > - Early migrations due to relevant access sampling from IBS, > is most probably the significant reason for the uplift that IBS > case gets. In original kernel, the NUMA page table scanning will delay for a while. Please check the below comments in task_tick_numa(). /* * Using runtime rather than walltime has the dual advantage that * we (mostly) drive the selection from busy threads and that the * task needs to have done some actual work before we bother with * NUMA placement. */ I think this is generally reasonable, while it's not best for this micro-benchmark. Best Regards, Huang, Ying > - It is consistently seen that the benchmark in the IBS case manages > to complete the specified number of accesses even before the entire > chunk of memory gets migrated. The early migrations are offsetting > the cost of remote accesses too. > - In the IBS case, we re-program the IBS counters for the incoming > task in the sched_switch path. It is seen that this overhead isn't > that significant to slow down the benchmark. > - One of the differences between the default case and the IBS case > is about when the faults-since-last-scan is updated/folded into the > historical faults stats and subsequent scan period update. Since we > don't have the notion of scanning in IBS, I have a threshold (number > of access faults) to determine when to update the historical faults > and the IBS sample period. I need to check if quicker migrations > could result from this change. > - Finally, all this is for the above mentioned microbenchmark. The > gains on other benchmarks is yet to be evaluated. > > Regards, > Bharata.
On 02-Mar-23 1:40 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: >> >> Here is the data for the benchmark run: >> >> Time taken or overhead (us) for fault, task_work and sched_switch >> handling >> >> Default IBS >> Fault handling 2875354862 2602455 >> Task work handling 139023 24008121 >> Sched switch handling 37712 >> Total overhead 2875493885 26648288 >> >> Default >> ------- >> Total Min Max Avg >> do_numa_page 2875354862 0.08 392.13 22.11 >> task_numa_work 139023 0.14 5365.77 532.66 >> Total 2875493885 >> >> IBS >> --- >> Total Min Max Avg >> ibs_overflow_handler 2602455 0.14 103.91 1.29 >> task_ibs_access_work 24008121 0.17 485.09 37.65 >> hw_access_sched_in 37712 0.15 287.55 1.35 >> Total 26648288 >> >> >> Default IBS >> Benchmark score(us) 160171762.0 40323293.0 >> numa_pages_migrated 2097220 511791 >> Overhead per page 1371 52 >> Pages migrated per sec 13094 12692 >> numa_hint_faults_local 2820311 140856 >> numa_hint_faults 38589520 652647 > > For default, numa_hint_faults >> numa_pages_migrated. It's hard to be > understood. Most of the migration requests from the numa hint page fault path are failing due to failure to isolate the pages. This is the check in migrate_misplaced_page() from where it returns without even trying to do the subsequent migrate_pages() call: isolated = numamigrate_isolate_page(pgdat, page); if (!isolated) goto out; I will further investigate this. > I guess that there aren't many shared pages in the > benchmark? I have a version of the benchmark which has a fraction of shared memory between sets of thread in addition to the per-set exclusive memory. Here too the same performance difference is seen. > And I guess that the free pages in the target node is enough > too? The benchmark is using 16G totally with 8G being accessed from threads on either nodes. There is enough memory on the target node to accept the incoming page migration requests. > >> hint_faults_local/hint_faults 7% 22% >> >> Here is the summary: >> >> - In case of IBS, the benchmark completes 75% faster compared to >> the default case. The gain varies based on how many iterations of >> memory accesses we run as part of the benchmark. For 2048 iterations >> of accesses, I have seen a gain of around 50%. >> - The overhead of NUMA balancing (as measured by the time taken in >> the fault handling, task_work time handling and sched_switch time >> handling) in the default case is seen to be pretty high compared to >> the IBS case. >> - The number of hint-faults in the default case is significantly >> higher than the IBS case. >> - The local hint-faults percentage is much better in the IBS >> case compared to the default case. >> - As shown in the graphs (in other threads of this mail thread), in >> the default case, the page migrations start a bit slowly while IBS >> case shows steady migrations right from the start. >> - I have also shown (via graphs in other threads of this mail thread) >> that in IBS case the benchmark is able to steadily increase >> the access iterations over time, while in the default case, the >> benchmark doesn't do forward progress for a long time after >> an initial increase. > > Hard to understand this too. Pages are migrated to local, but > performance doesn't improve. Migrations start a bit late and too much of time is spent later in the run in hint faults and failed migration attempts (due to failure to isolate the pages) is probably the reason? > >> - Early migrations due to relevant access sampling from IBS, >> is most probably the significant reason for the uplift that IBS >> case gets. > > In original kernel, the NUMA page table scanning will delay for a > while. Please check the below comments in task_tick_numa(). > > /* > * Using runtime rather than walltime has the dual advantage that > * we (mostly) drive the selection from busy threads and that the > * task needs to have done some actual work before we bother with > * NUMA placement. > */ > > I think this is generally reasonable, while it's not best for this > micro-benchmark. This is in addition to the initial scan delay that we have via sysctl_numa_balancing_scan_delay. I have an equivalent of this initial delay where the IBS access sampling is not started for the task until an initial delay. Thanks for your observations. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 02-Mar-23 1:40 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >>> >>> Here is the data for the benchmark run: >>> >>> Time taken or overhead (us) for fault, task_work and sched_switch >>> handling >>> >>> Default IBS >>> Fault handling 2875354862 2602455 >>> Task work handling 139023 24008121 >>> Sched switch handling 37712 >>> Total overhead 2875493885 26648288 >>> >>> Default >>> ------- >>> Total Min Max Avg >>> do_numa_page 2875354862 0.08 392.13 22.11 >>> task_numa_work 139023 0.14 5365.77 532.66 >>> Total 2875493885 >>> >>> IBS >>> --- >>> Total Min Max Avg >>> ibs_overflow_handler 2602455 0.14 103.91 1.29 >>> task_ibs_access_work 24008121 0.17 485.09 37.65 >>> hw_access_sched_in 37712 0.15 287.55 1.35 >>> Total 26648288 >>> >>> >>> Default IBS >>> Benchmark score(us) 160171762.0 40323293.0 >>> numa_pages_migrated 2097220 511791 >>> Overhead per page 1371 52 >>> Pages migrated per sec 13094 12692 >>> numa_hint_faults_local 2820311 140856 >>> numa_hint_faults 38589520 652647 >> >> For default, numa_hint_faults >> numa_pages_migrated. It's hard to be >> understood. > > Most of the migration requests from the numa hint page fault path > are failing due to failure to isolate the pages. > > This is the check in migrate_misplaced_page() from where it returns > without even trying to do the subsequent migrate_pages() call: > > isolated = numamigrate_isolate_page(pgdat, page); > if (!isolated) > goto out; > > I will further investigate this. > >> I guess that there aren't many shared pages in the >> benchmark? > > I have a version of the benchmark which has a fraction of > shared memory between sets of thread in addition to the > per-set exclusive memory. Here too the same performance > difference is seen. > >> And I guess that the free pages in the target node is enough >> too? > > The benchmark is using 16G totally with 8G being accessed from > threads on either nodes. There is enough memory on the target > node to accept the incoming page migration requests. > >> >>> hint_faults_local/hint_faults 7% 22% >>> >>> Here is the summary: >>> >>> - In case of IBS, the benchmark completes 75% faster compared to >>> the default case. The gain varies based on how many iterations of >>> memory accesses we run as part of the benchmark. For 2048 iterations >>> of accesses, I have seen a gain of around 50%. >>> - The overhead of NUMA balancing (as measured by the time taken in >>> the fault handling, task_work time handling and sched_switch time >>> handling) in the default case is seen to be pretty high compared to >>> the IBS case. >>> - The number of hint-faults in the default case is significantly >>> higher than the IBS case. >>> - The local hint-faults percentage is much better in the IBS >>> case compared to the default case. >>> - As shown in the graphs (in other threads of this mail thread), in >>> the default case, the page migrations start a bit slowly while IBS >>> case shows steady migrations right from the start. >>> - I have also shown (via graphs in other threads of this mail thread) >>> that in IBS case the benchmark is able to steadily increase >>> the access iterations over time, while in the default case, the >>> benchmark doesn't do forward progress for a long time after >>> an initial increase. >> >> Hard to understand this too. Pages are migrated to local, but >> performance doesn't improve. > > Migrations start a bit late and too much of time is spent later > in the run in hint faults and failed migration attempts (due to failure > to isolate the pages) is probably the reason? >> >>> - Early migrations due to relevant access sampling from IBS, >>> is most probably the significant reason for the uplift that IBS >>> case gets. >> >> In original kernel, the NUMA page table scanning will delay for a >> while. Please check the below comments in task_tick_numa(). >> >> /* >> * Using runtime rather than walltime has the dual advantage that >> * we (mostly) drive the selection from busy threads and that the >> * task needs to have done some actual work before we bother with >> * NUMA placement. >> */ >> >> I think this is generally reasonable, while it's not best for this >> micro-benchmark. > > This is in addition to the initial scan delay that we have via > sysctl_numa_balancing_scan_delay. I have an equivalent of this > initial delay where the IBS access sampling is not started for > the task until an initial delay. What is the memory accessing pattern of the workload? Uniform random or something like Gauss distribution? Anyway, it may take some time for the original method to scan enough memory space to trigger enough hint page fault. We can check numa_pte_updates to check whether enough virtual space has been scanned. Best Regards, Huang, Ying
On 03-Mar-23 11:23 AM, Huang, Ying wrote: > > What is the memory accessing pattern of the workload? Uniform random or > something like Gauss distribution? Multiple iterations of uniform access from beginning to end of the memory region. > > Anyway, it may take some time for the original method to scan enough > memory space to trigger enough hint page fault. We can check > numa_pte_updates to check whether enough virtual space has been scanned. I see that numa_hint_faults is way higher (sometimes close to 5 times) than numa_pte_updates. This doesn't make sense. Very rarely I do see saner numbers and when that happens the benchmark score is also much better. Looks like an issue with the default kernel itself. I will debug this further and get back. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 03-Mar-23 11:23 AM, Huang, Ying wrote: >> >> What is the memory accessing pattern of the workload? Uniform random or >> something like Gauss distribution? > > Multiple iterations of uniform access from beginning to end of the > memory region. I guess this is sequential accesses instead of random accesses with uniform distribution. >> >> Anyway, it may take some time for the original method to scan enough >> memory space to trigger enough hint page fault. We can check >> numa_pte_updates to check whether enough virtual space has been scanned. > > I see that numa_hint_faults is way higher (sometimes close to 5 times) > than numa_pte_updates. This doesn't make sense. Very rarely I do see > saner numbers and when that happens the benchmark score is also much better. > > Looks like an issue with the default kernel itself. I will debug this > further and get back. Yes. It appears that something is wrong. Best Regards, Huang, Ying