Message ID | 20240707094956.94654-4-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max | expand |
Yafang Shao <laoar.shao@gmail.com> writes: > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > quickly experimenting with specific workloads in a production environment, > particularly when monitoring latency spikes caused by contention on the > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > is introduced as a more practical alternative. In general, I'm neutral to the change. I can understand that kernel configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI too. > To ultimately mitigate the zone->lock contention issue, several suggestions > have been proposed. One approach involves dividing large zones into multi > smaller zones, as suggested by Matthew[0], while another entails splitting > the zone->lock using a mechanism similar to memory arenas and shifting away > from relying solely on zone_id to identify the range of free lists a > particular page belongs to[1]. However, implementing these solutions is > likely to necessitate a more extended development effort. Per my understanding, the change will hurt instead of improve zone->lock contention. Instead, it will reduce page allocation/freeing latency. > Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0] > Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1] > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: David Rientjes <rientjes@google.com> > --- > Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++ > include/linux/sysctl.h | 1 + > kernel/sysctl.c | 2 +- > mm/Kconfig | 11 ----------- > mm/page_alloc.c | 22 ++++++++++++++++------ > 5 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index e86c968a7a0e..eb9e5216eefe 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm: > - page_lock_unfairness > - panic_on_oom > - percpu_pagelist_high_fraction > +- pcp_batch_scale_max > - stat_interval > - stat_refresh > - numa_stat > @@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local > online CPUs. If the user writes '0' to this sysctl, it will revert to > this default behavior. > > +pcp_batch_scale_max > +=================== > + > +In page allocator, PCP (Per-CPU pageset) is refilled and drained in > +batches. The batch number is scaled automatically to improve page > +allocation/free throughput. But too large scale factor may hurt > +latency. This option sets the upper limit of scale factor to limit > +the maximum latency. > + > +The range for this parameter spans from 0 to 6, with a default value of 5. > +The value assigned to 'N' signifies that during each refilling or draining > +process, a maximum of (batch << N) pages will be involved, where "batch" > +represents the default batch size automatically computed by the kernel for > +each zone. > > stat_interval > ============= > diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h > index 09db2f2e6488..fb797f1c0ef7 100644 > --- a/include/linux/sysctl.h > +++ b/include/linux/sysctl.h > @@ -52,6 +52,7 @@ struct ctl_dir; > /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ > #define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10]) > #define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11]) > +#define SYSCTL_SIX ((void *)&sysctl_vals[12]) > > extern const int sysctl_vals[]; > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index e0b917328cf9..430ac4f58eb7 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -82,7 +82,7 @@ > #endif > > /* shared constants to be used in various sysctls */ > -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 }; > +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 }; > EXPORT_SYMBOL(sysctl_vals); > > const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX }; > diff --git a/mm/Kconfig b/mm/Kconfig > index b4cb45255a54..41fe4c13b7ac 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE > config CONTIG_ALLOC > def_bool (MEMORY_ISOLATION && COMPACTION) || CMA > > -config PCP_BATCH_SCALE_MAX > - int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" > - default 5 > - range 0 6 > - help > - In page allocator, PCP (Per-CPU pageset) is refilled and drained in > - batches. The batch number is scaled automatically to improve page > - allocation/free throughput. But too large scale factor may hurt > - latency. This option sets the upper limit of scale factor to limit > - the maximum latency. > - > config PHYS_ADDR_T_64BIT > def_bool 64BIT > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b76754a48e0..703eec22a997 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -273,6 +273,7 @@ int min_free_kbytes = 1024; > int user_min_free_kbytes = -1; > static int watermark_boost_factor __read_mostly = 15000; > static int watermark_scale_factor = 10; > +static int pcp_batch_scale_max = 5; > > /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > int movable_zone; > @@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > int count = READ_ONCE(pcp->count); > > while (count) { > - int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); > + int to_drain = min(count, pcp->batch << pcp_batch_scale_max); > count -= to_drain; > > spin_lock(&pcp->lock); > @@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free > > /* Free as much as possible if batch freeing high-order pages. */ > if (unlikely(free_high)) > - return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); > + return min(pcp->count, batch << pcp_batch_scale_max); > > /* Check for PCP disabled or boot pageset */ > if (unlikely(high < batch)) > @@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, > return 0; > > if (unlikely(free_high)) { > - pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), > + pcp->high = max(high - (batch << pcp_batch_scale_max), > high_min); > return 0; > } > @@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, > } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { > pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; > } > - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) > + if (pcp->free_count < (batch << pcp_batch_scale_max)) > pcp->free_count = min(pcp->free_count + (1 << order), > - batch << CONFIG_PCP_BATCH_SCALE_MAX); > + batch << pcp_batch_scale_max); > high = nr_pcp_high(pcp, zone, batch, free_high); > if (pcp->count >= high) { > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), > @@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) > * subsequent allocation of order-0 pages without any freeing. > */ > if (batch <= max_nr_alloc && > - pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) > + pcp->alloc_factor < pcp_batch_scale_max) > pcp->alloc_factor++; > batch = min(batch, max_nr_alloc); > } > @@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = { > .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, > .extra1 = SYSCTL_ZERO, > }, > + { > + .procname = "pcp_batch_scale_max", > + .data = &pcp_batch_scale_max, > + .maxlen = sizeof(pcp_batch_scale_max), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_SIX, > + }, > { > .procname = "lowmem_reserve_ratio", > .data = &sysctl_lowmem_reserve_ratio, -- Best Regards, Huang, Ying
On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > > quickly experimenting with specific workloads in a production environment, > > particularly when monitoring latency spikes caused by contention on the > > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > > is introduced as a more practical alternative. > > In general, I'm neutral to the change. I can understand that kernel > configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > too. > > > To ultimately mitigate the zone->lock contention issue, several suggestions > > have been proposed. One approach involves dividing large zones into multi > > smaller zones, as suggested by Matthew[0], while another entails splitting > > the zone->lock using a mechanism similar to memory arenas and shifting away > > from relying solely on zone_id to identify the range of free lists a > > particular page belongs to[1]. However, implementing these solutions is > > likely to necessitate a more extended development effort. > > Per my understanding, the change will hurt instead of improve zone->lock > contention. Instead, it will reduce page allocation/freeing latency. I'm quite perplexed by your recent comment. You introduced a configuration that has proven to be difficult to use, and you have been resistant to suggestions for modifying it to a more user-friendly and practical tuning approach. May I inquire about the rationale behind introducing this configuration in the beginning?
Yafang Shao <laoar.shao@gmail.com> writes: > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> > quickly experimenting with specific workloads in a production environment, >> > particularly when monitoring latency spikes caused by contention on the >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> > is introduced as a more practical alternative. >> >> In general, I'm neutral to the change. I can understand that kernel >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> too. >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> > have been proposed. One approach involves dividing large zones into multi >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> > from relying solely on zone_id to identify the range of free lists a >> > particular page belongs to[1]. However, implementing these solutions is >> > likely to necessitate a more extended development effort. >> >> Per my understanding, the change will hurt instead of improve zone->lock >> contention. Instead, it will reduce page allocation/freeing latency. > > I'm quite perplexed by your recent comment. You introduced a > configuration that has proven to be difficult to use, and you have > been resistant to suggestions for modifying it to a more user-friendly > and practical tuning approach. May I inquire about the rationale > behind introducing this configuration in the beginning? Sorry, I don't understand your words. Do you need me to explain what is "neutral"? -- Best Regards, Huang, Ying
On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> > quickly experimenting with specific workloads in a production environment, > >> > particularly when monitoring latency spikes caused by contention on the > >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> > is introduced as a more practical alternative. > >> > >> In general, I'm neutral to the change. I can understand that kernel > >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> too. > >> > >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> > have been proposed. One approach involves dividing large zones into multi > >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> > from relying solely on zone_id to identify the range of free lists a > >> > particular page belongs to[1]. However, implementing these solutions is > >> > likely to necessitate a more extended development effort. > >> > >> Per my understanding, the change will hurt instead of improve zone->lock > >> contention. Instead, it will reduce page allocation/freeing latency. > > > > I'm quite perplexed by your recent comment. You introduced a > > configuration that has proven to be difficult to use, and you have > > been resistant to suggestions for modifying it to a more user-friendly > > and practical tuning approach. May I inquire about the rationale > > behind introducing this configuration in the beginning? > > Sorry, I don't understand your words. Do you need me to explain what is > "neutral"? No, thanks. After consulting with ChatGPT, I received a clear and comprehensive explanation of what "neutral" means, providing me with a better understanding of the concept. So, can you explain why you introduced it as a config in the beginning ?
Yafang Shao <laoar.shao@gmail.com> writes: > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> > quickly experimenting with specific workloads in a production environment, >> >> > particularly when monitoring latency spikes caused by contention on the >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> > is introduced as a more practical alternative. >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> too. >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> > have been proposed. One approach involves dividing large zones into multi >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> > from relying solely on zone_id to identify the range of free lists a >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> > likely to necessitate a more extended development effort. >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> contention. Instead, it will reduce page allocation/freeing latency. >> > >> > I'm quite perplexed by your recent comment. You introduced a >> > configuration that has proven to be difficult to use, and you have >> > been resistant to suggestions for modifying it to a more user-friendly >> > and practical tuning approach. May I inquire about the rationale >> > behind introducing this configuration in the beginning? >> >> Sorry, I don't understand your words. Do you need me to explain what is >> "neutral"? > > No, thanks. > After consulting with ChatGPT, I received a clear and comprehensive > explanation of what "neutral" means, providing me with a better > understanding of the concept. > > So, can you explain why you introduced it as a config in the beginning ? I think that I have explained it in the commit log of commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long latency"). Which introduces the config. Sysctl knob is ABI, which needs to be maintained forever. Can you explain why you need it? Why cannot you use a fixed value after initial experiments. Best Regards, Huang, Ying
On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> > quickly experimenting with specific workloads in a production environment, > >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> > is introduced as a more practical alternative. > >> >> > >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> too. > >> >> > >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> > likely to necessitate a more extended development effort. > >> >> > >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> > > >> > I'm quite perplexed by your recent comment. You introduced a > >> > configuration that has proven to be difficult to use, and you have > >> > been resistant to suggestions for modifying it to a more user-friendly > >> > and practical tuning approach. May I inquire about the rationale > >> > behind introducing this configuration in the beginning? > >> > >> Sorry, I don't understand your words. Do you need me to explain what is > >> "neutral"? > > > > No, thanks. > > After consulting with ChatGPT, I received a clear and comprehensive > > explanation of what "neutral" means, providing me with a better > > understanding of the concept. > > > > So, can you explain why you introduced it as a config in the beginning ? > > I think that I have explained it in the commit log of commit > 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > latency"). Which introduces the config. What specifically are your expectations for how users should utilize this config in real production workload? > > Sysctl knob is ABI, which needs to be maintained forever. Can you > explain why you need it? Why cannot you use a fixed value after initial > experiments. Given the extensive scale of our production environment, with hundreds of thousands of servers, it begs the question: how do you propose we efficiently manage the various workloads that remain unaffected by the sysctl change implemented on just a few thousand servers? Is it feasible to expect us to recompile and release a new kernel for every instance where the default value falls short? Surely, there must be more practical and efficient approaches we can explore together to ensure optimal performance across all workloads. When making improvements or modifications, kindly ensure that they are not solely confined to a test or lab environment. It's vital to also consider the needs and requirements of our actual users, along with the diverse workloads they encounter in their daily operations. -- Regards Yafang
Yafang Shao <laoar.shao@gmail.com> writes: > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> too. >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> > >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> > configuration that has proven to be difficult to use, and you have >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> > and practical tuning approach. May I inquire about the rationale >> >> > behind introducing this configuration in the beginning? >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> "neutral"? >> > >> > No, thanks. >> > After consulting with ChatGPT, I received a clear and comprehensive >> > explanation of what "neutral" means, providing me with a better >> > understanding of the concept. >> > >> > So, can you explain why you introduced it as a config in the beginning ? >> >> I think that I have explained it in the commit log of commit >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> latency"). Which introduces the config. > > What specifically are your expectations for how users should utilize > this config in real production workload? > >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> explain why you need it? Why cannot you use a fixed value after initial >> experiments. > > Given the extensive scale of our production environment, with hundreds > of thousands of servers, it begs the question: how do you propose we > efficiently manage the various workloads that remain unaffected by the > sysctl change implemented on just a few thousand servers? Is it > feasible to expect us to recompile and release a new kernel for every > instance where the default value falls short? Surely, there must be > more practical and efficient approaches we can explore together to > ensure optimal performance across all workloads. > > When making improvements or modifications, kindly ensure that they are > not solely confined to a test or lab environment. It's vital to also > consider the needs and requirements of our actual users, along with > the diverse workloads they encounter in their daily operations. Have you found that your different systems requires different CONFIG_PCP_BATCH_SCALE_MAX value already? If no, I think that it's better for you to keep this patch in your downstream kernel for now. When you find that it is a common requirement, we can evaluate whether to make it a sysctl knob. -- Best Regards, Huang, Ying
On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> > is introduced as a more practical alternative. > >> >> >> > >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> too. > >> >> >> > >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> > likely to necessitate a more extended development effort. > >> >> >> > >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> > > >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> > configuration that has proven to be difficult to use, and you have > >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> > and practical tuning approach. May I inquire about the rationale > >> >> > behind introducing this configuration in the beginning? > >> >> > >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> "neutral"? > >> > > >> > No, thanks. > >> > After consulting with ChatGPT, I received a clear and comprehensive > >> > explanation of what "neutral" means, providing me with a better > >> > understanding of the concept. > >> > > >> > So, can you explain why you introduced it as a config in the beginning ? > >> > >> I think that I have explained it in the commit log of commit > >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> latency"). Which introduces the config. > > > > What specifically are your expectations for how users should utilize > > this config in real production workload? > > > >> > >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> explain why you need it? Why cannot you use a fixed value after initial > >> experiments. > > > > Given the extensive scale of our production environment, with hundreds > > of thousands of servers, it begs the question: how do you propose we > > efficiently manage the various workloads that remain unaffected by the > > sysctl change implemented on just a few thousand servers? Is it > > feasible to expect us to recompile and release a new kernel for every > > instance where the default value falls short? Surely, there must be > > more practical and efficient approaches we can explore together to > > ensure optimal performance across all workloads. > > > > When making improvements or modifications, kindly ensure that they are > > not solely confined to a test or lab environment. It's vital to also > > consider the needs and requirements of our actual users, along with > > the diverse workloads they encounter in their daily operations. > > Have you found that your different systems requires different > CONFIG_PCP_BATCH_SCALE_MAX value already? For specific workloads that introduce latency, we set the value to 0. For other workloads, we keep it unchanged until we determine that the default value is also suboptimal. What is the issue with this approach? > If no, I think that it's > better for you to keep this patch in your downstream kernel for now. > When you find that it is a common requirement, we can evaluate whether > to make it a sysctl knob.
Yafang Shao <laoar.shao@gmail.com> writes: > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> too. >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> > >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> "neutral"? >> >> > >> >> > No, thanks. >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> > explanation of what "neutral" means, providing me with a better >> >> > understanding of the concept. >> >> > >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> I think that I have explained it in the commit log of commit >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> latency"). Which introduces the config. >> > >> > What specifically are your expectations for how users should utilize >> > this config in real production workload? >> > >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> experiments. >> > >> > Given the extensive scale of our production environment, with hundreds >> > of thousands of servers, it begs the question: how do you propose we >> > efficiently manage the various workloads that remain unaffected by the >> > sysctl change implemented on just a few thousand servers? Is it >> > feasible to expect us to recompile and release a new kernel for every >> > instance where the default value falls short? Surely, there must be >> > more practical and efficient approaches we can explore together to >> > ensure optimal performance across all workloads. >> > >> > When making improvements or modifications, kindly ensure that they are >> > not solely confined to a test or lab environment. It's vital to also >> > consider the needs and requirements of our actual users, along with >> > the diverse workloads they encounter in their daily operations. >> >> Have you found that your different systems requires different >> CONFIG_PCP_BATCH_SCALE_MAX value already? > > For specific workloads that introduce latency, we set the value to 0. > For other workloads, we keep it unchanged until we determine that the > default value is also suboptimal. What is the issue with this > approach? Firstly, this is a system wide configuration, not workload specific. So, other workloads run on the same system will be impacted too. Will you run one workload only on one system? Secondly, we need some evidences to introduce a new system ABI. For example, we need to use different configuration on different systems otherwise some workloads will be hurt. Can you provide some evidences to support your change? IMHO, it's not good enough to say I don't know why I just don't want to change existing systems. If so, it may be better to wait until you have more evidences. >> If no, I think that it's >> better for you to keep this patch in your downstream kernel for now. >> When you find that it is a common requirement, we can evaluate whether >> to make it a sysctl knob. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> > >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> too. > >> >> >> >> > >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> > >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> > > >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> > behind introducing this configuration in the beginning? > >> >> >> > >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> "neutral"? > >> >> > > >> >> > No, thanks. > >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> > explanation of what "neutral" means, providing me with a better > >> >> > understanding of the concept. > >> >> > > >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> > >> >> I think that I have explained it in the commit log of commit > >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> latency"). Which introduces the config. > >> > > >> > What specifically are your expectations for how users should utilize > >> > this config in real production workload? > >> > > >> >> > >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> experiments. > >> > > >> > Given the extensive scale of our production environment, with hundreds > >> > of thousands of servers, it begs the question: how do you propose we > >> > efficiently manage the various workloads that remain unaffected by the > >> > sysctl change implemented on just a few thousand servers? Is it > >> > feasible to expect us to recompile and release a new kernel for every > >> > instance where the default value falls short? Surely, there must be > >> > more practical and efficient approaches we can explore together to > >> > ensure optimal performance across all workloads. > >> > > >> > When making improvements or modifications, kindly ensure that they are > >> > not solely confined to a test or lab environment. It's vital to also > >> > consider the needs and requirements of our actual users, along with > >> > the diverse workloads they encounter in their daily operations. > >> > >> Have you found that your different systems requires different > >> CONFIG_PCP_BATCH_SCALE_MAX value already? > > > > For specific workloads that introduce latency, we set the value to 0. > > For other workloads, we keep it unchanged until we determine that the > > default value is also suboptimal. What is the issue with this > > approach? > > Firstly, this is a system wide configuration, not workload specific. > So, other workloads run on the same system will be impacted too. Will > you run one workload only on one system? It seems we're living on different planets. You're happily working in your lab environment, while I'm struggling with real-world production issues. For servers: Server 1 to 10,000: vm.pcp_batch_scale_max = 0 Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 Server 1,000,001 and beyond: Happy with all values Is this hard to understand? In other words: For applications: Application 1 to 10,000: vm.pcp_batch_scale_max = 0 Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 Application 1,000,001 and beyond: Happy with all values > > Secondly, we need some evidences to introduce a new system ABI. For > example, we need to use different configuration on different systems > otherwise some workloads will be hurt. Can you provide some evidences > to support your change? IMHO, it's not good enough to say I don't know > why I just don't want to change existing systems. If so, it may be > better to wait until you have more evidences. It seems the community encourages developers to experiment with their improvements in lab environments using meticulously designed test cases A, B, C, and as many others as they can imagine, ultimately obtaining perfect data. However, it discourages developers from directly addressing real-world workloads. Sigh.
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> > >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> "neutral"? >> >> >> > >> >> >> > No, thanks. >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> > understanding of the concept. >> >> >> > >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> latency"). Which introduces the config. >> >> > >> >> > What specifically are your expectations for how users should utilize >> >> > this config in real production workload? >> >> > >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> experiments. >> >> > >> >> > Given the extensive scale of our production environment, with hundreds >> >> > of thousands of servers, it begs the question: how do you propose we >> >> > efficiently manage the various workloads that remain unaffected by the >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> > feasible to expect us to recompile and release a new kernel for every >> >> > instance where the default value falls short? Surely, there must be >> >> > more practical and efficient approaches we can explore together to >> >> > ensure optimal performance across all workloads. >> >> > >> >> > When making improvements or modifications, kindly ensure that they are >> >> > not solely confined to a test or lab environment. It's vital to also >> >> > consider the needs and requirements of our actual users, along with >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> Have you found that your different systems requires different >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> > >> > For specific workloads that introduce latency, we set the value to 0. >> > For other workloads, we keep it unchanged until we determine that the >> > default value is also suboptimal. What is the issue with this >> > approach? >> >> Firstly, this is a system wide configuration, not workload specific. >> So, other workloads run on the same system will be impacted too. Will >> you run one workload only on one system? > > It seems we're living on different planets. You're happily working in > your lab environment, while I'm struggling with real-world production > issues. > > For servers: > > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > Server 1,000,001 and beyond: Happy with all values > > Is this hard to understand? > > In other words: > > For applications: > > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > Application 1,000,001 and beyond: Happy with all values Good to know this. Thanks! >> >> Secondly, we need some evidences to introduce a new system ABI. For >> example, we need to use different configuration on different systems >> otherwise some workloads will be hurt. Can you provide some evidences >> to support your change? IMHO, it's not good enough to say I don't know >> why I just don't want to change existing systems. If so, it may be >> better to wait until you have more evidences. > > It seems the community encourages developers to experiment with their > improvements in lab environments using meticulously designed test > cases A, B, C, and as many others as they can imagine, ultimately > obtaining perfect data. However, it discourages developers from > directly addressing real-world workloads. Sigh. You cannot know whether your workloads benefit or hurt for the different batch number and how in your production environment? If you cannot, how do you decide which workload deploys on which system (with different batch number configuration). If you can, can you provide such information to support your patch? -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> > >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> too. > >> >> >> >> >> > >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> > >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> > > >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> > >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> "neutral"? > >> >> >> > > >> >> >> > No, thanks. > >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> > understanding of the concept. > >> >> >> > > >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> > >> >> >> I think that I have explained it in the commit log of commit > >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> latency"). Which introduces the config. > >> >> > > >> >> > What specifically are your expectations for how users should utilize > >> >> > this config in real production workload? > >> >> > > >> >> >> > >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> experiments. > >> >> > > >> >> > Given the extensive scale of our production environment, with hundreds > >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> > instance where the default value falls short? Surely, there must be > >> >> > more practical and efficient approaches we can explore together to > >> >> > ensure optimal performance across all workloads. > >> >> > > >> >> > When making improvements or modifications, kindly ensure that they are > >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> > consider the needs and requirements of our actual users, along with > >> >> > the diverse workloads they encounter in their daily operations. > >> >> > >> >> Have you found that your different systems requires different > >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> > > >> > For specific workloads that introduce latency, we set the value to 0. > >> > For other workloads, we keep it unchanged until we determine that the > >> > default value is also suboptimal. What is the issue with this > >> > approach? > >> > >> Firstly, this is a system wide configuration, not workload specific. > >> So, other workloads run on the same system will be impacted too. Will > >> you run one workload only on one system? > > > > It seems we're living on different planets. You're happily working in > > your lab environment, while I'm struggling with real-world production > > issues. > > > > For servers: > > > > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > > Server 1,000,001 and beyond: Happy with all values > > > > Is this hard to understand? > > > > In other words: > > > > For applications: > > > > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > > Application 1,000,001 and beyond: Happy with all values > > Good to know this. Thanks! > > >> > >> Secondly, we need some evidences to introduce a new system ABI. For > >> example, we need to use different configuration on different systems > >> otherwise some workloads will be hurt. Can you provide some evidences > >> to support your change? IMHO, it's not good enough to say I don't know > >> why I just don't want to change existing systems. If so, it may be > >> better to wait until you have more evidences. > > > > It seems the community encourages developers to experiment with their > > improvements in lab environments using meticulously designed test > > cases A, B, C, and as many others as they can imagine, ultimately > > obtaining perfect data. However, it discourages developers from > > directly addressing real-world workloads. Sigh. > > You cannot know whether your workloads benefit or hurt for the different > batch number and how in your production environment? If you cannot, how > do you decide which workload deploys on which system (with different > batch number configuration). If you can, can you provide such > information to support your patch? We leverage a meticulous selection of network metrics, particularly focusing on TcpExt indicators, to keep a close eye on application latency. This includes metrics such as TcpExt.TCPTimeouts, TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. In instances where a problematic container terminates, we've noticed a sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per second, which serves as a clear indication that other applications are experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max parameter to 0, we've been able to drastically reduce the maximum frequency of these timeouts to less than one per second. At present, we're selectively applying this adjustment to clusters that exclusively host the identified problematic applications, and we're closely monitoring their performance to ensure stability. To date, we've observed no network latency issues as a result of this change. However, we remain cautious about extending this optimization to other clusters, as the decision ultimately depends on a variety of factors. It's important to note that we're not eager to implement this change across our entire fleet, as we recognize the potential for unforeseen consequences. Instead, we're taking a cautious approach by initially applying it to a limited number of servers. This allows us to assess its impact and make informed decisions about whether or not to expand its use in the future. [0] 'Cluster' refers to a Kubernetes concept, where a single cluster comprises a specific group of servers designed to work in unison.
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> >> > >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> >> "neutral"? >> >> >> >> > >> >> >> >> > No, thanks. >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> >> > understanding of the concept. >> >> >> >> > >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> >> latency"). Which introduces the config. >> >> >> > >> >> >> > What specifically are your expectations for how users should utilize >> >> >> > this config in real production workload? >> >> >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> >> experiments. >> >> >> > >> >> >> > Given the extensive scale of our production environment, with hundreds >> >> >> > of thousands of servers, it begs the question: how do you propose we >> >> >> > efficiently manage the various workloads that remain unaffected by the >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> >> > feasible to expect us to recompile and release a new kernel for every >> >> >> > instance where the default value falls short? Surely, there must be >> >> >> > more practical and efficient approaches we can explore together to >> >> >> > ensure optimal performance across all workloads. >> >> >> > >> >> >> > When making improvements or modifications, kindly ensure that they are >> >> >> > not solely confined to a test or lab environment. It's vital to also >> >> >> > consider the needs and requirements of our actual users, along with >> >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> >> >> Have you found that your different systems requires different >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> >> > >> >> > For specific workloads that introduce latency, we set the value to 0. >> >> > For other workloads, we keep it unchanged until we determine that the >> >> > default value is also suboptimal. What is the issue with this >> >> > approach? >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> >> So, other workloads run on the same system will be impacted too. Will >> >> you run one workload only on one system? >> > >> > It seems we're living on different planets. You're happily working in >> > your lab environment, while I'm struggling with real-world production >> > issues. >> > >> > For servers: >> > >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> > Server 1,000,001 and beyond: Happy with all values >> > >> > Is this hard to understand? >> > >> > In other words: >> > >> > For applications: >> > >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> > Application 1,000,001 and beyond: Happy with all values >> >> Good to know this. Thanks! >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> >> example, we need to use different configuration on different systems >> >> otherwise some workloads will be hurt. Can you provide some evidences >> >> to support your change? IMHO, it's not good enough to say I don't know >> >> why I just don't want to change existing systems. If so, it may be >> >> better to wait until you have more evidences. >> > >> > It seems the community encourages developers to experiment with their >> > improvements in lab environments using meticulously designed test >> > cases A, B, C, and as many others as they can imagine, ultimately >> > obtaining perfect data. However, it discourages developers from >> > directly addressing real-world workloads. Sigh. >> >> You cannot know whether your workloads benefit or hurt for the different >> batch number and how in your production environment? If you cannot, how >> do you decide which workload deploys on which system (with different >> batch number configuration). If you can, can you provide such >> information to support your patch? > > We leverage a meticulous selection of network metrics, particularly > focusing on TcpExt indicators, to keep a close eye on application > latency. This includes metrics such as TcpExt.TCPTimeouts, > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > > In instances where a problematic container terminates, we've noticed a > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > second, which serves as a clear indication that other applications are > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > parameter to 0, we've been able to drastically reduce the maximum > frequency of these timeouts to less than one per second. Thanks a lot for sharing this. I learned much from it! > At present, we're selectively applying this adjustment to clusters > that exclusively host the identified problematic applications, and > we're closely monitoring their performance to ensure stability. To > date, we've observed no network latency issues as a result of this > change. However, we remain cautious about extending this optimization > to other clusters, as the decision ultimately depends on a variety of > factors. > > It's important to note that we're not eager to implement this change > across our entire fleet, as we recognize the potential for unforeseen > consequences. Instead, we're taking a cautious approach by initially > applying it to a limited number of servers. This allows us to assess > its impact and make informed decisions about whether or not to expand > its use in the future. So, you haven't observed any performance hurt yet. Right? If you haven't, I suggest you to keep the patch in your downstream kernel for a while. In the future, if you find the performance of some workloads hurts because of the new batch number, you can repost the patch with the supporting data. If in the end, the performance of more and more workloads is good with the new batch number. You may consider to make 0 the default value :-) > [0] 'Cluster' refers to a Kubernetes concept, where a single cluster > comprises a specific group of servers designed to work in unison. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> > >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> too. > >> >> >> >> >> >> > >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> > >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> > > >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> > >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> "neutral"? > >> >> >> >> > > >> >> >> >> > No, thanks. > >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> > understanding of the concept. > >> >> >> >> > > >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> > >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> latency"). Which introduces the config. > >> >> >> > > >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> > this config in real production workload? > >> >> >> > > >> >> >> >> > >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> experiments. > >> >> >> > > >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> > ensure optimal performance across all workloads. > >> >> >> > > >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> > >> >> >> Have you found that your different systems requires different > >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> > > >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> > default value is also suboptimal. What is the issue with this > >> >> > approach? > >> >> > >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> So, other workloads run on the same system will be impacted too. Will > >> >> you run one workload only on one system? > >> > > >> > It seems we're living on different planets. You're happily working in > >> > your lab environment, while I'm struggling with real-world production > >> > issues. > >> > > >> > For servers: > >> > > >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> > Server 1,000,001 and beyond: Happy with all values > >> > > >> > Is this hard to understand? > >> > > >> > In other words: > >> > > >> > For applications: > >> > > >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> > Application 1,000,001 and beyond: Happy with all values > >> > >> Good to know this. Thanks! > >> > >> >> > >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> example, we need to use different configuration on different systems > >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> why I just don't want to change existing systems. If so, it may be > >> >> better to wait until you have more evidences. > >> > > >> > It seems the community encourages developers to experiment with their > >> > improvements in lab environments using meticulously designed test > >> > cases A, B, C, and as many others as they can imagine, ultimately > >> > obtaining perfect data. However, it discourages developers from > >> > directly addressing real-world workloads. Sigh. > >> > >> You cannot know whether your workloads benefit or hurt for the different > >> batch number and how in your production environment? If you cannot, how > >> do you decide which workload deploys on which system (with different > >> batch number configuration). If you can, can you provide such > >> information to support your patch? > > > > We leverage a meticulous selection of network metrics, particularly > > focusing on TcpExt indicators, to keep a close eye on application > > latency. This includes metrics such as TcpExt.TCPTimeouts, > > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > > > > In instances where a problematic container terminates, we've noticed a > > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > > second, which serves as a clear indication that other applications are > > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > > parameter to 0, we've been able to drastically reduce the maximum > > frequency of these timeouts to less than one per second. > > Thanks a lot for sharing this. I learned much from it! > > > At present, we're selectively applying this adjustment to clusters > > that exclusively host the identified problematic applications, and > > we're closely monitoring their performance to ensure stability. To > > date, we've observed no network latency issues as a result of this > > change. However, we remain cautious about extending this optimization > > to other clusters, as the decision ultimately depends on a variety of > > factors. > > > > It's important to note that we're not eager to implement this change > > across our entire fleet, as we recognize the potential for unforeseen > > consequences. Instead, we're taking a cautious approach by initially > > applying it to a limited number of servers. This allows us to assess > > its impact and make informed decisions about whether or not to expand > > its use in the future. > > So, you haven't observed any performance hurt yet. Right? Right. > If you > haven't, I suggest you to keep the patch in your downstream kernel for a > while. In the future, if you find the performance of some workloads > hurts because of the new batch number, you can repost the patch with the > supporting data. If in the end, the performance of more and more > workloads is good with the new batch number. You may consider to make 0 > the default value :-) That is not how the real world works. In the real world: - No one knows what may happen in the future. Therefore, if possible, we should make systems flexible, unless there is a strong justification for using a hard-coded value. - Minimize changes whenever possible. These systems have been working fine in the past, even if with lower performance. Why make changes just for the sake of improving performance? Does the key metric of your performance data truly matter for their workload?
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> >> >> > >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> >> >> "neutral"? >> >> >> >> >> > >> >> >> >> >> > No, thanks. >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> >> >> > understanding of the concept. >> >> >> >> >> > >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> >> >> latency"). Which introduces the config. >> >> >> >> > >> >> >> >> > What specifically are your expectations for how users should utilize >> >> >> >> > this config in real production workload? >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> >> >> experiments. >> >> >> >> > >> >> >> >> > Given the extensive scale of our production environment, with hundreds >> >> >> >> > of thousands of servers, it begs the question: how do you propose we >> >> >> >> > efficiently manage the various workloads that remain unaffected by the >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> >> >> > feasible to expect us to recompile and release a new kernel for every >> >> >> >> > instance where the default value falls short? Surely, there must be >> >> >> >> > more practical and efficient approaches we can explore together to >> >> >> >> > ensure optimal performance across all workloads. >> >> >> >> > >> >> >> >> > When making improvements or modifications, kindly ensure that they are >> >> >> >> > not solely confined to a test or lab environment. It's vital to also >> >> >> >> > consider the needs and requirements of our actual users, along with >> >> >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> >> >> >> >> Have you found that your different systems requires different >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> >> >> > >> >> >> > For specific workloads that introduce latency, we set the value to 0. >> >> >> > For other workloads, we keep it unchanged until we determine that the >> >> >> > default value is also suboptimal. What is the issue with this >> >> >> > approach? >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> >> >> So, other workloads run on the same system will be impacted too. Will >> >> >> you run one workload only on one system? >> >> > >> >> > It seems we're living on different planets. You're happily working in >> >> > your lab environment, while I'm struggling with real-world production >> >> > issues. >> >> > >> >> > For servers: >> >> > >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> > Server 1,000,001 and beyond: Happy with all values >> >> > >> >> > Is this hard to understand? >> >> > >> >> > In other words: >> >> > >> >> > For applications: >> >> > >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> > Application 1,000,001 and beyond: Happy with all values >> >> >> >> Good to know this. Thanks! >> >> >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> >> >> example, we need to use different configuration on different systems >> >> >> otherwise some workloads will be hurt. Can you provide some evidences >> >> >> to support your change? IMHO, it's not good enough to say I don't know >> >> >> why I just don't want to change existing systems. If so, it may be >> >> >> better to wait until you have more evidences. >> >> > >> >> > It seems the community encourages developers to experiment with their >> >> > improvements in lab environments using meticulously designed test >> >> > cases A, B, C, and as many others as they can imagine, ultimately >> >> > obtaining perfect data. However, it discourages developers from >> >> > directly addressing real-world workloads. Sigh. >> >> >> >> You cannot know whether your workloads benefit or hurt for the different >> >> batch number and how in your production environment? If you cannot, how >> >> do you decide which workload deploys on which system (with different >> >> batch number configuration). If you can, can you provide such >> >> information to support your patch? >> > >> > We leverage a meticulous selection of network metrics, particularly >> > focusing on TcpExt indicators, to keep a close eye on application >> > latency. This includes metrics such as TcpExt.TCPTimeouts, >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. >> > >> > In instances where a problematic container terminates, we've noticed a >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per >> > second, which serves as a clear indication that other applications are >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max >> > parameter to 0, we've been able to drastically reduce the maximum >> > frequency of these timeouts to less than one per second. >> >> Thanks a lot for sharing this. I learned much from it! >> >> > At present, we're selectively applying this adjustment to clusters >> > that exclusively host the identified problematic applications, and >> > we're closely monitoring their performance to ensure stability. To >> > date, we've observed no network latency issues as a result of this >> > change. However, we remain cautious about extending this optimization >> > to other clusters, as the decision ultimately depends on a variety of >> > factors. >> > >> > It's important to note that we're not eager to implement this change >> > across our entire fleet, as we recognize the potential for unforeseen >> > consequences. Instead, we're taking a cautious approach by initially >> > applying it to a limited number of servers. This allows us to assess >> > its impact and make informed decisions about whether or not to expand >> > its use in the future. >> >> So, you haven't observed any performance hurt yet. Right? > > Right. > >> If you >> haven't, I suggest you to keep the patch in your downstream kernel for a >> while. In the future, if you find the performance of some workloads >> hurts because of the new batch number, you can repost the patch with the >> supporting data. If in the end, the performance of more and more >> workloads is good with the new batch number. You may consider to make 0 >> the default value :-) > > That is not how the real world works. > > In the real world: > > - No one knows what may happen in the future. > Therefore, if possible, we should make systems flexible, unless > there is a strong justification for using a hard-coded value. > > - Minimize changes whenever possible. > These systems have been working fine in the past, even if with lower > performance. Why make changes just for the sake of improving > performance? Does the key metric of your performance data truly matter > for their workload? These are good policy in your organization and business. But, it's not necessary the policy that Linux kernel upstream should take. Community needs to consider long-term maintenance overhead, so it adds new ABI (such as sysfs knob) to kernel with the necessary justification. In general, it prefer to use a good default value or an automatic algorithm that works for everyone. Community tries avoiding (or fixing) regressions as much as possible, but this will not stop kernel from changing, even if it's big. IIUC, because of the different requirements, there are upstream and downstream kernels. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> >> > > >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> >> > >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> >> "neutral"? > >> >> >> >> >> > > >> >> >> >> >> > No, thanks. > >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> > > >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> >> > >> >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> > > >> >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> >> > this config in real production workload? > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> >> experiments. > >> >> >> >> > > >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> > > >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> >> > >> >> >> >> Have you found that your different systems requires different > >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> > > >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> > approach? > >> >> >> > >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> >> So, other workloads run on the same system will be impacted too. Will > >> >> >> you run one workload only on one system? > >> >> > > >> >> > It seems we're living on different planets. You're happily working in > >> >> > your lab environment, while I'm struggling with real-world production > >> >> > issues. > >> >> > > >> >> > For servers: > >> >> > > >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> > > >> >> > Is this hard to understand? > >> >> > > >> >> > In other words: > >> >> > > >> >> > For applications: > >> >> > > >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> > >> >> Good to know this. Thanks! > >> >> > >> >> >> > >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> >> example, we need to use different configuration on different systems > >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> >> why I just don't want to change existing systems. If so, it may be > >> >> >> better to wait until you have more evidences. > >> >> > > >> >> > It seems the community encourages developers to experiment with their > >> >> > improvements in lab environments using meticulously designed test > >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> > obtaining perfect data. However, it discourages developers from > >> >> > directly addressing real-world workloads. Sigh. > >> >> > >> >> You cannot know whether your workloads benefit or hurt for the different > >> >> batch number and how in your production environment? If you cannot, how > >> >> do you decide which workload deploys on which system (with different > >> >> batch number configuration). If you can, can you provide such > >> >> information to support your patch? > >> > > >> > We leverage a meticulous selection of network metrics, particularly > >> > focusing on TcpExt indicators, to keep a close eye on application > >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> > > >> > In instances where a problematic container terminates, we've noticed a > >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> > second, which serves as a clear indication that other applications are > >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> > parameter to 0, we've been able to drastically reduce the maximum > >> > frequency of these timeouts to less than one per second. > >> > >> Thanks a lot for sharing this. I learned much from it! > >> > >> > At present, we're selectively applying this adjustment to clusters > >> > that exclusively host the identified problematic applications, and > >> > we're closely monitoring their performance to ensure stability. To > >> > date, we've observed no network latency issues as a result of this > >> > change. However, we remain cautious about extending this optimization > >> > to other clusters, as the decision ultimately depends on a variety of > >> > factors. > >> > > >> > It's important to note that we're not eager to implement this change > >> > across our entire fleet, as we recognize the potential for unforeseen > >> > consequences. Instead, we're taking a cautious approach by initially > >> > applying it to a limited number of servers. This allows us to assess > >> > its impact and make informed decisions about whether or not to expand > >> > its use in the future. > >> > >> So, you haven't observed any performance hurt yet. Right? > > > > Right. > > > >> If you > >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> while. In the future, if you find the performance of some workloads > >> hurts because of the new batch number, you can repost the patch with the > >> supporting data. If in the end, the performance of more and more > >> workloads is good with the new batch number. You may consider to make 0 > >> the default value :-) > > > > That is not how the real world works. > > > > In the real world: > > > > - No one knows what may happen in the future. > > Therefore, if possible, we should make systems flexible, unless > > there is a strong justification for using a hard-coded value. > > > > - Minimize changes whenever possible. > > These systems have been working fine in the past, even if with lower > > performance. Why make changes just for the sake of improving > > performance? Does the key metric of your performance data truly matter > > for their workload? > > These are good policy in your organization and business. But, it's not > necessary the policy that Linux kernel upstream should take. You mean the Upstream Linux kernel only designed for the lab ? > > Community needs to consider long-term maintenance overhead, so it adds > new ABI (such as sysfs knob) to kernel with the necessary justification. > In general, it prefer to use a good default value or an automatic > algorithm that works for everyone. Community tries avoiding (or fixing) > regressions as much as possible, but this will not stop kernel from > changing, even if it's big. Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > > IIUC, because of the different requirements, there are upstream and > downstream kernels. The downstream developer backport features from the upsteam kernel, and if they find issues in the upstream kernel, they should contribute it back. That is how the Linux Community works, right ?
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> >> >> >> "neutral"? >> >> >> >> >> >> > >> >> >> >> >> >> > No, thanks. >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> >> >> >> > understanding of the concept. >> >> >> >> >> >> > >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> >> >> >> latency"). Which introduces the config. >> >> >> >> >> > >> >> >> >> >> > What specifically are your expectations for how users should utilize >> >> >> >> >> > this config in real production workload? >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> >> >> >> experiments. >> >> >> >> >> > >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every >> >> >> >> >> > instance where the default value falls short? Surely, there must be >> >> >> >> >> > more practical and efficient approaches we can explore together to >> >> >> >> >> > ensure optimal performance across all workloads. >> >> >> >> >> > >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also >> >> >> >> >> > consider the needs and requirements of our actual users, along with >> >> >> >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> >> >> >> >> >> >> Have you found that your different systems requires different >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> >> >> >> > >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. >> >> >> >> > For other workloads, we keep it unchanged until we determine that the >> >> >> >> > default value is also suboptimal. What is the issue with this >> >> >> >> > approach? >> >> >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> >> >> >> So, other workloads run on the same system will be impacted too. Will >> >> >> >> you run one workload only on one system? >> >> >> > >> >> >> > It seems we're living on different planets. You're happily working in >> >> >> > your lab environment, while I'm struggling with real-world production >> >> >> > issues. >> >> >> > >> >> >> > For servers: >> >> >> > >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> > Server 1,000,001 and beyond: Happy with all values >> >> >> > >> >> >> > Is this hard to understand? >> >> >> > >> >> >> > In other words: >> >> >> > >> >> >> > For applications: >> >> >> > >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> > Application 1,000,001 and beyond: Happy with all values >> >> >> >> >> >> Good to know this. Thanks! >> >> >> >> >> >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> >> >> >> example, we need to use different configuration on different systems >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know >> >> >> >> why I just don't want to change existing systems. If so, it may be >> >> >> >> better to wait until you have more evidences. >> >> >> > >> >> >> > It seems the community encourages developers to experiment with their >> >> >> > improvements in lab environments using meticulously designed test >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately >> >> >> > obtaining perfect data. However, it discourages developers from >> >> >> > directly addressing real-world workloads. Sigh. >> >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different >> >> >> batch number and how in your production environment? If you cannot, how >> >> >> do you decide which workload deploys on which system (with different >> >> >> batch number configuration). If you can, can you provide such >> >> >> information to support your patch? >> >> > >> >> > We leverage a meticulous selection of network metrics, particularly >> >> > focusing on TcpExt indicators, to keep a close eye on application >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. >> >> > >> >> > In instances where a problematic container terminates, we've noticed a >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per >> >> > second, which serves as a clear indication that other applications are >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max >> >> > parameter to 0, we've been able to drastically reduce the maximum >> >> > frequency of these timeouts to less than one per second. >> >> >> >> Thanks a lot for sharing this. I learned much from it! >> >> >> >> > At present, we're selectively applying this adjustment to clusters >> >> > that exclusively host the identified problematic applications, and >> >> > we're closely monitoring their performance to ensure stability. To >> >> > date, we've observed no network latency issues as a result of this >> >> > change. However, we remain cautious about extending this optimization >> >> > to other clusters, as the decision ultimately depends on a variety of >> >> > factors. >> >> > >> >> > It's important to note that we're not eager to implement this change >> >> > across our entire fleet, as we recognize the potential for unforeseen >> >> > consequences. Instead, we're taking a cautious approach by initially >> >> > applying it to a limited number of servers. This allows us to assess >> >> > its impact and make informed decisions about whether or not to expand >> >> > its use in the future. >> >> >> >> So, you haven't observed any performance hurt yet. Right? >> > >> > Right. >> > >> >> If you >> >> haven't, I suggest you to keep the patch in your downstream kernel for a >> >> while. In the future, if you find the performance of some workloads >> >> hurts because of the new batch number, you can repost the patch with the >> >> supporting data. If in the end, the performance of more and more >> >> workloads is good with the new batch number. You may consider to make 0 >> >> the default value :-) >> > >> > That is not how the real world works. >> > >> > In the real world: >> > >> > - No one knows what may happen in the future. >> > Therefore, if possible, we should make systems flexible, unless >> > there is a strong justification for using a hard-coded value. >> > >> > - Minimize changes whenever possible. >> > These systems have been working fine in the past, even if with lower >> > performance. Why make changes just for the sake of improving >> > performance? Does the key metric of your performance data truly matter >> > for their workload? >> >> These are good policy in your organization and business. But, it's not >> necessary the policy that Linux kernel upstream should take. > > You mean the Upstream Linux kernel only designed for the lab ? > >> >> Community needs to consider long-term maintenance overhead, so it adds >> new ABI (such as sysfs knob) to kernel with the necessary justification. >> In general, it prefer to use a good default value or an automatic >> algorithm that works for everyone. Community tries avoiding (or fixing) >> regressions as much as possible, but this will not stop kernel from >> changing, even if it's big. > > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. Linux kernel will not break ABI until the last users stop using it. This usually means tens years if not forever. Kernel config options aren't considered ABI, they are used by developers and distributions. They come and go from version to version. >> >> IIUC, because of the different requirements, there are upstream and >> downstream kernels. > > The downstream developer backport features from the upsteam kernel, > and if they find issues in the upstream kernel, they should contribute > it back. That is how the Linux Community works, right ? Yes. If they are issues for upstream kernel too. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> > > >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> > > >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> >> >> > >> >> >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> > > >> >> >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> > > >> >> >> >> >> >> > >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> >> >> experiments. > >> >> >> >> >> > > >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> > > >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> >> >> > >> >> >> >> >> Have you found that your different systems requires different > >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> > > >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> >> > approach? > >> >> >> >> > >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> >> >> So, other workloads run on the same system will be impacted too. Will > >> >> >> >> you run one workload only on one system? > >> >> >> > > >> >> >> > It seems we're living on different planets. You're happily working in > >> >> >> > your lab environment, while I'm struggling with real-world production > >> >> >> > issues. > >> >> >> > > >> >> >> > For servers: > >> >> >> > > >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> > > >> >> >> > Is this hard to understand? > >> >> >> > > >> >> >> > In other words: > >> >> >> > > >> >> >> > For applications: > >> >> >> > > >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> > >> >> >> Good to know this. Thanks! > >> >> >> > >> >> >> >> > >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> >> >> example, we need to use different configuration on different systems > >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> >> >> why I just don't want to change existing systems. If so, it may be > >> >> >> >> better to wait until you have more evidences. > >> >> >> > > >> >> >> > It seems the community encourages developers to experiment with their > >> >> >> > improvements in lab environments using meticulously designed test > >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> >> > obtaining perfect data. However, it discourages developers from > >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> > >> >> >> You cannot know whether your workloads benefit or hurt for the different > >> >> >> batch number and how in your production environment? If you cannot, how > >> >> >> do you decide which workload deploys on which system (with different > >> >> >> batch number configuration). If you can, can you provide such > >> >> >> information to support your patch? > >> >> > > >> >> > We leverage a meticulous selection of network metrics, particularly > >> >> > focusing on TcpExt indicators, to keep a close eye on application > >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> > > >> >> > In instances where a problematic container terminates, we've noticed a > >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> >> > second, which serves as a clear indication that other applications are > >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> >> > parameter to 0, we've been able to drastically reduce the maximum > >> >> > frequency of these timeouts to less than one per second. > >> >> > >> >> Thanks a lot for sharing this. I learned much from it! > >> >> > >> >> > At present, we're selectively applying this adjustment to clusters > >> >> > that exclusively host the identified problematic applications, and > >> >> > we're closely monitoring their performance to ensure stability. To > >> >> > date, we've observed no network latency issues as a result of this > >> >> > change. However, we remain cautious about extending this optimization > >> >> > to other clusters, as the decision ultimately depends on a variety of > >> >> > factors. > >> >> > > >> >> > It's important to note that we're not eager to implement this change > >> >> > across our entire fleet, as we recognize the potential for unforeseen > >> >> > consequences. Instead, we're taking a cautious approach by initially > >> >> > applying it to a limited number of servers. This allows us to assess > >> >> > its impact and make informed decisions about whether or not to expand > >> >> > its use in the future. > >> >> > >> >> So, you haven't observed any performance hurt yet. Right? > >> > > >> > Right. > >> > > >> >> If you > >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> >> while. In the future, if you find the performance of some workloads > >> >> hurts because of the new batch number, you can repost the patch with the > >> >> supporting data. If in the end, the performance of more and more > >> >> workloads is good with the new batch number. You may consider to make 0 > >> >> the default value :-) > >> > > >> > That is not how the real world works. > >> > > >> > In the real world: > >> > > >> > - No one knows what may happen in the future. > >> > Therefore, if possible, we should make systems flexible, unless > >> > there is a strong justification for using a hard-coded value. > >> > > >> > - Minimize changes whenever possible. > >> > These systems have been working fine in the past, even if with lower > >> > performance. Why make changes just for the sake of improving > >> > performance? Does the key metric of your performance data truly matter > >> > for their workload? > >> > >> These are good policy in your organization and business. But, it's not > >> necessary the policy that Linux kernel upstream should take. > > > > You mean the Upstream Linux kernel only designed for the lab ? > > > >> > >> Community needs to consider long-term maintenance overhead, so it adds > >> new ABI (such as sysfs knob) to kernel with the necessary justification. > >> In general, it prefer to use a good default value or an automatic > >> algorithm that works for everyone. Community tries avoiding (or fixing) > >> regressions as much as possible, but this will not stop kernel from > >> changing, even if it's big. > > > > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > > Linux kernel will not break ABI until the last users stop using it. However, you haven't given a clear reference why the systl is an ABI. > This usually means tens years if not forever. Kernel config options > aren't considered ABI, they are used by developers and distributions. > They come and go from version to version. > > >> > >> IIUC, because of the different requirements, there are upstream and > >> downstream kernels. > > > > The downstream developer backport features from the upsteam kernel, > > and if they find issues in the upstream kernel, they should contribute > > it back. That is how the Linux Community works, right ? > > Yes. If they are issues for upstream kernel too. > > -- > Best Regards, > Huang, Ying
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> >> >> >> >> "neutral"? >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > No, thanks. >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> >> >> >> >> > understanding of the concept. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> >> >> >> >> latency"). Which introduces the config. >> >> >> >> >> >> > >> >> >> >> >> >> > What specifically are your expectations for how users should utilize >> >> >> >> >> >> > this config in real production workload? >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> >> >> >> >> experiments. >> >> >> >> >> >> > >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be >> >> >> >> >> >> > more practical and efficient approaches we can explore together to >> >> >> >> >> >> > ensure optimal performance across all workloads. >> >> >> >> >> >> > >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> >> >> >> >> >> >> >> >> Have you found that your different systems requires different >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> >> >> >> >> > >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the >> >> >> >> >> > default value is also suboptimal. What is the issue with this >> >> >> >> >> > approach? >> >> >> >> >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will >> >> >> >> >> you run one workload only on one system? >> >> >> >> > >> >> >> >> > It seems we're living on different planets. You're happily working in >> >> >> >> > your lab environment, while I'm struggling with real-world production >> >> >> >> > issues. >> >> >> >> > >> >> >> >> > For servers: >> >> >> >> > >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> >> > Server 1,000,001 and beyond: Happy with all values >> >> >> >> > >> >> >> >> > Is this hard to understand? >> >> >> >> > >> >> >> >> > In other words: >> >> >> >> > >> >> >> >> > For applications: >> >> >> >> > >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> >> > Application 1,000,001 and beyond: Happy with all values >> >> >> >> >> >> >> >> Good to know this. Thanks! >> >> >> >> >> >> >> >> >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> >> >> >> >> example, we need to use different configuration on different systems >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know >> >> >> >> >> why I just don't want to change existing systems. If so, it may be >> >> >> >> >> better to wait until you have more evidences. >> >> >> >> > >> >> >> >> > It seems the community encourages developers to experiment with their >> >> >> >> > improvements in lab environments using meticulously designed test >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately >> >> >> >> > obtaining perfect data. However, it discourages developers from >> >> >> >> > directly addressing real-world workloads. Sigh. >> >> >> >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different >> >> >> >> batch number and how in your production environment? If you cannot, how >> >> >> >> do you decide which workload deploys on which system (with different >> >> >> >> batch number configuration). If you can, can you provide such >> >> >> >> information to support your patch? >> >> >> > >> >> >> > We leverage a meticulous selection of network metrics, particularly >> >> >> > focusing on TcpExt indicators, to keep a close eye on application >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. >> >> >> > >> >> >> > In instances where a problematic container terminates, we've noticed a >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per >> >> >> > second, which serves as a clear indication that other applications are >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max >> >> >> > parameter to 0, we've been able to drastically reduce the maximum >> >> >> > frequency of these timeouts to less than one per second. >> >> >> >> >> >> Thanks a lot for sharing this. I learned much from it! >> >> >> >> >> >> > At present, we're selectively applying this adjustment to clusters >> >> >> > that exclusively host the identified problematic applications, and >> >> >> > we're closely monitoring their performance to ensure stability. To >> >> >> > date, we've observed no network latency issues as a result of this >> >> >> > change. However, we remain cautious about extending this optimization >> >> >> > to other clusters, as the decision ultimately depends on a variety of >> >> >> > factors. >> >> >> > >> >> >> > It's important to note that we're not eager to implement this change >> >> >> > across our entire fleet, as we recognize the potential for unforeseen >> >> >> > consequences. Instead, we're taking a cautious approach by initially >> >> >> > applying it to a limited number of servers. This allows us to assess >> >> >> > its impact and make informed decisions about whether or not to expand >> >> >> > its use in the future. >> >> >> >> >> >> So, you haven't observed any performance hurt yet. Right? >> >> > >> >> > Right. >> >> > >> >> >> If you >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a >> >> >> while. In the future, if you find the performance of some workloads >> >> >> hurts because of the new batch number, you can repost the patch with the >> >> >> supporting data. If in the end, the performance of more and more >> >> >> workloads is good with the new batch number. You may consider to make 0 >> >> >> the default value :-) >> >> > >> >> > That is not how the real world works. >> >> > >> >> > In the real world: >> >> > >> >> > - No one knows what may happen in the future. >> >> > Therefore, if possible, we should make systems flexible, unless >> >> > there is a strong justification for using a hard-coded value. >> >> > >> >> > - Minimize changes whenever possible. >> >> > These systems have been working fine in the past, even if with lower >> >> > performance. Why make changes just for the sake of improving >> >> > performance? Does the key metric of your performance data truly matter >> >> > for their workload? >> >> >> >> These are good policy in your organization and business. But, it's not >> >> necessary the policy that Linux kernel upstream should take. >> > >> > You mean the Upstream Linux kernel only designed for the lab ? >> > >> >> >> >> Community needs to consider long-term maintenance overhead, so it adds >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. >> >> In general, it prefer to use a good default value or an automatic >> >> algorithm that works for everyone. Community tries avoiding (or fixing) >> >> regressions as much as possible, but this will not stop kernel from >> >> changing, even if it's big. >> > >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. >> >> Linux kernel will not break ABI until the last users stop using it. > > However, you haven't given a clear reference why the systl is an ABI. TBH, I don't find a formal document said it explicitly after some searching. Hi, Andrew, Matthew, Can you help me on this? Whether sysctl is considered Linux kernel ABI? Or something similar? >> This usually means tens years if not forever. Kernel config options >> aren't considered ABI, they are used by developers and distributions. >> They come and go from version to version. >> >> >> >> >> IIUC, because of the different requirements, there are upstream and >> >> downstream kernels. >> > >> > The downstream developer backport features from the upsteam kernel, >> > and if they find issues in the upstream kernel, they should contribute >> > it back. That is how the Linux Community works, right ? >> >> Yes. If they are issues for upstream kernel too. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> >> > > >> >> >> >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> >> >> >> experiments. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> >> > > >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> >> >> >> > >> >> >> >> >> >> Have you found that your different systems requires different > >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> >> > > >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> >> >> > approach? > >> >> >> >> >> > >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will > >> >> >> >> >> you run one workload only on one system? > >> >> >> >> > > >> >> >> >> > It seems we're living on different planets. You're happily working in > >> >> >> >> > your lab environment, while I'm struggling with real-world production > >> >> >> >> > issues. > >> >> >> >> > > >> >> >> >> > For servers: > >> >> >> >> > > >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> >> > > >> >> >> >> > Is this hard to understand? > >> >> >> >> > > >> >> >> >> > In other words: > >> >> >> >> > > >> >> >> >> > For applications: > >> >> >> >> > > >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> >> > >> >> >> >> Good to know this. Thanks! > >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> >> >> >> example, we need to use different configuration on different systems > >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> >> >> >> why I just don't want to change existing systems. If so, it may be > >> >> >> >> >> better to wait until you have more evidences. > >> >> >> >> > > >> >> >> >> > It seems the community encourages developers to experiment with their > >> >> >> >> > improvements in lab environments using meticulously designed test > >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> >> >> > obtaining perfect data. However, it discourages developers from > >> >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> >> > >> >> >> >> You cannot know whether your workloads benefit or hurt for the different > >> >> >> >> batch number and how in your production environment? If you cannot, how > >> >> >> >> do you decide which workload deploys on which system (with different > >> >> >> >> batch number configuration). If you can, can you provide such > >> >> >> >> information to support your patch? > >> >> >> > > >> >> >> > We leverage a meticulous selection of network metrics, particularly > >> >> >> > focusing on TcpExt indicators, to keep a close eye on application > >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> >> > > >> >> >> > In instances where a problematic container terminates, we've noticed a > >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> >> >> > second, which serves as a clear indication that other applications are > >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> >> >> > parameter to 0, we've been able to drastically reduce the maximum > >> >> >> > frequency of these timeouts to less than one per second. > >> >> >> > >> >> >> Thanks a lot for sharing this. I learned much from it! > >> >> >> > >> >> >> > At present, we're selectively applying this adjustment to clusters > >> >> >> > that exclusively host the identified problematic applications, and > >> >> >> > we're closely monitoring their performance to ensure stability. To > >> >> >> > date, we've observed no network latency issues as a result of this > >> >> >> > change. However, we remain cautious about extending this optimization > >> >> >> > to other clusters, as the decision ultimately depends on a variety of > >> >> >> > factors. > >> >> >> > > >> >> >> > It's important to note that we're not eager to implement this change > >> >> >> > across our entire fleet, as we recognize the potential for unforeseen > >> >> >> > consequences. Instead, we're taking a cautious approach by initially > >> >> >> > applying it to a limited number of servers. This allows us to assess > >> >> >> > its impact and make informed decisions about whether or not to expand > >> >> >> > its use in the future. > >> >> >> > >> >> >> So, you haven't observed any performance hurt yet. Right? > >> >> > > >> >> > Right. > >> >> > > >> >> >> If you > >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> >> >> while. In the future, if you find the performance of some workloads > >> >> >> hurts because of the new batch number, you can repost the patch with the > >> >> >> supporting data. If in the end, the performance of more and more > >> >> >> workloads is good with the new batch number. You may consider to make 0 > >> >> >> the default value :-) > >> >> > > >> >> > That is not how the real world works. > >> >> > > >> >> > In the real world: > >> >> > > >> >> > - No one knows what may happen in the future. > >> >> > Therefore, if possible, we should make systems flexible, unless > >> >> > there is a strong justification for using a hard-coded value. > >> >> > > >> >> > - Minimize changes whenever possible. > >> >> > These systems have been working fine in the past, even if with lower > >> >> > performance. Why make changes just for the sake of improving > >> >> > performance? Does the key metric of your performance data truly matter > >> >> > for their workload? > >> >> > >> >> These are good policy in your organization and business. But, it's not > >> >> necessary the policy that Linux kernel upstream should take. > >> > > >> > You mean the Upstream Linux kernel only designed for the lab ? > >> > > >> >> > >> >> Community needs to consider long-term maintenance overhead, so it adds > >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. > >> >> In general, it prefer to use a good default value or an automatic > >> >> algorithm that works for everyone. Community tries avoiding (or fixing) > >> >> regressions as much as possible, but this will not stop kernel from > >> >> changing, even if it's big. > >> > > >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > >> > >> Linux kernel will not break ABI until the last users stop using it. > > > > However, you haven't given a clear reference why the systl is an ABI. > > TBH, I don't find a formal document said it explicitly after some > searching. > > Hi, Andrew, Matthew, > > Can you help me on this? Whether sysctl is considered Linux kernel ABI? > Or something similar? In my experience, we consistently utilize an if-statement to configure sysctl settings in our production environments. if [ -f ${sysctl_file} ]; then echo ${new_value} > ${sysctl_file} fi Additionally, you can incorporate this into rc.local to ensure the configuration is applied upon system reboot. Even if you add it to the sysctl.conf without the if-statement, it won't break anything. The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, underwent a naming change along with a functional update from its predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite this significant change, there have been no reported issues or complaints, suggesting that the renaming and functional update have not negatively impacted the system's functionality.
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> >> >> >> >> >> >> >> >> >> too. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> >> >> >> >> >> >> >> >> "neutral"? >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > No, thanks. >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better >> >> >> >> >> >> >> >> > understanding of the concept. >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> >> >> >> >> >> >> >> latency"). Which introduces the config. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize >> >> >> >> >> >> >> > this config in real production workload? >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> >> >> >> >> >> >> >> experiments. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to >> >> >> >> >> >> >> > ensure optimal performance across all workloads. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. >> >> >> >> >> >> >> >> >> >> >> >> >> >> Have you found that your different systems requires different >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> >> >> >> >> >> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this >> >> >> >> >> >> > approach? >> >> >> >> >> >> >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will >> >> >> >> >> >> you run one workload only on one system? >> >> >> >> >> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in >> >> >> >> >> > your lab environment, while I'm struggling with real-world production >> >> >> >> >> > issues. >> >> >> >> >> > >> >> >> >> >> > For servers: >> >> >> >> >> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values >> >> >> >> >> > >> >> >> >> >> > Is this hard to understand? >> >> >> >> >> > >> >> >> >> >> > In other words: >> >> >> >> >> > >> >> >> >> >> > For applications: >> >> >> >> >> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values >> >> >> >> >> >> >> >> >> >> Good to know this. Thanks! >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> >> >> >> >> >> example, we need to use different configuration on different systems >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences >> >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know >> >> >> >> >> >> why I just don't want to change existing systems. If so, it may be >> >> >> >> >> >> better to wait until you have more evidences. >> >> >> >> >> > >> >> >> >> >> > It seems the community encourages developers to experiment with their >> >> >> >> >> > improvements in lab environments using meticulously designed test >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately >> >> >> >> >> > obtaining perfect data. However, it discourages developers from >> >> >> >> >> > directly addressing real-world workloads. Sigh. >> >> >> >> >> >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different >> >> >> >> >> batch number and how in your production environment? If you cannot, how >> >> >> >> >> do you decide which workload deploys on which system (with different >> >> >> >> >> batch number configuration). If you can, can you provide such >> >> >> >> >> information to support your patch? >> >> >> >> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. >> >> >> >> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per >> >> >> >> > second, which serves as a clear indication that other applications are >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum >> >> >> >> > frequency of these timeouts to less than one per second. >> >> >> >> >> >> >> >> Thanks a lot for sharing this. I learned much from it! >> >> >> >> >> >> >> >> > At present, we're selectively applying this adjustment to clusters >> >> >> >> > that exclusively host the identified problematic applications, and >> >> >> >> > we're closely monitoring their performance to ensure stability. To >> >> >> >> > date, we've observed no network latency issues as a result of this >> >> >> >> > change. However, we remain cautious about extending this optimization >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of >> >> >> >> > factors. >> >> >> >> > >> >> >> >> > It's important to note that we're not eager to implement this change >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially >> >> >> >> > applying it to a limited number of servers. This allows us to assess >> >> >> >> > its impact and make informed decisions about whether or not to expand >> >> >> >> > its use in the future. >> >> >> >> >> >> >> >> So, you haven't observed any performance hurt yet. Right? >> >> >> > >> >> >> > Right. >> >> >> > >> >> >> >> If you >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a >> >> >> >> while. In the future, if you find the performance of some workloads >> >> >> >> hurts because of the new batch number, you can repost the patch with the >> >> >> >> supporting data. If in the end, the performance of more and more >> >> >> >> workloads is good with the new batch number. You may consider to make 0 >> >> >> >> the default value :-) >> >> >> > >> >> >> > That is not how the real world works. >> >> >> > >> >> >> > In the real world: >> >> >> > >> >> >> > - No one knows what may happen in the future. >> >> >> > Therefore, if possible, we should make systems flexible, unless >> >> >> > there is a strong justification for using a hard-coded value. >> >> >> > >> >> >> > - Minimize changes whenever possible. >> >> >> > These systems have been working fine in the past, even if with lower >> >> >> > performance. Why make changes just for the sake of improving >> >> >> > performance? Does the key metric of your performance data truly matter >> >> >> > for their workload? >> >> >> >> >> >> These are good policy in your organization and business. But, it's not >> >> >> necessary the policy that Linux kernel upstream should take. >> >> > >> >> > You mean the Upstream Linux kernel only designed for the lab ? >> >> > >> >> >> >> >> >> Community needs to consider long-term maintenance overhead, so it adds >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. >> >> >> In general, it prefer to use a good default value or an automatic >> >> >> algorithm that works for everyone. Community tries avoiding (or fixing) >> >> >> regressions as much as possible, but this will not stop kernel from >> >> >> changing, even if it's big. >> >> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. >> >> >> >> Linux kernel will not break ABI until the last users stop using it. >> > >> > However, you haven't given a clear reference why the systl is an ABI. >> >> TBH, I don't find a formal document said it explicitly after some >> searching. >> >> Hi, Andrew, Matthew, >> >> Can you help me on this? Whether sysctl is considered Linux kernel ABI? >> Or something similar? > > In my experience, we consistently utilize an if-statement to configure > sysctl settings in our production environments. > > if [ -f ${sysctl_file} ]; then > echo ${new_value} > ${sysctl_file} > fi > > Additionally, you can incorporate this into rc.local to ensure the > configuration is applied upon system reboot. > > Even if you add it to the sysctl.conf without the if-statement, it > won't break anything. > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, > underwent a naming change along with a functional update from its > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite > this significant change, there have been no reported issues or > complaints, suggesting that the renaming and functional update have > not negatively impacted the system's functionality. Thanks for your information. From the commit, sysctl isn't considered as the kernel ABI. Even if so, IMHO, we shouldn't introduce a user tunable knob without a real world requirements except more flexibility. -- Best Regards, Huang, Ying
On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> >> >> >> >> experiments. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Have you found that your different systems requires different > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> >> >> > > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> >> >> >> > approach? > >> >> >> >> >> >> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will > >> >> >> >> >> >> you run one workload only on one system? > >> >> >> >> >> > > >> >> >> >> >> > It seems we're living on different planets. You're happily working in > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production > >> >> >> >> >> > issues. > >> >> >> >> >> > > >> >> >> >> >> > For servers: > >> >> >> >> >> > > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> >> >> > > >> >> >> >> >> > Is this hard to understand? > >> >> >> >> >> > > >> >> >> >> >> > In other words: > >> >> >> >> >> > > >> >> >> >> >> > For applications: > >> >> >> >> >> > > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> >> >> > >> >> >> >> >> Good to know this. Thanks! > >> >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> >> >> >> >> example, we need to use different configuration on different systems > >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> >> >> >> >> why I just don't want to change existing systems. If so, it may be > >> >> >> >> >> >> better to wait until you have more evidences. > >> >> >> >> >> > > >> >> >> >> >> > It seems the community encourages developers to experiment with their > >> >> >> >> >> > improvements in lab environments using meticulously designed test > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from > >> >> >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> >> >> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different > >> >> >> >> >> batch number and how in your production environment? If you cannot, how > >> >> >> >> >> do you decide which workload deploys on which system (with different > >> >> >> >> >> batch number configuration). If you can, can you provide such > >> >> >> >> >> information to support your patch? > >> >> >> >> > > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> >> >> > > >> >> >> >> > In instances where a problematic container terminates, we've noticed a > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> >> >> >> > second, which serves as a clear indication that other applications are > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum > >> >> >> >> > frequency of these timeouts to less than one per second. > >> >> >> >> > >> >> >> >> Thanks a lot for sharing this. I learned much from it! > >> >> >> >> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters > >> >> >> >> > that exclusively host the identified problematic applications, and > >> >> >> >> > we're closely monitoring their performance to ensure stability. To > >> >> >> >> > date, we've observed no network latency issues as a result of this > >> >> >> >> > change. However, we remain cautious about extending this optimization > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of > >> >> >> >> > factors. > >> >> >> >> > > >> >> >> >> > It's important to note that we're not eager to implement this change > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially > >> >> >> >> > applying it to a limited number of servers. This allows us to assess > >> >> >> >> > its impact and make informed decisions about whether or not to expand > >> >> >> >> > its use in the future. > >> >> >> >> > >> >> >> >> So, you haven't observed any performance hurt yet. Right? > >> >> >> > > >> >> >> > Right. > >> >> >> > > >> >> >> >> If you > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> >> >> >> while. In the future, if you find the performance of some workloads > >> >> >> >> hurts because of the new batch number, you can repost the patch with the > >> >> >> >> supporting data. If in the end, the performance of more and more > >> >> >> >> workloads is good with the new batch number. You may consider to make 0 > >> >> >> >> the default value :-) > >> >> >> > > >> >> >> > That is not how the real world works. > >> >> >> > > >> >> >> > In the real world: > >> >> >> > > >> >> >> > - No one knows what may happen in the future. > >> >> >> > Therefore, if possible, we should make systems flexible, unless > >> >> >> > there is a strong justification for using a hard-coded value. > >> >> >> > > >> >> >> > - Minimize changes whenever possible. > >> >> >> > These systems have been working fine in the past, even if with lower > >> >> >> > performance. Why make changes just for the sake of improving > >> >> >> > performance? Does the key metric of your performance data truly matter > >> >> >> > for their workload? > >> >> >> > >> >> >> These are good policy in your organization and business. But, it's not > >> >> >> necessary the policy that Linux kernel upstream should take. > >> >> > > >> >> > You mean the Upstream Linux kernel only designed for the lab ? > >> >> > > >> >> >> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. > >> >> >> In general, it prefer to use a good default value or an automatic > >> >> >> algorithm that works for everyone. Community tries avoiding (or fixing) > >> >> >> regressions as much as possible, but this will not stop kernel from > >> >> >> changing, even if it's big. > >> >> > > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > >> >> > >> >> Linux kernel will not break ABI until the last users stop using it. > >> > > >> > However, you haven't given a clear reference why the systl is an ABI. > >> > >> TBH, I don't find a formal document said it explicitly after some > >> searching. > >> > >> Hi, Andrew, Matthew, > >> > >> Can you help me on this? Whether sysctl is considered Linux kernel ABI? > >> Or something similar? > > > > In my experience, we consistently utilize an if-statement to configure > > sysctl settings in our production environments. > > > > if [ -f ${sysctl_file} ]; then > > echo ${new_value} > ${sysctl_file} > > fi > > > > Additionally, you can incorporate this into rc.local to ensure the > > configuration is applied upon system reboot. > > > > Even if you add it to the sysctl.conf without the if-statement, it > > won't break anything. > > > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, > > underwent a naming change along with a functional update from its > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite > > this significant change, there have been no reported issues or > > complaints, suggesting that the renaming and functional update have > > not negatively impacted the system's functionality. > > Thanks for your information. From the commit, sysctl isn't considered > as the kernel ABI. > > Even if so, IMHO, we shouldn't introduce a user tunable knob without a > real world requirements except more flexibility. Indeed, I do not reside in the physical realm but within a virtualized universe. (Of course, that is your perspective.) -- Regards Yafang
On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > Yafang Shao <laoar.shao@gmail.com> writes: > > > > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> > > >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> > > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> > > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> > > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> > > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> > > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> > > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> > > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> > > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> > > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> >> > > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> >> > > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > > >> >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > > >> >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > > >> >> >> >> >> >> >> >> >> >> too. > > >> >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > > >> >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > > >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > > >> >> >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > > >> >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > > >> >> >> >> >> >> >> >> >> "neutral"? > > >> >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >> > No, thanks. > > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > > >> >> >> >> >> >> >> >> > understanding of the concept. > > >> >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit > > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > > >> >> >> >> >> >> >> >> latency"). Which introduces the config. > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize > > >> >> >> >> >> >> >> > this config in real production workload? > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > > >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > > >> >> >> >> >> >> >> >> experiments. > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be > > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to > > >> >> >> >> >> >> >> > ensure optimal performance across all workloads. > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with > > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Have you found that your different systems requires different > > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > > >> >> >> >> >> >> > > > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this > > >> >> >> >> >> >> > approach? > > >> >> >> >> >> >> > > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will > > >> >> >> >> >> >> you run one workload only on one system? > > >> >> >> >> >> > > > >> >> >> >> >> > It seems we're living on different planets. You're happily working in > > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production > > >> >> >> >> >> > issues. > > >> >> >> >> >> > > > >> >> >> >> >> > For servers: > > >> >> >> >> >> > > > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > > >> >> >> >> >> > > > >> >> >> >> >> > Is this hard to understand? > > >> >> >> >> >> > > > >> >> >> >> >> > In other words: > > >> >> >> >> >> > > > >> >> >> >> >> > For applications: > > >> >> >> >> >> > > > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > > >> >> >> >> >> > > >> >> >> >> >> Good to know this. Thanks! > > >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > > >> >> >> >> >> >> example, we need to use different configuration on different systems > > >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > > >> >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > > >> >> >> >> >> >> why I just don't want to change existing systems. If so, it may be > > >> >> >> >> >> >> better to wait until you have more evidences. > > >> >> >> >> >> > > > >> >> >> >> >> > It seems the community encourages developers to experiment with their > > >> >> >> >> >> > improvements in lab environments using meticulously designed test > > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from > > >> >> >> >> >> > directly addressing real-world workloads. Sigh. > > >> >> >> >> >> > > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different > > >> >> >> >> >> batch number and how in your production environment? If you cannot, how > > >> >> >> >> >> do you decide which workload deploys on which system (with different > > >> >> >> >> >> batch number configuration). If you can, can you provide such > > >> >> >> >> >> information to support your patch? > > >> >> >> >> > > > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly > > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application > > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > > >> >> >> >> > > > >> >> >> >> > In instances where a problematic container terminates, we've noticed a > > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > > >> >> >> >> > second, which serves as a clear indication that other applications are > > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum > > >> >> >> >> > frequency of these timeouts to less than one per second. > > >> >> >> >> > > >> >> >> >> Thanks a lot for sharing this. I learned much from it! > > >> >> >> >> > > >> >> >> >> > At present, we're selectively applying this adjustment to clusters > > >> >> >> >> > that exclusively host the identified problematic applications, and > > >> >> >> >> > we're closely monitoring their performance to ensure stability. To > > >> >> >> >> > date, we've observed no network latency issues as a result of this > > >> >> >> >> > change. However, we remain cautious about extending this optimization > > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of > > >> >> >> >> > factors. > > >> >> >> >> > > > >> >> >> >> > It's important to note that we're not eager to implement this change > > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen > > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially > > >> >> >> >> > applying it to a limited number of servers. This allows us to assess > > >> >> >> >> > its impact and make informed decisions about whether or not to expand > > >> >> >> >> > its use in the future. > > >> >> >> >> > > >> >> >> >> So, you haven't observed any performance hurt yet. Right? > > >> >> >> > > > >> >> >> > Right. > > >> >> >> > > > >> >> >> >> If you > > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > > >> >> >> >> while. In the future, if you find the performance of some workloads > > >> >> >> >> hurts because of the new batch number, you can repost the patch with the > > >> >> >> >> supporting data. If in the end, the performance of more and more > > >> >> >> >> workloads is good with the new batch number. You may consider to make 0 > > >> >> >> >> the default value :-) > > >> >> >> > > > >> >> >> > That is not how the real world works. > > >> >> >> > > > >> >> >> > In the real world: > > >> >> >> > > > >> >> >> > - No one knows what may happen in the future. > > >> >> >> > Therefore, if possible, we should make systems flexible, unless > > >> >> >> > there is a strong justification for using a hard-coded value. > > >> >> >> > > > >> >> >> > - Minimize changes whenever possible. > > >> >> >> > These systems have been working fine in the past, even if with lower > > >> >> >> > performance. Why make changes just for the sake of improving > > >> >> >> > performance? Does the key metric of your performance data truly matter > > >> >> >> > for their workload? > > >> >> >> > > >> >> >> These are good policy in your organization and business. But, it's not > > >> >> >> necessary the policy that Linux kernel upstream should take. > > >> >> > > > >> >> > You mean the Upstream Linux kernel only designed for the lab ? > > >> >> > > > >> >> >> > > >> >> >> Community needs to consider long-term maintenance overhead, so it adds > > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. > > >> >> >> In general, it prefer to use a good default value or an automatic > > >> >> >> algorithm that works for everyone. Community tries avoiding (or fixing) > > >> >> >> regressions as much as possible, but this will not stop kernel from > > >> >> >> changing, even if it's big. > > >> >> > > > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > > >> >> > > >> >> Linux kernel will not break ABI until the last users stop using it. > > >> > > > >> > However, you haven't given a clear reference why the systl is an ABI. > > >> > > >> TBH, I don't find a formal document said it explicitly after some > > >> searching. > > >> > > >> Hi, Andrew, Matthew, > > >> > > >> Can you help me on this? Whether sysctl is considered Linux kernel ABI? > > >> Or something similar? > > > > > > In my experience, we consistently utilize an if-statement to configure > > > sysctl settings in our production environments. > > > > > > if [ -f ${sysctl_file} ]; then > > > echo ${new_value} > ${sysctl_file} > > > fi > > > > > > Additionally, you can incorporate this into rc.local to ensure the > > > configuration is applied upon system reboot. > > > > > > Even if you add it to the sysctl.conf without the if-statement, it > > > won't break anything. > > > > > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, > > > underwent a naming change along with a functional update from its > > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c > > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite > > > this significant change, there have been no reported issues or > > > complaints, suggesting that the renaming and functional update have > > > not negatively impacted the system's functionality. > > > > Thanks for your information. From the commit, sysctl isn't considered > > as the kernel ABI. > > > > Even if so, IMHO, we shouldn't introduce a user tunable knob without a > > real world requirements except more flexibility. > > Indeed, I do not reside in the physical realm but within a virtualized > universe. (Of course, that is your perspective.) One final note: you explained very well what "neutral" means. Thank you for your comments.
Yafang Shao <laoar.shao@gmail.com> writes: > On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote: >> >> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> > Yafang Shao <laoar.shao@gmail.com> writes: >> > >> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> > >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for >> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, >> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the >> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max >> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel >> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI >> > >> >> >> >> >> >> >> >> >> >> too. >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions >> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi >> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting >> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away >> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a >> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is >> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock >> > >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. >> > >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a >> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have >> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly >> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale >> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? >> > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is >> > >> >> >> >> >> >> >> >> >> "neutral"? >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> > No, thanks. >> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive >> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better >> > >> >> >> >> >> >> >> >> > understanding of the concept. >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? >> > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit >> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long >> > >> >> >> >> >> >> >> >> latency"). Which introduces the config. >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize >> > >> >> >> >> >> >> >> > this config in real production workload? >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you >> > >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial >> > >> >> >> >> >> >> >> >> experiments. >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds >> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we >> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the >> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it >> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every >> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be >> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to >> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads. >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are >> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also >> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with >> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. >> > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Have you found that your different systems requires different >> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? >> > >> >> >> >> >> >> > >> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. >> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the >> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this >> > >> >> >> >> >> >> > approach? >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. >> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will >> > >> >> >> >> >> >> you run one workload only on one system? >> > >> >> >> >> >> > >> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in >> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production >> > >> >> >> >> >> > issues. >> > >> >> >> >> >> > >> > >> >> >> >> >> > For servers: >> > >> >> >> >> >> > >> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 >> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values >> > >> >> >> >> >> > >> > >> >> >> >> >> > Is this hard to understand? >> > >> >> >> >> >> > >> > >> >> >> >> >> > In other words: >> > >> >> >> >> >> > >> > >> >> >> >> >> > For applications: >> > >> >> >> >> >> > >> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 >> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 >> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values >> > >> >> >> >> >> >> > >> >> >> >> >> Good to know this. Thanks! >> > >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For >> > >> >> >> >> >> >> example, we need to use different configuration on different systems >> > >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences >> > >> >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know >> > >> >> >> >> >> >> why I just don't want to change existing systems. If so, it may be >> > >> >> >> >> >> >> better to wait until you have more evidences. >> > >> >> >> >> >> > >> > >> >> >> >> >> > It seems the community encourages developers to experiment with their >> > >> >> >> >> >> > improvements in lab environments using meticulously designed test >> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately >> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from >> > >> >> >> >> >> > directly addressing real-world workloads. Sigh. >> > >> >> >> >> >> >> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different >> > >> >> >> >> >> batch number and how in your production environment? If you cannot, how >> > >> >> >> >> >> do you decide which workload deploys on which system (with different >> > >> >> >> >> >> batch number configuration). If you can, can you provide such >> > >> >> >> >> >> information to support your patch? >> > >> >> >> >> > >> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly >> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application >> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, >> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, >> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. >> > >> >> >> >> > >> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a >> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per >> > >> >> >> >> > second, which serves as a clear indication that other applications are >> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max >> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum >> > >> >> >> >> > frequency of these timeouts to less than one per second. >> > >> >> >> >> >> > >> >> >> >> Thanks a lot for sharing this. I learned much from it! >> > >> >> >> >> >> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters >> > >> >> >> >> > that exclusively host the identified problematic applications, and >> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To >> > >> >> >> >> > date, we've observed no network latency issues as a result of this >> > >> >> >> >> > change. However, we remain cautious about extending this optimization >> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of >> > >> >> >> >> > factors. >> > >> >> >> >> > >> > >> >> >> >> > It's important to note that we're not eager to implement this change >> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen >> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially >> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess >> > >> >> >> >> > its impact and make informed decisions about whether or not to expand >> > >> >> >> >> > its use in the future. >> > >> >> >> >> >> > >> >> >> >> So, you haven't observed any performance hurt yet. Right? >> > >> >> >> > >> > >> >> >> > Right. >> > >> >> >> > >> > >> >> >> >> If you >> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a >> > >> >> >> >> while. In the future, if you find the performance of some workloads >> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the >> > >> >> >> >> supporting data. If in the end, the performance of more and more >> > >> >> >> >> workloads is good with the new batch number. You may consider to make 0 >> > >> >> >> >> the default value :-) >> > >> >> >> > >> > >> >> >> > That is not how the real world works. >> > >> >> >> > >> > >> >> >> > In the real world: >> > >> >> >> > >> > >> >> >> > - No one knows what may happen in the future. >> > >> >> >> > Therefore, if possible, we should make systems flexible, unless >> > >> >> >> > there is a strong justification for using a hard-coded value. >> > >> >> >> > >> > >> >> >> > - Minimize changes whenever possible. >> > >> >> >> > These systems have been working fine in the past, even if with lower >> > >> >> >> > performance. Why make changes just for the sake of improving >> > >> >> >> > performance? Does the key metric of your performance data truly matter >> > >> >> >> > for their workload? >> > >> >> >> >> > >> >> >> These are good policy in your organization and business. But, it's not >> > >> >> >> necessary the policy that Linux kernel upstream should take. >> > >> >> > >> > >> >> > You mean the Upstream Linux kernel only designed for the lab ? >> > >> >> > >> > >> >> >> >> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds >> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. >> > >> >> >> In general, it prefer to use a good default value or an automatic >> > >> >> >> algorithm that works for everyone. Community tries avoiding (or fixing) >> > >> >> >> regressions as much as possible, but this will not stop kernel from >> > >> >> >> changing, even if it's big. >> > >> >> > >> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. >> > >> >> >> > >> >> Linux kernel will not break ABI until the last users stop using it. >> > >> > >> > >> > However, you haven't given a clear reference why the systl is an ABI. >> > >> >> > >> TBH, I don't find a formal document said it explicitly after some >> > >> searching. >> > >> >> > >> Hi, Andrew, Matthew, >> > >> >> > >> Can you help me on this? Whether sysctl is considered Linux kernel ABI? >> > >> Or something similar? >> > > >> > > In my experience, we consistently utilize an if-statement to configure >> > > sysctl settings in our production environments. >> > > >> > > if [ -f ${sysctl_file} ]; then >> > > echo ${new_value} > ${sysctl_file} >> > > fi >> > > >> > > Additionally, you can incorporate this into rc.local to ensure the >> > > configuration is applied upon system reboot. >> > > >> > > Even if you add it to the sysctl.conf without the if-statement, it >> > > won't break anything. >> > > >> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, >> > > underwent a naming change along with a functional update from its >> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c >> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite >> > > this significant change, there have been no reported issues or >> > > complaints, suggesting that the renaming and functional update have >> > > not negatively impacted the system's functionality. >> > >> > Thanks for your information. From the commit, sysctl isn't considered >> > as the kernel ABI. >> > >> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a >> > real world requirements except more flexibility. >> >> Indeed, I do not reside in the physical realm but within a virtualized >> universe. (Of course, that is your perspective.) > > One final note: you explained very well what "neutral" means. Thank > you for your comments. Originally, my opinion to the change is neutral. But, after more thoughts, I changed my opinion to "we need more evidence to prove the knob is necessary". -- Best Regards, Huang, Ying
On Mon, Jul 15, 2024 at 9:11 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yafang Shao <laoar.shao@gmail.com> writes: > > > On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote: > >> > >> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > > >> > Yafang Shao <laoar.shao@gmail.com> writes: > >> > > >> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> > >> > >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> > >> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> > >> > >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> > >> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> > >> > >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> > >> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> > >> > >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> > >> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> > >> > >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> > >> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> >> > >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> >> > >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes: > >> > >> >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> > >> >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> > >> >> >> >> >> >> >> >> >> >> too. > >> > >> >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> > >> >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> > >> >> >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> > >> >> >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> > >> >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> > >> >> >> >> >> >> >> >> >> "neutral"? > >> > >> >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> >> > No, thanks. > >> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> > >> >> >> >> >> >> >> >> > understanding of the concept. > >> > >> >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit > >> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> > >> >> >> >> >> >> >> >> latency"). Which introduces the config. > >> > >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize > >> > >> >> >> >> >> >> >> > this config in real production workload? > >> > >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> > >> >> >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> > >> >> >> >> >> >> >> >> experiments. > >> > >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be > >> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to > >> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads. > >> > >> >> >> >> >> >> >> > > >> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with > >> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> > >> >> >> >> >> >> >> > >> > >> >> >> >> >> >> >> Have you found that your different systems requires different > >> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> > >> >> >> >> >> >> > > >> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this > >> > >> >> >> >> >> >> > approach? > >> > >> >> >> >> >> >> > >> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too. Will > >> > >> >> >> >> >> >> you run one workload only on one system? > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in > >> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production > >> > >> >> >> >> >> > issues. > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > For servers: > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > Is this hard to understand? > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > In other words: > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > For applications: > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> > >> >> >> >> >> > >> > >> >> >> >> >> Good to know this. Thanks! > >> > >> >> >> >> >> > >> > >> >> >> >> >> >> > >> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> > >> >> >> >> >> >> example, we need to use different configuration on different systems > >> > >> >> >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> > >> >> >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> > >> >> >> >> >> >> why I just don't want to change existing systems. If so, it may be > >> > >> >> >> >> >> >> better to wait until you have more evidences. > >> > >> >> >> >> >> > > >> > >> >> >> >> >> > It seems the community encourages developers to experiment with their > >> > >> >> >> >> >> > improvements in lab environments using meticulously designed test > >> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from > >> > >> >> >> >> >> > directly addressing real-world workloads. Sigh. > >> > >> >> >> >> >> > >> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different > >> > >> >> >> >> >> batch number and how in your production environment? If you cannot, how > >> > >> >> >> >> >> do you decide which workload deploys on which system (with different > >> > >> >> >> >> >> batch number configuration). If you can, can you provide such > >> > >> >> >> >> >> information to support your patch? > >> > >> >> >> >> > > >> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly > >> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application > >> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> > >> >> >> >> > > >> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a > >> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> > >> >> >> >> > second, which serves as a clear indication that other applications are > >> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum > >> > >> >> >> >> > frequency of these timeouts to less than one per second. > >> > >> >> >> >> > >> > >> >> >> >> Thanks a lot for sharing this. I learned much from it! > >> > >> >> >> >> > >> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters > >> > >> >> >> >> > that exclusively host the identified problematic applications, and > >> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To > >> > >> >> >> >> > date, we've observed no network latency issues as a result of this > >> > >> >> >> >> > change. However, we remain cautious about extending this optimization > >> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of > >> > >> >> >> >> > factors. > >> > >> >> >> >> > > >> > >> >> >> >> > It's important to note that we're not eager to implement this change > >> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen > >> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially > >> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess > >> > >> >> >> >> > its impact and make informed decisions about whether or not to expand > >> > >> >> >> >> > its use in the future. > >> > >> >> >> >> > >> > >> >> >> >> So, you haven't observed any performance hurt yet. Right? > >> > >> >> >> > > >> > >> >> >> > Right. > >> > >> >> >> > > >> > >> >> >> >> If you > >> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> > >> >> >> >> while. In the future, if you find the performance of some workloads > >> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the > >> > >> >> >> >> supporting data. If in the end, the performance of more and more > >> > >> >> >> >> workloads is good with the new batch number. You may consider to make 0 > >> > >> >> >> >> the default value :-) > >> > >> >> >> > > >> > >> >> >> > That is not how the real world works. > >> > >> >> >> > > >> > >> >> >> > In the real world: > >> > >> >> >> > > >> > >> >> >> > - No one knows what may happen in the future. > >> > >> >> >> > Therefore, if possible, we should make systems flexible, unless > >> > >> >> >> > there is a strong justification for using a hard-coded value. > >> > >> >> >> > > >> > >> >> >> > - Minimize changes whenever possible. > >> > >> >> >> > These systems have been working fine in the past, even if with lower > >> > >> >> >> > performance. Why make changes just for the sake of improving > >> > >> >> >> > performance? Does the key metric of your performance data truly matter > >> > >> >> >> > for their workload? > >> > >> >> >> > >> > >> >> >> These are good policy in your organization and business. But, it's not > >> > >> >> >> necessary the policy that Linux kernel upstream should take. > >> > >> >> > > >> > >> >> > You mean the Upstream Linux kernel only designed for the lab ? > >> > >> >> > > >> > >> >> >> > >> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds > >> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification. > >> > >> >> >> In general, it prefer to use a good default value or an automatic > >> > >> >> >> algorithm that works for everyone. Community tries avoiding (or fixing) > >> > >> >> >> regressions as much as possible, but this will not stop kernel from > >> > >> >> >> changing, even if it's big. > >> > >> >> > > >> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > >> > >> >> > >> > >> >> Linux kernel will not break ABI until the last users stop using it. > >> > >> > > >> > >> > However, you haven't given a clear reference why the systl is an ABI. > >> > >> > >> > >> TBH, I don't find a formal document said it explicitly after some > >> > >> searching. > >> > >> > >> > >> Hi, Andrew, Matthew, > >> > >> > >> > >> Can you help me on this? Whether sysctl is considered Linux kernel ABI? > >> > >> Or something similar? > >> > > > >> > > In my experience, we consistently utilize an if-statement to configure > >> > > sysctl settings in our production environments. > >> > > > >> > > if [ -f ${sysctl_file} ]; then > >> > > echo ${new_value} > ${sysctl_file} > >> > > fi > >> > > > >> > > Additionally, you can incorporate this into rc.local to ensure the > >> > > configuration is applied upon system reboot. > >> > > > >> > > Even if you add it to the sysctl.conf without the if-statement, it > >> > > won't break anything. > >> > > > >> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction, > >> > > underwent a naming change along with a functional update from its > >> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c > >> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite > >> > > this significant change, there have been no reported issues or > >> > > complaints, suggesting that the renaming and functional update have > >> > > not negatively impacted the system's functionality. > >> > > >> > Thanks for your information. From the commit, sysctl isn't considered > >> > as the kernel ABI. > >> > > >> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a > >> > real world requirements except more flexibility. > >> > >> Indeed, I do not reside in the physical realm but within a virtualized > >> universe. (Of course, that is your perspective.) > > > > One final note: you explained very well what "neutral" means. Thank > > you for your comments. > > Originally, my opinion to the change is neutral. But, after more > thoughts, I changed my opinion to "we need more evidence to prove the > knob is necessary". > One obvious issue in your original patch is that you missed including any type of AMD CPUs in your original commit, despite AMD CPUs being widely used nowadays. This clearly indicates that your default configuration lacks consideration. However, I'm a developer working in a resource-limited company. So please don't ask me to verify it on more AMD CPUs as you did for your Intel CPUs.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e86c968a7a0e..eb9e5216eefe 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm: - page_lock_unfairness - panic_on_oom - percpu_pagelist_high_fraction +- pcp_batch_scale_max - stat_interval - stat_refresh - numa_stat @@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local online CPUs. If the user writes '0' to this sysctl, it will revert to this default behavior. +pcp_batch_scale_max +=================== + +In page allocator, PCP (Per-CPU pageset) is refilled and drained in +batches. The batch number is scaled automatically to improve page +allocation/free throughput. But too large scale factor may hurt +latency. This option sets the upper limit of scale factor to limit +the maximum latency. + +The range for this parameter spans from 0 to 6, with a default value of 5. +The value assigned to 'N' signifies that during each refilling or draining +process, a maximum of (batch << N) pages will be involved, where "batch" +represents the default batch size automatically computed by the kernel for +each zone. stat_interval ============= diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 09db2f2e6488..fb797f1c0ef7 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -52,6 +52,7 @@ struct ctl_dir; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ #define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10]) #define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11]) +#define SYSCTL_SIX ((void *)&sysctl_vals[12]) extern const int sysctl_vals[]; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e0b917328cf9..430ac4f58eb7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -82,7 +82,7 @@ #endif /* shared constants to be used in various sysctls */ -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 }; +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 }; EXPORT_SYMBOL(sysctl_vals); const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX }; diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..41fe4c13b7ac 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA -config PCP_BATCH_SCALE_MAX - int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" - default 5 - range 0 6 - help - In page allocator, PCP (Per-CPU pageset) is refilled and drained in - batches. The batch number is scaled automatically to improve page - allocation/free throughput. But too large scale factor may hurt - latency. This option sets the upper limit of scale factor to limit - the maximum latency. - config PHYS_ADDR_T_64BIT def_bool 64BIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b76754a48e0..703eec22a997 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,6 +273,7 @@ int min_free_kbytes = 1024; int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; +static int pcp_batch_scale_max = 5; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) int count = READ_ONCE(pcp->count); while (count) { - int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); + int to_drain = min(count, pcp->batch << pcp_batch_scale_max); count -= to_drain; spin_lock(&pcp->lock); @@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); + return min(pcp->count, batch << pcp_batch_scale_max); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return 0; if (unlikely(free_high)) { - pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), + pcp->high = max(high - (batch << pcp_batch_scale_max), high_min); return 0; } @@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) + if (pcp->free_count < (batch << pcp_batch_scale_max)) pcp->free_count = min(pcp->free_count + (1 << order), - batch << CONFIG_PCP_BATCH_SCALE_MAX); + batch << pcp_batch_scale_max); high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) * subsequent allocation of order-0 pages without any freeing. */ if (batch <= max_nr_alloc && - pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) + pcp->alloc_factor < pcp_batch_scale_max) pcp->alloc_factor++; batch = min(batch, max_nr_alloc); } @@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = { .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, .extra1 = SYSCTL_ZERO, }, + { + .procname = "pcp_batch_scale_max", + .data = &pcp_batch_scale_max, + .maxlen = sizeof(pcp_batch_scale_max), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_SIX, + }, { .procname = "lowmem_reserve_ratio", .data = &sysctl_lowmem_reserve_ratio,
The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for quickly experimenting with specific workloads in a production environment, particularly when monitoring latency spikes caused by contention on the zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max is introduced as a more practical alternative. To ultimately mitigate the zone->lock contention issue, several suggestions have been proposed. One approach involves dividing large zones into multi smaller zones, as suggested by Matthew[0], while another entails splitting the zone->lock using a mechanism similar to memory arenas and shifting away from relying solely on zone_id to identify the range of free lists a particular page belongs to[1]. However, implementing these solutions is likely to necessitate a more extended development effort. Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0] Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> --- Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++ include/linux/sysctl.h | 1 + kernel/sysctl.c | 2 +- mm/Kconfig | 11 ----------- mm/page_alloc.c | 22 ++++++++++++++++------ 5 files changed, 33 insertions(+), 18 deletions(-)