Message ID | adeb1b7d2998bba69d1a57e38300f83e646ee849.1730206468.git.maciej.wieczor-retman@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | selftests/resctrl: SNC kernel support discovery | expand |
Hi Maciej, On 10/29/24 6:00 AM, Maciej Wieczor-Retman wrote: > Resctrl selftest prints a message on test failure that Sub-Numa > Clustering (SNC) could be enabled and points the user to check their BIOS > settings. No actual check is performed before printing that message so > it is not very accurate in pinpointing a problem. > > Figuring out if SNC is enabled is only one part of the problem, the > others being whether the detected SNC mode is reliable and whether the > kernel supports SNC in resctrl. Starting to sound like previous patch ... > > When there is SNC support for kernel's resctrl subsystem and SNC is > enabled then sub node files are created for each node in the resctrlfs. > The sub node files exist in each regular node's L3 monitoring directory. > The reliable path to check for existence of sub node files is > /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00. ... this is about previous patch ... > > To check if SNC detection is reliable one can check the > /sys/devices/system/cpu/offline file. If it's empty, it means all cores > are operational and the ratio should be calculated correctly. If it has > any contents, it means the detected SNC mode can't be trusted and should > be disabled. ... also about previous patch ... > > Add helpers for all operations mentioned above. Not done in this patch. > > Detect SNC mode once and let other tests inherit that information. Not done in this patch. > > Add messages to alert the user when SNC detection could return incorrect > results. Correct old messages to account for kernel support of SNC in > resctrl. Sounds like description of what earlier version of this patch did ... > > Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> > --- > Changelog v5: > - Move all resctrlfs.c code from this patch to 1/2. (Reinette) Please update the changelog to match the changes made to the patch. > - Remove kernel support check and error message from CAT since it can't > be happen. > - Remove snc checks in CAT since snc doesn't affect it here. > - Skip MBM, MBA and CMT tests if snc is unreliable. > > Changelog v4: > - Change messages at the end of tests and at the start of > run_single_test. (Reinette) > - Add messages at the end of CAT since it can also fail due to enabled > SNC + lack of kernel support. > - Remove snc_mode global variable. (Reinette) > - Fix wrong description of snc_kernel_support(). (Reinette) > - Move call to cpus_offline_empty() into snc_nodes_per_l3_cache() so the > whole detection flow is in one place as discussed. (Reinette) > > Changelog v3: > - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette) > - Add printing the discovered SNC mode. (Reinette) > - Change method of kernel support discovery from cache sizes to > existance of sub node files. > - Check if SNC detection is unreliable. > - Move SNC detection to only the first run_single_test() instead on > error at the end of test runs. > - Add global value to remind user at the end of relevant tests if SNC > detection was found to be unreliable. > - Redo the patch message after the changes. > > Changelog v2: > - Move snc_ways() checks from individual tests into > snc_kernel_support(). > - Write better comment for snc_kernel_support(). > > tools/testing/selftests/resctrl/cmt_test.c | 8 ++++++-- > tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- > tools/testing/selftests/resctrl/mbm_test.c | 10 +++++++--- > tools/testing/selftests/resctrl/resctrl.h | 3 +++ > 4 files changed, 23 insertions(+), 6 deletions(-) > > diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c > index 0c045080d808..1470bd64d158 100644 > --- a/tools/testing/selftests/resctrl/cmt_test.c > +++ b/tools/testing/selftests/resctrl/cmt_test.c > @@ -133,6 +133,10 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param > ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); > if (ret) > return ret; > + > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + This (inconsistent) duplication can be moved to run_single_test(). > ksft_print_msg("Cache size :%lu\n", cache_total_size); > > count_of_bits = count_bits(long_mask); > @@ -175,8 +179,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param > goto out; > > ret = check_results(¶m, span, n); > - if (ret && (get_vendor() == ARCH_INTEL)) > - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support()) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > out: > free(span_str); > diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c > index ab8496a4925b..8f4e198da047 100644 > --- a/tools/testing/selftests/resctrl/mba_test.c > +++ b/tools/testing/selftests/resctrl/mba_test.c > @@ -170,15 +170,21 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param > .setup = mba_setup, > .measure = mba_measure, > }; > - int ret; > + int ret, snc_support; > > remove(RESULT_FILE_NAME); > > + snc_support = snc_kernel_support(); > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + > ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); > if (ret) > return ret; > > ret = check_results(); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > return ret; > } > diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c > index 6b5a3b52d861..a68f70589b91 100644 > --- a/tools/testing/selftests/resctrl/mbm_test.c > +++ b/tools/testing/selftests/resctrl/mbm_test.c > @@ -138,17 +138,21 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param > .setup = mbm_setup, > .measure = mbm_measure, > }; > - int ret; > + int ret, snc_support; > > remove(RESULT_FILE_NAME); > > + snc_support = snc_kernel_support(); > + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) > + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); > + > ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); > if (ret) > return ret; > > ret = check_results(DEFAULT_SPAN); > - if (ret && (get_vendor() == ARCH_INTEL)) > - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); > + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) > + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); > > return ret; > } > diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h > index 851b37c9c38a..488bdca01e4f 100644 > --- a/tools/testing/selftests/resctrl/resctrl.h > +++ b/tools/testing/selftests/resctrl/resctrl.h > @@ -121,6 +121,8 @@ struct perf_event_read { > */ > extern volatile int *value_sink; > > +extern int snc_unreliable; > + > extern char llc_occup_path[1024]; > > int snc_nodes_per_l3_cache(void); > @@ -167,6 +169,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr); > int signal_handler_register(const struct resctrl_test *test); > void signal_handler_unregister(void); > unsigned int count_bits(unsigned long n); > +int snc_kernel_support(void); > > void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); > void perf_event_initialize_read_format(struct perf_event_read *pe_read); Reinette
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index 0c045080d808..1470bd64d158 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -133,6 +133,10 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); if (ret) return ret; + + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); + ksft_print_msg("Cache size :%lu\n", cache_total_size); count_of_bits = count_bits(long_mask); @@ -175,8 +179,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param goto out; ret = check_results(¶m, span, n); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); out: free(span_str); diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index ab8496a4925b..8f4e198da047 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -170,15 +170,21 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param .setup = mba_setup, .measure = mba_measure, }; - int ret; + int ret, snc_support; remove(RESULT_FILE_NAME); + snc_support = snc_kernel_support(); + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); + ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); if (ret) return ret; ret = check_results(); + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); return ret; } diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index 6b5a3b52d861..a68f70589b91 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -138,17 +138,21 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param .setup = mbm_setup, .measure = mbm_measure, }; - int ret; + int ret, snc_support; remove(RESULT_FILE_NAME); + snc_support = snc_kernel_support(); + if ((get_vendor() == ARCH_INTEL) && snc_unreliable) + ksft_exit_skip("Sub-NUMA Clustering could not be detected properly. Skipping...\n"); + ret = resctrl_val(test, uparams, uparams->benchmark_cmd, ¶m); if (ret) return ret; ret = check_results(DEFAULT_SPAN); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && !snc_support) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled on the system.\n"); return ret; } diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 851b37c9c38a..488bdca01e4f 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -121,6 +121,8 @@ struct perf_event_read { */ extern volatile int *value_sink; +extern int snc_unreliable; + extern char llc_occup_path[1024]; int snc_nodes_per_l3_cache(void); @@ -167,6 +169,7 @@ void ctrlc_handler(int signum, siginfo_t *info, void *ptr); int signal_handler_register(const struct resctrl_test *test); void signal_handler_unregister(void); unsigned int count_bits(unsigned long n); +int snc_kernel_support(void); void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); void perf_event_initialize_read_format(struct perf_event_read *pe_read);
Resctrl selftest prints a message on test failure that Sub-Numa Clustering (SNC) could be enabled and points the user to check their BIOS settings. No actual check is performed before printing that message so it is not very accurate in pinpointing a problem. Figuring out if SNC is enabled is only one part of the problem, the others being whether the detected SNC mode is reliable and whether the kernel supports SNC in resctrl. When there is SNC support for kernel's resctrl subsystem and SNC is enabled then sub node files are created for each node in the resctrlfs. The sub node files exist in each regular node's L3 monitoring directory. The reliable path to check for existence of sub node files is /sys/fs/resctrl/mon_data/mon_L3_00/mon_sub_L3_00. To check if SNC detection is reliable one can check the /sys/devices/system/cpu/offline file. If it's empty, it means all cores are operational and the ratio should be calculated correctly. If it has any contents, it means the detected SNC mode can't be trusted and should be disabled. Add helpers for all operations mentioned above. Detect SNC mode once and let other tests inherit that information. Add messages to alert the user when SNC detection could return incorrect results. Correct old messages to account for kernel support of SNC in resctrl. Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> --- Changelog v5: - Move all resctrlfs.c code from this patch to 1/2. (Reinette) - Remove kernel support check and error message from CAT since it can't be happen. - Remove snc checks in CAT since snc doesn't affect it here. - Skip MBM, MBA and CMT tests if snc is unreliable. Changelog v4: - Change messages at the end of tests and at the start of run_single_test. (Reinette) - Add messages at the end of CAT since it can also fail due to enabled SNC + lack of kernel support. - Remove snc_mode global variable. (Reinette) - Fix wrong description of snc_kernel_support(). (Reinette) - Move call to cpus_offline_empty() into snc_nodes_per_l3_cache() so the whole detection flow is in one place as discussed. (Reinette) Changelog v3: - Change snc_ways() to snc_nodes_per_l3_cache(). (Reinette) - Add printing the discovered SNC mode. (Reinette) - Change method of kernel support discovery from cache sizes to existance of sub node files. - Check if SNC detection is unreliable. - Move SNC detection to only the first run_single_test() instead on error at the end of test runs. - Add global value to remind user at the end of relevant tests if SNC detection was found to be unreliable. - Redo the patch message after the changes. Changelog v2: - Move snc_ways() checks from individual tests into snc_kernel_support(). - Write better comment for snc_kernel_support(). tools/testing/selftests/resctrl/cmt_test.c | 8 ++++++-- tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- tools/testing/selftests/resctrl/mbm_test.c | 10 +++++++--- tools/testing/selftests/resctrl/resctrl.h | 3 +++ 4 files changed, 23 insertions(+), 6 deletions(-)