Message ID | a0fe2be86f3e868a5f908ac4f2c76e71b4d08d4f.1724970211.git.reinette.chatre@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | selftests/resctrl: Support diverse platforms with MBM and MBA tests | expand |
On Thu, 29 Aug 2024, Reinette Chatre wrote: > The CMT, MBA, and MBM tests rely on the resctrl_val() wrapper to > start and run a benchmark while providing test specific flows > via callbacks to do test specific configuration and measurements. > > At a high level, the resctrl_val() flow is: > a) Start by fork()ing a child process that installs a signal > handler for SIGUSR1 that, on receipt of SIGUSR1, will > start running a benchmark. > b) Assign the child process created in (a) to the resctrl > control and monitoring group that dictates the memory and > cache allocations with which the process can run and will > contain all resctrl monitoring data of that process. > c) Once parent and child are considered "ready" (determined via > a message over a pipe) the parent signals the child (via > SIGUSR1) to start the benchmark, waits one second for the > benchmark to run, and then starts collecting monitoring data > for the tests, potentially also changing allocation > configuration depending on the various test callbacks. > > A problem with the above flow is the "black box" view of the > benchmark that is combined with an arbitrarily chosen > "wait one second" before measurements start. No matter what > the benchmark does, it is given one second to initialize before > measurements start. > > The default benchmark "fill_buf" consists of two parts, > first it prepares a buffer (allocate, initialize, then flush), then it > reads from the buffer (in unpredictable ways) until terminated. > Depending on the system and the size of the buffer, the first "prepare" > part may not be complete by the time the one second delay expires. Test > measurements may thus start before the work needing to be measured runs. > > Split the default benchmark into its "prepare" and "runtime" parts and > simplify the resctrl_val() wrapper while doing so. This same split > cannot be done for the user provided benchmark (without a user > interface change), so the current behavior is maintained for user > provided benchmark. > > Assign the test itself to the control and monitoring group and run the > "prepare" part of the benchmark in this context, ensuring it runs with > required cache and memory bandwidth allocations. With the benchmark > preparation complete it is only needed to fork() the "runtime" part > of the benchmark (or entire user provided benchmark). > > Keep the "wait one second" delay before measurements start. For the > default "fill_buf" benchmark this time now covers only the "runtime" > portion that needs to be measured. For the user provided benchmark this > delay maintains current behavior. > > Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> > --- > tools/testing/selftests/resctrl/fill_buf.c | 19 +- > tools/testing/selftests/resctrl/resctrl.h | 2 +- > tools/testing/selftests/resctrl/resctrl_val.c | 225 ++++++------------ > 3 files changed, 70 insertions(+), 176 deletions(-) > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c > index ae120f1735c0..12c71bb44cb6 100644 > --- a/tools/testing/selftests/resctrl/fill_buf.c > +++ b/tools/testing/selftests/resctrl/fill_buf.c > @@ -114,7 +114,7 @@ void fill_cache_read(unsigned char *buf, size_t buf_size, bool once) > *value_sink = ret; > } > > -static void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) > +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) > { > while (1) { > fill_one_span_write(buf, buf_size); > @@ -150,20 +150,3 @@ unsigned char *alloc_buffer(size_t buf_size, int memflush) > > return buf; > } > - > -int run_fill_buf(size_t buf_size, int memflush, int op, bool once) > -{ > - unsigned char *buf; > - > - buf = alloc_buffer(buf_size, memflush); > - if (!buf) > - return -1; > - > - if (op == 0) > - fill_cache_read(buf, buf_size, once); > - else > - fill_cache_write(buf, buf_size, once); > - free(buf); > - > - return 0; > -} > diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h > index 2dda56084588..0afbc4dd18e4 100644 > --- a/tools/testing/selftests/resctrl/resctrl.h > +++ b/tools/testing/selftests/resctrl/resctrl.h > @@ -142,7 +142,7 @@ int perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, > unsigned char *alloc_buffer(size_t buf_size, int memflush); > void mem_flush(unsigned char *buf, size_t buf_size); > void fill_cache_read(unsigned char *buf, size_t buf_size, bool once); > -int run_fill_buf(size_t buf_size, int memflush, int op, bool once); > +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once); > int initialize_mem_bw_imc(void); > int measure_mem_bw(const struct user_params *uparams, > struct resctrl_val_param *param, pid_t bm_pid, > diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c > index 70e8e31f5d1a..574b72604f95 100644 > --- a/tools/testing/selftests/resctrl/resctrl_val.c > +++ b/tools/testing/selftests/resctrl/resctrl_val.c > @@ -448,7 +448,7 @@ static int get_mem_bw_resctrl(FILE *fp, unsigned long *mbm_total) > return 0; > } > > -static pid_t bm_pid, ppid; > +static pid_t bm_pid; > > void ctrlc_handler(int signum, siginfo_t *info, void *ptr) > { > @@ -506,13 +506,6 @@ void signal_handler_unregister(void) > } > } > > -static void parent_exit(pid_t ppid) > -{ > - kill(ppid, SIGKILL); > - umount_resctrlfs(); > - exit(EXIT_FAILURE); > -} > - > /* > * print_results_bw: the memory bandwidth results are stored in a file > * @filename: file that stores the results > @@ -614,61 +607,6 @@ int measure_mem_bw(const struct user_params *uparams, > return ret; > } > > -/* > - * run_benchmark - Run a specified benchmark or fill_buf (default benchmark) > - * in specified signal. Direct benchmark stdio to /dev/null. > - * @signum: signal number > - * @info: signal info > - * @ucontext: user context in signal handling > - */ > -static void run_benchmark(int signum, siginfo_t *info, void *ucontext) > -{ > - int operation, ret, memflush; > - char **benchmark_cmd; > - size_t span; > - bool once; > - FILE *fp; > - > - benchmark_cmd = info->si_ptr; > - > - /* > - * Direct stdio of child to /dev/null, so that only parent writes to > - * stdio (console) > - */ > - fp = freopen("/dev/null", "w", stdout); > - if (!fp) { > - ksft_perror("Unable to direct benchmark status to /dev/null"); > - parent_exit(ppid); > - } > - > - if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { > - /* Execute default fill_buf benchmark */ > - span = strtoul(benchmark_cmd[1], NULL, 10); > - memflush = atoi(benchmark_cmd[2]); > - operation = atoi(benchmark_cmd[3]); > - if (!strcmp(benchmark_cmd[4], "true")) { > - once = true; > - } else if (!strcmp(benchmark_cmd[4], "false")) { > - once = false; > - } else { > - ksft_print_msg("Invalid once parameter\n"); > - parent_exit(ppid); > - } > - > - if (run_fill_buf(span, memflush, operation, once)) > - fprintf(stderr, "Error in running fill buffer\n"); > - } else { > - /* Execute specified benchmark */ > - ret = execvp(benchmark_cmd[0], benchmark_cmd); > - if (ret) > - ksft_perror("execvp"); > - } > - > - fclose(stdout); > - ksft_print_msg("Unable to run specified benchmark\n"); > - parent_exit(ppid); > -} > - > /* > * resctrl_val: execute benchmark and measure memory bandwidth on > * the benchmark > @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, > const char * const *benchmark_cmd, > struct resctrl_val_param *param) > { > - struct sigaction sigact; > - int ret = 0, pipefd[2]; > - char pipe_message = 0; > - union sigval value; > - int domain_id; > + int domain_id, operation = 0, memflush = 1; > + size_t span = DEFAULT_SPAN; > + unsigned char *buf = NULL; > + cpu_set_t old_affinity; > + bool once = false; > + int ret = 0; > + pid_t ppid; > > if (strcmp(param->filename, "") == 0) > sprintf(param->filename, "stdio"); > @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, > return ret; > } > > - /* > - * If benchmark wasn't successfully started by child, then child should > - * kill parent, so save parent's pid > - */ > ppid = getpid(); > > - if (pipe(pipefd)) { > - ksft_perror("Unable to create pipe"); > + /* Taskset test to specified CPU. */ > + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement. > + if (ret) > + return ret; > > - return -1; > + /* Write test to specified control & monitoring group in resctrl FS. */ > + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); Previously, this was done for bm_pid but now it's done for the parent. I'm not sure how inheritance goes with resctrl on fork(), will the forked PID get added to the list of PIDs or not? You probably know the answer :-). Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process. > + if (ret) > + goto reset_affinity; > + > + if (param->init) { > + ret = param->init(param, domain_id); > + if (ret) > + goto reset_affinity; > } > > /* > - * Fork to start benchmark, save child's pid so that it can be killed > - * when needed > + * If not running user provided benchmark, run the default > + * "fill_buf". First phase of "fill_buf" is to prepare the > + * buffer that the benchmark will operate on. No measurements > + * are needed during this phase and prepared memory will be > + * passed to next part of benchmark via copy-on-write. TBD > + * how this impacts "write" benchmark, but no test currently > + * uses this. > */ > - fflush(stdout); Please don't remove fflush() in front of fork() as it leads to duplicating messages.
Hi Ilpo, Thank you for taking a look. On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > On Thu, 29 Aug 2024, Reinette Chatre wrote: > ... >> @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, >> const char * const *benchmark_cmd, >> struct resctrl_val_param *param) >> { >> - struct sigaction sigact; >> - int ret = 0, pipefd[2]; >> - char pipe_message = 0; >> - union sigval value; >> - int domain_id; >> + int domain_id, operation = 0, memflush = 1; >> + size_t span = DEFAULT_SPAN; >> + unsigned char *buf = NULL; >> + cpu_set_t old_affinity; >> + bool once = false; >> + int ret = 0; >> + pid_t ppid; >> >> if (strcmp(param->filename, "") == 0) >> sprintf(param->filename, "stdio"); >> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, >> return ret; >> } >> >> - /* >> - * If benchmark wasn't successfully started by child, then child should >> - * kill parent, so save parent's pid >> - */ >> ppid = getpid(); >> >> - if (pipe(pipefd)) { >> - ksft_perror("Unable to create pipe"); >> + /* Taskset test to specified CPU. */ >> + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); > > Previously only CPU affinity for bm_pid was set but now it's set before > fork(). Quickly checking the Internet, it seems that CPU affinity gets > inherited on fork() so now both processes will have the same affinity > which might make the other process to interfere with the measurement. Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts. This does tie in with the association with the resctrl group and I will elaborate more below ... > >> + if (ret) >> + return ret; >> >> - return -1; >> + /* Write test to specified control & monitoring group in resctrl FS. */ >> + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); > > Previously, this was done for bm_pid but now it's done for the parent. I'm > not sure how inheritance goes with resctrl on fork(), will the forked PID > get added to the list of PIDs or not? You probably know the answer :-). Yes. A process fork()ed will land in the same resctrl group as its parent. > Neither behavior, however, seems to result in the intended behavior as we > either get interfering processes (if inherited) or no desired resctrl > setup for the benchmark process. There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise. Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be. >> + if (ret) >> + goto reset_affinity; >> + >> + if (param->init) { >> + ret = param->init(param, domain_id); >> + if (ret) >> + goto reset_affinity; >> } >> >> /* >> - * Fork to start benchmark, save child's pid so that it can be killed >> - * when needed >> + * If not running user provided benchmark, run the default >> + * "fill_buf". First phase of "fill_buf" is to prepare the >> + * buffer that the benchmark will operate on. No measurements >> + * are needed during this phase and prepared memory will be >> + * passed to next part of benchmark via copy-on-write. TBD >> + * how this impacts "write" benchmark, but no test currently >> + * uses this. >> */ >> - fflush(stdout); > > Please don't remove fflush() in front of fork() as it leads to duplicating > messages. > Indeed. Shaopeng just fixed this for us. Thank you!. Reinette
On Fri, 30 Aug 2024, Reinette Chatre wrote: > > Thank you for taking a look. > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > ... > > > > @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, > > > const char * const *benchmark_cmd, > > > struct resctrl_val_param *param) > > > { > > > - struct sigaction sigact; > > > - int ret = 0, pipefd[2]; > > > - char pipe_message = 0; > > > - union sigval value; > > > - int domain_id; > > > + int domain_id, operation = 0, memflush = 1; > > > + size_t span = DEFAULT_SPAN; > > > + unsigned char *buf = NULL; > > > + cpu_set_t old_affinity; > > > + bool once = false; > > > + int ret = 0; > > > + pid_t ppid; > > > if (strcmp(param->filename, "") == 0) > > > sprintf(param->filename, "stdio"); > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, > > > return ret; > > > } > > > - /* > > > - * If benchmark wasn't successfully started by child, then child > > > should > > > - * kill parent, so save parent's pid > > > - */ > > > ppid = getpid(); > > > - if (pipe(pipefd)) { > > > - ksft_perror("Unable to create pipe"); > > > + /* Taskset test to specified CPU. */ > > > + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); > > > > Previously only CPU affinity for bm_pid was set but now it's set before > > fork(). Quickly checking the Internet, it seems that CPU affinity gets > > inherited on fork() so now both processes will have the same affinity > > which might make the other process to interfere with the measurement. > > Setting the affinity is intended to ensure that the buffer preparation > occurs in the same topology as where the runtime portion will run. > This preparation is done before the work to be measured starts. > > This does tie in with the association with the resctrl group and I > will elaborate more below ... Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by forking. I guess that doesn't matter for memflush = true case but might be meaningful for the memflush = false case that seems to be there to allow keeping caches hot (I personally haven't thought how to use "caches hot" test but we do have that capability by the fact that memflush paremeter exists). > > > + if (ret) > > > + return ret; > > > - return -1; > > > + /* Write test to specified control & monitoring group in resctrl FS. > > > */ > > > + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); > > > > Previously, this was done for bm_pid but now it's done for the parent. I'm > > not sure how inheritance goes with resctrl on fork(), will the forked PID > > get added to the list of PIDs or not? You probably know the answer :-). > > Yes. A process fork()ed will land in the same resctrl group as its parent. > > > Neither behavior, however, seems to result in the intended behavior as we > > either get interfering processes (if inherited) or no desired resctrl > > setup for the benchmark process. > > There are two processes to consider in the resource group, the parent (that > sets up the buffer and does the measurements) and the child (that runs the > workload to be measured). Thanks to your commit da50de0a92f3 > ("selftests/resctrl: > Calculate resctrl FS derived mem bw over sleep(1) only") the parent > will be sleeping while the child runs its workload and there is no > other interference I am aware of. The only additional measurements > that I can see would be the work needed to actually start and stop the > measurements and from what I can tell this falls into the noise. > > Please do keep in mind that the performance counters used, iMC, cannot > actually > be bound to a single CPU since it is a per-socket PMU. The measurements have > thus never been as fine grained as the code pretends it to be. I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.
Hi Ilpo, On 9/4/24 4:57 AM, Ilpo Järvinen wrote: > On Fri, 30 Aug 2024, Reinette Chatre wrote: >> >> Thank you for taking a look. >> >> On 8/30/24 3:56 AM, Ilpo Järvinen wrote: >>> On Thu, 29 Aug 2024, Reinette Chatre wrote: >>> >> >> ... >> >>>> @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, >>>> const char * const *benchmark_cmd, >>>> struct resctrl_val_param *param) >>>> { >>>> - struct sigaction sigact; >>>> - int ret = 0, pipefd[2]; >>>> - char pipe_message = 0; >>>> - union sigval value; >>>> - int domain_id; >>>> + int domain_id, operation = 0, memflush = 1; >>>> + size_t span = DEFAULT_SPAN; >>>> + unsigned char *buf = NULL; >>>> + cpu_set_t old_affinity; >>>> + bool once = false; >>>> + int ret = 0; >>>> + pid_t ppid; >>>> if (strcmp(param->filename, "") == 0) >>>> sprintf(param->filename, "stdio"); >>>> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, >>>> return ret; >>>> } >>>> - /* >>>> - * If benchmark wasn't successfully started by child, then child >>>> should >>>> - * kill parent, so save parent's pid >>>> - */ >>>> ppid = getpid(); >>>> - if (pipe(pipefd)) { >>>> - ksft_perror("Unable to create pipe"); >>>> + /* Taskset test to specified CPU. */ >>>> + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); >>> >>> Previously only CPU affinity for bm_pid was set but now it's set before >>> fork(). Quickly checking the Internet, it seems that CPU affinity gets >>> inherited on fork() so now both processes will have the same affinity >>> which might make the other process to interfere with the measurement. >> >> Setting the affinity is intended to ensure that the buffer preparation >> occurs in the same topology as where the runtime portion will run. >> This preparation is done before the work to be measured starts. >> >> This does tie in with the association with the resctrl group and I >> will elaborate more below ... > > Okay, that's useful to retain but thinking this further, now we're also > going do non-trivial amount of work in between the setup and the test by Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell. > forking. I guess that doesn't matter for memflush = true case but might be > meaningful for the memflush = false case that seems to be there to allow > keeping caches hot (I personally haven't thought how to use "caches hot" > test but we do have that capability by the fact that memflush paremeter > exists). I believe that memflush = true will always be needed/used by the tests relying on memory bandwidth measurement since that reduces cache hits during measurement phase and avoids the additional guessing on how long the workload should be run before reliable/consistent measurements can start. Thinking about the memflush = false case I now think that we should use that for the CMT test. The buffer is allocated and initialized while the task is configured with appropriate allocation limits so there should not be a reason to flush the buffer from the cache. In fact, flushing the cache introduces the requirement to guess the workload's "settle" time (time to allocate the buffer into the cache again) before its occupancy can be measured. As a quick test I set memflush = false on one system and it brought down the average diff between the cache portion size and the occupancy counts. I'll try it out on a few more systems to confirm. >>>> + if (ret) >>>> + return ret; >>>> - return -1; >>>> + /* Write test to specified control & monitoring group in resctrl FS. >>>> */ >>>> + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); >>> >>> Previously, this was done for bm_pid but now it's done for the parent. I'm >>> not sure how inheritance goes with resctrl on fork(), will the forked PID >>> get added to the list of PIDs or not? You probably know the answer :-). >> >> Yes. A process fork()ed will land in the same resctrl group as its parent. >> >>> Neither behavior, however, seems to result in the intended behavior as we >>> either get interfering processes (if inherited) or no desired resctrl >>> setup for the benchmark process. >> >> There are two processes to consider in the resource group, the parent (that >> sets up the buffer and does the measurements) and the child (that runs the >> workload to be measured). Thanks to your commit da50de0a92f3 >> ("selftests/resctrl: >> Calculate resctrl FS derived mem bw over sleep(1) only") the parent >> will be sleeping while the child runs its workload and there is no >> other interference I am aware of. The only additional measurements >> that I can see would be the work needed to actually start and stop the >> measurements and from what I can tell this falls into the noise. >> >> Please do keep in mind that the performance counters used, iMC, cannot >> actually >> be bound to a single CPU since it is a per-socket PMU. The measurements have >> thus never been as fine grained as the code pretends it to be. > > I was thinking if I should note the amount of work is small. Maybe it's > fine to leave that noise there and I'm just overly cautious :-), when I > used to do networking research in the past life, I wanted to eliminate as > much noise sources so I guess it comes from that background. The goal of these tests are to verify *resctrl*, these are not intended to be hardware validation tests. I think it would be better for resctrl if more time is spent on functional tests of resctrl than these performance tests. Reinette
On Wed, 4 Sep 2024, Reinette Chatre wrote: > On 9/4/24 4:57 AM, Ilpo Järvinen wrote: > > On Fri, 30 Aug 2024, Reinette Chatre wrote: > > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test > > > > > *test, > > > > > return ret; > > > > > } > > > > > - /* > > > > > - * If benchmark wasn't successfully started by child, then > > > > > child > > > > > should > > > > > - * kill parent, so save parent's pid > > > > > - */ > > > > > ppid = getpid(); > > > > > - if (pipe(pipefd)) { > > > > > - ksft_perror("Unable to create pipe"); > > > > > + /* Taskset test to specified CPU. */ > > > > > + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); > > > > > > > > Previously only CPU affinity for bm_pid was set but now it's set before > > > > fork(). Quickly checking the Internet, it seems that CPU affinity gets > > > > inherited on fork() so now both processes will have the same affinity > > > > which might make the other process to interfere with the measurement. > > > > > > Setting the affinity is intended to ensure that the buffer preparation > > > occurs in the same topology as where the runtime portion will run. > > > This preparation is done before the work to be measured starts. > > > > > > This does tie in with the association with the resctrl group and I > > > will elaborate more below ... > > > > Okay, that's useful to retain but thinking this further, now we're also > > going do non-trivial amount of work in between the setup and the test by > > Could you please elaborate how the amount of work during setup can be an > issue? I have been focused on the measurements that are done afterwards > that do have clear boundaries from what I can tell. Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance? fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-) BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()). > > forking. I guess that doesn't matter for memflush = true case but might be > > meaningful for the memflush = false case that seems to be there to allow > > keeping caches hot (I personally haven't thought how to use "caches hot" > > test but we do have that capability by the fact that memflush paremeter > > exists). > > I believe that memflush = true will always be needed/used by the tests > relying on memory bandwidth measurement since that reduces cache hits during > measurement phase and avoids the additional guessing on how long the workload > should be run before reliable/consistent measurements can start. > > Thinking about the memflush = false case I now think that we should use that > for the CMT test. The buffer is allocated and initialized while the task > is configured with appropriate allocation limits so there should not be a > reason to flush the buffer from the cache. In fact, flushing the cache > introduces > the requirement to guess the workload's "settle" time (time to allocate the > buffer > into the cache again) before its occupancy can be measured. As a quick test I > set memflush = false on one system and it brought down the average diff > between > the cache portion size and the occupancy counts. I'll try it out on a few more > systems to confirm. Oh great! I've not really figured out the logic used in the old CMT test because there was the rewrite for it in the series I started to upstream some of these improvements from. But I was unable to rebase successfully that rewrite either because somebody had used a non-publically available tree as a basis for it so I never did even have time to understand what even the rewritten test did thanks to the very complex diff. > > > > Neither behavior, however, seems to result in the intended behavior as > > > > we > > > > either get interfering processes (if inherited) or no desired resctrl > > > > setup for the benchmark process. > > > > > > There are two processes to consider in the resource group, the parent > > > (that > > > sets up the buffer and does the measurements) and the child (that runs the > > > workload to be measured). Thanks to your commit da50de0a92f3 > > > ("selftests/resctrl: > > > Calculate resctrl FS derived mem bw over sleep(1) only") the parent > > > will be sleeping while the child runs its workload and there is no > > > other interference I am aware of. The only additional measurements > > > that I can see would be the work needed to actually start and stop the > > > measurements and from what I can tell this falls into the noise. > > > > > > Please do keep in mind that the performance counters used, iMC, cannot > > > actually > > > be bound to a single CPU since it is a per-socket PMU. The measurements > > > have > > > thus never been as fine grained as the code pretends it to be. > > > > I was thinking if I should note the amount of work is small. Maybe it's > > fine to leave that noise there and I'm just overly cautious :-), when I > > used to do networking research in the past life, I wanted to eliminate as > > much noise sources so I guess it comes from that background. > > The goal of these tests are to verify *resctrl*, these are not intended to be > hardware validation tests. I think it would be better for resctrl if more time > is spent on functional tests of resctrl than these performance tests. This sounds so easy... (no offence) :-) If only there wouldn't be the black boxes and we'd have good and fine-grained ways to instrument it, it would be so much easier to realize non-statistical means to do functional tests.
Hi Ilpo, On 9/5/24 5:10 AM, Ilpo Järvinen wrote: > On Wed, 4 Sep 2024, Reinette Chatre wrote: >> On 9/4/24 4:57 AM, Ilpo Järvinen wrote: >>> On Fri, 30 Aug 2024, Reinette Chatre wrote: >>>> On 8/30/24 3:56 AM, Ilpo Järvinen wrote: >>>>> On Thu, 29 Aug 2024, Reinette Chatre wrote: > >>>>>> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test >>>>>> *test, >>>>>> return ret; >>>>>> } >>>>>> - /* >>>>>> - * If benchmark wasn't successfully started by child, then >>>>>> child >>>>>> should >>>>>> - * kill parent, so save parent's pid >>>>>> - */ >>>>>> ppid = getpid(); >>>>>> - if (pipe(pipefd)) { >>>>>> - ksft_perror("Unable to create pipe"); >>>>>> + /* Taskset test to specified CPU. */ >>>>>> + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); >>>>> >>>>> Previously only CPU affinity for bm_pid was set but now it's set before >>>>> fork(). Quickly checking the Internet, it seems that CPU affinity gets >>>>> inherited on fork() so now both processes will have the same affinity >>>>> which might make the other process to interfere with the measurement. >>>> >>>> Setting the affinity is intended to ensure that the buffer preparation >>>> occurs in the same topology as where the runtime portion will run. >>>> This preparation is done before the work to be measured starts. >>>> >>>> This does tie in with the association with the resctrl group and I >>>> will elaborate more below ... >>> >>> Okay, that's useful to retain but thinking this further, now we're also >>> going do non-trivial amount of work in between the setup and the test by >> >> Could you please elaborate how the amount of work during setup can be an >> issue? I have been focused on the measurements that are done afterwards >> that do have clear boundaries from what I can tell. > > Well, you used it as a justification: "Setting the affinity is intended > to ensure that the buffer preparation occurs in the same topology as where > the runtime portion will run." So I assumed you had some expectations about > "preparations" done outside of those "clear boundaries" but now you seem > to take entirely opposite stance? I do not follow you here. With the "clear boundaries" I meant the measurements and associated variables that have a clear start/init and stop/read while the other task sleeps. Yes, preparations are done outside of this since that should not be measured. You stated "now we're also going do non-trivial amount of work in between the setup and the test" ... could you clarify what the problem is with this? Before this work the "non-trivial amount of work" (for "fill_buf") was done as part of the test with (wrong) guesses about how long it takes. This work aims to improve this. > > fork() quite heavy operation as it has to copy various things including > the address space which you just made to contain a huge mem blob. :-) As highlighted in a comment found in the patch, the kernel uses copy-on-write and all tests only read from the buffer after fork(). > > BTW, perhaps we could use some lighter weighted fork variant in the > resctrl selftests, the processes don't really need to be that separated > to justify using full fork() (and custom benchmarks will do execvp()). Are you thinking about pthreads? It is not clear to me that this is necessary. It is also not clear to me what problem you are describing that needs this solution. When I understand that better it will be easier to discuss solutions. > >>> forking. I guess that doesn't matter for memflush = true case but might be >>> meaningful for the memflush = false case that seems to be there to allow >>> keeping caches hot (I personally haven't thought how to use "caches hot" >>> test but we do have that capability by the fact that memflush paremeter >>> exists). >> >> I believe that memflush = true will always be needed/used by the tests >> relying on memory bandwidth measurement since that reduces cache hits during >> measurement phase and avoids the additional guessing on how long the workload >> should be run before reliable/consistent measurements can start. >> >> Thinking about the memflush = false case I now think that we should use that >> for the CMT test. The buffer is allocated and initialized while the task >> is configured with appropriate allocation limits so there should not be a >> reason to flush the buffer from the cache. In fact, flushing the cache >> introduces >> the requirement to guess the workload's "settle" time (time to allocate the >> buffer >> into the cache again) before its occupancy can be measured. As a quick test I >> set memflush = false on one system and it brought down the average diff >> between >> the cache portion size and the occupancy counts. I'll try it out on a few more >> systems to confirm. > > Oh great! > > I've not really figured out the logic used in the old CMT test because > there was the rewrite for it in the series I started to upstream some of > these improvements from. But I was unable to rebase successfully that > rewrite either because somebody had used a non-publically available tree > as a basis for it so I never did even have time to understand what even > the rewritten test did thanks to the very complex diff. > >>>>> Neither behavior, however, seems to result in the intended behavior as >>>>> we >>>>> either get interfering processes (if inherited) or no desired resctrl >>>>> setup for the benchmark process. >>>> >>>> There are two processes to consider in the resource group, the parent >>>> (that >>>> sets up the buffer and does the measurements) and the child (that runs the >>>> workload to be measured). Thanks to your commit da50de0a92f3 >>>> ("selftests/resctrl: >>>> Calculate resctrl FS derived mem bw over sleep(1) only") the parent >>>> will be sleeping while the child runs its workload and there is no >>>> other interference I am aware of. The only additional measurements >>>> that I can see would be the work needed to actually start and stop the >>>> measurements and from what I can tell this falls into the noise. >>>> >>>> Please do keep in mind that the performance counters used, iMC, cannot >>>> actually >>>> be bound to a single CPU since it is a per-socket PMU. The measurements >>>> have >>>> thus never been as fine grained as the code pretends it to be. >>> >>> I was thinking if I should note the amount of work is small. Maybe it's >>> fine to leave that noise there and I'm just overly cautious :-), when I >>> used to do networking research in the past life, I wanted to eliminate as >>> much noise sources so I guess it comes from that background. >> >> The goal of these tests are to verify *resctrl*, these are not intended to be >> hardware validation tests. I think it would be better for resctrl if more time >> is spent on functional tests of resctrl than these performance tests. > > This sounds so easy... (no offence) :-) If only there wouldn't be the > black boxes and we'd have good and fine-grained ways to instrument it, > it would be so much easier to realize non-statistical means to do > functional tests. > To me functional tests of resctrl indeed sounds easier. Tests can interact with the resctrl interface to ensure it works as expected ... test the boundaries of user input to the various files, test task movement between groups, test moving of monitor groups, test assigning CPUs to resource groups, ensure the "mode" of a resource group can be changed and behaves as expected (for example, shared vs exclusive), ensure changes to one file are reflected in others, like changing schemata is reflected in "size" and "bit_usage", etc. etc. These are all tests of *resctrl* that supports development and can be used to verify all is well as major changes (that we are seeing more and more of) are made to the kernel subsystem. None of this is "black box" and is all deterministic with obvious pass/fail. This can be made even more reliable with not just checking of resctrl files but see if changes via resctrl is reflected in MSRs as expected. Reinette
On Thu, 5 Sep 2024, Reinette Chatre wrote: > On 9/5/24 5:10 AM, Ilpo Järvinen wrote: > > On Wed, 4 Sep 2024, Reinette Chatre wrote: > > > On 9/4/24 4:57 AM, Ilpo Järvinen wrote: > > > > On Fri, 30 Aug 2024, Reinette Chatre wrote: > > > > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > > > > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > > > > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test > > > > > > > *test, > > > > > > > return ret; > > > > > > > } > > > > > > > - /* > > > > > > > - * If benchmark wasn't successfully started by child, then > > > > > > > child > > > > > > > should > > > > > > > - * kill parent, so save parent's pid > > > > > > > - */ > > > > > > > ppid = getpid(); > > > > > > > - if (pipe(pipefd)) { > > > > > > > - ksft_perror("Unable to create pipe"); > > > > > > > + /* Taskset test to specified CPU. */ > > > > > > > + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); > > > > > > > > > > > > Previously only CPU affinity for bm_pid was set but now it's set > > > > > > before > > > > > > fork(). Quickly checking the Internet, it seems that CPU affinity > > > > > > gets > > > > > > inherited on fork() so now both processes will have the same > > > > > > affinity > > > > > > which might make the other process to interfere with the > > > > > > measurement. > > > > > > > > > > Setting the affinity is intended to ensure that the buffer preparation > > > > > occurs in the same topology as where the runtime portion will run. > > > > > This preparation is done before the work to be measured starts. > > > > > > > > > > This does tie in with the association with the resctrl group and I > > > > > will elaborate more below ... > > > > > > > > Okay, that's useful to retain but thinking this further, now we're also > > > > going do non-trivial amount of work in between the setup and the test by > > > > > > Could you please elaborate how the amount of work during setup can be an > > > issue? I have been focused on the measurements that are done afterwards > > > that do have clear boundaries from what I can tell. > > > > Well, you used it as a justification: "Setting the affinity is intended > > to ensure that the buffer preparation occurs in the same topology as where > > the runtime portion will run." So I assumed you had some expectations about > > "preparations" done outside of those "clear boundaries" but now you seem > > to take entirely opposite stance? > > I do not follow you here. With the "clear boundaries" I meant the > measurements and associated variables that have a clear start/init and > stop/read while the other task sleeps. Yes, preparations are done outside > of this since that should not be measured. Of course the preparations are not measured (at least not after this patch :-)). But that's not what I meant. You said the preparations are to be done using the same topology as the test but if it's outside of the measurement period, the topology during preparations only matters if you make some hidden assumption that some state carries from preparations to the actual test. If no state carry-over is assumed, the topology during preparations is not important. So which way it is? Is the topology during preparations important or not? You used the topology argument to justify why both parent and child are now in the same resctrl group unlike before when only the actual test was. > You stated "now we're also going > do non-trivial amount of work in between the setup and the test" ... > could you clarify what the problem is with this? Before this work > the "non-trivial amount of work" (for "fill_buf") was done as part of the > test with (wrong) guesses about how long it takes. This work aims to improve > this. I understand why you're trying to do with this change. However, I was trying to say that before preparations occurred right before the test which is no longer the case because there's fork() in between. Does that extra work impact the state carry-over from preparations to the test? > > fork() quite heavy operation as it has to copy various things including > > the address space which you just made to contain a huge mem blob. :-) > > As highlighted in a comment found in the patch, the kernel uses copy-on-write > and all tests only read from the buffer after fork(). Write test is possible using -b fill_buf ... as mentioned in comments to the other patch. > > BTW, perhaps we could use some lighter weighted fork variant in the > > resctrl selftests, the processes don't really need to be that separated > > to justify using full fork() (and custom benchmarks will do execvp()). > > Are you thinking about pthreads? It is not clear to me that this is > necessary. It is also not clear to me what problem you are describing that > needs this solution. When I understand that better it will be easier to > discuss solutions. I was trying to say I see advantage of retaining the address space which is not possible with fork(), so perhaps using clone() with CLONE_VM would be useful and maybe some other flags too. I think this would solve the write test case. > > > > I was thinking if I should note the amount of work is small. Maybe it's > > > > fine to leave that noise there and I'm just overly cautious :-), when I > > > > used to do networking research in the past life, I wanted to eliminate > > > > as > > > > much noise sources so I guess it comes from that background. > > > > > > The goal of these tests are to verify *resctrl*, these are not intended to > > > be > > > hardware validation tests. I think it would be better for resctrl if more > > > time > > > is spent on functional tests of resctrl than these performance tests. > > > > This sounds so easy... (no offence) :-) If only there wouldn't be the > > black boxes and we'd have good and fine-grained ways to instrument it, > > it would be so much easier to realize non-statistical means to do > > functional tests. > > > > To me functional tests of resctrl indeed sounds easier. Tests can interact > with the > resctrl interface to ensure it works as expected ... test the boundaries > of user input to the various files, test task movement between groups, test > moving of > monitor groups, test assigning CPUs to resource groups, ensure the "mode" of a > resource group can be changed and behaves as expected (for example, shared vs > exclusive), > ensure changes to one file are reflected in others, like changing schemata is > reflected > in "size" and "bit_usage", etc. etc. These are all tests of *resctrl* that > supports > development and can be used to verify all is well as major changes (that we > are seeing > more and more of) are made to the kernel subsystem. None of this is "black > box" and > is all deterministic with obvious pass/fail. This can be made even more > reliable with > not just checking of resctrl files but see if changes via resctrl is reflected > in MSRs > as expected. Okay, I get it now, you meant testing the kernel-userspace interfaces and with using MSRs as a further validation step to test kernel-HW interface too. I'll probably take a look at this when I've some spare time what I can come up with.
Hi Ilpo, On 9/6/24 3:00 AM, Ilpo Järvinen wrote: > On Thu, 5 Sep 2024, Reinette Chatre wrote: >> On 9/5/24 5:10 AM, Ilpo Järvinen wrote: >>> On Wed, 4 Sep 2024, Reinette Chatre wrote: >>>> On 9/4/24 4:57 AM, Ilpo Järvinen wrote: >>>>> On Fri, 30 Aug 2024, Reinette Chatre wrote: >>>>>> On 8/30/24 3:56 AM, Ilpo Järvinen wrote: >>>>>>> On Thu, 29 Aug 2024, Reinette Chatre wrote: >>> >>>>>>>> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test >>>>>>>> *test, >>>>>>>> return ret; >>>>>>>> } >>>>>>>> - /* >>>>>>>> - * If benchmark wasn't successfully started by child, then >>>>>>>> child >>>>>>>> should >>>>>>>> - * kill parent, so save parent's pid >>>>>>>> - */ >>>>>>>> ppid = getpid(); >>>>>>>> - if (pipe(pipefd)) { >>>>>>>> - ksft_perror("Unable to create pipe"); >>>>>>>> + /* Taskset test to specified CPU. */ >>>>>>>> + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); >>>>>>> >>>>>>> Previously only CPU affinity for bm_pid was set but now it's set >>>>>>> before >>>>>>> fork(). Quickly checking the Internet, it seems that CPU affinity >>>>>>> gets >>>>>>> inherited on fork() so now both processes will have the same >>>>>>> affinity >>>>>>> which might make the other process to interfere with the >>>>>>> measurement. >>>>>> >>>>>> Setting the affinity is intended to ensure that the buffer preparation >>>>>> occurs in the same topology as where the runtime portion will run. >>>>>> This preparation is done before the work to be measured starts. >>>>>> >>>>>> This does tie in with the association with the resctrl group and I >>>>>> will elaborate more below ... >>>>> >>>>> Okay, that's useful to retain but thinking this further, now we're also >>>>> going do non-trivial amount of work in between the setup and the test by >>>> >>>> Could you please elaborate how the amount of work during setup can be an >>>> issue? I have been focused on the measurements that are done afterwards >>>> that do have clear boundaries from what I can tell. >>> >>> Well, you used it as a justification: "Setting the affinity is intended >>> to ensure that the buffer preparation occurs in the same topology as where >>> the runtime portion will run." So I assumed you had some expectations about >>> "preparations" done outside of those "clear boundaries" but now you seem >>> to take entirely opposite stance? >> >> I do not follow you here. With the "clear boundaries" I meant the >> measurements and associated variables that have a clear start/init and >> stop/read while the other task sleeps. Yes, preparations are done outside >> of this since that should not be measured. > > Of course the preparations are not measured (at least not after this > patch :-)). > > But that's not what I meant. You said the preparations are to be done > using the same topology as the test but if it's outside of the measurement > period, the topology during preparations only matters if you make some > hidden assumption that some state carries from preparations to the actual > test. If no state carry-over is assumed, the topology during preparations > is not important. So which way it is? Is the topology during preparations > important or not? Topology during preparations is important. In the CMT test this is more relevant with the transition to using memflush = false. The preparation phase and measure phase uses the same cache alloc configuration and just as importantly, the same monitoring configuration. During preparation the cache portion that will be used during measurement will be filled with the contents of the buffer. During measurement it will be the same cache portion into which any new reads will be allocated and measured. In fact, the preparation phase will thus form part of the measurement. If preparation was done with different configuration, then I see a problem whether memflush = true as well as when memflush = false. In the case of memflush = true it will have the familiar issue of the test needing to "guess" the workload settle time. In the case of memflush = false the buffer will remain allocated into the cache portion used during preparation phase, when the workload runs it will read the data from a cache portion that does not belong to it and since it does not need to allocate into its own cache portion its LLC occupancy counts will not be accurate (the LLC occupancy associated with the buffer will be attributed to prepare portion). I am not familiar with the details of memory allocation but as I understand Linux does attempt to satisfy memory requests from the node to which the requesting CPU is assigned. For the MBM and MBA tests I thus believe it is important to allocate the memory from where it will be used. I have encountered systems where CPU0 and CPU1 are on different sockets and by default the workload is set to run on CPU1. If the preparation phase runs on CPU0 then it may be that memory could be allocated from a different NUMA node than where the workload will be running. By doing preparation within the same topology as what the workload will be running I believe that memory will be allocated appropriate to workload and thus result in more reliable measurements. > > You used the topology argument to justify why both parent and child are > now in the same resctrl group unlike before when only the actual test was. > >> You stated "now we're also going >> do non-trivial amount of work in between the setup and the test" ... >> could you clarify what the problem is with this? Before this work >> the "non-trivial amount of work" (for "fill_buf") was done as part of the >> test with (wrong) guesses about how long it takes. This work aims to improve >> this. > > I understand why you're trying to do with this change. > > However, I was trying to say that before preparations occurred right > before the test which is no longer the case because there's fork() in > between. If by "test" you mean the measurement phase then in the case of "fill_buf" preparations only now reliably occur before the measurement. Original behavior is maintained with user provided benchmark. > > Does that extra work impact the state carry-over from preparations to the > test? It is not clear to me what extra work or state you are referring to. > >>> fork() quite heavy operation as it has to copy various things including >>> the address space which you just made to contain a huge mem blob. :-) >> >> As highlighted in a comment found in the patch, the kernel uses copy-on-write >> and all tests only read from the buffer after fork(). > > Write test is possible using -b fill_buf ... as mentioned in comments to > the other patch. Yes, it is theoretically possible, but looking closer it is not supported. Note how measure_mem_bw() is always called with hardcoded "reads". Reading through the history of commits I do not believe modifying fill_buf parameters was ever intended to be a use case for the "-b" parameter. It seems intended to provide an alternate benchmark to fill_buf. I am not interested in adding support for the write test because I do not see how it helps us to improve resctrl subsystem health. Considering that nobody has ever complained about the write test being broken I think all that dead code should just be removed instead. I prefer the focus be on the health of the resctrl subsystem instead of additional hardware performance testing. I do not think it is energy well spent to further tune these performance tests unless it benefits resctrl subsystem health. These performance tests are difficult to maintain and I have not yet seen how they benefit the resctrl subsystem. >>> BTW, perhaps we could use some lighter weighted fork variant in the >>> resctrl selftests, the processes don't really need to be that separated >>> to justify using full fork() (and custom benchmarks will do execvp()). >> >> Are you thinking about pthreads? It is not clear to me that this is >> necessary. It is also not clear to me what problem you are describing that >> needs this solution. When I understand that better it will be easier to >> discuss solutions. > > I was trying to say I see advantage of retaining the address space which > is not possible with fork(), so perhaps using clone() with CLONE_VM would > be useful and maybe some other flags too. I think this would solve the > write test case. clone() brings with it the complexity of needing to manage the child's stack. This again aims for a performance improvement. What does this fix? What is the benefit to resctrl health? I would prefer that our energy instead be focused on improving resctrl subsystem health. Reinette
On Fri, 6 Sep 2024, Reinette Chatre wrote: > On 9/6/24 3:00 AM, Ilpo Järvinen wrote: > > On Thu, 5 Sep 2024, Reinette Chatre wrote: > > > On 9/5/24 5:10 AM, Ilpo Järvinen wrote: > > > > On Wed, 4 Sep 2024, Reinette Chatre wrote: > > > > > On 9/4/24 4:57 AM, Ilpo Järvinen wrote: > > > > > > On Fri, 30 Aug 2024, Reinette Chatre wrote: > > > > > > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > > > > > > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > > > > > > > > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct > > > > > > > > > resctrl_test > > > > > > > > > *test, > > > > > > > > > return ret; > > > > > > > > > } > > > > > > > > > - /* > > > > > > > > > - * If benchmark wasn't successfully started by child, > > > > > > > > > then > > > > > > > > > child > > > > > > > > > should > > > > > > > > > - * kill parent, so save parent's pid > > > > > > > > > - */ > > > > > > > > > ppid = getpid(); > > > > > > > > > - if (pipe(pipefd)) { > > > > > > > > > - ksft_perror("Unable to create pipe"); > > > > > > > > > + /* Taskset test to specified CPU. */ > > > > > > > > > + ret = taskset_benchmark(ppid, uparams->cpu, > > > > > > > > > &old_affinity); > > > > > > > > > > > > > > > > Previously only CPU affinity for bm_pid was set but now it's set > > > > > > > > before > > > > > > > > fork(). Quickly checking the Internet, it seems that CPU > > > > > > > > affinity > > > > > > > > gets > > > > > > > > inherited on fork() so now both processes will have the same > > > > > > > > affinity > > > > > > > > which might make the other process to interfere with the > > > > > > > > measurement. > > > > > > > > > > > > > > Setting the affinity is intended to ensure that the buffer > > > > > > > preparation > > > > > > > occurs in the same topology as where the runtime portion will run. > > > > > > > This preparation is done before the work to be measured starts. > > > > > > > > > > > > > > This does tie in with the association with the resctrl group and I > > > > > > > will elaborate more below ... > > > > > > > > > > > > Okay, that's useful to retain but thinking this further, now we're > > > > > > also > > > > > > going do non-trivial amount of work in between the setup and the > > > > > > test by > > > > > > > > > > Could you please elaborate how the amount of work during setup can be > > > > > an > > > > > issue? I have been focused on the measurements that are done > > > > > afterwards > > > > > that do have clear boundaries from what I can tell. > > > > > > > > Well, you used it as a justification: "Setting the affinity is intended > > > > to ensure that the buffer preparation occurs in the same topology as > > > > where > > > > the runtime portion will run." So I assumed you had some expectations > > > > about > > > > "preparations" done outside of those "clear boundaries" but now you seem > > > > to take entirely opposite stance? > > > > > > I do not follow you here. With the "clear boundaries" I meant the > > > measurements and associated variables that have a clear start/init and > > > stop/read while the other task sleeps. Yes, preparations are done outside > > > of this since that should not be measured. > > > > Of course the preparations are not measured (at least not after this > > patch :-)). > > > > But that's not what I meant. You said the preparations are to be done > > using the same topology as the test but if it's outside of the measurement > > period, the topology during preparations only matters if you make some > > hidden assumption that some state carries from preparations to the actual > > test. If no state carry-over is assumed, the topology during preparations > > is not important. So which way it is? Is the topology during preparations > > important or not? > > Topology during preparations is important. > > In the CMT test this is more relevant with the transition to using > memflush = false. The preparation phase and measure phase uses the same > cache alloc configuration and just as importantly, the same monitoring > configuration. During preparation the cache portion that will be > used during measurement will be filled with the contents of the buffer. > During measurement it will be the same cache portion into which any new reads > will be allocated and measured. In fact, the preparation phase will thus form > part of the measurement. If preparation was done with different > configuration, then I see a problem whether memflush = true as well as when > memflush = false. In the case of memflush = true it will have the familiar > issue of the test needing to "guess" the workload settle time. In the case > of memflush = false the buffer will remain allocated into the cache portion > used during preparation phase, when the workload runs it will read the > data from a cache portion that does not belong to it and since it does > not need to allocate into its own cache portion its LLC occupancy counts will > not be accurate (the LLC occupancy associated with the buffer will be > attributed > to prepare portion). > > I am not familiar with the details of memory allocation but as I understand > Linux does attempt to satisfy memory requests from the node to which the > requesting CPU is assigned. For the MBM and MBA tests I thus believe it is > important to allocate the memory from where it will be used. I have > encountered > systems where CPU0 and CPU1 are on different sockets and by default the > workload > is set to run on CPU1. If the preparation phase runs on CPU0 then it may be > that memory could be allocated from a different NUMA node than where the > workload will > be running. By doing preparation within the same topology as what the > workload will be running I believe that memory will be allocated appropriate > to workload and thus result in more reliable measurements. > > > > > You used the topology argument to justify why both parent and child are > > now in the same resctrl group unlike before when only the actual test was. > > > > > You stated "now we're also going > > > do non-trivial amount of work in between the setup and the test" ... > > > could you clarify what the problem is with this? Before this work > > > the "non-trivial amount of work" (for "fill_buf") was done as part of the > > > test with (wrong) guesses about how long it takes. This work aims to > > > improve > > > this. > > > > I understand why you're trying to do with this change. > > > > However, I was trying to say that before preparations occurred right > > before the test which is no longer the case because there's fork() in > > between. > > If by "test" you mean the measurement phase then in the case of "fill_buf" > preparations only now reliably occur before the measurement. Original behavior > is maintained with user provided benchmark. > > > > > Does that extra work impact the state carry-over from preparations to the > > test? > > It is not clear to me what extra work or state you are referring to. > > > > > > > fork() quite heavy operation as it has to copy various things including > > > > the address space which you just made to contain a huge mem blob. :-) > > > > > > As highlighted in a comment found in the patch, the kernel uses > > > copy-on-write > > > and all tests only read from the buffer after fork(). > > > > Write test is possible using -b fill_buf ... as mentioned in comments to > > the other patch. > > Yes, it is theoretically possible, but looking closer it is not supported. > Note > how measure_mem_bw() is always called with hardcoded "reads". Reading through > the history of commits I do not believe modifying fill_buf parameters was > ever intended to be a use case for the "-b" parameter. It seems intended > to provide an alternate benchmark to fill_buf. > > I am not interested in adding support for the write test because I do not see > how it helps us to improve resctrl subsystem health. Considering that > nobody has ever complained about the write test being broken I think all > that dead code should just be removed instead. > > I prefer the focus be on the health of the resctrl subsystem instead of > additional > hardware performance testing. I do not think it is energy well spent to > further > tune these performance tests unless it benefits resctrl subsystem health. > These > performance tests are difficult to maintain and I have not yet seen how they > benefit the resctrl subsystem. > > > > > BTW, perhaps we could use some lighter weighted fork variant in the > > > > resctrl selftests, the processes don't really need to be that separated > > > > to justify using full fork() (and custom benchmarks will do execvp()). > > > > > > Are you thinking about pthreads? It is not clear to me that this is > > > necessary. It is also not clear to me what problem you are describing that > > > needs this solution. When I understand that better it will be easier to > > > discuss solutions. > > > > I was trying to say I see advantage of retaining the address space which > > is not possible with fork(), so perhaps using clone() with CLONE_VM would > > be useful and maybe some other flags too. I think this would solve the > > write test case. > > clone() brings with it the complexity of needing to manage the child's > stack. This again aims for a performance improvement. What does this fix? > What is the benefit to resctrl health? I would prefer that our energy > instead be focused on improving resctrl subsystem health. Fair enough.
diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c index ae120f1735c0..12c71bb44cb6 100644 --- a/tools/testing/selftests/resctrl/fill_buf.c +++ b/tools/testing/selftests/resctrl/fill_buf.c @@ -114,7 +114,7 @@ void fill_cache_read(unsigned char *buf, size_t buf_size, bool once) *value_sink = ret; } -static void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) { while (1) { fill_one_span_write(buf, buf_size); @@ -150,20 +150,3 @@ unsigned char *alloc_buffer(size_t buf_size, int memflush) return buf; } - -int run_fill_buf(size_t buf_size, int memflush, int op, bool once) -{ - unsigned char *buf; - - buf = alloc_buffer(buf_size, memflush); - if (!buf) - return -1; - - if (op == 0) - fill_cache_read(buf, buf_size, once); - else - fill_cache_write(buf, buf_size, once); - free(buf); - - return 0; -} diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 2dda56084588..0afbc4dd18e4 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -142,7 +142,7 @@ int perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, unsigned char *alloc_buffer(size_t buf_size, int memflush); void mem_flush(unsigned char *buf, size_t buf_size); void fill_cache_read(unsigned char *buf, size_t buf_size, bool once); -int run_fill_buf(size_t buf_size, int memflush, int op, bool once); +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once); int initialize_mem_bw_imc(void); int measure_mem_bw(const struct user_params *uparams, struct resctrl_val_param *param, pid_t bm_pid, diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 70e8e31f5d1a..574b72604f95 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -448,7 +448,7 @@ static int get_mem_bw_resctrl(FILE *fp, unsigned long *mbm_total) return 0; } -static pid_t bm_pid, ppid; +static pid_t bm_pid; void ctrlc_handler(int signum, siginfo_t *info, void *ptr) { @@ -506,13 +506,6 @@ void signal_handler_unregister(void) } } -static void parent_exit(pid_t ppid) -{ - kill(ppid, SIGKILL); - umount_resctrlfs(); - exit(EXIT_FAILURE); -} - /* * print_results_bw: the memory bandwidth results are stored in a file * @filename: file that stores the results @@ -614,61 +607,6 @@ int measure_mem_bw(const struct user_params *uparams, return ret; } -/* - * run_benchmark - Run a specified benchmark or fill_buf (default benchmark) - * in specified signal. Direct benchmark stdio to /dev/null. - * @signum: signal number - * @info: signal info - * @ucontext: user context in signal handling - */ -static void run_benchmark(int signum, siginfo_t *info, void *ucontext) -{ - int operation, ret, memflush; - char **benchmark_cmd; - size_t span; - bool once; - FILE *fp; - - benchmark_cmd = info->si_ptr; - - /* - * Direct stdio of child to /dev/null, so that only parent writes to - * stdio (console) - */ - fp = freopen("/dev/null", "w", stdout); - if (!fp) { - ksft_perror("Unable to direct benchmark status to /dev/null"); - parent_exit(ppid); - } - - if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { - /* Execute default fill_buf benchmark */ - span = strtoul(benchmark_cmd[1], NULL, 10); - memflush = atoi(benchmark_cmd[2]); - operation = atoi(benchmark_cmd[3]); - if (!strcmp(benchmark_cmd[4], "true")) { - once = true; - } else if (!strcmp(benchmark_cmd[4], "false")) { - once = false; - } else { - ksft_print_msg("Invalid once parameter\n"); - parent_exit(ppid); - } - - if (run_fill_buf(span, memflush, operation, once)) - fprintf(stderr, "Error in running fill buffer\n"); - } else { - /* Execute specified benchmark */ - ret = execvp(benchmark_cmd[0], benchmark_cmd); - if (ret) - ksft_perror("execvp"); - } - - fclose(stdout); - ksft_print_msg("Unable to run specified benchmark\n"); - parent_exit(ppid); -} - /* * resctrl_val: execute benchmark and measure memory bandwidth on * the benchmark @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) { - struct sigaction sigact; - int ret = 0, pipefd[2]; - char pipe_message = 0; - union sigval value; - int domain_id; + int domain_id, operation = 0, memflush = 1; + size_t span = DEFAULT_SPAN; + unsigned char *buf = NULL; + cpu_set_t old_affinity; + bool once = false; + int ret = 0; + pid_t ppid; if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio"); @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; } - /* - * If benchmark wasn't successfully started by child, then child should - * kill parent, so save parent's pid - */ ppid = getpid(); - if (pipe(pipefd)) { - ksft_perror("Unable to create pipe"); + /* Taskset test to specified CPU. */ + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); + if (ret) + return ret; - return -1; + /* Write test to specified control & monitoring group in resctrl FS. */ + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); + if (ret) + goto reset_affinity; + + if (param->init) { + ret = param->init(param, domain_id); + if (ret) + goto reset_affinity; } /* - * Fork to start benchmark, save child's pid so that it can be killed - * when needed + * If not running user provided benchmark, run the default + * "fill_buf". First phase of "fill_buf" is to prepare the + * buffer that the benchmark will operate on. No measurements + * are needed during this phase and prepared memory will be + * passed to next part of benchmark via copy-on-write. TBD + * how this impacts "write" benchmark, but no test currently + * uses this. */ - fflush(stdout); + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { + span = strtoul(benchmark_cmd[1], NULL, 10); + memflush = atoi(benchmark_cmd[2]); + operation = atoi(benchmark_cmd[3]); + if (!strcmp(benchmark_cmd[4], "true")) { + once = true; + } else if (!strcmp(benchmark_cmd[4], "false")) { + once = false; + } else { + ksft_print_msg("Invalid once parameter\n"); + ret = -EINVAL; + goto reset_affinity; + } + + buf = alloc_buffer(span, memflush); + if (!buf) { + ret = -ENOMEM; + goto reset_affinity; + } + } + bm_pid = fork(); if (bm_pid == -1) { + ret = -errno; ksft_perror("Unable to fork"); - - return -1; + goto free_buf; } + /* + * What needs to be measured runs in separate process until + * terminated. + */ if (bm_pid == 0) { - /* - * Mask all signals except SIGUSR1, parent uses SIGUSR1 to - * start benchmark - */ - sigfillset(&sigact.sa_mask); - sigdelset(&sigact.sa_mask, SIGUSR1); - - sigact.sa_sigaction = run_benchmark; - sigact.sa_flags = SA_SIGINFO; - - /* Register for "SIGUSR1" signal from parent */ - if (sigaction(SIGUSR1, &sigact, NULL)) { - ksft_perror("Can't register child for signal"); - parent_exit(ppid); + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { + if (operation == 0) + fill_cache_read(buf, span, once); + else + fill_cache_write(buf, span, once); + } else { + execvp(benchmark_cmd[0], (char **)benchmark_cmd); } - - /* Tell parent that child is ready */ - close(pipefd[0]); - pipe_message = 1; - if (write(pipefd[1], &pipe_message, sizeof(pipe_message)) < - sizeof(pipe_message)) { - ksft_perror("Failed signaling parent process"); - close(pipefd[1]); - return -1; - } - close(pipefd[1]); - - /* Suspend child until delivery of "SIGUSR1" from parent */ - sigsuspend(&sigact.sa_mask); - - ksft_perror("Child is done"); - parent_exit(ppid); + exit(EXIT_SUCCESS); } ksft_print_msg("Benchmark PID: %d\n", (int)bm_pid); - /* - * The cast removes constness but nothing mutates benchmark_cmd within - * the context of this process. At the receiving process, it becomes - * argv, which is mutable, on exec() but that's after fork() so it - * doesn't matter for the process running the tests. - */ - value.sival_ptr = (void *)benchmark_cmd; - - /* Taskset benchmark to specified cpu */ - ret = taskset_benchmark(bm_pid, uparams->cpu, NULL); - if (ret) - goto out; - - /* Write benchmark to specified control&monitoring grp in resctrl FS */ - ret = write_bm_pid_to_resctrl(bm_pid, param->ctrlgrp, param->mongrp); - if (ret) - goto out; - - if (param->init) { - ret = param->init(param, domain_id); - if (ret) - goto out; - } - - /* Parent waits for child to be ready. */ - close(pipefd[1]); - while (pipe_message != 1) { - if (read(pipefd[0], &pipe_message, sizeof(pipe_message)) < - sizeof(pipe_message)) { - ksft_perror("Failed reading message from child process"); - close(pipefd[0]); - goto out; - } - } - close(pipefd[0]); - - /* Signal child to start benchmark */ - if (sigqueue(bm_pid, SIGUSR1, value) == -1) { - ksft_perror("sigqueue SIGUSR1 to child"); - ret = -1; - goto out; - } - - /* Give benchmark enough time to fully run */ + /* Give benchmark enough time to fully run. */ sleep(1); /* Test runs until the callback setup() tells the test to stop. */ @@ -821,8 +730,10 @@ int resctrl_val(const struct resctrl_test *test, break; } -out: kill(bm_pid, SIGKILL); - +free_buf: + free(buf); +reset_affinity: + taskset_restore(ppid, &old_affinity); return ret; }
The CMT, MBA, and MBM tests rely on the resctrl_val() wrapper to start and run a benchmark while providing test specific flows via callbacks to do test specific configuration and measurements. At a high level, the resctrl_val() flow is: a) Start by fork()ing a child process that installs a signal handler for SIGUSR1 that, on receipt of SIGUSR1, will start running a benchmark. b) Assign the child process created in (a) to the resctrl control and monitoring group that dictates the memory and cache allocations with which the process can run and will contain all resctrl monitoring data of that process. c) Once parent and child are considered "ready" (determined via a message over a pipe) the parent signals the child (via SIGUSR1) to start the benchmark, waits one second for the benchmark to run, and then starts collecting monitoring data for the tests, potentially also changing allocation configuration depending on the various test callbacks. A problem with the above flow is the "black box" view of the benchmark that is combined with an arbitrarily chosen "wait one second" before measurements start. No matter what the benchmark does, it is given one second to initialize before measurements start. The default benchmark "fill_buf" consists of two parts, first it prepares a buffer (allocate, initialize, then flush), then it reads from the buffer (in unpredictable ways) until terminated. Depending on the system and the size of the buffer, the first "prepare" part may not be complete by the time the one second delay expires. Test measurements may thus start before the work needing to be measured runs. Split the default benchmark into its "prepare" and "runtime" parts and simplify the resctrl_val() wrapper while doing so. This same split cannot be done for the user provided benchmark (without a user interface change), so the current behavior is maintained for user provided benchmark. Assign the test itself to the control and monitoring group and run the "prepare" part of the benchmark in this context, ensuring it runs with required cache and memory bandwidth allocations. With the benchmark preparation complete it is only needed to fork() the "runtime" part of the benchmark (or entire user provided benchmark). Keep the "wait one second" delay before measurements start. For the default "fill_buf" benchmark this time now covers only the "runtime" portion that needs to be measured. For the user provided benchmark this delay maintains current behavior. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> --- tools/testing/selftests/resctrl/fill_buf.c | 19 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/resctrl/resctrl_val.c | 225 ++++++------------ 3 files changed, 70 insertions(+), 176 deletions(-)