diff mbox series

[v2,2/2] KVM: selftests: Print summary stats of memory latency distribution

Message ID 20230316222752.1911001-3-coltonlewis@google.com (mailing list archive)
State New, archived
Headers show
Series Calculate memory access latency stats | expand

Commit Message

Colton Lewis March 16, 2023, 10:27 p.m. UTC
Print summary stats of the memory latency distribution in nanoseconds
for dirty_log_perf_test. For every iteration, this prints the minimum,
the maximum, and the 50th, 90th, and 99th percentiles.

Stats are calculated by sorting the samples taken from all vcpus and
picking from the index corresponding with each percentile.

Because keeping all samples is impractical due to the space required
in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64
vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 GB extra memory just
for samples), reservoir sampling [1] is used to only keep a small
number of samples per vcpu (settable via command argument). There is a
tradeoff between stat accuracy and memory usage. More samples means
more accuracy but also more memory use.

Because other selftests use memstress.c, those tests were hardcoded to
take 0 samples at this time.

[1] https://en.wikipedia.org/wiki/Reservoir_sampling#Simple:_Algorithm_R

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 .../selftests/kvm/access_tracking_perf_test.c |  3 +-
 .../selftests/kvm/demand_paging_test.c        |  2 +-
 .../selftests/kvm/dirty_log_perf_test.c       | 10 ++-
 .../testing/selftests/kvm/include/memstress.h | 10 ++-
 tools/testing/selftests/kvm/lib/memstress.c   | 68 ++++++++++++++++---
 .../kvm/memslot_modification_stress_test.c    |  2 +-
 6 files changed, 81 insertions(+), 14 deletions(-)

--
2.40.0.rc1.284.g88254d51c5-goog

Comments

Vipin Sharma March 17, 2023, 6:16 p.m. UTC | #1
On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com> wrote:
>
> Print summary stats of the memory latency distribution in nanoseconds
> for dirty_log_perf_test. For every iteration, this prints the minimum,
> the maximum, and the 50th, 90th, and 99th percentiles.
>

Can you also write how this is useful and why these 5 specific
percentiles are valuable for testers? Generally, 5 number summaries
are 0, 25, 50, 75, 100 %ile.
It might also be too much data to display with each iteration.

Nit: minimum, 50th, 90th, 99th and maximum since this is the order you
are printing.


> @@ -428,6 +432,7 @@ int main(int argc, char *argv[])
>                 .slots = 1,
>                 .random_seed = 1,
>                 .write_percent = 100,
> +               .samples_per_vcpu = 1000,

Why is 1000 the right default choice? Maybe the default should be 0
and if anyone wants to use it then they can use the command line
option to set it?

> @@ -438,7 +443,7 @@ int main(int argc, char *argv[])
>
>         guest_modes_append_default();
>
> -       while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
> +       while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:S:v:x:w:")) != -1) {

1. Please add the help section to tell about the new command line option.
2. Instead of having s and S, it may be better to use a different
lower case letter, like "l" for latency. Giving this option will print
memory latency and users need to provide the number of samples they
prefer per vCPU.

> @@ -480,6 +485,9 @@ int main(int argc, char *argv[])
>                 case 's':
>                         p.backing_src = parse_backing_src_type(optarg);
>                         break;
> +               case 'S':
> +                       p.samples_per_vcpu = atoi_positive("Number of samples/vcpu", optarg);
> +                       break;

This will force users to always see latency stat when they do not want
to see it. I think this patch should be modified in a way to easily
disable this feature.
I might be wrong here and it is actually a useful metric to see with
each run. If this is true then maybe the commit should mention why it
is good for each run.

> +void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
> +{
> +       uint64_t n_samples = nr_vcpus * memstress_args.samples_per_vcpu;
> +       uint64_t *host_latency_samples = addr_gva2hva(vm, memstress_args.latency_samples);
> +
> +       qsort(host_latency_samples, n_samples, sizeof(uint64_t), &memstress_qcmp);
> +
> +       pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),

I am not much aware of how tsc is set up and used. Will all vCPUs have
the same tsc value? Can this change if vCPU gets scheduled to
different pCPU on the host?

> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples / 2]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 9 / 10]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 99 / 100]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples - 1]));
> +}

All of the dirty_log_perf_tests print data in seconds followed by
nanoseconds. I will recommend keeping it the same.
Also, 9 digits for nanoseconds instead of 6 as in other formats.
Sean Christopherson March 21, 2023, 9:21 p.m. UTC | #2
On Tue, Mar 21, 2023, Colton Lewis wrote:
> Vipin Sharma <vipinsh@google.com> writes:
> 
> > On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com>
> > wrote:
> > > +       pr_info("Latency distribution (ns) = min:%6.0lf,
> > > 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
> > > +               cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),
> 
> > I am not much aware of how tsc is set up and used. Will all vCPUs have
> > the same tsc value? Can this change if vCPU gets scheduled to
> > different pCPU on the host?

FWIW, if this test were run on older CPUs, there would be potential divergence
across pCPUs that would then bleed into vCPUs to some extent.  Older CPUs tied
the TSC frequency to the core frequency, e.g. would change frequency depending
on the power/turbo state, and the TSC would even stop counting altogether at
certain C-states.  KVM does its best to adjust the guest's perception of the TSC,
but it can't be hidden completely.

But for what this test is trying to do, IMO there's zero reason to worry about
that.
 
> All vCPUs *in one VM* should have the same frequency. The alternative is
> probably possible but so weird I can't imagine a reason for doing it.

Somewhat related to Vipin's question, "host_latency_samples" is a confusing name.
It's easy to miss that "host_latency_samples" are actually samples collected in
the guest, and thus to think that this code will depend on which pCPU it runs on.

I don't see any reason for such a verbose name, e.g. can't it just be "samples"?
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..89947aa4ff3a 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,8 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;

-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes,
+				 0, 1,
 				 params->backing_src, !overlap_memory_access);

 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e2..bc5c171eca9e 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,7 +135,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int i;

-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 0, 1,
 				 p->src_type, p->partition_vcpu_memory_access);

 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89..40058fe2b58f 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -133,6 +133,7 @@  struct test_params {
 	int slots;
 	uint32_t write_percent;
 	uint32_t random_seed;
+	uint64_t samples_per_vcpu;
 	bool random_access;
 };

@@ -224,6 +225,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;

 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 p->samples_per_vcpu,
 				 p->slots, p->backing_src,
 				 p->partition_vcpu_memory_access);

@@ -274,6 +276,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("Populate memory time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
+	memstress_print_percentiles(vm, nr_vcpus);

 	/* Enable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -304,6 +307,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 		vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
 		pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
 			iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
+		memstress_print_percentiles(vm, nr_vcpus);

 		clock_gettime(CLOCK_MONOTONIC, &start);
 		get_dirty_log(vm, bitmaps, p->slots);
@@ -428,6 +432,7 @@  int main(int argc, char *argv[])
 		.slots = 1,
 		.random_seed = 1,
 		.write_percent = 100,
+		.samples_per_vcpu = 1000,
 	};
 	int opt;

@@ -438,7 +443,7 @@  int main(int argc, char *argv[])

 	guest_modes_append_default();

-	while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
+	while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:S:v:x:w:")) != -1) {
 		switch (opt) {
 		case 'a':
 			p.random_access = true;
@@ -480,6 +485,9 @@  int main(int argc, char *argv[])
 		case 's':
 			p.backing_src = parse_backing_src_type(optarg);
 			break;
+		case 'S':
+			p.samples_per_vcpu = atoi_positive("Number of samples/vcpu", optarg);
+			break;
 		case 'v':
 			nr_vcpus = atoi_positive("Number of vCPUs", optarg);
 			TEST_ASSERT(nr_vcpus <= max_vcpus,
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7b..25a3b5e308e9 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -37,6 +37,12 @@  struct memstress_args {
 	uint64_t guest_page_size;
 	uint32_t random_seed;
 	uint32_t write_percent;
+	uint64_t samples_per_vcpu;
+	/* Store all samples in a flat array so they can be easily
+	 * sorted later.
+	 */
+	vm_vaddr_t latency_samples;
+

 	/* Run vCPUs in L2 instead of L1, if the architecture supports it. */
 	bool nested;
@@ -56,7 +62,8 @@  struct memstress_args {
 extern struct memstress_args memstress_args;

 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes,
+				   uint64_t samples_per_vcpu, int slots,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
@@ -72,4 +79,5 @@  void memstress_guest_code(uint32_t vcpu_id);
 uint64_t memstress_nested_pages(int nr_vcpus);
 void memstress_setup_nested(struct kvm_vm *vm, int nr_vcpus, struct kvm_vcpu *vcpus[]);

+void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus);
 #endif /* SELFTEST_KVM_MEMSTRESS_H */
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238..c275471ad6d8 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -48,14 +48,17 @@  void memstress_guest_code(uint32_t vcpu_idx)
 {
 	struct memstress_args *args = &memstress_args;
 	struct memstress_vcpu_args *vcpu_args = &args->vcpu_args[vcpu_idx];
-	struct guest_random_state rand_state;
 	uint64_t gva;
 	uint64_t pages;
 	uint64_t addr;
 	uint64_t page;
 	int i;
-
-	rand_state = new_guest_random_state(args->random_seed + vcpu_idx);
+	struct guest_random_state rand_state =
+		new_guest_random_state(args->random_seed + vcpu_idx);
+	uint64_t *vcpu_samples = (uint64_t *)args->latency_samples
+		+ args->samples_per_vcpu * vcpu_idx;
+	uint64_t cycles;
+	uint32_t maybe_sample;

 	gva = vcpu_args->gva;
 	pages = vcpu_args->pages;
@@ -73,15 +76,46 @@  void memstress_guest_code(uint32_t vcpu_idx)
 			addr = gva + (page * args->guest_page_size);

 			if (guest_random_u32(&rand_state) % 100 < args->write_percent)
-				*(uint64_t *)addr = 0x0123456789ABCDEF;
+				cycles = MEASURE_CYCLES(*(uint64_t *)addr = 0x0123456789ABCDEF);
 			else
-				READ_ONCE(*(uint64_t *)addr);
+				cycles = MEASURE_CYCLES(READ_ONCE(*(uint64_t *)addr));
+
+			if (i < args->samples_per_vcpu) {
+				vcpu_samples[i] = cycles;
+				continue;
+			}
+
+			maybe_sample = guest_random_u32(&rand_state) % (i + 1);
+
+			if (maybe_sample < args->samples_per_vcpu)
+				vcpu_samples[maybe_sample] = cycles;
 		}

 		GUEST_SYNC(1);
 	}
 }

+/* compare function for qsort */
+static int memstress_qcmp(const void *a, const void *b)
+{
+	return *(int *)a - *(int *)b;
+}
+
+void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
+{
+	uint64_t n_samples = nr_vcpus * memstress_args.samples_per_vcpu;
+	uint64_t *host_latency_samples = addr_gva2hva(vm, memstress_args.latency_samples);
+
+	qsort(host_latency_samples, n_samples, sizeof(uint64_t), &memstress_qcmp);
+
+	pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples / 2]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 9 / 10]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 99 / 100]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples - 1]));
+}
+
 void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 			   struct kvm_vcpu *vcpus[],
 			   uint64_t vcpu_memory_bytes,
@@ -119,21 +153,26 @@  void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }

 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes,
+				   uint64_t samples_per_vcpu, int slots,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
 	struct memstress_args *args = &memstress_args;
 	struct kvm_vm *vm;
-	uint64_t guest_num_pages, slot0_pages = 0;
+	uint64_t guest_num_pages, sample_pages, slot0_pages = 0;
 	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
 	uint64_t region_end_gfn;
+	uint64_t sample_size;
+	uint64_t sample_slot;
+	uint64_t sample_start;
 	int i;

 	pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));

 	/* By default vCPUs will write to memory. */
 	args->write_percent = 100;
+	args->samples_per_vcpu = samples_per_vcpu;

 	/*
 	 * Snapshot the non-huge page size.  This is used by the guest code to
@@ -159,12 +198,17 @@  struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	if (args->nested)
 		slot0_pages += memstress_nested_pages(nr_vcpus);

+	sample_size = samples_per_vcpu * nr_vcpus * sizeof(uint64_t);
+	sample_pages = vm_adjust_num_guest_pages(
+		mode, 1 + sample_size / args->guest_page_size);
+
 	/*
 	 * Pass guest_num_pages to populate the page tables for test memory.
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+	vm = __vm_create_with_vcpus(mode, nr_vcpus,
+				    slot0_pages + guest_num_pages + sample_pages,
 				    memstress_guest_code, vcpus);

 	args->vm = vm;
@@ -190,7 +234,7 @@  struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 		    " nr_vcpus: %d wss: %" PRIx64 "]\n",
 		    guest_num_pages, region_end_gfn - 1, nr_vcpus, vcpu_memory_bytes);

-	args->gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size;
+	args->gpa = (region_end_gfn - guest_num_pages - sample_pages - 1) * args->guest_page_size;
 	args->gpa = align_down(args->gpa, backing_src_pagesz);
 #ifdef __s390x__
 	/* Align to 1M (segment size) */
@@ -213,6 +257,12 @@  struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	/* Do mapping for the demand paging memory slot */
 	virt_map(vm, guest_test_virt_mem, args->gpa, guest_num_pages);

+	memstress_args.latency_samples = guest_test_virt_mem + args->size;
+	sample_start = args->gpa + args->size;
+	sample_slot = MEMSTRESS_MEM_SLOT_INDEX + slots;
+	vm_userspace_mem_region_add(vm, backing_src, sample_start, sample_slot, sample_pages, 0);
+	virt_map(vm, memstress_args.latency_samples, sample_start, sample_pages);
+
 	memstress_setup_vcpus(vm, nr_vcpus, vcpus, vcpu_memory_bytes,
 			      partition_vcpu_memory_access);

diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0a14f7d16e0c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;

-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 0, 1,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);