diff mbox

[RFC/RFT,0/3] arm64: KVM: work around incoherency with uncached guest mappings

Message ID 20150220142905.GA10942@hawk.usersys.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Jones Feb. 20, 2015, 2:29 p.m. UTC
On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
> 
> 
> On 19/02/2015 18:55, Andrew Jones wrote:
> >> > > (I don't have an exact number for how many times it went to EL1 because
> >> > >  access_mair() doesn't have a trace point.)
> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
> >> > >  applied.)
> >> > >
> >> > > I consider this bad news because, even considering it only goes to EL2,
> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
> >> > > plan for enabling traps though.
> 
> If a full guest boots, can you try timing a kernel compile?
>

Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
times inside the guest for each host kernel; base and mair. I dropped
the time from the first run of each set, and captured the other 3.
Command line used below. Time is from the
  Elapsed (wall clock) time (h:mm:ss or m:ss):
output of /usr/bin/time - the host's wall clock.

  /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'

Results:
base: 3:06.11 3:07.00 3:10.93
mair: 3:08.47 3:06.75 3:04.76

So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.

Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;

	asm volatile(
	"	mrs %0, sctlr_el1\n"
	"	msr pmcr_el0, %1\n"
	: "=&r" (sctlr) : "r" (5));

	asm volatile(
	"	mrs %0, pmccntr_el0\n"
	"	msr sctlr_el1, %2\n"
	"	mrs %1, pmccntr_el0\n"
	: "=&r" (start), "=&r" (end) : "r" (sctlr));

	printf("%llx\n", end - start);
	return 0;
}

after applying this patch to kvm


But I get zero for the cycle count. Not sure what I'm missing.

drew

Comments

Ard Biesheuvel Feb. 20, 2015, 2:37 p.m. UTC | #1
On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 19/02/2015 18:55, Andrew Jones wrote:
>> >> > > (I don't have an exact number for how many times it went to EL1 because
>> >> > >  access_mair() doesn't have a trace point.)
>> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
>> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
>> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>> >> > >  applied.)
>> >> > >
>> >> > > I consider this bad news because, even considering it only goes to EL2,
>> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
>> >> > > plan for enabling traps though.
>>
>> If a full guest boots, can you try timing a kernel compile?
>>
>
> Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
> times inside the guest for each host kernel; base and mair. I dropped
> the time from the first run of each set, and captured the other 3.
> Command line used below. Time is from the
>   Elapsed (wall clock) time (h:mm:ss or m:ss):
> output of /usr/bin/time - the host's wall clock.
>
>   /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'
>
> Results:
> base: 3:06.11 3:07.00 3:10.93
> mair: 3:08.47 3:06.75 3:04.76
>
> So looks like the 3 orders of magnitude greater number of traps
> (only to el2) don't impact kernel compiles.
>

OK, good! That was what I was hoping for, obviously.

> Then I thought I'd be able to quick measure the number of cycles
> a trap to el2 takes with this kvm-unit-tests test
>
> int main(void)
> {
>         unsigned long start, end;
>         unsigned int sctlr;
>
>         asm volatile(
>         "       mrs %0, sctlr_el1\n"
>         "       msr pmcr_el0, %1\n"
>         : "=&r" (sctlr) : "r" (5));
>
>         asm volatile(
>         "       mrs %0, pmccntr_el0\n"
>         "       msr sctlr_el1, %2\n"
>         "       mrs %1, pmccntr_el0\n"
>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>
>         printf("%llx\n", end - start);
>         return 0;
> }
>
> after applying this patch to kvm
>
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index bb91b6fc63861..5de39d740aa58 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -770,7 +770,7 @@
>
>         mrs     x2, mdcr_el2
>         and     x2, x2, #MDCR_EL2_HPMN_MASK
> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>
>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>
> But I get zero for the cycle count. Not sure what I'm missing.
>

No clue tbh. Does the counter work as expected in the host?
Andrew Jones Feb. 20, 2015, 3:36 p.m. UTC | #2
On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > So looks like the 3 orders of magnitude greater number of traps
> > (only to el2) don't impact kernel compiles.
> >
> 
> OK, good! That was what I was hoping for, obviously.
> 
> > Then I thought I'd be able to quick measure the number of cycles
> > a trap to el2 takes with this kvm-unit-tests test
> >
> > int main(void)
> > {
> >         unsigned long start, end;
> >         unsigned int sctlr;
> >
> >         asm volatile(
> >         "       mrs %0, sctlr_el1\n"
> >         "       msr pmcr_el0, %1\n"
> >         : "=&r" (sctlr) : "r" (5));
> >
> >         asm volatile(
> >         "       mrs %0, pmccntr_el0\n"
> >         "       msr sctlr_el1, %2\n"
> >         "       mrs %1, pmccntr_el0\n"
> >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >
> >         printf("%llx\n", end - start);
> >         return 0;
> > }
> >
> > after applying this patch to kvm
> >
> > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > index bb91b6fc63861..5de39d740aa58 100644
> > --- a/arch/arm64/kvm/hyp.S
> > +++ b/arch/arm64/kvm/hyp.S
> > @@ -770,7 +770,7 @@
> >
> >         mrs     x2, mdcr_el2
> >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >
> >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >
> > But I get zero for the cycle count. Not sure what I'm missing.
> >
> 
> No clue tbh. Does the counter work as expected in the host?
>

Guess not. I dropped the test into a module_init and inserted
it on the host. Always get zero for pmccntr_el0 reads. Or, if
I set it to something non-zero with a write, then I always get
that back - no increments. pmcr_el0 looks OK... I had forgotten
to set bit 31 of pmcntenset_el0, but doing that still doesn't
help. Anyway, I assume the problem is me. I'll keep looking to
see what I'm missing.

drew
Andrew Jones Feb. 24, 2015, 2:55 p.m. UTC | #3
On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > > So looks like the 3 orders of magnitude greater number of traps
> > > (only to el2) don't impact kernel compiles.
> > >
> > 
> > OK, good! That was what I was hoping for, obviously.
> > 
> > > Then I thought I'd be able to quick measure the number of cycles
> > > a trap to el2 takes with this kvm-unit-tests test
> > >
> > > int main(void)
> > > {
> > >         unsigned long start, end;
> > >         unsigned int sctlr;
> > >
> > >         asm volatile(
> > >         "       mrs %0, sctlr_el1\n"
> > >         "       msr pmcr_el0, %1\n"
> > >         : "=&r" (sctlr) : "r" (5));
> > >
> > >         asm volatile(
> > >         "       mrs %0, pmccntr_el0\n"
> > >         "       msr sctlr_el1, %2\n"
> > >         "       mrs %1, pmccntr_el0\n"
> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >
> > >         printf("%llx\n", end - start);
> > >         return 0;
> > > }
> > >
> > > after applying this patch to kvm
> > >
> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > > index bb91b6fc63861..5de39d740aa58 100644
> > > --- a/arch/arm64/kvm/hyp.S
> > > +++ b/arch/arm64/kvm/hyp.S
> > > @@ -770,7 +770,7 @@
> > >
> > >         mrs     x2, mdcr_el2
> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >
> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >
> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >
> > 
> > No clue tbh. Does the counter work as expected in the host?
> >
> 
> Guess not. I dropped the test into a module_init and inserted
> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> I set it to something non-zero with a write, then I always get
> that back - no increments. pmcr_el0 looks OK... I had forgotten
> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> help. Anyway, I assume the problem is me. I'll keep looking to
> see what I'm missing.
>

I returned to this and see that the problem was indeed me. I needed yet
another enable bit set (the filter register needed to be instructed to
count cycles while in el2). I've attached the code for the curious.
The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
running on a host without this patch series (after TVM traps have been
disabled), I get a pretty consistent 40.

I checked how many vm-sysreg traps we do during the kernel compile
benchmark. It's 124924. So it's a bit strange that we don't see the
benchmark taking 10 to 20 seconds longer on average. I should probably
double check my runs. In any case, while I like the approach of this
series, the overhead is looking non-negligible.

drew
#include <libcflat.h>

static void prep_cc(void)
{
	asm volatile(
	"	msr pmovsclr_el0, %0\n"
	"	msr pmccfiltr_el0, %1\n"
	"	msr pmcntenset_el0, %2\n"
	"	msr pmcr_el0, %3\n"
	"	isb\n"
	:
	: "r" (1 << 31), "r" (1 << 27), "r" (1 << 31),
	  "r" (1 << 6 | 1 << 2 | 1 << 0));
}

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;
	int i, zeros = 0;

	asm volatile("mrs %0, sctlr_el1" : "=&r" (sctlr));
	prep_cc();

	for (i = 0; i < 100000; ++i) {
		asm volatile(
		"	mrs %0, pmccntr_el0\n"
		"	msr sctlr_el1, %2\n"
		"	mrs %1, pmccntr_el0\n"
		"	isb\n"
		: "=&r" (start), "=&r" (end) : "r" (sctlr));

		if ((i % 10) == 0)
			printf("\n");

		printf(" %d", end - start);

		if ((end - start) == 0) {
			++zeros;
			prep_cc();
		}
	}

	printf("\nnum zero counts = %d\n", zeros);
	return 0;
}
Ard Biesheuvel Feb. 24, 2015, 5:47 p.m. UTC | #4
On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>> > > So looks like the 3 orders of magnitude greater number of traps
>> > > (only to el2) don't impact kernel compiles.
>> > >
>> >
>> > OK, good! That was what I was hoping for, obviously.
>> >
>> > > Then I thought I'd be able to quick measure the number of cycles
>> > > a trap to el2 takes with this kvm-unit-tests test
>> > >
>> > > int main(void)
>> > > {
>> > >         unsigned long start, end;
>> > >         unsigned int sctlr;
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, sctlr_el1\n"
>> > >         "       msr pmcr_el0, %1\n"
>> > >         : "=&r" (sctlr) : "r" (5));
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, pmccntr_el0\n"
>> > >         "       msr sctlr_el1, %2\n"
>> > >         "       mrs %1, pmccntr_el0\n"
>> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>> > >
>> > >         printf("%llx\n", end - start);
>> > >         return 0;
>> > > }
>> > >
>> > > after applying this patch to kvm
>> > >
>> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>> > > index bb91b6fc63861..5de39d740aa58 100644
>> > > --- a/arch/arm64/kvm/hyp.S
>> > > +++ b/arch/arm64/kvm/hyp.S
>> > > @@ -770,7 +770,7 @@
>> > >
>> > >         mrs     x2, mdcr_el2
>> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
>> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>> > >
>> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>> > >
>> > > But I get zero for the cycle count. Not sure what I'm missing.
>> > >
>> >
>> > No clue tbh. Does the counter work as expected in the host?
>> >
>>
>> Guess not. I dropped the test into a module_init and inserted
>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>> I set it to something non-zero with a write, then I always get
>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>> help. Anyway, I assume the problem is me. I'll keep looking to
>> see what I'm missing.
>>
>
> I returned to this and see that the problem was indeed me. I needed yet
> another enable bit set (the filter register needed to be instructed to
> count cycles while in el2). I've attached the code for the curious.
> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> running on a host without this patch series (after TVM traps have been
> disabled), I get a pretty consistent 40.
>
> I checked how many vm-sysreg traps we do during the kernel compile
> benchmark. It's 124924. So it's a bit strange that we don't see the
> benchmark taking 10 to 20 seconds longer on average. I should probably
> double check my runs. In any case, while I like the approach of this
> series, the overhead is looking non-negligible.
>

Thanks a lot for producing these numbers. 125k x 7k == <1 billion
cycles == <1 second on a >1 GHz machine, I think?
Or am I missing something? How long does the actual compile take?
Andrew Jones Feb. 24, 2015, 7:12 p.m. UTC | #5
On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
>

Wait, my fault. I dropped a pretty big divisor in my calculation. Don't
ask... I'll just go home and study one of my daughter's math books now...

So, I even have a 2.4 GHz machine, which explains why the benchmark times
are the same with and without this series (those times are provided earlier
in this thread, they're roughly 03:10). I'm glad you straighted me out. I
was second guessing my benchmark results, and considering redoing them.

Anyway, this series, at least wrt to overhead, is looking good again.

Thanks,
drew
Christoffer Dall March 2, 2015, 4:31 p.m. UTC | #6
On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
> 
I ran a sequence of benchmarks that I occasionally run (pbzip,
kernbench, and hackbench) and I also saw < 1% performance degradation,
so I think we can trust that somewhat.  (I can post the raw numbers when
I have ssh access to my Linux desktop - sending this from Somewhere Over
The Atlantic).

However, my concern with these patches are on two points:

1. It's not a fix-all.  We still have the case where the guest expects
the behavior of device memory (for strong ordering for example) on a RAM
region, which we now break.  Similiarly this doesn't support the
non-coherent DMA to RAM region case.

2. While the code is probably as nice as this kind of stuff gets, it
is non-trivial and extremely difficult to debug.  The counter-point here
is that we may end up handling other stuff at EL2 for performanc reasons
in the future.

Mainly because of point 1 above, I am leaning to thinking userspace
should do the invalidation when it knows it needs to, either through KVM
via a memslot flag or through some other syscall mechanism.

Thanks,
-Christoffer
Paolo Bonzini March 2, 2015, 4:47 p.m. UTC | #7
On 02/03/2015 17:31, Christoffer Dall wrote:
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I'm okay with adding a KVM capability and ioctl that flushes the dcache
for a given gpa range.  However:

1) I'd like to have an implementation for QEMU and/or kvmtool before
accepting that ioctl.

2) I think the ioctl should work whatever the stage1 mapping is (e.g.
with and without Ard's patches, with and without Laszlo's OVMF patch, etc.).

Also, we may want to invalidate the cache for dirty pages before
returning the dirty bitmap, and probably should do that directly in
KVM_GET_DIRTY_LOG.

Paolo
Andrew Jones March 2, 2015, 4:48 p.m. UTC | #8
On Mon, Mar 02, 2015 at 08:31:46AM -0800, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> > On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> > >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > >> > > So looks like the 3 orders of magnitude greater number of traps
> > >> > > (only to el2) don't impact kernel compiles.
> > >> > >
> > >> >
> > >> > OK, good! That was what I was hoping for, obviously.
> > >> >
> > >> > > Then I thought I'd be able to quick measure the number of cycles
> > >> > > a trap to el2 takes with this kvm-unit-tests test
> > >> > >
> > >> > > int main(void)
> > >> > > {
> > >> > >         unsigned long start, end;
> > >> > >         unsigned int sctlr;
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, sctlr_el1\n"
> > >> > >         "       msr pmcr_el0, %1\n"
> > >> > >         : "=&r" (sctlr) : "r" (5));
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, pmccntr_el0\n"
> > >> > >         "       msr sctlr_el1, %2\n"
> > >> > >         "       mrs %1, pmccntr_el0\n"
> > >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >> > >
> > >> > >         printf("%llx\n", end - start);
> > >> > >         return 0;
> > >> > > }
> > >> > >
> > >> > > after applying this patch to kvm
> > >> > >
> > >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > >> > > index bb91b6fc63861..5de39d740aa58 100644
> > >> > > --- a/arch/arm64/kvm/hyp.S
> > >> > > +++ b/arch/arm64/kvm/hyp.S
> > >> > > @@ -770,7 +770,7 @@
> > >> > >
> > >> > >         mrs     x2, mdcr_el2
> > >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >> > >
> > >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >> > >
> > >> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >> > >
> > >> >
> > >> > No clue tbh. Does the counter work as expected in the host?
> > >> >
> > >>
> > >> Guess not. I dropped the test into a module_init and inserted
> > >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> > >> I set it to something non-zero with a write, then I always get
> > >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> > >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> > >> help. Anyway, I assume the problem is me. I'll keep looking to
> > >> see what I'm missing.
> > >>
> > >
> > > I returned to this and see that the problem was indeed me. I needed yet
> > > another enable bit set (the filter register needed to be instructed to
> > > count cycles while in el2). I've attached the code for the curious.
> > > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > > running on a host without this patch series (after TVM traps have been
> > > disabled), I get a pretty consistent 40.
> > >
> > > I checked how many vm-sysreg traps we do during the kernel compile
> > > benchmark. It's 124924. So it's a bit strange that we don't see the
> > > benchmark taking 10 to 20 seconds longer on average. I should probably
> > > double check my runs. In any case, while I like the approach of this
> > > series, the overhead is looking non-negligible.
> > >
> > 
> > Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> > cycles == <1 second on a >1 GHz machine, I think?
> > Or am I missing something? How long does the actual compile take?
> > 
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I've started down the memslot flag road by promoting KVM_MEMSLOT_INCOHERENT
to uapi/KVM_MEM_INCOHERENT, replacing the readonly memslot heuristic.
With a couple more changes it should work for all memory regions with
the 'incoherent' property. I'll make some changes to QEMU to test it all
out as well. Progress was slow last week due to too many higher priority
tasks, but I plan to return to it this week.

Thanks,
drew


> 
> Thanks,
> -Christoffer
Laszlo Ersek March 2, 2015, 4:55 p.m. UTC | #9
On 03/02/15 17:47, Paolo Bonzini wrote:
> 
> Also, we may want to invalidate the cache for dirty pages before
> returning the dirty bitmap, and probably should do that directly in
> KVM_GET_DIRTY_LOG.

"I agree."

If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
userspace's aspect), then the cache invalidation should be an atomic
part of it too (from the same aspect).

(Sorry if I just said something incredibly stupid.)

Laszlo
Andrew Jones March 2, 2015, 5:05 p.m. UTC | #10
On Mon, Mar 02, 2015 at 05:55:44PM +0100, Laszlo Ersek wrote:
> On 03/02/15 17:47, Paolo Bonzini wrote:
> > 
> > Also, we may want to invalidate the cache for dirty pages before
> > returning the dirty bitmap, and probably should do that directly in
> > KVM_GET_DIRTY_LOG.
> 
> "I agree."
> 
> If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
> userspace's aspect), then the cache invalidation should be an atomic
> part of it too (from the same aspect).
> 
> (Sorry if I just said something incredibly stupid.)
>

With the path I'm headed down, all cache maintenance operations will
be done before exiting to userspace (and after returning). I was
actually already letting a feature creep into this PoC by setting
KVM_MEM_LOG_DIRTY_PAGES when we see KVM_MEM_INCOHERENT has been set,
and the region isn't readonly. The dirty log would then be used by
KVM internally to know exactly which pages need to be invalidated
before the exit.

drew
Mario Smarduch March 3, 2015, 2:20 a.m. UTC | #11
Hi Christoffer,

I don't understand how can the CPU handle different cache attributes
used by
QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache
evictions or
cleans wipe out guest updates to same cache line(s)?

- Mario


On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
>> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
>>> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>>>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>>>>> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>>>>>> So looks like the 3 orders of magnitude greater number of traps
>>>>>> (only to el2) don't impact kernel compiles.
>>>>>>
>>>>>
>>>>> OK, good! That was what I was hoping for, obviously.
>>>>>
>>>>>> Then I thought I'd be able to quick measure the number of cycles
>>>>>> a trap to el2 takes with this kvm-unit-tests test
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         unsigned long start, end;
>>>>>>         unsigned int sctlr;
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, sctlr_el1\n"
>>>>>>         "       msr pmcr_el0, %1\n"
>>>>>>         : "=&r" (sctlr) : "r" (5));
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, pmccntr_el0\n"
>>>>>>         "       msr sctlr_el1, %2\n"
>>>>>>         "       mrs %1, pmccntr_el0\n"
>>>>>>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>>>>>>
>>>>>>         printf("%llx\n", end - start);
>>>>>>         return 0;
>>>>>> }
>>>>>>
>>>>>> after applying this patch to kvm
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>>>>>> index bb91b6fc63861..5de39d740aa58 100644
>>>>>> --- a/arch/arm64/kvm/hyp.S
>>>>>> +++ b/arch/arm64/kvm/hyp.S
>>>>>> @@ -770,7 +770,7 @@
>>>>>>
>>>>>>         mrs     x2, mdcr_el2
>>>>>>         and     x2, x2, #MDCR_EL2_HPMN_MASK
>>>>>> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>>>>>>
>>>>>>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>>>>>>
>>>>>> But I get zero for the cycle count. Not sure what I'm missing.
>>>>>>
>>>>>
>>>>> No clue tbh. Does the counter work as expected in the host?
>>>>>
>>>>
>>>> Guess not. I dropped the test into a module_init and inserted
>>>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>>>> I set it to something non-zero with a write, then I always get
>>>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>>>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>>>> help. Anyway, I assume the problem is me. I'll keep looking to
>>>> see what I'm missing.
>>>>
>>>
>>> I returned to this and see that the problem was indeed me. I needed yet
>>> another enable bit set (the filter register needed to be instructed to
>>> count cycles while in el2). I've attached the code for the curious.
>>> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
>>> running on a host without this patch series (after TVM traps have been
>>> disabled), I get a pretty consistent 40.
>>>
>>> I checked how many vm-sysreg traps we do during the kernel compile
>>> benchmark. It's 124924. So it's a bit strange that we don't see the
>>> benchmark taking 10 to 20 seconds longer on average. I should probably
>>> double check my runs. In any case, while I like the approach of this
>>> series, the overhead is looking non-negligible.
>>>
>>
>> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
>> cycles == <1 second on a >1 GHz machine, I think?
>> Or am I missing something? How long does the actual compile take?
>>
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.
> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
Catalin Marinas March 4, 2015, 11:35 a.m. UTC | #12
(please try to avoid top-posting)

On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> > However, my concern with these patches are on two points:
> > 
> > 1. It's not a fix-all.  We still have the case where the guest expects
> > the behavior of device memory (for strong ordering for example) on a RAM
> > region, which we now break.  Similiarly this doesn't support the
> > non-coherent DMA to RAM region case.
> > 
> > 2. While the code is probably as nice as this kind of stuff gets, it
> > is non-trivial and extremely difficult to debug.  The counter-point here
> > is that we may end up handling other stuff at EL2 for performanc reasons
> > in the future.
> > 
> > Mainly because of point 1 above, I am leaning to thinking userspace
> > should do the invalidation when it knows it needs to, either through KVM
> > via a memslot flag or through some other syscall mechanism.

I expressed my concerns as well, I'm definitely against merging this
series.

> I don't understand how can the CPU handle different cache attributes
> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
> cache evictions or cleans wipe out guest updates to same cache
> line(s)?

"Clean+invalidate" is a safe operation even if the guest accesses the
memory in a cacheable way. But if the guest can update the cache lines,
Qemu should avoid cache maintenance from a performance perspective.

The guest is either told that the DMA is coherent (via DT properties) or
Qemu deals with (non-)coherency itself. The latter is fully in line with
the B2.9 chapter in the ARM ARM, more precisely point 5:

  If the mismatched attributes for a memory location all assign the same
  shareability attribute to the location, any loss of uniprocessor
  semantics or coherency within a shareability domain can be avoided by
  use of software cache management.

... it continues with what kind of cache maintenance is required,
together with:

  A clean and invalidate instruction can be used instead of a clean
  instruction, or instead of an invalidate instruction.
Ard Biesheuvel March 4, 2015, 11:50 a.m. UTC | #13
On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> (please try to avoid top-posting)
>
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> > However, my concern with these patches are on two points:
>> >
>> > 1. It's not a fix-all.  We still have the case where the guest expects
>> > the behavior of device memory (for strong ordering for example) on a RAM
>> > region, which we now break.  Similiarly this doesn't support the
>> > non-coherent DMA to RAM region case.
>> >
>> > 2. While the code is probably as nice as this kind of stuff gets, it
>> > is non-trivial and extremely difficult to debug.  The counter-point here
>> > is that we may end up handling other stuff at EL2 for performanc reasons
>> > in the future.
>> >
>> > Mainly because of point 1 above, I am leaning to thinking userspace
>> > should do the invalidation when it knows it needs to, either through KVM
>> > via a memslot flag or through some other syscall mechanism.
>
> I expressed my concerns as well, I'm definitely against merging this
> series.
>

Don't worry, that was never the intention, at least not as-is :-)

I think we have established that the performance hit is not the
problem but the correctness is.

I do have a remaining question, though: my original [non-working]
approach was to replace uncached mappings with write-through
read-allocate write-allocate, which I expected would keep the caches
in sync with main memory, but apparently I am misunderstanding
something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
get it to work: it replaces WT/RA/WA with WB/RA/WA)

Is there no way to use write-through caching here?

>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
>
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
>
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
>
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
>
> ... it continues with what kind of cache maintenance is required,
> together with:
>
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
>
> --
> Catalin
Catalin Marinas March 4, 2015, 12:29 p.m. UTC | #14
On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> > However, my concern with these patches are on two points:
> >> >
> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> > region, which we now break.  Similiarly this doesn't support the
> >> > non-coherent DMA to RAM region case.
> >> >
> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> > in the future.
> >> >
> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> > should do the invalidation when it knows it needs to, either through KVM
> >> > via a memslot flag or through some other syscall mechanism.
> >
> > I expressed my concerns as well, I'm definitely against merging this
> > series.
> 
> Don't worry, that was never the intention, at least not as-is :-)

I wasn't worried, just wanted to make my position clearer ;).

> I think we have established that the performance hit is not the
> problem but the correctness is.

I haven't looked at the performance figures but has anyone assessed the
hit caused by doing cache maintenance in Qemu vs cacheable guest
accesses (and no maintenance)?

> I do have a remaining question, though: my original [non-working]
> approach was to replace uncached mappings with write-through
> read-allocate write-allocate,

Does it make sense to have write-through and write-allocate at the same
time? The write-allocate hint would probably be ignored as write-through
writes do not generate linefills.

> which I expected would keep the caches
> in sync with main memory, but apparently I am misunderstanding
> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
> get it to work: it replaces WT/RA/WA with WB/RA/WA)
> 
> Is there no way to use write-through caching here?

Write-through is considered non-cacheable from a write perspective when
it does not hit in the cache. AFAIK, it should still be able to hit
existing cache lines and evict. The ARM ARM states that cache cleaning
to _PoU_ is not required for coherency when the writes are to
write-through memory but I have to dig further into the PoC because
that's what we care about here.

What platform did you test it on? I can't tell what the behaviour of
system caches is. I know they intercept explicit cache maintenance by VA
but not sure what happens to write-through writes when they hit in the
system cache (are they evicted to RAM or not?). If such write-through
writes are only evicted to the point-of-unification, they won't work
since non-cacheable accesses go all the way to PoC.

I need to do more reading through the ARM ARM, it should be hidden
somewhere ;).
Ard Biesheuvel March 4, 2015, 12:43 p.m. UTC | #15
On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
>> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> >> > However, my concern with these patches are on two points:
>> >> >
>> >> > 1. It's not a fix-all.  We still have the case where the guest expects
>> >> > the behavior of device memory (for strong ordering for example) on a RAM
>> >> > region, which we now break.  Similiarly this doesn't support the
>> >> > non-coherent DMA to RAM region case.
>> >> >
>> >> > 2. While the code is probably as nice as this kind of stuff gets, it
>> >> > is non-trivial and extremely difficult to debug.  The counter-point here
>> >> > is that we may end up handling other stuff at EL2 for performanc reasons
>> >> > in the future.
>> >> >
>> >> > Mainly because of point 1 above, I am leaning to thinking userspace
>> >> > should do the invalidation when it knows it needs to, either through KVM
>> >> > via a memslot flag or through some other syscall mechanism.
>> >
>> > I expressed my concerns as well, I'm definitely against merging this
>> > series.
>>
>> Don't worry, that was never the intention, at least not as-is :-)
>
> I wasn't worried, just wanted to make my position clearer ;).
>
>> I think we have established that the performance hit is not the
>> problem but the correctness is.
>
> I haven't looked at the performance figures but has anyone assessed the
> hit caused by doing cache maintenance in Qemu vs cacheable guest
> accesses (and no maintenance)?
>

No, I don't think so. The performance hit I am referring to is the
performance hit caused by leaving the trapping of VM system register
writes enabled all the time, so that writes to MAIR_EL1 are always
caught. This is why patch #1 implements some of the sysreg write
handling in EL2

>> I do have a remaining question, though: my original [non-working]
>> approach was to replace uncached mappings with write-through
>> read-allocate write-allocate,
>
> Does it make sense to have write-through and write-allocate at the same
> time? The write-allocate hint would probably be ignored as write-through
> writes do not generate linefills.
>

OK, that answers my question then. The availability of a
write-allocate setting on write-through attributes suggested to me
that writes would go to both the cache and main memory, so that the
write-back cached attribute the host is using for the same memory
would not result in it reading stale data.

>> which I expected would keep the caches
>> in sync with main memory, but apparently I am misunderstanding
>> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
>> get it to work: it replaces WT/RA/WA with WB/RA/WA)
>>
>> Is there no way to use write-through caching here?
>
> Write-through is considered non-cacheable from a write perspective when
> it does not hit in the cache. AFAIK, it should still be able to hit
> existing cache lines and evict. The ARM ARM states that cache cleaning
> to _PoU_ is not required for coherency when the writes are to
> write-through memory but I have to dig further into the PoC because
> that's what we care about here.
>
> What platform did you test it on? I can't tell what the behaviour of
> system caches is. I know they intercept explicit cache maintenance by VA
> but not sure what happens to write-through writes when they hit in the
> system cache (are they evicted to RAM or not?). If such write-through
> writes are only evicted to the point-of-unification, they won't work
> since non-cacheable accesses go all the way to PoC.
>

This was tested on APM, by Drew and Laszlo (thanks guys)

I have recently received a Seattle myself, but I haven't had time yet
to test these patches myself.

> I need to do more reading through the ARM ARM,

If you say so :-)

> it should be hidden
> somewhere ;).
>
Andrew Jones March 4, 2015, 2:12 p.m. UTC | #16
On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> >> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> >> > However, my concern with these patches are on two points:
> >> >> >
> >> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> >> > region, which we now break.  Similiarly this doesn't support the
> >> >> > non-coherent DMA to RAM region case.
> >> >> >
> >> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> >> > in the future.
> >> >> >
> >> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> >> > should do the invalidation when it knows it needs to, either through KVM
> >> >> > via a memslot flag or through some other syscall mechanism.
> >> >
> >> > I expressed my concerns as well, I'm definitely against merging this
> >> > series.
> >>
> >> Don't worry, that was never the intention, at least not as-is :-)
> >
> > I wasn't worried, just wanted to make my position clearer ;).
> >
> >> I think we have established that the performance hit is not the
> >> problem but the correctness is.
> >
> > I haven't looked at the performance figures but has anyone assessed the
> > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > accesses (and no maintenance)?
> >

I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
Hopefully I'll send it out this evening. Tomorrow at the latest.
Getting numbers of that approach vs. a guest's use of cached memory
for devices would take a decent amount of additional work, so won't
be part of that post. I'm actually not sure we should care about
the numbers for a guest using normal mem attributes for device
memory - other than out of curiosity. For correctness this issue
really needs to be solved 100% host-side. We can't rely on a
guest to do different/weird things, just because it's a guest.
Ideally guests don't even know that they're guests. (Even if we
describe the memory as cache-able to the guest, I don't think we
can rely on the guest believing us.)

drew
Catalin Marinas March 4, 2015, 2:29 p.m. UTC | #17
On Wed, Mar 04, 2015 at 03:12:12PM +0100, Andrew Jones wrote:
> On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> > On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> > >> I think we have established that the performance hit is not the
> > >> problem but the correctness is.
> > >
> > > I haven't looked at the performance figures but has anyone assessed the
> > > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > > accesses (and no maintenance)?
> 
> I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
> Hopefully I'll send it out this evening. Tomorrow at the latest.
> Getting numbers of that approach vs. a guest's use of cached memory
> for devices would take a decent amount of additional work, so won't
> be part of that post.

OK.

> I'm actually not sure we should care about
> the numbers for a guest using normal mem attributes for device
> memory - other than out of curiosity. For correctness this issue
> really needs to be solved 100% host-side. We can't rely on a
> guest to do different/weird things, just because it's a guest.
> Ideally guests don't even know that they're guests. (Even if we
> describe the memory as cache-able to the guest, I don't think we
> can rely on the guest believing us.)

I disagree it is 100% a host-side issue. It is a host-side issue _if_
the host tells the guest that the (virtual) device is non-coherent (or,
more precisely, it does not explicitly tell the guest that the device is
coherent). If the guest thinks the (virtual) device is non-coherent
because of information passed by the host, I fully agree that the host
needs to manage the cache coherency.

However, the host could also pass a "dma-coherent" property in the DT
given to the guest and avoid any form of cache maintenance. If the guest
does not honour such coherency property, it's a guest problem and it
needs fixing in the guest. This isn't any different from a real physical
device behaviour.

(there are counter arguments for the latter as well like emulating old
platforms that never had coherency but from a performance/production
perspective, I strongly recommend that guests are passed the
"dma-coherent" property for such virtual devices)
Peter Maydell March 4, 2015, 2:34 p.m. UTC | #18
On 4 March 2015 at 23:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
>
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Right, and we should do that for things like virtio, because we want
the performance. But we also have devices (like vga framebuffers)
which shouldn't be handled as cacheable, so we need to be able to
deal with both situations.

-- PMM
Paolo Bonzini March 4, 2015, 5:03 p.m. UTC | #19
On 04/03/2015 15:29, Catalin Marinas wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
> 
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Can you add that property to the device tree for PCI devices too?

Paolo
Catalin Marinas March 4, 2015, 5:28 p.m. UTC | #20
On Wed, Mar 04, 2015 at 06:03:11PM +0100, Paolo Bonzini wrote:
> 
> 
> On 04/03/2015 15:29, Catalin Marinas wrote:
> > I disagree it is 100% a host-side issue. It is a host-side issue _if_
> > the host tells the guest that the (virtual) device is non-coherent (or,
> > more precisely, it does not explicitly tell the guest that the device is
> > coherent). If the guest thinks the (virtual) device is non-coherent
> > because of information passed by the host, I fully agree that the host
> > needs to manage the cache coherency.
> > 
> > However, the host could also pass a "dma-coherent" property in the DT
> > given to the guest and avoid any form of cache maintenance. If the guest
> > does not honour such coherency property, it's a guest problem and it
> > needs fixing in the guest. This isn't any different from a real physical
> > device behaviour.
> 
> Can you add that property to the device tree for PCI devices too?

Yes but not with mainline yet:

http://thread.gmane.org/gmane.linux.kernel.iommu/8935

We can add the property at the PCI host bridge level (with the drawback
that it covers all the PCI devices), like here:

Documentation/devicetree/bindings/pci/xgene-pci.txt
Paolo Bonzini March 5, 2015, 10:12 a.m. UTC | #21
On 04/03/2015 18:28, Catalin Marinas wrote:
>> > 
>> > Can you add that property to the device tree for PCI devices too?
> Yes but not with mainline yet:
> 
> http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> 
> We can add the property at the PCI host bridge level (with the drawback
> that it covers all the PCI devices), like here:

Even covering all PCI devices is not enough if we want to support device
assignment of PCI host devices.  I'd rather not spend effort on a
solution that we know will only half-work a few months down the road.

Paolo

> Documentation/devicetree/bindings/pci/xgene-pci.txt
Catalin Marinas March 5, 2015, 11:04 a.m. UTC | #22
On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> > Can you add that property to the device tree for PCI devices too?
> >
> > Yes but not with mainline yet:
> > 
> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> > 
> > We can add the property at the PCI host bridge level (with the drawback
> > that it covers all the PCI devices), like here:
> 
> Even covering all PCI devices is not enough if we want to support device
> assignment of PCI host devices. 

Can we not have another PCI bridge node in the DT for the host device
assignments?

> I'd rather not spend effort on a solution that we know will only
> half-work a few months down the road.

Which of the solutions are you referring to? On the Applied Micro
boards, for example, the PCIe is coherent. If you do device assignment,
the guest must be aware of the coherency property of the physical device
and behave accordingly, there isn't much the host can do about it.
Peter Maydell March 5, 2015, 11:52 a.m. UTC | #23
On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
>> On 04/03/2015 18:28, Catalin Marinas wrote:
>> >> > Can you add that property to the device tree for PCI devices too?
>> >
>> > Yes but not with mainline yet:
>> >
>> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
>> >
>> > We can add the property at the PCI host bridge level (with the drawback
>> > that it covers all the PCI devices), like here:
>>
>> Even covering all PCI devices is not enough if we want to support device
>> assignment of PCI host devices.
>
> Can we not have another PCI bridge node in the DT for the host device
> assignments?

I'd hate to have to do that. PCI should be entirely probeable
given that we tell the guest where the host bridge is, that's
one of its advantages.

-- PMM
Catalin Marinas March 5, 2015, 12:03 p.m. UTC | #24
On Thu, Mar 05, 2015 at 08:52:49PM +0900, Peter Maydell wrote:
> On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> >> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> >> > Can you add that property to the device tree for PCI devices too?
> >> >
> >> > Yes but not with mainline yet:
> >> >
> >> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> >> >
> >> > We can add the property at the PCI host bridge level (with the drawback
> >> > that it covers all the PCI devices), like here:
> >>
> >> Even covering all PCI devices is not enough if we want to support device
> >> assignment of PCI host devices.
> >
> > Can we not have another PCI bridge node in the DT for the host device
> > assignments?
> 
> I'd hate to have to do that. PCI should be entirely probeable
> given that we tell the guest where the host bridge is, that's
> one of its advantages.

I didn't say a DT node per device, the DT doesn't know what PCI devices
are available (otherwise it defeats the idea of probing). But we need to
tell the OS where the host bridge is via DT.

So the guest would be told about two host bridges: one for real devices
and another for virtual devices. These can have different coherency
properties.
Paolo Bonzini March 5, 2015, 12:26 p.m. UTC | #25
On 05/03/2015 13:03, Catalin Marinas wrote:
>> > 
>> > I'd hate to have to do that. PCI should be entirely probeable
>> > given that we tell the guest where the host bridge is, that's
>> > one of its advantages.
> I didn't say a DT node per device, the DT doesn't know what PCI devices
> are available (otherwise it defeats the idea of probing). But we need to
> tell the OS where the host bridge is via DT.
> 
> So the guest would be told about two host bridges: one for real devices
> and another for virtual devices. These can have different coherency
> properties.

Yeah, and it would suck that the user needs to know the difference
between the coherency proprties of the host bridges.

It would especially suck if the user has a cluster with different
machines, some of them coherent and others non-coherent, and then has to
debug why the same configuration works on some machines and not on others.


To avoid replying in two different places, which of the solutions look
to me like something that half-works?  Pretty much all of them, because
in the end it is just a processor misfeature.  For example, Intel
virtualization extensions let the hypervisor override stage1 translation
_if necessary_.  AMD doesn't, but has some other quirky things that let
you achieve the same effect..

In particular, I am not even sure that this is about bus coherency,
because this problem does not happen when the device is doing bus master
DMA.  Working around coherency for bus master DMA would be easy.

The problem arises with MMIO areas that the guest can reasonably expect
to be uncacheable, but that are optimized by the host so that they end
up backed by cacheable RAM.  It's perfectly reasonable that the same
device needs cacheable mapping with one userspace, and works with
uncacheable mapping with another userspace that doesn't optimize the
MMIO area to RAM.

Currently the VGA framebuffer is the main case where this happen, and I
don't expect many more.  Because this is not bus master DMA, it's hard
to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
just reading from an array of chars.

In practice, the VGA framebuffer has an optimization that uses dirty
page tracking, so we could piggyback on the ioctls that return which
pages are dirty.  It turns out that piggybacking on those ioctls also
should fix the case of migrating a guest while the MMU is disabled.

But it's a hack, and it may not work for other devices.

We could use _DSD to export the device tree property separately for each
device, but that wouldn't work for hotplugged devices.

Paolo
Catalin Marinas March 5, 2015, 2:58 p.m. UTC | #26
On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
> On 05/03/2015 13:03, Catalin Marinas wrote:
> >> > I'd hate to have to do that. PCI should be entirely probeable
> >> > given that we tell the guest where the host bridge is, that's
> >> > one of its advantages.
> > I didn't say a DT node per device, the DT doesn't know what PCI devices
> > are available (otherwise it defeats the idea of probing). But we need to
> > tell the OS where the host bridge is via DT.
> > 
> > So the guest would be told about two host bridges: one for real devices
> > and another for virtual devices. These can have different coherency
> > properties.
> 
> Yeah, and it would suck that the user needs to know the difference
> between the coherency proprties of the host bridges.

The host needs to know about this, unless we assume full coherency on
all the platforms. Arguably, Qemu needs to know as well if it is the one
generating the DT for guest (or at least passing some snippets from the
host DT).

> It would especially suck if the user has a cluster with different
> machines, some of them coherent and others non-coherent, and then has to
> debug why the same configuration works on some machines and not on others.

That's a problem indeed, especially with guest migration. But I don't
think we have any sane solution here for the bus master DMA.

> To avoid replying in two different places, which of the solutions look
> to me like something that half-works?  Pretty much all of them, because
> in the end it is just a processor misfeature.  For example, Intel
> virtualization extensions let the hypervisor override stage1 translation
> _if necessary_.  AMD doesn't, but has some other quirky things that let
> you achieve the same effect..

ARM can override them as well but only making them stricter. Otherwise,
on a weakly ordered architecture, it's not always safe (let's say the
guest thinks it accesses Strongly Ordered memory and avoids barriers for
flag updates but the host "upgrades" it to Cacheable which breaks the
memory order).

If we want the host to enforce guest memory mapping attributes via stage
2, we could do it the other way around: get the guests to always assume
full cache coherency, generating Normal Cacheable mappings, but use the
stage 2 attributes restriction in the host to make such mappings
non-cacheable when needed (it works this way on ARM but not in the other
direction to relax the attributes).

> In particular, I am not even sure that this is about bus coherency,
> because this problem does not happen when the device is doing bus master
> DMA.  Working around coherency for bus master DMA would be easy.

My previous emails on the "dma-coherent" property were only about bus
master DMA (which would cause the correct selection of the DMA API ops
in the guest).

But even for bus master DMA, guest OS still needs to be aware of the
(virtual) device DMA capabilities (cache coherent or not). You may be
able to work around it in the host (stage 2, explicit cache flushing or
SMMU attributes) if the guests assumes non-coherency but it's not really
efficient (nor nice to implement).

> The problem arises with MMIO areas that the guest can reasonably expect
> to be uncacheable, but that are optimized by the host so that they end
> up backed by cacheable RAM.  It's perfectly reasonable that the same
> device needs cacheable mapping with one userspace, and works with
> uncacheable mapping with another userspace that doesn't optimize the
> MMIO area to RAM.

Unless the guest allocates the framebuffer itself (e.g.
dma_alloc_coherent), we can't control the cacheability via
"dma-coherent" properties as it refers to bus master DMA.

So for MMIO with the buffer allocated by the host (Qemu), the only
solution I see on ARM is for the host to ensure coherency, either via
explicit cache maintenance (new KVM API) or by changing the memory
attributes used by Qemu to access such virtual MMIO.

Basically Qemu is acting as a bus master when reading the framebuffer it
allocated but the guest considers it a slave access and we don't have a
way to tell the guest that such accesses should be cacheable, nor can we
upgrade them via architecture features.

> Currently the VGA framebuffer is the main case where this happen, and I
> don't expect many more.  Because this is not bus master DMA, it's hard
> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
> just reading from an array of chars.

I now understand the problem better. I was under the impression that the
guest allocates the framebuffer itself and tells Qemu where it is (like
in amba-clcd.c for example).

> In practice, the VGA framebuffer has an optimization that uses dirty
> page tracking, so we could piggyback on the ioctls that return which
> pages are dirty.  It turns out that piggybacking on those ioctls also
> should fix the case of migrating a guest while the MMU is disabled.

Yes, Qemu would need to invalidate the cache before reading a dirty
framebuffer page.

As I said above, an API that allows non-cacheable mappings for the VGA
framebuffer in Qemu would also solve the problem. I'm not sure what KVM
provides here (or whether we can add such API).

> We could use _DSD to export the device tree property separately for each
> device, but that wouldn't work for hotplugged devices.

This would only work for bus master DMA, so it doesn't solve the VGA
framebuffer issue.
Paolo Bonzini March 5, 2015, 5:43 p.m. UTC | #27
On 05/03/2015 15:58, Catalin Marinas wrote:
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
> 
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.

I do not oppose doing cache management in QEMU for bus master DMA
(though if the solution you outlined below works it would be great).

> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).

The same can happen on x86 though, even if it's rarer.  You still need a
barrier between stores and loads.

> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).

That sounds like a plan for device assignment.  But it still would not
solve the problem of the MMIO framebuffer, right?

>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
> 
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.

Okay, it's good to rule that out.  One less thing to think about. :)
Same for _DSD.

> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
> 
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.

Yes, that's a way to put it.

>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
> 
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
> 
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).

Nothing for now; other architectures simply do not have the issue.

As long as it's just VGA, we can quirk it.  There's just a couple
vendor/device IDs to catch, and the guest can then use a cacheable mapping.

For a more generic solution, the API would be madvise(MADV_DONTCACHE).
It would be easy for QEMU to use it, but I am not too optimistic about
convincing the mm folks about it.  We can try.

Paolo
Ard Biesheuvel March 5, 2015, 7:13 p.m. UTC | #28
On 5 March 2015 at 15:58, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
>> On 05/03/2015 13:03, Catalin Marinas wrote:
>> >> > I'd hate to have to do that. PCI should be entirely probeable
>> >> > given that we tell the guest where the host bridge is, that's
>> >> > one of its advantages.
>> > I didn't say a DT node per device, the DT doesn't know what PCI devices
>> > are available (otherwise it defeats the idea of probing). But we need to
>> > tell the OS where the host bridge is via DT.
>> >
>> > So the guest would be told about two host bridges: one for real devices
>> > and another for virtual devices. These can have different coherency
>> > properties.
>>
>> Yeah, and it would suck that the user needs to know the difference
>> between the coherency proprties of the host bridges.
>
> The host needs to know about this, unless we assume full coherency on
> all the platforms. Arguably, Qemu needs to know as well if it is the one
> generating the DT for guest (or at least passing some snippets from the
> host DT).
>
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
>
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.
>
>> To avoid replying in two different places, which of the solutions look
>> to me like something that half-works?  Pretty much all of them, because
>> in the end it is just a processor misfeature.  For example, Intel
>> virtualization extensions let the hypervisor override stage1 translation
>> _if necessary_.  AMD doesn't, but has some other quirky things that let
>> you achieve the same effect..
>
> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).
>
> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).
>

This was precisely the idea of the MAIR mangling patch: promote all
stage1 mappings to cacheable, and put the host in control by allowing
it to supersede them with device mappings in stage 2.
But it appears there are too many corner cases where this doesn't
quite work out.


>> In particular, I am not even sure that this is about bus coherency,
>> because this problem does not happen when the device is doing bus master
>> DMA.  Working around coherency for bus master DMA would be easy.
>
> My previous emails on the "dma-coherent" property were only about bus
> master DMA (which would cause the correct selection of the DMA API ops
> in the guest).
>
> But even for bus master DMA, guest OS still needs to be aware of the
> (virtual) device DMA capabilities (cache coherent or not). You may be
> able to work around it in the host (stage 2, explicit cache flushing or
> SMMU attributes) if the guests assumes non-coherency but it's not really
> efficient (nor nice to implement).
>
>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
>
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.
>
> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
>
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.
>
>> Currently the VGA framebuffer is the main case where this happen, and I
>> don't expect many more.  Because this is not bus master DMA, it's hard
>> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
>> just reading from an array of chars.
>
> I now understand the problem better. I was under the impression that the
> guest allocates the framebuffer itself and tells Qemu where it is (like
> in amba-clcd.c for example).
>
>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
>
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
>
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).
>
>> We could use _DSD to export the device tree property separately for each
>> device, but that wouldn't work for hotplugged devices.
>
> This would only work for bus master DMA, so it doesn't solve the VGA
> framebuffer issue.
>
> --
> Catalin
Mario Smarduch March 6, 2015, 8:33 p.m. UTC | #29
On 03/04/2015 03:35 AM, Catalin Marinas wrote:
> (please try to avoid top-posting)
> 
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>>> However, my concern with these patches are on two points:
>>>
>>> 1. It's not a fix-all.  We still have the case where the guest expects
>>> the behavior of device memory (for strong ordering for example) on a RAM
>>> region, which we now break.  Similiarly this doesn't support the
>>> non-coherent DMA to RAM region case.
>>>
>>> 2. While the code is probably as nice as this kind of stuff gets, it
>>> is non-trivial and extremely difficult to debug.  The counter-point here
>>> is that we may end up handling other stuff at EL2 for performanc reasons
>>> in the future.
>>>
>>> Mainly because of point 1 above, I am leaning to thinking userspace
>>> should do the invalidation when it knows it needs to, either through KVM
>>> via a memslot flag or through some other syscall mechanism.
> 
> I expressed my concerns as well, I'm definitely against merging this
> series.
> 
>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
> 
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
> 
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
> 
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
> 
> ... it continues with what kind of cache maintenance is required,
> together with:
> 
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
> 
Hi Catalin,
  sorry for the top posting. I'm struggling with QEMU cache
maintenance for devices that don't have registers cache line aligned
and may be multi-function, for lack of a better
one I thought of sp804 that supports two devices with registers
covered by one cache line. Wouldn't QEMU cache maintenance
of one device have potential to corrupt the second device?
These could be used by two guest threads in parallel.

I get bullet 2,3 still working on 1st one it will take a while.

Thanks,
- Mario
Mario Smarduch March 6, 2015, 9:08 p.m. UTC | #30
On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> 
> 
> On 05/03/2015 15:58, Catalin Marinas wrote:
>>> It would especially suck if the user has a cluster with different
>>> machines, some of them coherent and others non-coherent, and then has to
>>> debug why the same configuration works on some machines and not on others.
>>
>> That's a problem indeed, especially with guest migration. But I don't
>> think we have any sane solution here for the bus master DMA.
> 
> I do not oppose doing cache management in QEMU for bus master DMA
> (though if the solution you outlined below works it would be great).
> 
>> ARM can override them as well but only making them stricter. Otherwise,
>> on a weakly ordered architecture, it's not always safe (let's say the
>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>> flag updates but the host "upgrades" it to Cacheable which breaks the
>> memory order).
> 
> The same can happen on x86 though, even if it's rarer.  You still need a
> barrier between stores and loads.
> 
>> If we want the host to enforce guest memory mapping attributes via stage
>> 2, we could do it the other way around: get the guests to always assume
>> full cache coherency, generating Normal Cacheable mappings, but use the
>> stage 2 attributes restriction in the host to make such mappings
>> non-cacheable when needed (it works this way on ARM but not in the other
>> direction to relax the attributes).
> 
> That sounds like a plan for device assignment.  But it still would not
> solve the problem of the MMIO framebuffer, right?
> 
>>> The problem arises with MMIO areas that the guest can reasonably expect
>>> to be uncacheable, but that are optimized by the host so that they end
>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>> device needs cacheable mapping with one userspace, and works with
>>> uncacheable mapping with another userspace that doesn't optimize the
>>> MMIO area to RAM.
>>
>> Unless the guest allocates the framebuffer itself (e.g.
>> dma_alloc_coherent), we can't control the cacheability via
>> "dma-coherent" properties as it refers to bus master DMA.
> 
> Okay, it's good to rule that out.  One less thing to think about. :)
> Same for _DSD.
> 
>> So for MMIO with the buffer allocated by the host (Qemu), the only
>> solution I see on ARM is for the host to ensure coherency, either via
>> explicit cache maintenance (new KVM API) or by changing the memory
>> attributes used by Qemu to access such virtual MMIO.
>>
>> Basically Qemu is acting as a bus master when reading the framebuffer it
>> allocated but the guest considers it a slave access and we don't have a
>> way to tell the guest that such accesses should be cacheable, nor can we
>> upgrade them via architecture features.
> 
> Yes, that's a way to put it.
> 
>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>> page tracking, so we could piggyback on the ioctls that return which
>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>> should fix the case of migrating a guest while the MMU is disabled.
>>
>> Yes, Qemu would need to invalidate the cache before reading a dirty
>> framebuffer page.
>>
>> As I said above, an API that allows non-cacheable mappings for the VGA
>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>> provides here (or whether we can add such API).
> 
> Nothing for now; other architectures simply do not have the issue.
> 
> As long as it's just VGA, we can quirk it.  There's just a couple
> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> 
> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> It would be easy for QEMU to use it, but I am not too optimistic about
> convincing the mm folks about it.  We can try.

Interested to see the outcome.

I was thinking of a very basic memory driver that can provide
an uncached memslot to QEMU - in mmap() file operation
apply pgprot_uncached to allocated pages, lock them, flush TLB
call remap_pfn_range().

Mario

> 
> Paolo
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
Andrew Jones March 9, 2015, 2:26 p.m. UTC | #31
On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> > 
> > 
> > On 05/03/2015 15:58, Catalin Marinas wrote:
> >>> It would especially suck if the user has a cluster with different
> >>> machines, some of them coherent and others non-coherent, and then has to
> >>> debug why the same configuration works on some machines and not on others.
> >>
> >> That's a problem indeed, especially with guest migration. But I don't
> >> think we have any sane solution here for the bus master DMA.
> > 
> > I do not oppose doing cache management in QEMU for bus master DMA
> > (though if the solution you outlined below works it would be great).
> > 
> >> ARM can override them as well but only making them stricter. Otherwise,
> >> on a weakly ordered architecture, it's not always safe (let's say the
> >> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> >> flag updates but the host "upgrades" it to Cacheable which breaks the
> >> memory order).
> > 
> > The same can happen on x86 though, even if it's rarer.  You still need a
> > barrier between stores and loads.
> > 
> >> If we want the host to enforce guest memory mapping attributes via stage
> >> 2, we could do it the other way around: get the guests to always assume
> >> full cache coherency, generating Normal Cacheable mappings, but use the
> >> stage 2 attributes restriction in the host to make such mappings
> >> non-cacheable when needed (it works this way on ARM but not in the other
> >> direction to relax the attributes).
> > 
> > That sounds like a plan for device assignment.  But it still would not
> > solve the problem of the MMIO framebuffer, right?
> > 
> >>> The problem arises with MMIO areas that the guest can reasonably expect
> >>> to be uncacheable, but that are optimized by the host so that they end
> >>> up backed by cacheable RAM.  It's perfectly reasonable that the same
> >>> device needs cacheable mapping with one userspace, and works with
> >>> uncacheable mapping with another userspace that doesn't optimize the
> >>> MMIO area to RAM.
> >>
> >> Unless the guest allocates the framebuffer itself (e.g.
> >> dma_alloc_coherent), we can't control the cacheability via
> >> "dma-coherent" properties as it refers to bus master DMA.
> > 
> > Okay, it's good to rule that out.  One less thing to think about. :)
> > Same for _DSD.
> > 
> >> So for MMIO with the buffer allocated by the host (Qemu), the only
> >> solution I see on ARM is for the host to ensure coherency, either via
> >> explicit cache maintenance (new KVM API) or by changing the memory
> >> attributes used by Qemu to access such virtual MMIO.
> >>
> >> Basically Qemu is acting as a bus master when reading the framebuffer it
> >> allocated but the guest considers it a slave access and we don't have a
> >> way to tell the guest that such accesses should be cacheable, nor can we
> >> upgrade them via architecture features.
> > 
> > Yes, that's a way to put it.
> > 
> >>> In practice, the VGA framebuffer has an optimization that uses dirty
> >>> page tracking, so we could piggyback on the ioctls that return which
> >>> pages are dirty.  It turns out that piggybacking on those ioctls also
> >>> should fix the case of migrating a guest while the MMU is disabled.
> >>
> >> Yes, Qemu would need to invalidate the cache before reading a dirty
> >> framebuffer page.
> >>
> >> As I said above, an API that allows non-cacheable mappings for the VGA
> >> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> >> provides here (or whether we can add such API).
> > 
> > Nothing for now; other architectures simply do not have the issue.
> > 
> > As long as it's just VGA, we can quirk it.  There's just a couple
> > vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> > 
> > For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> > It would be easy for QEMU to use it, but I am not too optimistic about
> > convincing the mm folks about it.  We can try.

I forgot to list this one in my summary of approaches[*]. This is a
nice, clean approach. Avoids getting cache maintenance into everything.
However, besides the difficulty to get it past mm people, it reduces
performance for any userspace-userspace uses/sharing of the memory.
userspace-guest requires cache maintenance, but nothing else. Maybe
that's not an important concern for the few emulated devices that need
it though.

> 
> Interested to see the outcome.
> 
> I was thinking of a very basic memory driver that can provide
> an uncached memslot to QEMU - in mmap() file operation
> apply pgprot_uncached to allocated pages, lock them, flush TLB
> call remap_pfn_range().

I guess this is the same as the madvise approach, but with a driver.
KVM could take this approach itself when memslots are added/updated
with the INCOHERENT flag. Maybe worth some experimental patches to
find out?

I'm still thinking about experimenting with the ARM private syscalls
next though.

drew

[*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html
Mario Smarduch March 9, 2015, 3:33 p.m. UTC | #32
On 03/09/2015 07:26 AM, Andrew Jones wrote:
> On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
>> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 05/03/2015 15:58, Catalin Marinas wrote:
>>>>> It would especially suck if the user has a cluster with different
>>>>> machines, some of them coherent and others non-coherent, and then has to
>>>>> debug why the same configuration works on some machines and not on others.
>>>>
>>>> That's a problem indeed, especially with guest migration. But I don't
>>>> think we have any sane solution here for the bus master DMA.
>>>
>>> I do not oppose doing cache management in QEMU for bus master DMA
>>> (though if the solution you outlined below works it would be great).
>>>
>>>> ARM can override them as well but only making them stricter. Otherwise,
>>>> on a weakly ordered architecture, it's not always safe (let's say the
>>>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>>>> flag updates but the host "upgrades" it to Cacheable which breaks the
>>>> memory order).
>>>
>>> The same can happen on x86 though, even if it's rarer.  You still need a
>>> barrier between stores and loads.
>>>
>>>> If we want the host to enforce guest memory mapping attributes via stage
>>>> 2, we could do it the other way around: get the guests to always assume
>>>> full cache coherency, generating Normal Cacheable mappings, but use the
>>>> stage 2 attributes restriction in the host to make such mappings
>>>> non-cacheable when needed (it works this way on ARM but not in the other
>>>> direction to relax the attributes).
>>>
>>> That sounds like a plan for device assignment.  But it still would not
>>> solve the problem of the MMIO framebuffer, right?
>>>
>>>>> The problem arises with MMIO areas that the guest can reasonably expect
>>>>> to be uncacheable, but that are optimized by the host so that they end
>>>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>>>> device needs cacheable mapping with one userspace, and works with
>>>>> uncacheable mapping with another userspace that doesn't optimize the
>>>>> MMIO area to RAM.
>>>>
>>>> Unless the guest allocates the framebuffer itself (e.g.
>>>> dma_alloc_coherent), we can't control the cacheability via
>>>> "dma-coherent" properties as it refers to bus master DMA.
>>>
>>> Okay, it's good to rule that out.  One less thing to think about. :)
>>> Same for _DSD.
>>>
>>>> So for MMIO with the buffer allocated by the host (Qemu), the only
>>>> solution I see on ARM is for the host to ensure coherency, either via
>>>> explicit cache maintenance (new KVM API) or by changing the memory
>>>> attributes used by Qemu to access such virtual MMIO.
>>>>
>>>> Basically Qemu is acting as a bus master when reading the framebuffer it
>>>> allocated but the guest considers it a slave access and we don't have a
>>>> way to tell the guest that such accesses should be cacheable, nor can we
>>>> upgrade them via architecture features.
>>>
>>> Yes, that's a way to put it.
>>>
>>>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>>>> page tracking, so we could piggyback on the ioctls that return which
>>>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>>>> should fix the case of migrating a guest while the MMU is disabled.
>>>>
>>>> Yes, Qemu would need to invalidate the cache before reading a dirty
>>>> framebuffer page.
>>>>
>>>> As I said above, an API that allows non-cacheable mappings for the VGA
>>>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>>>> provides here (or whether we can add such API).
>>>
>>> Nothing for now; other architectures simply do not have the issue.
>>>
>>> As long as it's just VGA, we can quirk it.  There's just a couple
>>> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
>>>
>>> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
>>> It would be easy for QEMU to use it, but I am not too optimistic about
>>> convincing the mm folks about it.  We can try.
> 
> I forgot to list this one in my summary of approaches[*]. This is a
> nice, clean approach. Avoids getting cache maintenance into everything.
> However, besides the difficulty to get it past mm people, it reduces
> performance for any userspace-userspace uses/sharing of the memory.
> userspace-guest requires cache maintenance, but nothing else. Maybe
> that's not an important concern for the few emulated devices that need
> it though.
> 
>>
>> Interested to see the outcome.
>>
>> I was thinking of a very basic memory driver that can provide
>> an uncached memslot to QEMU - in mmap() file operation
>> apply pgprot_uncached to allocated pages, lock them, flush TLB
>> call remap_pfn_range().
> 
> I guess this is the same as the madvise approach, but with a driver.
> KVM could take this approach itself when memslots are added/updated
> with the INCOHERENT flag. Maybe worth some experimental patches to
> find out?

I would work on this but I'm tied up for next 3 weeks.
If anyone is interested I can provide base code, I used
it for memory passthrough although testing may be time consuming.
I think the hurdle here is the kernel doesn't map these
for any reason like page migration, locking pages should
tell kernel don't touch. madvise() is the  desired solution
but I suspect it might take a while to get in.
> 
> I'm still thinking about experimenting with the ARM private syscalls
> next though.

Hope it succeeds.
> 
> drew
> 
> [*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html
>
diff mbox

Patch

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@ 
 
 	mrs	x2, mdcr_el2
 	and	x2, x2, #MDCR_EL2_HPMN_MASK
-	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+//	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 	orr	x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
 	// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap