mbox series

[RFC,v2,0/2] arm: enable MTE for QEMU + kvm

Message ID 20220707161656.41664-1-cohuck@redhat.com (mailing list archive)
Headers show
Series arm: enable MTE for QEMU + kvm | expand

Message

Cornelia Huck July 7, 2022, 4:16 p.m. UTC
This series makes it possible to enable MTE for kvm guests, if the kernel
supports it. Again, tested on the simulator via patiently waiting for the
arm64/mte kselftests to finish successfully.

For tcg, turning on mte on the machine level (to get tag memory) stays a
requirement. If the new mte cpu feature is not explicitly specified, a tcg
vm will get mte depending on the presence of tag memory (just as today).

For kvm, mte stays off by default; this is because migration is not yet
supported (postcopy will need an extension of the kernel interface, possibly
an extension of the userfaultfd interface), and turning on mte will add a
migration blocker.

My biggest question going forward is actually concerning migration; I gather
that we should not bother adding something unless postcopy is working as well?
If I'm not misunderstanding things, we need a way to fault in a page together
with the tag; doing that in one go is probably the only way that we can be
sure that this is race-free on the QEMU side. Comments welcome :)

Changes v1->v2: [Thanks to Eric for the feedback!]
- add documentation
- switch the mte prop to OnOffAuto; this improves the interaction with the
  existing mte machine prop
- leave mte off for kvm by default
- improve tests; the poking in QDicts feels a bit ugly, but seems to work

Cornelia Huck (2):
  arm/kvm: add support for MTE
  qtests/arm: add some mte tests

 docs/system/arm/cpu-features.rst |  21 +++++
 target/arm/cpu.c                 |  18 ++---
 target/arm/cpu.h                 |   1 +
 target/arm/cpu64.c               | 132 +++++++++++++++++++++++++++++++
 target/arm/internals.h           |   1 +
 target/arm/kvm64.c               |   5 ++
 target/arm/kvm_arm.h             |  12 +++
 target/arm/monitor.c             |   1 +
 tests/qtest/arm-cpu-features.c   |  77 ++++++++++++++++++
 9 files changed, 256 insertions(+), 12 deletions(-)

Comments

Richard Henderson July 9, 2022, 2:59 a.m. UTC | #1
On 7/7/22 21:46, Cornelia Huck wrote:
> If I'm not misunderstanding things, we need a way to fault in a page together
> with the tag; doing that in one go is probably the only way that we can be
> sure that this is race-free on the QEMU side.

That's my understanding as well.


r~
Dr. David Alan Gilbert July 11, 2022, 1:24 p.m. UTC | #2
* Cornelia Huck (cohuck@redhat.com) wrote:
> This series makes it possible to enable MTE for kvm guests, if the kernel
> supports it. Again, tested on the simulator via patiently waiting for the
> arm64/mte kselftests to finish successfully.
> 
> For tcg, turning on mte on the machine level (to get tag memory) stays a
> requirement. If the new mte cpu feature is not explicitly specified, a tcg
> vm will get mte depending on the presence of tag memory (just as today).
> 
> For kvm, mte stays off by default; this is because migration is not yet
> supported (postcopy will need an extension of the kernel interface, possibly
> an extension of the userfaultfd interface), and turning on mte will add a
> migration blocker.

My assumption was that a normal migration would need something as well
to retrieve and place the MTE flags; albeit not atomically.

> My biggest question going forward is actually concerning migration; I gather
> that we should not bother adding something unless postcopy is working as well?

I don't think that restriction is fair on you; just make sure
postcopy_ram_supported_by_host gains an arch call and fails cleanly;
that way if anyone tries to enable postcopy they'll find out with a
clean fail.

> If I'm not misunderstanding things, we need a way to fault in a page together
> with the tag; doing that in one go is probably the only way that we can be
> sure that this is race-free on the QEMU side. Comments welcome :)

I think it will.
But, ignoring postcopy for a minute, with KVM how do different types of
backing memory work - e.g. if I back a region of guest memory with
/dev/shm/something or a hugepage equivalent, where does the MTE memory
come from, and how do you set it?

Dave

> Changes v1->v2: [Thanks to Eric for the feedback!]
> - add documentation
> - switch the mte prop to OnOffAuto; this improves the interaction with the
>   existing mte machine prop
> - leave mte off for kvm by default
> - improve tests; the poking in QDicts feels a bit ugly, but seems to work
> 
> Cornelia Huck (2):
>   arm/kvm: add support for MTE
>   qtests/arm: add some mte tests
> 
>  docs/system/arm/cpu-features.rst |  21 +++++
>  target/arm/cpu.c                 |  18 ++---
>  target/arm/cpu.h                 |   1 +
>  target/arm/cpu64.c               | 132 +++++++++++++++++++++++++++++++
>  target/arm/internals.h           |   1 +
>  target/arm/kvm64.c               |   5 ++
>  target/arm/kvm_arm.h             |  12 +++
>  target/arm/monitor.c             |   1 +
>  tests/qtest/arm-cpu-features.c   |  77 ++++++++++++++++++
>  9 files changed, 256 insertions(+), 12 deletions(-)
> 
> -- 
> 2.35.3
>
Peter Maydell July 11, 2022, 1:39 p.m. UTC | #3
On Mon, 11 Jul 2022 at 14:24, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> But, ignoring postcopy for a minute, with KVM how do different types of
> backing memory work - e.g. if I back a region of guest memory with
> /dev/shm/something or a hugepage equivalent, where does the MTE memory
> come from, and how do you set it?

Generally in an MTE system anything that's "plain old RAM" is expected
to support tags. (The architecture manual calls this "conventional
memory". This isn't quite the same as "anything that looks RAM-like",
e.g. the graphics card framebuffer doesn't have to support tags!)

One plausible implementation is that the firmware and memory controller
are in cahoots and arrange that the appropriate fraction of the DRAM is
reserved for holding tags (and inaccessible as normal RAM even by the OS);
but where the tags are stored is entirely impdef and an implementation
could choose to put the tags in their own entirely separate storage if
it liked. The only way to access the tag storage is via the instructions
for getting and setting tags.

-- PMM
Dr. David Alan Gilbert July 11, 2022, 1:55 p.m. UTC | #4
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Cornelia Huck (cohuck@redhat.com) wrote:
> > This series makes it possible to enable MTE for kvm guests, if the kernel
> > supports it. Again, tested on the simulator via patiently waiting for the
> > arm64/mte kselftests to finish successfully.
> > 
> > For tcg, turning on mte on the machine level (to get tag memory) stays a
> > requirement. If the new mte cpu feature is not explicitly specified, a tcg
> > vm will get mte depending on the presence of tag memory (just as today).
> > 
> > For kvm, mte stays off by default; this is because migration is not yet
> > supported (postcopy will need an extension of the kernel interface, possibly
> > an extension of the userfaultfd interface), and turning on mte will add a
> > migration blocker.
> 
> My assumption was that a normal migration would need something as well
> to retrieve and place the MTE flags; albeit not atomically.
> 
> > My biggest question going forward is actually concerning migration; I gather
> > that we should not bother adding something unless postcopy is working as well?
> 
> I don't think that restriction is fair on you; just make sure
> postcopy_ram_supported_by_host gains an arch call and fails cleanly;
> that way if anyone tries to enable postcopy they'll find out with a
> clean fail.
> 
> > If I'm not misunderstanding things, we need a way to fault in a page together
> > with the tag; doing that in one go is probably the only way that we can be
> > sure that this is race-free on the QEMU side. Comments welcome :)
> 
> I think it will.
> But, ignoring postcopy for a minute, with KVM how do different types of
> backing memory work - e.g. if I back a region of guest memory with
> /dev/shm/something or a hugepage equivalent, where does the MTE memory
> come from, and how do you set it?

Another case that just came to mind, are the data content optimisations;
we special case all-zero pages, which I guess you still need to transmit
tags for, and the xbzrle page-difference code wouldn't notice
differences in tags.

Dave

> Dave
> 
> > Changes v1->v2: [Thanks to Eric for the feedback!]
> > - add documentation
> > - switch the mte prop to OnOffAuto; this improves the interaction with the
> >   existing mte machine prop
> > - leave mte off for kvm by default
> > - improve tests; the poking in QDicts feels a bit ugly, but seems to work
> > 
> > Cornelia Huck (2):
> >   arm/kvm: add support for MTE
> >   qtests/arm: add some mte tests
> > 
> >  docs/system/arm/cpu-features.rst |  21 +++++
> >  target/arm/cpu.c                 |  18 ++---
> >  target/arm/cpu.h                 |   1 +
> >  target/arm/cpu64.c               | 132 +++++++++++++++++++++++++++++++
> >  target/arm/internals.h           |   1 +
> >  target/arm/kvm64.c               |   5 ++
> >  target/arm/kvm_arm.h             |  12 +++
> >  target/arm/monitor.c             |   1 +
> >  tests/qtest/arm-cpu-features.c   |  77 ++++++++++++++++++
> >  9 files changed, 256 insertions(+), 12 deletions(-)
> > 
> > -- 
> > 2.35.3
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert July 11, 2022, 2:26 p.m. UTC | #5
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Mon, 11 Jul 2022 at 14:24, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > But, ignoring postcopy for a minute, with KVM how do different types of
> > backing memory work - e.g. if I back a region of guest memory with
> > /dev/shm/something or a hugepage equivalent, where does the MTE memory
> > come from, and how do you set it?
> 
> Generally in an MTE system anything that's "plain old RAM" is expected
> to support tags. (The architecture manual calls this "conventional
> memory". This isn't quite the same as "anything that looks RAM-like",
> e.g. the graphics card framebuffer doesn't have to support tags!)

I guess things like non-volatile disks mapped as DAX are fun edge cases.

> One plausible implementation is that the firmware and memory controller
> are in cahoots and arrange that the appropriate fraction of the DRAM is
> reserved for holding tags (and inaccessible as normal RAM even by the OS);
> but where the tags are stored is entirely impdef and an implementation
> could choose to put the tags in their own entirely separate storage if
> it liked. The only way to access the tag storage is via the instructions
> for getting and setting tags.

Hmm OK;   In postcopy, at the moment, the call qemu uses is a call that
atomically places a page of data in memory and then tells the vCPUs to
continue.  I guess a variant that took an extra blob of MTE data would
do.
Note that other VMMs built on kvm work in different ways; the other
common way is to write into the backing file (i.e. the /dev/shm
whatever atomically somehow) and then do the userfault call to tell the
vcpus to continue.  It looks like this is the way things will work in
the split hugepage mechanism Google are currently adding.

Dave

> -- PMM
>
Cornelia Huck July 11, 2022, 2:56 p.m. UTC | #6
On Mon, Jul 11 2022, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> On Mon, 11 Jul 2022 at 14:24, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > But, ignoring postcopy for a minute, with KVM how do different types of
>> > backing memory work - e.g. if I back a region of guest memory with
>> > /dev/shm/something or a hugepage equivalent, where does the MTE memory
>> > come from, and how do you set it?
>> 
>> Generally in an MTE system anything that's "plain old RAM" is expected
>> to support tags. (The architecture manual calls this "conventional
>> memory". This isn't quite the same as "anything that looks RAM-like",
>> e.g. the graphics card framebuffer doesn't have to support tags!)
>
> I guess things like non-volatile disks mapped as DAX are fun edge cases.
>
>> One plausible implementation is that the firmware and memory controller
>> are in cahoots and arrange that the appropriate fraction of the DRAM is
>> reserved for holding tags (and inaccessible as normal RAM even by the OS);
>> but where the tags are stored is entirely impdef and an implementation
>> could choose to put the tags in their own entirely separate storage if
>> it liked. The only way to access the tag storage is via the instructions
>> for getting and setting tags.
>
> Hmm OK;   In postcopy, at the moment, the call qemu uses is a call that
> atomically places a page of data in memory and then tells the vCPUs to
> continue.  I guess a variant that took an extra blob of MTE data would
> do.

Yes, the current idea is to extend UFFDIO_COPY with a flag so that we
get the tag data along with the page.

> Note that other VMMs built on kvm work in different ways; the other
> common way is to write into the backing file (i.e. the /dev/shm
> whatever atomically somehow) and then do the userfault call to tell the
> vcpus to continue.  It looks like this is the way things will work in
> the split hugepage mechanism Google are currently adding.

Hmm... I had the impression that other VMMs had not cared about this
particular use case yet; if they need a slightly different mechanism,
it would complicate things a bit.
Cornelia Huck July 11, 2022, 3:08 p.m. UTC | #7
On Mon, Jul 11 2022, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Cornelia Huck (cohuck@redhat.com) wrote:
>> For kvm, mte stays off by default; this is because migration is not yet
>> supported (postcopy will need an extension of the kernel interface, possibly
>> an extension of the userfaultfd interface), and turning on mte will add a
>> migration blocker.
>
> My assumption was that a normal migration would need something as well
> to retrieve and place the MTE flags; albeit not atomically.

There's KVM_ARM_MTE_COPY_TAGS, which should be sufficient to move tags
around for normal migration.

>
>> My biggest question going forward is actually concerning migration; I gather
>> that we should not bother adding something unless postcopy is working as well?
>
> I don't think that restriction is fair on you; just make sure
> postcopy_ram_supported_by_host gains an arch call and fails cleanly;
> that way if anyone tries to enable postcopy they'll find out with a
> clean fail.

Ok, if simply fencing off postcopy is fine, we can try to move forward
with what we have now. The original attempt at
https://lore.kernel.org/all/881871e8394fa18a656dfb105d42e6099335c721.1615972140.git.haibo.xu@linaro.org/
hooked itself directly into common code; maybe we should rather copy the
approach used for s390 storage keys (extra "device") instead?
Dr. David Alan Gilbert July 11, 2022, 3:28 p.m. UTC | #8
* Cornelia Huck (cohuck@redhat.com) wrote:
> On Mon, Jul 11 2022, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Cornelia Huck (cohuck@redhat.com) wrote:
> >> For kvm, mte stays off by default; this is because migration is not yet
> >> supported (postcopy will need an extension of the kernel interface, possibly
> >> an extension of the userfaultfd interface), and turning on mte will add a
> >> migration blocker.
> >
> > My assumption was that a normal migration would need something as well
> > to retrieve and place the MTE flags; albeit not atomically.
> 
> There's KVM_ARM_MTE_COPY_TAGS, which should be sufficient to move tags
> around for normal migration.
> 
> >
> >> My biggest question going forward is actually concerning migration; I gather
> >> that we should not bother adding something unless postcopy is working as well?
> >
> > I don't think that restriction is fair on you; just make sure
> > postcopy_ram_supported_by_host gains an arch call and fails cleanly;
> > that way if anyone tries to enable postcopy they'll find out with a
> > clean fail.
> 
> Ok, if simply fencing off postcopy is fine, we can try to move forward
> with what we have now. The original attempt at
> https://lore.kernel.org/all/881871e8394fa18a656dfb105d42e6099335c721.1615972140.git.haibo.xu@linaro.org/
> hooked itself directly into common code; maybe we should rather copy the
> approach used for s390 storage keys (extra "device") instead?

I don't understand how a separate device would keep the idea of page
changed flags coherent with the main RAM that the tags correspond to.

Dave
Dr. David Alan Gilbert July 11, 2022, 3:30 p.m. UTC | #9
* Cornelia Huck (cohuck@redhat.com) wrote:
> On Mon, Jul 11 2022, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> >> On Mon, 11 Jul 2022 at 14:24, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > But, ignoring postcopy for a minute, with KVM how do different types of
> >> > backing memory work - e.g. if I back a region of guest memory with
> >> > /dev/shm/something or a hugepage equivalent, where does the MTE memory
> >> > come from, and how do you set it?
> >> 
> >> Generally in an MTE system anything that's "plain old RAM" is expected
> >> to support tags. (The architecture manual calls this "conventional
> >> memory". This isn't quite the same as "anything that looks RAM-like",
> >> e.g. the graphics card framebuffer doesn't have to support tags!)
> >
> > I guess things like non-volatile disks mapped as DAX are fun edge cases.
> >
> >> One plausible implementation is that the firmware and memory controller
> >> are in cahoots and arrange that the appropriate fraction of the DRAM is
> >> reserved for holding tags (and inaccessible as normal RAM even by the OS);
> >> but where the tags are stored is entirely impdef and an implementation
> >> could choose to put the tags in their own entirely separate storage if
> >> it liked. The only way to access the tag storage is via the instructions
> >> for getting and setting tags.
> >
> > Hmm OK;   In postcopy, at the moment, the call qemu uses is a call that
> > atomically places a page of data in memory and then tells the vCPUs to
> > continue.  I guess a variant that took an extra blob of MTE data would
> > do.
> 
> Yes, the current idea is to extend UFFDIO_COPY with a flag so that we
> get the tag data along with the page.
> 
> > Note that other VMMs built on kvm work in different ways; the other
> > common way is to write into the backing file (i.e. the /dev/shm
> > whatever atomically somehow) and then do the userfault call to tell the
> > vcpus to continue.  It looks like this is the way things will work in
> > the split hugepage mechanism Google are currently adding.
> 
> Hmm... I had the impression that other VMMs had not cared about this
> particular use case yet; if they need a slightly different mechanism,
> it would complicate things a bit.

I think Google's internal VMM doesn't use UFFDIO_COPY - but I don't have
details to be sure of that.

Dave