Message ID | 20220223131231.403386-1-Jason@zx2c4.com (mailing list archive) |
---|---|
Headers | show |
Series | VM fork detection for RNG | expand |
On Wed, Feb 23, 2022 at 2:12 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > second patch is the reason this is just an RFC: it's a cleanup of the > ACPI driver from last year, and I don't really have much experience > writing, testing, debugging, or maintaining these types of drivers. > Ideally this thread would yield somebody saying, "I see the intent of > this; I'm happy to take over ownership of this part." That way, I can > focus on the RNG part, and whoever steps up for the paravirt ACPI part > can focus on that. I actually managed to test this in QEMU, and it seems to work quite well. Steps: $ qemu-system-x86_64 ... -device vmgenid,guid=auto -monitor stdio (qemu) savevm blah (qemu) quit $ qemu-system-x86_64 ... -device vmgenid,guid=auto -monitor stdio (qemu) loadvm blah Doing this successfully triggers the function to reinitialize the RNG with the new GUID. (It appears there's a bug in QEMU which prevents the GUID from being reinitialized when running `loadvm` without quitting first; I suppose this should be discussed with QEMU upstream.) So that's very positive. But I would appreciate hearing from some ACPI/Virt/Amazon people about this. Jason
On Wed, Feb 23, 2022 at 5:08 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Wed, Feb 23, 2022 at 2:12 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > second patch is the reason this is just an RFC: it's a cleanup of the > > ACPI driver from last year, and I don't really have much experience > > writing, testing, debugging, or maintaining these types of drivers. > > Ideally this thread would yield somebody saying, "I see the intent of > > this; I'm happy to take over ownership of this part." That way, I can > > focus on the RNG part, and whoever steps up for the paravirt ACPI part > > can focus on that. > > I actually managed to test this in QEMU, and it seems to work quite well. Steps: > > $ qemu-system-x86_64 ... -device vmgenid,guid=auto -monitor stdio > (qemu) savevm blah > (qemu) quit > $ qemu-system-x86_64 ... -device vmgenid,guid=auto -monitor stdio > (qemu) loadvm blah > > Doing this successfully triggers the function to reinitialize the RNG > with the new GUID. (It appears there's a bug in QEMU which prevents > the GUID from being reinitialized when running `loadvm` without > quitting first; I suppose this should be discussed with QEMU > upstream.) > > So that's very positive. But I would appreciate hearing from some > ACPI/Virt/Amazon people about this. Because something something picture thousand words something, here's a gif to see this working as expected: https://data.zx2c4.com/vmgenid-appears-to-work.gif Jason
Hey Jason, On 23.02.22 14:12, Jason A. Donenfeld wrote: > This small series picks up work from Amazon that seems to have stalled > out later year around this time: listening for the vmgenid ACPI > notification, and using it to "do something." Last year, that something > involved a complicated userspace mmap chardev, which seems frought with > difficulty. This year, I have something much simpler in mind: simply > using those ACPI notifications to tell the RNG to reinitialize safely, > so we don't repeat random numbers in cloned, forked, or rolled-back VM > instances. > > This series consists of two patches. The first is a rather > straightforward addition to random.c, which I feel fine about. The > second patch is the reason this is just an RFC: it's a cleanup of the > ACPI driver from last year, and I don't really have much experience > writing, testing, debugging, or maintaining these types of drivers. > Ideally this thread would yield somebody saying, "I see the intent of > this; I'm happy to take over ownership of this part." That way, I can > focus on the RNG part, and whoever steps up for the paravirt ACPI part > can focus on that. > > As a final note, this series intentionally does _not_ focus on > notification of these events to userspace or to other kernel consumers. > Since these VM fork detection events first need to hit the RNG, we can > later talk about what sorts of notifications or mmap'd counters the RNG > should be making accessible to elsewhere. But that's a different sort of > project and ties into a lot of more complicated concerns beyond this > more basic patchset. So hopefully we can keep the discussion rather > focused here to this ACPI business. The main problem with VMGenID is that it is inherently racy. There will always be a (short) amount of time where the ACPI notification is not processed, but the VM could use its RNG to for example establish TLS connections. Hence we as the next step proposed a multi-stage quiesce/resume mechanism where the system is aware that it is going into suspend - can block network connections for example - and only returns to a fully functional state after an unquiesce phase: https://github.com/systemd/systemd/issues/20222 Looking at the issue again, it seems like we completely missed to follow up with a PR to implement that functionality :(. What exact use case do you have in mind for the RNG/VMGenID update? Can you think of situations where the race is not an actual concern? Alex > > Cc: dwmw@amazon.co.uk > Cc: acatan@amazon.com > Cc: graf@amazon.com > Cc: colmmacc@amazon.com > Cc: sblbir@amazon.com > Cc: raduweis@amazon.com > Cc: jannh@google.com > Cc: gregkh@linuxfoundation.org > Cc: tytso@mit.edu > > Jason A. Donenfeld (2): > random: add mechanism for VM forks to reinitialize crng > drivers/virt: add vmgenid driver for reinitializing RNG > > drivers/char/random.c | 58 ++++++++++++++++++ > drivers/virt/Kconfig | 8 +++ > drivers/virt/Makefile | 1 + > drivers/virt/vmgenid.c | 133 +++++++++++++++++++++++++++++++++++++++++ > include/linux/random.h | 1 + > 5 files changed, 201 insertions(+) > create mode 100644 drivers/virt/vmgenid.c > > -- > 2.35.1 > Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Feb 24, 2022 at 09:53:59AM +0100, Alexander Graf wrote: > Hey Jason, > > On 23.02.22 14:12, Jason A. Donenfeld wrote: > > This small series picks up work from Amazon that seems to have stalled > > out later year around this time: listening for the vmgenid ACPI > > notification, and using it to "do something." Last year, that something > > involved a complicated userspace mmap chardev, which seems frought with > > difficulty. This year, I have something much simpler in mind: simply > > using those ACPI notifications to tell the RNG to reinitialize safely, > > so we don't repeat random numbers in cloned, forked, or rolled-back VM > > instances. > > > > This series consists of two patches. The first is a rather > > straightforward addition to random.c, which I feel fine about. The > > second patch is the reason this is just an RFC: it's a cleanup of the > > ACPI driver from last year, and I don't really have much experience > > writing, testing, debugging, or maintaining these types of drivers. > > Ideally this thread would yield somebody saying, "I see the intent of > > this; I'm happy to take over ownership of this part." That way, I can > > focus on the RNG part, and whoever steps up for the paravirt ACPI part > > can focus on that. > > > > As a final note, this series intentionally does _not_ focus on > > notification of these events to userspace or to other kernel consumers. > > Since these VM fork detection events first need to hit the RNG, we can > > later talk about what sorts of notifications or mmap'd counters the RNG > > should be making accessible to elsewhere. But that's a different sort of > > project and ties into a lot of more complicated concerns beyond this > > more basic patchset. So hopefully we can keep the discussion rather > > focused here to this ACPI business. > > > The main problem with VMGenID is that it is inherently racy. There will > always be a (short) amount of time where the ACPI notification is not > processed, but the VM could use its RNG to for example establish TLS > connections. > > Hence we as the next step proposed a multi-stage quiesce/resume mechanism > where the system is aware that it is going into suspend - can block network > connections for example - and only returns to a fully functional state after > an unquiesce phase: > > https://github.com/systemd/systemd/issues/20222 The downside of course is precisely that the guest now needs to be aware and involved every single time a snapshot is taken. Currently with virt the act of taking a snapshot can often remain invisible to the VM with no functional effect on the guest OS or its workload, and the host OS knows it can complete a snapshot in a specific timeframe. That said, this transparency to the VM is precisely the cause of the race condition described. With guest involvement to quiesce the bulk of activity for time period, there is more likely to be a negative impact on the guest workload. The guest admin likely needs to be more explicit about exactly when in time it is reasonable to take a snapshot to mitigate the impact. The host OS snapshot operations are also now dependant on co-operation of a guest OS that has to be considered to be potentially malicious, or at least crashed/non-responsive. The guest OS also needs a way to receive the triggers for snapshot capture and restore, most likely via an extension to something like the QEMU guest agent or an equivalent for othuer hypervisors. Despite the above, I'm not against the idea of co-operative involvement of the guest OS in the acts of taking & restoring snapshots. I can't see any other proposals so far that can reliably eliminate the races in the general case, from the kernel right upto user applications. So I think it is neccessary to have guest cooperative snapshotting. > What exact use case do you have in mind for the RNG/VMGenID update? Can you > think of situations where the race is not an actual concern? Lets assume we do take the approach described in that systemd bug and have a co-operative snapshot process. If the hypervisor does the right thing and guest owners install the right things, they'll have a race free solution that works well in normal operation. That's good. Realistically though, it is never going to be universally and reliably put into practice. So what is our attitude to cases where the preferred solution isn't availble and/or operative ? There are going to be users who continue to build their guest disk images without the QEMU guest agent (or equivalent for whatever hypervisor they run on) installed because they don't know any better. Or where the guest agent is mis-configured or fails to starts or some other scenario that prevents the quiesce working as desired. The host mgmt could refuse to take a snapshot in these cases. More likely is that they are just going to go ahead and do a snapshot anyway because lack of guest agent is a very common scenario today and users want their snapshots. There are going to be virt management apps / hypervisors that don't support talking to any guest agent across their snapshot operation in the first place, so systemd gets no way to trigger the required quiesce dance on snapshot, but they likely have VMGenID support implemented already. IOW, I could view VMGenID triggered fork detection integrated with the kernel RNG as providing a backup line of defence that is going to "just work", albeit with the known race. It isn't as good as the guest co-operative snapshot approach, because it only tries to solve the one specific targetted problem of updating the kernel RNG. Is it still better than doing nothing at all though, for the scenario where guest co-operative snapshot is unavailable ? If it is better than nothing, is it then compelling enough to justify the maint cost of the code added to the kernel ? With regards, Daniel
Hi Lazlo, Thanks for your reply. On Thu, Feb 24, 2022 at 9:23 AM Laszlo Ersek <lersek@redhat.com> wrote: > QEMU's related design is documented in > <https://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/vmgenid.txt>. I'll link to this document on the 2/2 patch next to the other ones. > "they can also use the data provided in the 128-bit identifier as a high > entropy random data source" > > So reinitializing an RNG from it is an express purpose. It seems like this is indeed meant to be used for RNG purposes, but the Windows 10 RNG document says: "Windows 10 on a Hyper-V VM will detect when the VM state is reset, retrieve a unique (not random) value from the hypervisor." I gather from that that it's not totally clear what the "quality" of those 128 bits are. So this patchset mixes them into the entropy pool, but does not credit it, which is consistent with how the RNG deals with other data where the conclusion is, "probably pretty good but maybe not," erring on the side of caution. Either way, it's certainly being used -- and combined with what was there before -- to reinitialize the RNG following a VM fork. > > More info in the libvirt docs (see "genid"): > > https://libvirt.org/formatdomain.html#general-metadata Thanks, noted in the 2/2 patch too. > QEMU's interpretation of the VMGENID specifically as a UUID (which I > believe comes from me) has received (valid) criticism since: > > https://github.com/libguestfs/virt-v2v/blob/master/docs/vm-generation-id-across-hypervisors.txt > > (This document also investigates VMGENID on other hypervisors, which I > think pertains to your other message.) Thank you very much for this reference! You're absolutely right here. v3 will treat this as just an opaque 128-bit binary blob. There's no point, anyway, in treating it as a UUID in the kernel, since it never should be printed or exposed to anywhere except random.c (and my gifs, of course :-P). > > > (It appears there's a bug in QEMU which prevents > > the GUID from being reinitialized when running `loadvm` without > > quitting first; I suppose this should be discussed with QEMU > > upstream.) > > That's not (necessarily) a bug; see the end of the above-linked QEMU > document: > > "There are no known use cases for changing the GUID once QEMU is > running, and adding this capability would greatly increase the complexity." I read that, and I think I might disagree? If you're QEMUing with the monitor and are jumping back and forth and all around between saved snapshots, probably those snapshots should have their RNG reinitialized through this mechanism, right? It seems like doing that would be the proper behavior for `guid=auto`, but not for `guid={some-fixed-thing}`. > > So that's very positive. But I would appreciate hearing from some > > ACPI/Virt/Amazon people about this. > > I've only made some random comments; I didn't see a question so I > couldn't attempt to answer :) "Am I on the right track," I guess, and your reply has been very informative. Thanks for your feedback. I'll have a v3 sent out not before long. Jason
Hi Alex, Strangely your message never made it to me, and I had to pull this out of Lore after seeing Daniel's reply to it. I wonder what's up. On Thu, Feb 24, 2022 at 09:53:59AM +0100, Alexander Graf wrote: > The main problem with VMGenID is that it is inherently racy. There will > always be a (short) amount of time where the ACPI notification is not > processed, but the VM could use its RNG to for example establish TLS > connections. > > Hence we as the next step proposed a multi-stage quiesce/resume > mechanism where the system is aware that it is going into suspend - can > block network connections for example - and only returns to a fully > functional state after an unquiesce phase: > > https://github.com/systemd/systemd/issues/20222 > > Looking at the issue again, it seems like we completely missed to follow > up with a PR to implement that functionality :(. > > What exact use case do you have in mind for the RNG/VMGenID update? Can > you think of situations where the race is not an actual concern? No, I think the race is something that remains a problem for the situations I care about. There are simpler ways of fixing that -- just expose a single incrementing integer so that it can be checked every time the RNG does something, without being expensive, via the same mechanism -- and then you don't need any complexity. But anyway, that doesn't exist right now, so this series tries to implement something for what does exist and is already supported by multiple hypervisors. I'd suggest sending a proposal for an improved mechanism as part of a different thread, and pull the various parties into that, and we can make something good for the future. I'm happy to implement whatever the virtual hardware exposes. Jason
On Thu, Feb 24, 2022 at 09:22:50AM +0100, Laszlo Ersek wrote: > (+Daniel, +Rich) > > On 02/23/22 17:08, Jason A. Donenfeld wrote: > > On Wed, Feb 23, 2022 at 2:12 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > >> second patch is the reason this is just an RFC: it's a cleanup of the > >> ACPI driver from last year, and I don't really have much experience > >> writing, testing, debugging, or maintaining these types of drivers. > >> Ideally this thread would yield somebody saying, "I see the intent of > >> this; I'm happy to take over ownership of this part." That way, I can > >> focus on the RNG part, and whoever steps up for the paravirt ACPI part > >> can focus on that. > > > (It appears there's a bug in QEMU which prevents > > the GUID from being reinitialized when running `loadvm` without > > quitting first; I suppose this should be discussed with QEMU > > upstream.) > > That's not (necessarily) a bug; see the end of the above-linked QEMU > document: > > "There are no known use cases for changing the GUID once QEMU is > running, and adding this capability would greatly increase the complexity." IIRC this part of the QEMU doc was making an implicit assumption about the way QEMU is to be used by mgmt apps doing snapshots. Instead of using the 'loadvm' command on the existing running QEMU process, the doc seems to tacitly expect the management app will throwaway the existing QEMU process and spawn a brand new QEMU process to load the snapshot into, thus getting the new GUID on the QEMU command line. There are some downsides with doing this as compared to running 'loadvm' in the existing QEMU, most notably the user's VNC/SPICE console session gets interrupted. I guess the ease of impl for QEMU was more compelling though. With regards, Daniel
On Thu, Feb 24, 2022 at 11:56 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > IIRC this part of the QEMU doc was making an implicit assumption > about the way QEMU is to be used by mgmt apps doing snapshots. > > Instead of using the 'loadvm' command on the existing running QEMU > process, the doc seems to tacitly expect the management app will > throwaway the existing QEMU process and spawn a brand new QEMU > process to load the snapshot into, thus getting the new GUID on > the QEMU command line. Right, exactly. The "there are no known use cases" bit I think just forgot about one very common use case that perhaps just wasn't in use by the original author. So I'm pretty sure this remains a QEMU bug. Jason
On 24.02.22 11:43, Daniel P. Berrangé wrote: > On Thu, Feb 24, 2022 at 09:53:59AM +0100, Alexander Graf wrote: >> Hey Jason, >> >> On 23.02.22 14:12, Jason A. Donenfeld wrote: >>> This small series picks up work from Amazon that seems to have stalled >>> out later year around this time: listening for the vmgenid ACPI >>> notification, and using it to "do something." Last year, that something >>> involved a complicated userspace mmap chardev, which seems frought with >>> difficulty. This year, I have something much simpler in mind: simply >>> using those ACPI notifications to tell the RNG to reinitialize safely, >>> so we don't repeat random numbers in cloned, forked, or rolled-back VM >>> instances. >>> >>> This series consists of two patches. The first is a rather >>> straightforward addition to random.c, which I feel fine about. The >>> second patch is the reason this is just an RFC: it's a cleanup of the >>> ACPI driver from last year, and I don't really have much experience >>> writing, testing, debugging, or maintaining these types of drivers. >>> Ideally this thread would yield somebody saying, "I see the intent of >>> this; I'm happy to take over ownership of this part." That way, I can >>> focus on the RNG part, and whoever steps up for the paravirt ACPI part >>> can focus on that. >>> >>> As a final note, this series intentionally does _not_ focus on >>> notification of these events to userspace or to other kernel consumers. >>> Since these VM fork detection events first need to hit the RNG, we can >>> later talk about what sorts of notifications or mmap'd counters the RNG >>> should be making accessible to elsewhere. But that's a different sort of >>> project and ties into a lot of more complicated concerns beyond this >>> more basic patchset. So hopefully we can keep the discussion rather >>> focused here to this ACPI business. >> >> The main problem with VMGenID is that it is inherently racy. There will >> always be a (short) amount of time where the ACPI notification is not >> processed, but the VM could use its RNG to for example establish TLS >> connections. >> >> Hence we as the next step proposed a multi-stage quiesce/resume mechanism >> where the system is aware that it is going into suspend - can block network >> connections for example - and only returns to a fully functional state after >> an unquiesce phase: >> >> https://github.com/systemd/systemd/issues/20222 > The downside of course is precisely that the guest now needs to be aware > and involved every single time a snapshot is taken. > > Currently with virt the act of taking a snapshot can often remain invisible > to the VM with no functional effect on the guest OS or its workload, and > the host OS knows it can complete a snapshot in a specific timeframe. That > said, this transparency to the VM is precisely the cause of the race > condition described. > > With guest involvement to quiesce the bulk of activity for time period, > there is more likely to be a negative impact on the guest workload. The > guest admin likely needs to be more explicit about exactly when in time > it is reasonable to take a snapshot to mitigate the impact. > > The host OS snapshot operations are also now dependant on co-operation > of a guest OS that has to be considered to be potentially malicious, or > at least crashed/non-responsive. The guest OS also needs a way to receive > the triggers for snapshot capture and restore, most likely via an extension > to something like the QEMU guest agent or an equivalent for othuer > hypervisors. What you describe sounds almost exactly like pressing a power button on modern systems. You don't just kill the power line, you press a button and wait for the guest to acknowledge that it's ready. Maybe the real answer to all of this is S3: Suspend to RAM. You press the suspend button, the guest can prepare for sleep (quiesce!) and the next time you run, it can check whether VMGenID changed and act accordingly. > Despite the above, I'm not against the idea of co-operative involvement > of the guest OS in the acts of taking & restoring snapshots. I can't > see any other proposals so far that can reliably eliminate the races > in the general case, from the kernel right upto user applications. > So I think it is neccessary to have guest cooperative snapshotting. > >> What exact use case do you have in mind for the RNG/VMGenID update? Can you >> think of situations where the race is not an actual concern? > Lets assume we do take the approach described in that systemd bug and > have a co-operative snapshot process. If the hypervisor does the right > thing and guest owners install the right things, they'll have a race > free solution that works well in normal operation. That's good. > > > Realistically though, it is never going to be universally and reliably > put into practice. So what is our attitude to cases where the preferred > solution isn't availble and/or operative ? > > > There are going to be users who continue to build their guest disk images > without the QEMU guest agent (or equivalent for whatever hypervisor they > run on) installed because they don't know any better. Or where the guest > agent is mis-configured or fails to starts or some other scenario that > prevents the quiesce working as desired. The host mgmt could refuse to > take a snapshot in these cases. More likely is that they are just > going to go ahead and do a snapshot anyway because lack of guest agent > is a very common scenario today and users want their snapshots. > > > There are going to be virt management apps / hypervisors that don't > support talking to any guest agent across their snapshot operation > in the first place, so systemd gets no way to trigger the required > quiesce dance on snapshot, but they likely have VMGenID support > implemented already. > > > IOW, I could view VMGenID triggered fork detection integrated with > the kernel RNG as providing a backup line of defence that is going > to "just work", albeit with the known race. It isn't as good as the > guest co-operative snapshot approach, because it only tries to solve > the one specific targetted problem of updating the kernel RNG. > > Is it still better than doing nothing at all though, for the scenario > where guest co-operative snapshot is unavailable ? > > If it is better than nothing, is it then compelling enough to justify > the maint cost of the code added to the kernel ? I'm tempted to say "If it also exposes the VMGenID via sysfs so that you can actually check whether you were cloned, probably yes." Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Feb 24, 2022 at 11:57:34AM +0100, Jason A. Donenfeld wrote: > On Thu, Feb 24, 2022 at 11:56 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > IIRC this part of the QEMU doc was making an implicit assumption > > about the way QEMU is to be used by mgmt apps doing snapshots. > > > > Instead of using the 'loadvm' command on the existing running QEMU > > process, the doc seems to tacitly expect the management app will > > throwaway the existing QEMU process and spawn a brand new QEMU > > process to load the snapshot into, thus getting the new GUID on > > the QEMU command line. > > Right, exactly. The "there are no known use cases" bit I think just > forgot about one very common use case that perhaps just wasn't in use > by the original author. So I'm pretty sure this remains a QEMU bug. > > Jason Quite possibly yes.