Message ID | 20190830084255.55113-2-steven.price@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Stolen time support | expand |
On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: > Introduce a paravirtualization interface for KVM/arm64 based on the > "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A. > > This only adds the details about "Stolen Time" as the details of "Live > Physical Time" have not been fully agreed. > > User space can specify a reserved area of memory for the guest and > inform KVM to populate the memory with information on time that the host > kernel has stolen from the guest. > > A hypercall interface is provided for the guest to interrogate the > hypervisor's support for this interface and the location of the shared > memory structures. > > Signed-off-by: Steven Price <steven.price@arm.com> > --- > Documentation/virt/kvm/arm/pvtime.txt | 64 +++++++++++++++++++++++++ > Documentation/virt/kvm/devices/vcpu.txt | 14 ++++++ > 2 files changed, 78 insertions(+) > create mode 100644 Documentation/virt/kvm/arm/pvtime.txt > > diff --git a/Documentation/virt/kvm/arm/pvtime.txt b/Documentation/virt/kvm/arm/pvtime.txt > new file mode 100644 > index 000000000000..dda3f0f855b9 > --- /dev/null > +++ b/Documentation/virt/kvm/arm/pvtime.txt > @@ -0,0 +1,64 @@ > +Paravirtualized time support for arm64 > +====================================== > + > +Arm specification DEN0057/A defined a standard for paravirtualised time > +support for AArch64 guests: > + > +https://developer.arm.com/docs/den0057/a > + > +KVM/arm64 implements the stolen time part of this specification by providing > +some hypervisor service calls to support a paravirtualized guest obtaining a > +view of the amount of time stolen from its execution. > + > +Two new SMCCC compatible hypercalls are defined: > + > +PV_FEATURES 0xC5000020 > +PV_TIME_ST 0xC5000022 > + > +These are only available in the SMC64/HVC64 calling convention as > +paravirtualized time is not available to 32 bit Arm guests. The existence of > +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES > +mechanism before calling it. > + > +PV_FEATURES > + Function ID: (uint32) : 0xC5000020 > + PV_func_id: (uint32) : Either PV_TIME_LPT or PV_TIME_ST PV_TIME_LPT doesn't exist > + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant > + PV-time feature is supported by the hypervisor. > + > +PV_TIME_ST > + Function ID: (uint32) : 0xC5000022 > + Return value: (int64) : IPA of the stolen time data structure for this > + VCPU. On failure: > + NOT_SUPPORTED (-1) > + > +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory > +with inner and outer write back caching attributes, in the inner shareable > +domain. A total of 16 bytes from the IPA returned are guaranteed to be > +meaningfully filled by the hypervisor (see structure below). > + > +PV_TIME_ST returns the structure for the calling VCPU. > + > +Stolen Time > +----------- > + > +The structure pointed to by the PV_TIME_ST hypercall is as follows: > + > + Field | Byte Length | Byte Offset | Description > + ----------- | ----------- | ----------- | -------------------------- > + Revision | 4 | 0 | Must be 0 for version 0.1 > + Attributes | 4 | 4 | Must be 0 The above fields don't appear to be exposed to userspace in anyway. How will we handle migration from one KVM with one version of the structure to another? > + Stolen time | 8 | 8 | Stolen time in unsigned > + | | | nanoseconds indicating how > + | | | much time this VCPU thread > + | | | was involuntarily not > + | | | running on a physical CPU. > + > +The structure will be updated by the hypervisor prior to scheduling a VCPU. It > +will be present within a reserved region of the normal memory given to the > +guest. The guest should not attempt to write into this memory. There is a > +structure per VCPU of the guest. Should we provide a recommendation as to how that reserved memory is provided? One memslot divided into NR_VCPUS subregions? Should the reserved region be described to the guest kernel with DT/ACPI? Or should userspace ensure the region is not within any DT/ACPI described regions? > + > +For the user space interface see Documentation/virt/kvm/devices/vcpu.txt > +section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". > + > diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt > index 2b5dab16c4f2..896777f76f36 100644 > --- a/Documentation/virt/kvm/devices/vcpu.txt > +++ b/Documentation/virt/kvm/devices/vcpu.txt > @@ -60,3 +60,17 @@ time to use the number provided for a given timer, overwriting any previously > configured values on other VCPUs. Userspace should configure the interrupt > numbers on at least one VCPU after creating all VCPUs and before running any > VCPUs. > + > +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL > +Architectures: ARM64 > + > +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_SET_IPA > +Parameters: 64-bit base address > +Returns: -ENXIO: Stolen time not implemented > + -EEXIST: Base address already set for this VCPU > + -EINVAL: Base address not 64 byte aligned > + > +Specifies the base address of the stolen time structure for this VCPU. The > +base address must be 64 byte aligned and exist within a valid guest memory > +region. See Documentation/virt/kvm/arm/pvtime.txt for more information > +including the layout of the stolen time structure. > -- > 2.20.1 > Thanks, drew
On 30/08/2019 15:47, Andrew Jones wrote: > On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: >> Introduce a paravirtualization interface for KVM/arm64 based on the >> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A. >> >> This only adds the details about "Stolen Time" as the details of "Live >> Physical Time" have not been fully agreed. >> >> User space can specify a reserved area of memory for the guest and >> inform KVM to populate the memory with information on time that the host >> kernel has stolen from the guest. >> >> A hypercall interface is provided for the guest to interrogate the >> hypervisor's support for this interface and the location of the shared >> memory structures. >> >> Signed-off-by: Steven Price <steven.price@arm.com> >> --- >> Documentation/virt/kvm/arm/pvtime.txt | 64 +++++++++++++++++++++++++ >> Documentation/virt/kvm/devices/vcpu.txt | 14 ++++++ >> 2 files changed, 78 insertions(+) >> create mode 100644 Documentation/virt/kvm/arm/pvtime.txt >> >> diff --git a/Documentation/virt/kvm/arm/pvtime.txt b/Documentation/virt/kvm/arm/pvtime.txt >> new file mode 100644 >> index 000000000000..dda3f0f855b9 >> --- /dev/null >> +++ b/Documentation/virt/kvm/arm/pvtime.txt >> @@ -0,0 +1,64 @@ >> +Paravirtualized time support for arm64 >> +====================================== >> + >> +Arm specification DEN0057/A defined a standard for paravirtualised time >> +support for AArch64 guests: >> + >> +https://developer.arm.com/docs/den0057/a >> + >> +KVM/arm64 implements the stolen time part of this specification by providing >> +some hypervisor service calls to support a paravirtualized guest obtaining a >> +view of the amount of time stolen from its execution. >> + >> +Two new SMCCC compatible hypercalls are defined: >> + >> +PV_FEATURES 0xC5000020 >> +PV_TIME_ST 0xC5000022 >> + >> +These are only available in the SMC64/HVC64 calling convention as >> +paravirtualized time is not available to 32 bit Arm guests. The existence of >> +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES >> +mechanism before calling it. >> + >> +PV_FEATURES >> + Function ID: (uint32) : 0xC5000020 >> + PV_func_id: (uint32) : Either PV_TIME_LPT or PV_TIME_ST > > PV_TIME_LPT doesn't exist Thanks, will remove. >> + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant >> + PV-time feature is supported by the hypervisor. >> + >> +PV_TIME_ST >> + Function ID: (uint32) : 0xC5000022 >> + Return value: (int64) : IPA of the stolen time data structure for this >> + VCPU. On failure: >> + NOT_SUPPORTED (-1) >> + >> +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory >> +with inner and outer write back caching attributes, in the inner shareable >> +domain. A total of 16 bytes from the IPA returned are guaranteed to be >> +meaningfully filled by the hypervisor (see structure below). >> + >> +PV_TIME_ST returns the structure for the calling VCPU. >> + >> +Stolen Time >> +----------- >> + >> +The structure pointed to by the PV_TIME_ST hypercall is as follows: >> + >> + Field | Byte Length | Byte Offset | Description >> + ----------- | ----------- | ----------- | -------------------------- >> + Revision | 4 | 0 | Must be 0 for version 0.1 >> + Attributes | 4 | 4 | Must be 0 > > The above fields don't appear to be exposed to userspace in anyway. How > will we handle migration from one KVM with one version of the structure > to another? Interesting question. User space does have access to them now it is providing the memory, but it's not exactly an easy method. In particular user space has no (simple) way of probing the kernel's supported version. I guess one solution would be to add an extra attribute on the VCPU which would provide the revision information. The current kernel would then reject any revision other than 0, but this could then be extended to support other revision numbers in the future. Although there's some logic in saying we could add the extra attribute when(/if) there is a new version. Future kernels would then be expected to use the current version unless user space explicitly set the new attribute. Do you feel this is something that needs to be addressed now, or can it be deferred until another version is proposed? >> + Stolen time | 8 | 8 | Stolen time in unsigned >> + | | | nanoseconds indicating how >> + | | | much time this VCPU thread >> + | | | was involuntarily not >> + | | | running on a physical CPU. >> + >> +The structure will be updated by the hypervisor prior to scheduling a VCPU. It >> +will be present within a reserved region of the normal memory given to the >> +guest. The guest should not attempt to write into this memory. There is a >> +structure per VCPU of the guest. > > Should we provide a recommendation as to how that reserved memory is > provided? One memslot divided into NR_VCPUS subregions? Should the > reserved region be described to the guest kernel with DT/ACPI? Or > should userspace ensure the region is not within any DT/ACPI described > regions? I'm open to providing a recommendation, but I'm not entirely sure I know enough here to provide one. There is an obvious efficiency argument for minimizing memslots with the current code. But if someone has a reason for using multiple memslots then that's probably a good argument for implementing a memslot-caching kvm_put_user() rather than to be dis-recommended. My assumption (and testing) has been with a single memslot divided into NR_VCPUS (or more accurately the number of VCPUs in the VM) subregions. For testing DT I've tested both methods: an explicit reserved region or just ensuring it's not in any DT described region. Both seem reasonable, but it might be easier to integrate into existing migration mechanisms if it's simply a reserved region (then the memory block of the guest is just as it always was). For ACPI the situation should be similar, but my testing has been with DT. Thanks, Steve >> + >> +For the user space interface see Documentation/virt/kvm/devices/vcpu.txt >> +section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". >> + >> diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt >> index 2b5dab16c4f2..896777f76f36 100644 >> --- a/Documentation/virt/kvm/devices/vcpu.txt >> +++ b/Documentation/virt/kvm/devices/vcpu.txt >> @@ -60,3 +60,17 @@ time to use the number provided for a given timer, overwriting any previously >> configured values on other VCPUs. Userspace should configure the interrupt >> numbers on at least one VCPU after creating all VCPUs and before running any >> VCPUs. >> + >> +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL >> +Architectures: ARM64 >> + >> +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_SET_IPA >> +Parameters: 64-bit base address >> +Returns: -ENXIO: Stolen time not implemented >> + -EEXIST: Base address already set for this VCPU >> + -EINVAL: Base address not 64 byte aligned >> + >> +Specifies the base address of the stolen time structure for this VCPU. The >> +base address must be 64 byte aligned and exist within a valid guest memory >> +region. See Documentation/virt/kvm/arm/pvtime.txt for more information >> +including the layout of the stolen time structure. >> -- >> 2.20.1 >> > > Thanks, > drew > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On Fri, Aug 30, 2019 at 04:25:08PM +0100, Steven Price wrote: > On 30/08/2019 15:47, Andrew Jones wrote: > > On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: > >> Introduce a paravirtualization interface for KVM/arm64 based on the > >> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A. > >> > >> This only adds the details about "Stolen Time" as the details of "Live > >> Physical Time" have not been fully agreed. > >> > >> User space can specify a reserved area of memory for the guest and > >> inform KVM to populate the memory with information on time that the host > >> kernel has stolen from the guest. > >> > >> A hypercall interface is provided for the guest to interrogate the > >> hypervisor's support for this interface and the location of the shared > >> memory structures. > >> > >> Signed-off-by: Steven Price <steven.price@arm.com> > >> --- > >> Documentation/virt/kvm/arm/pvtime.txt | 64 +++++++++++++++++++++++++ > >> Documentation/virt/kvm/devices/vcpu.txt | 14 ++++++ > >> 2 files changed, 78 insertions(+) > >> create mode 100644 Documentation/virt/kvm/arm/pvtime.txt > >> > >> diff --git a/Documentation/virt/kvm/arm/pvtime.txt b/Documentation/virt/kvm/arm/pvtime.txt > >> new file mode 100644 > >> index 000000000000..dda3f0f855b9 > >> --- /dev/null > >> +++ b/Documentation/virt/kvm/arm/pvtime.txt > >> @@ -0,0 +1,64 @@ > >> +Paravirtualized time support for arm64 > >> +====================================== > >> + > >> +Arm specification DEN0057/A defined a standard for paravirtualised time > >> +support for AArch64 guests: > >> + > >> +https://developer.arm.com/docs/den0057/a > >> + > >> +KVM/arm64 implements the stolen time part of this specification by providing > >> +some hypervisor service calls to support a paravirtualized guest obtaining a > >> +view of the amount of time stolen from its execution. > >> + > >> +Two new SMCCC compatible hypercalls are defined: > >> + > >> +PV_FEATURES 0xC5000020 > >> +PV_TIME_ST 0xC5000022 > >> + > >> +These are only available in the SMC64/HVC64 calling convention as > >> +paravirtualized time is not available to 32 bit Arm guests. The existence of > >> +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES > >> +mechanism before calling it. > >> + > >> +PV_FEATURES > >> + Function ID: (uint32) : 0xC5000020 > >> + PV_func_id: (uint32) : Either PV_TIME_LPT or PV_TIME_ST > > > > PV_TIME_LPT doesn't exist > > Thanks, will remove. > > >> + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant > >> + PV-time feature is supported by the hypervisor. > >> + > >> +PV_TIME_ST > >> + Function ID: (uint32) : 0xC5000022 > >> + Return value: (int64) : IPA of the stolen time data structure for this > >> + VCPU. On failure: > >> + NOT_SUPPORTED (-1) > >> + > >> +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory > >> +with inner and outer write back caching attributes, in the inner shareable > >> +domain. A total of 16 bytes from the IPA returned are guaranteed to be > >> +meaningfully filled by the hypervisor (see structure below). > >> + > >> +PV_TIME_ST returns the structure for the calling VCPU. > >> + > >> +Stolen Time > >> +----------- > >> + > >> +The structure pointed to by the PV_TIME_ST hypercall is as follows: > >> + > >> + Field | Byte Length | Byte Offset | Description > >> + ----------- | ----------- | ----------- | -------------------------- > >> + Revision | 4 | 0 | Must be 0 for version 0.1 > >> + Attributes | 4 | 4 | Must be 0 > > > > The above fields don't appear to be exposed to userspace in anyway. How > > will we handle migration from one KVM with one version of the structure > > to another? > > Interesting question. User space does have access to them now it is > providing the memory, but it's not exactly an easy method. In particular > user space has no (simple) way of probing the kernel's supported version. > > I guess one solution would be to add an extra attribute on the VCPU > which would provide the revision information. The current kernel would > then reject any revision other than 0, but this could then be extended > to support other revision numbers in the future. > > Although there's some logic in saying we could add the extra attribute > when(/if) there is a new version. Future kernels would then be expected > to use the current version unless user space explicitly set the new > attribute. > > Do you feel this is something that needs to be addressed now, or can it > be deferred until another version is proposed? Assuming we'll want userspace to have the option of choosing version=0, and that we're fine with version=0 being the implicit choice, when nothing is selected, then I guess it can be left as is for now. If, OTOH, we just want migration to fail when attempting to migrate to another host with an incompatible stolen-time structure (i.e. version=0 is not selectable on hosts that implement later versions), then we should expose the version in some way now. Perhaps a VCPU's "PV config" should be described in a set of pseudo registers? > > >> + Stolen time | 8 | 8 | Stolen time in unsigned > >> + | | | nanoseconds indicating how > >> + | | | much time this VCPU thread > >> + | | | was involuntarily not > >> + | | | running on a physical CPU. > >> + > >> +The structure will be updated by the hypervisor prior to scheduling a VCPU. It > >> +will be present within a reserved region of the normal memory given to the > >> +guest. The guest should not attempt to write into this memory. There is a > >> +structure per VCPU of the guest. > > > > Should we provide a recommendation as to how that reserved memory is > > provided? One memslot divided into NR_VCPUS subregions? Should the > > reserved region be described to the guest kernel with DT/ACPI? Or > > should userspace ensure the region is not within any DT/ACPI described > > regions? > > I'm open to providing a recommendation, but I'm not entirely sure I know > enough here to provide one. > > There is an obvious efficiency argument for minimizing memslots with the > current code. But if someone has a reason for using multiple memslots > then that's probably a good argument for implementing a memslot-caching > kvm_put_user() rather than to be dis-recommended. Actually even if a single memslot is used for all the PV structures for all VCPUs, but it's separate from the slot(s) used for main memory, then we'll likely see performance issues with memslot searches (even though it's a binary search). This is because memslots already have caching. The last used slot is stored in the memslots' lru_slot member (the "lru" name is confusing, but it means "last used" somehow). This means we could get thrashing on that slot cache if we're searching for the PV structure memslot on each vcpu load after searching for the main memory slot on each page fault. > > My assumption (and testing) has been with a single memslot divided into > NR_VCPUS (or more accurately the number of VCPUs in the VM) subregions. > > For testing DT I've tested both methods: an explicit reserved region or > just ensuring it's not in any DT described region. Both seem reasonable, > but it might be easier to integrate into existing migration mechanisms > if it's simply a reserved region (then the memory block of the guest is > just as it always was). > > For ACPI the situation should be similar, but my testing has been with DT. I also can't think of any reason why we'd have to describe it in DT/ACPI, but I get this feeling that if we don't, then we'll hit some issue that will make us wish we had... Thanks, drew
On 02/09/2019 13:52, Andrew Jones wrote: > On Fri, Aug 30, 2019 at 04:25:08PM +0100, Steven Price wrote: >> On 30/08/2019 15:47, Andrew Jones wrote: >>> On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: [...] >>>> + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant >>>> + PV-time feature is supported by the hypervisor. >>>> + >>>> +PV_TIME_ST >>>> + Function ID: (uint32) : 0xC5000022 >>>> + Return value: (int64) : IPA of the stolen time data structure for this >>>> + VCPU. On failure: >>>> + NOT_SUPPORTED (-1) >>>> + >>>> +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory >>>> +with inner and outer write back caching attributes, in the inner shareable >>>> +domain. A total of 16 bytes from the IPA returned are guaranteed to be >>>> +meaningfully filled by the hypervisor (see structure below). >>>> + >>>> +PV_TIME_ST returns the structure for the calling VCPU. >>>> + >>>> +Stolen Time >>>> +----------- >>>> + >>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows: >>>> + >>>> + Field | Byte Length | Byte Offset | Description >>>> + ----------- | ----------- | ----------- | -------------------------- >>>> + Revision | 4 | 0 | Must be 0 for version 0.1 >>>> + Attributes | 4 | 4 | Must be 0 >>> >>> The above fields don't appear to be exposed to userspace in anyway. How >>> will we handle migration from one KVM with one version of the structure >>> to another? >> >> Interesting question. User space does have access to them now it is >> providing the memory, but it's not exactly an easy method. In particular >> user space has no (simple) way of probing the kernel's supported version. >> >> I guess one solution would be to add an extra attribute on the VCPU >> which would provide the revision information. The current kernel would >> then reject any revision other than 0, but this could then be extended >> to support other revision numbers in the future. >> >> Although there's some logic in saying we could add the extra attribute >> when(/if) there is a new version. Future kernels would then be expected >> to use the current version unless user space explicitly set the new >> attribute. >> >> Do you feel this is something that needs to be addressed now, or can it >> be deferred until another version is proposed? > > Assuming we'll want userspace to have the option of choosing version=0, > and that we're fine with version=0 being the implicit choice, when nothing > is selected, then I guess it can be left as is for now. If, OTOH, we just > want migration to fail when attempting to migrate to another host with > an incompatible stolen-time structure (i.e. version=0 is not selectable > on hosts that implement later versions), then we should expose the version > in some way now. Perhaps a VCPU's "PV config" should be described in a > set of pseudo registers? I wouldn't have thought making migration fail if/when the host upgrades to a new version would be particularly helpful - we'd want to provide backwards compatibility. In particular for the suspend/resume case (I want to be able to save my VM to disk, upgrade the host kernel and then resume the VM). The only potential issue I see is the implicit "version=0 if not specified". That seems solvable by rejecting setting the stolen time base address if no version has been specified and the host kernel doesn't support version=0. >> >>>> + Stolen time | 8 | 8 | Stolen time in unsigned >>>> + | | | nanoseconds indicating how >>>> + | | | much time this VCPU thread >>>> + | | | was involuntarily not >>>> + | | | running on a physical CPU. >>>> + >>>> +The structure will be updated by the hypervisor prior to scheduling a VCPU. It >>>> +will be present within a reserved region of the normal memory given to the >>>> +guest. The guest should not attempt to write into this memory. There is a >>>> +structure per VCPU of the guest. >>> >>> Should we provide a recommendation as to how that reserved memory is >>> provided? One memslot divided into NR_VCPUS subregions? Should the >>> reserved region be described to the guest kernel with DT/ACPI? Or >>> should userspace ensure the region is not within any DT/ACPI described >>> regions? >> >> I'm open to providing a recommendation, but I'm not entirely sure I know >> enough here to provide one. >> >> There is an obvious efficiency argument for minimizing memslots with the >> current code. But if someone has a reason for using multiple memslots >> then that's probably a good argument for implementing a memslot-caching >> kvm_put_user() rather than to be dis-recommended. > > Actually even if a single memslot is used for all the PV structures for > all VCPUs, but it's separate from the slot(s) used for main memory, then > we'll likely see performance issues with memslot searches (even though > it's a binary search). This is because memslots already have caching. The > last used slot is stored in the memslots' lru_slot member (the "lru" name > is confusing, but it means "last used" somehow). This means we could get > thrashing on that slot cache if we're searching for the PV structure > memslot on each vcpu load after searching for the main memory slot on each > page fault. True - a dedicated memslot for stolen time wouldn't be great if a VM is needing to fault pages (which would obviously be in a different memslot). I don't have a good idea of the overhead of missing in the lru_slot cache. The main reason I stopped using a dedicated cache was because I discovered that my initial implementation using kvm_write_guest_offset_cached() (which wasn't single-copy atomic safe) was actually failing to use the cache because the buffer crossed a page boundary (see __kvm_gfn_to_hva_cache_init()). So switching away from the "_cached" variant was actually avoiding the extra walks of the memslots. I can look at reintroducing the caching for kvm_put_guest(). >> >> My assumption (and testing) has been with a single memslot divided into >> NR_VCPUS (or more accurately the number of VCPUs in the VM) subregions. >> >> For testing DT I've tested both methods: an explicit reserved region or >> just ensuring it's not in any DT described region. Both seem reasonable, >> but it might be easier to integrate into existing migration mechanisms >> if it's simply a reserved region (then the memory block of the guest is >> just as it always was). >> >> For ACPI the situation should be similar, but my testing has been with DT. > > I also can't think of any reason why we'd have to describe it in DT/ACPI, > but I get this feeling that if we don't, then we'll hit some issue that > will make us wish we had... Without knowing why we need it it's hard to justify what should go in the bindings. But the idea of having the hypercalls is that the description is returned via hypercalls rather than explicitly in DT/ACPI. In theory we wouldn't need the hypercalls if it was fully described in DT/ACPI. Steve
On Wed, Sep 04, 2019 at 02:55:15PM +0100, Steven Price wrote: > On 02/09/2019 13:52, Andrew Jones wrote: > > On Fri, Aug 30, 2019 at 04:25:08PM +0100, Steven Price wrote: > >> On 30/08/2019 15:47, Andrew Jones wrote: > >>> On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: > [...] > >>>> + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant > >>>> + PV-time feature is supported by the hypervisor. > >>>> + > >>>> +PV_TIME_ST > >>>> + Function ID: (uint32) : 0xC5000022 > >>>> + Return value: (int64) : IPA of the stolen time data structure for this > >>>> + VCPU. On failure: > >>>> + NOT_SUPPORTED (-1) > >>>> + > >>>> +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory > >>>> +with inner and outer write back caching attributes, in the inner shareable > >>>> +domain. A total of 16 bytes from the IPA returned are guaranteed to be > >>>> +meaningfully filled by the hypervisor (see structure below). > >>>> + > >>>> +PV_TIME_ST returns the structure for the calling VCPU. > >>>> + > >>>> +Stolen Time > >>>> +----------- > >>>> + > >>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows: > >>>> + > >>>> + Field | Byte Length | Byte Offset | Description > >>>> + ----------- | ----------- | ----------- | -------------------------- > >>>> + Revision | 4 | 0 | Must be 0 for version 0.1 > >>>> + Attributes | 4 | 4 | Must be 0 > >>> > >>> The above fields don't appear to be exposed to userspace in anyway. How > >>> will we handle migration from one KVM with one version of the structure > >>> to another? > >> > >> Interesting question. User space does have access to them now it is > >> providing the memory, but it's not exactly an easy method. In particular > >> user space has no (simple) way of probing the kernel's supported version. > >> > >> I guess one solution would be to add an extra attribute on the VCPU > >> which would provide the revision information. The current kernel would > >> then reject any revision other than 0, but this could then be extended > >> to support other revision numbers in the future. > >> > >> Although there's some logic in saying we could add the extra attribute > >> when(/if) there is a new version. Future kernels would then be expected > >> to use the current version unless user space explicitly set the new > >> attribute. > >> > >> Do you feel this is something that needs to be addressed now, or can it > >> be deferred until another version is proposed? > > > > Assuming we'll want userspace to have the option of choosing version=0, > > and that we're fine with version=0 being the implicit choice, when nothing > > is selected, then I guess it can be left as is for now. If, OTOH, we just > > want migration to fail when attempting to migrate to another host with > > an incompatible stolen-time structure (i.e. version=0 is not selectable > > on hosts that implement later versions), then we should expose the version > > in some way now. Perhaps a VCPU's "PV config" should be described in a > > set of pseudo registers? > > I wouldn't have thought making migration fail if/when the host upgrades > to a new version would be particularly helpful - we'd want to provide > backwards compatibility. In particular for the suspend/resume case (I > want to be able to save my VM to disk, upgrade the host kernel and then > resume the VM). > > The only potential issue I see is the implicit "version=0 if not > specified". That seems solvable by rejecting setting the stolen time > base address if no version has been specified and the host kernel > doesn't support version=0. I think that's the same failure I was trying avoid by failing the migration instead. Maybe it's equivalent to fail at this vcpu-ioctl time though? > > >> > >>>> + Stolen time | 8 | 8 | Stolen time in unsigned > >>>> + | | | nanoseconds indicating how > >>>> + | | | much time this VCPU thread > >>>> + | | | was involuntarily not > >>>> + | | | running on a physical CPU. > >>>> + > >>>> +The structure will be updated by the hypervisor prior to scheduling a VCPU. It > >>>> +will be present within a reserved region of the normal memory given to the > >>>> +guest. The guest should not attempt to write into this memory. There is a > >>>> +structure per VCPU of the guest. > >>> > >>> Should we provide a recommendation as to how that reserved memory is > >>> provided? One memslot divided into NR_VCPUS subregions? Should the > >>> reserved region be described to the guest kernel with DT/ACPI? Or > >>> should userspace ensure the region is not within any DT/ACPI described > >>> regions? > >> > >> I'm open to providing a recommendation, but I'm not entirely sure I know > >> enough here to provide one. > >> > >> There is an obvious efficiency argument for minimizing memslots with the > >> current code. But if someone has a reason for using multiple memslots > >> then that's probably a good argument for implementing a memslot-caching > >> kvm_put_user() rather than to be dis-recommended. > > > > Actually even if a single memslot is used for all the PV structures for > > all VCPUs, but it's separate from the slot(s) used for main memory, then > > we'll likely see performance issues with memslot searches (even though > > it's a binary search). This is because memslots already have caching. The > > last used slot is stored in the memslots' lru_slot member (the "lru" name > > is confusing, but it means "last used" somehow). This means we could get > > thrashing on that slot cache if we're searching for the PV structure > > memslot on each vcpu load after searching for the main memory slot on each > > page fault. > > True - a dedicated memslot for stolen time wouldn't be great if a VM is > needing to fault pages (which would obviously be in a different > memslot). I don't have a good idea of the overhead of missing in the > lru_slot cache. The main reason I stopped using a dedicated cache was > because I discovered that my initial implementation using > kvm_write_guest_offset_cached() (which wasn't single-copy atomic safe) > was actually failing to use the cache because the buffer crossed a page > boundary (see __kvm_gfn_to_hva_cache_init()). So switching away from the > "_cached" variant was actually avoiding the extra walks of the memslots. > > I can look at reintroducing the caching for kvm_put_guest(). > > >> > >> My assumption (and testing) has been with a single memslot divided into > >> NR_VCPUS (or more accurately the number of VCPUs in the VM) subregions. > >> > >> For testing DT I've tested both methods: an explicit reserved region or > >> just ensuring it's not in any DT described region. Both seem reasonable, > >> but it might be easier to integrate into existing migration mechanisms > >> if it's simply a reserved region (then the memory block of the guest is > >> just as it always was). > >> > >> For ACPI the situation should be similar, but my testing has been with DT. > > > > I also can't think of any reason why we'd have to describe it in DT/ACPI, > > but I get this feeling that if we don't, then we'll hit some issue that > > will make us wish we had... > > Without knowing why we need it it's hard to justify what should go in > the bindings. But the idea of having the hypercalls is that the > description is returned via hypercalls rather than explicitly in > DT/ACPI. In theory we wouldn't need the hypercalls if it was fully > described in DT/ACPI. > Fair enough Thanks, drew
On 04/09/2019 15:22, Andrew Jones wrote: > On Wed, Sep 04, 2019 at 02:55:15PM +0100, Steven Price wrote: >> On 02/09/2019 13:52, Andrew Jones wrote: >>> On Fri, Aug 30, 2019 at 04:25:08PM +0100, Steven Price wrote: >>>> On 30/08/2019 15:47, Andrew Jones wrote: >>>>> On Fri, Aug 30, 2019 at 09:42:46AM +0100, Steven Price wrote: >> [...] >>>>>> + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant >>>>>> + PV-time feature is supported by the hypervisor. >>>>>> + >>>>>> +PV_TIME_ST >>>>>> + Function ID: (uint32) : 0xC5000022 >>>>>> + Return value: (int64) : IPA of the stolen time data structure for this >>>>>> + VCPU. On failure: >>>>>> + NOT_SUPPORTED (-1) >>>>>> + >>>>>> +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory >>>>>> +with inner and outer write back caching attributes, in the inner shareable >>>>>> +domain. A total of 16 bytes from the IPA returned are guaranteed to be >>>>>> +meaningfully filled by the hypervisor (see structure below). >>>>>> + >>>>>> +PV_TIME_ST returns the structure for the calling VCPU. >>>>>> + >>>>>> +Stolen Time >>>>>> +----------- >>>>>> + >>>>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows: >>>>>> + >>>>>> + Field | Byte Length | Byte Offset | Description >>>>>> + ----------- | ----------- | ----------- | -------------------------- >>>>>> + Revision | 4 | 0 | Must be 0 for version 0.1 >>>>>> + Attributes | 4 | 4 | Must be 0 >>>>> >>>>> The above fields don't appear to be exposed to userspace in anyway. How >>>>> will we handle migration from one KVM with one version of the structure >>>>> to another? >>>> >>>> Interesting question. User space does have access to them now it is >>>> providing the memory, but it's not exactly an easy method. In particular >>>> user space has no (simple) way of probing the kernel's supported version. >>>> >>>> I guess one solution would be to add an extra attribute on the VCPU >>>> which would provide the revision information. The current kernel would >>>> then reject any revision other than 0, but this could then be extended >>>> to support other revision numbers in the future. >>>> >>>> Although there's some logic in saying we could add the extra attribute >>>> when(/if) there is a new version. Future kernels would then be expected >>>> to use the current version unless user space explicitly set the new >>>> attribute. >>>> >>>> Do you feel this is something that needs to be addressed now, or can it >>>> be deferred until another version is proposed? >>> >>> Assuming we'll want userspace to have the option of choosing version=0, >>> and that we're fine with version=0 being the implicit choice, when nothing >>> is selected, then I guess it can be left as is for now. If, OTOH, we just >>> want migration to fail when attempting to migrate to another host with >>> an incompatible stolen-time structure (i.e. version=0 is not selectable >>> on hosts that implement later versions), then we should expose the version >>> in some way now. Perhaps a VCPU's "PV config" should be described in a >>> set of pseudo registers? >> >> I wouldn't have thought making migration fail if/when the host upgrades >> to a new version would be particularly helpful - we'd want to provide >> backwards compatibility. In particular for the suspend/resume case (I >> want to be able to save my VM to disk, upgrade the host kernel and then >> resume the VM). >> >> The only potential issue I see is the implicit "version=0 if not >> specified". That seems solvable by rejecting setting the stolen time >> base address if no version has been specified and the host kernel >> doesn't support version=0. > > I think that's the same failure I was trying avoid by failing the > migration instead. Maybe it's equivalent to fail at this vcpu-ioctl > time though? Yes this is effectively the same failure. But since we require the vcpu-ioctl to enable stolen time this gives an appropriate place to fail. Indeed this is the failure if migrating from a host with these patches to one running an existing kernel with no stolen time support. Steve
diff --git a/Documentation/virt/kvm/arm/pvtime.txt b/Documentation/virt/kvm/arm/pvtime.txt new file mode 100644 index 000000000000..dda3f0f855b9 --- /dev/null +++ b/Documentation/virt/kvm/arm/pvtime.txt @@ -0,0 +1,64 @@ +Paravirtualized time support for arm64 +====================================== + +Arm specification DEN0057/A defined a standard for paravirtualised time +support for AArch64 guests: + +https://developer.arm.com/docs/den0057/a + +KVM/arm64 implements the stolen time part of this specification by providing +some hypervisor service calls to support a paravirtualized guest obtaining a +view of the amount of time stolen from its execution. + +Two new SMCCC compatible hypercalls are defined: + +PV_FEATURES 0xC5000020 +PV_TIME_ST 0xC5000022 + +These are only available in the SMC64/HVC64 calling convention as +paravirtualized time is not available to 32 bit Arm guests. The existence of +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES +mechanism before calling it. + +PV_FEATURES + Function ID: (uint32) : 0xC5000020 + PV_func_id: (uint32) : Either PV_TIME_LPT or PV_TIME_ST + Return value: (int32) : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant + PV-time feature is supported by the hypervisor. + +PV_TIME_ST + Function ID: (uint32) : 0xC5000022 + Return value: (int64) : IPA of the stolen time data structure for this + VCPU. On failure: + NOT_SUPPORTED (-1) + +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory +with inner and outer write back caching attributes, in the inner shareable +domain. A total of 16 bytes from the IPA returned are guaranteed to be +meaningfully filled by the hypervisor (see structure below). + +PV_TIME_ST returns the structure for the calling VCPU. + +Stolen Time +----------- + +The structure pointed to by the PV_TIME_ST hypercall is as follows: + + Field | Byte Length | Byte Offset | Description + ----------- | ----------- | ----------- | -------------------------- + Revision | 4 | 0 | Must be 0 for version 0.1 + Attributes | 4 | 4 | Must be 0 + Stolen time | 8 | 8 | Stolen time in unsigned + | | | nanoseconds indicating how + | | | much time this VCPU thread + | | | was involuntarily not + | | | running on a physical CPU. + +The structure will be updated by the hypervisor prior to scheduling a VCPU. It +will be present within a reserved region of the normal memory given to the +guest. The guest should not attempt to write into this memory. There is a +structure per VCPU of the guest. + +For the user space interface see Documentation/virt/kvm/devices/vcpu.txt +section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". + diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt index 2b5dab16c4f2..896777f76f36 100644 --- a/Documentation/virt/kvm/devices/vcpu.txt +++ b/Documentation/virt/kvm/devices/vcpu.txt @@ -60,3 +60,17 @@ time to use the number provided for a given timer, overwriting any previously configured values on other VCPUs. Userspace should configure the interrupt numbers on at least one VCPU after creating all VCPUs and before running any VCPUs. + +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL +Architectures: ARM64 + +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_SET_IPA +Parameters: 64-bit base address +Returns: -ENXIO: Stolen time not implemented + -EEXIST: Base address already set for this VCPU + -EINVAL: Base address not 64 byte aligned + +Specifies the base address of the stolen time structure for this VCPU. The +base address must be 64 byte aligned and exist within a valid guest memory +region. See Documentation/virt/kvm/arm/pvtime.txt for more information +including the layout of the stolen time structure.
Introduce a paravirtualization interface for KVM/arm64 based on the "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A. This only adds the details about "Stolen Time" as the details of "Live Physical Time" have not been fully agreed. User space can specify a reserved area of memory for the guest and inform KVM to populate the memory with information on time that the host kernel has stolen from the guest. A hypercall interface is provided for the guest to interrogate the hypervisor's support for this interface and the location of the shared memory structures. Signed-off-by: Steven Price <steven.price@arm.com> --- Documentation/virt/kvm/arm/pvtime.txt | 64 +++++++++++++++++++++++++ Documentation/virt/kvm/devices/vcpu.txt | 14 ++++++ 2 files changed, 78 insertions(+) create mode 100644 Documentation/virt/kvm/arm/pvtime.txt