Message ID | 20250319-hybrid-v1-1-4d1ada10e705@daynix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] KVM: arm64: PMU: Use multiple host PMUs | expand |
Hi Akihiko, On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: > Problem > ------- > > arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: > > The observant among you will notice that the supported_cpus > > mask does not get updated for the default PMU even though it > > is quite possible the selected instance supports only a > > subset of cores in the system. This is intentional, and > > upholds the preexisting behavior on heterogeneous systems > > where vCPUs can be scheduled on any core but the guest > > counters could stop working. > > Despite the reference manual says counters may not continuously > incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 > and crashes with a division-by-zero error and it also crashes when the > PMU is not present. > > To avoid such a problem, the userspace should pin the vCPU threads to > pCPUs supported by one host PMU when initializing the vCPUs or specify > the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the > initialization. However, QEMU/libvirt can pin vCPU threads only after the > vCPUs are initialized. It also limits the pCPUs the guest can use even > for VMMs that support proper pinning. > > Solution > -------- > > Ideally, Windows should fix the division-by-zero error and QEMU/libvirt > should support pinning better, but neither of them are going to happen > anytime soon. > > To allow running Windows on QEMU/libvirt or with heterogeneous cores, > combine all host PMUs necessary to cover the cores vCPUs can run and > keep PMCCNTR_EL0 working. I'm extremely uneasy about making this a generalized solution. PMUs are deeply tied to the microarchitecture of a particular implementation, and that isn't something we can abstract away from the guest in KVM. For example, you could have an event ID that counts on only a subset of cores, or better yet an event that counts something completely different depending on where a vCPU lands. I do appreciate the issue that you're trying to solve. The good news though is that the fixed PMU cycle counter is the only thing guaranteed to be present in any PMUv3 implementation. Since that's the only counter Windows actually needs, perhaps we could special-case this in KVM. I have the following (completely untested) patch, do you want to give it a try? There's still going to be observable differences between PMUs (e.g. CPU frequency) but at least it should get things booting. Thanks, Oliver --- diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index a1bc10d7116a..913a7bab50b5 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -724,14 +724,21 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) return; memset(&attr, 0, sizeof(struct perf_event_attr)); - attr.type = arm_pmu->pmu.type; + + if (pmc->idx == ARMV8_PMU_CYCLE_IDX) { + attr.type = PERF_TYPE_HARDWARE; + attr.config = PERF_COUNT_HW_CPU_CYCLES; + } else { + attr.type = arm_pmu->pmu.type; + attr.config = eventsel; + } + attr.size = sizeof(attr); attr.pinned = 1; attr.disabled = !kvm_pmu_counter_is_enabled(pmc); attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); attr.exclude_hv = 1; /* Don't count EL2 events */ attr.exclude_host = 1; /* Don't count host events */ - attr.config = eventsel; /* * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the
On 2025/03/19 16:34, Oliver Upton wrote: > Hi Akihiko, > > On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: >> Problem >> ------- >> >> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: >>> The observant among you will notice that the supported_cpus >>> mask does not get updated for the default PMU even though it >>> is quite possible the selected instance supports only a >>> subset of cores in the system. This is intentional, and >>> upholds the preexisting behavior on heterogeneous systems >>> where vCPUs can be scheduled on any core but the guest >>> counters could stop working. >> >> Despite the reference manual says counters may not continuously >> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 >> and crashes with a division-by-zero error and it also crashes when the >> PMU is not present. >> >> To avoid such a problem, the userspace should pin the vCPU threads to >> pCPUs supported by one host PMU when initializing the vCPUs or specify >> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the >> initialization. However, QEMU/libvirt can pin vCPU threads only after the >> vCPUs are initialized. It also limits the pCPUs the guest can use even >> for VMMs that support proper pinning. >> >> Solution >> -------- >> >> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt >> should support pinning better, but neither of them are going to happen >> anytime soon. >> >> To allow running Windows on QEMU/libvirt or with heterogeneous cores, >> combine all host PMUs necessary to cover the cores vCPUs can run and >> keep PMCCNTR_EL0 working. > > I'm extremely uneasy about making this a generalized solution. PMUs are > deeply tied to the microarchitecture of a particular implementation, and > that isn't something we can abstract away from the guest in KVM. > > For example, you could have an event ID that counts on only a subset of > cores, or better yet an event that counts something completely different > depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. > > The good news though is that the fixed PMU cycle counter is the only > thing guaranteed to be present in any PMUv3 implementation. Since > that's the only counter Windows actually needs, perhaps we could > special-case this in KVM. > > I have the following (completely untested) patch, do you want to give it > a try? There's still going to be observable differences between PMUs > (e.g. CPU frequency) but at least it should get things booting. I don't think it will work, unfortunately. perf_init_event() binds a perf event to a particular PMU so the event will stop working when the thread migrates away. It should also be the reason why the perf program creates an event for each PMU. tools/perf/Documentation/intel-hybrid.txt has more descriptions. Allowing to enable more than one counter and/or an event type other than the cycle counter is not the goal. Enabling another event type may result in a garbage value, but I don't think it's worse than the current situation where the count stays zero; please tell me if I miss something. There is still room for improvement. Returning a garbage value may not be worse than returning zero, but counters and event types not supported by some cores shouldn't be advertised as available in the first place. More concretely: - The vCPU should be limited to run only on cores covered by PMUs when KVM_ARM_VCPU_PMU_V3 is set. - PMCR_EL0.N advertised to the guest should be the minimum of ones of host PMUs. - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the result of the AND operations of ones of host PMUs. Special-casing the cycle counter may make sense if we are going to fix the advertised values of PMCR_EL0.N, PMCEID0_EL0, and PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these registers. We can also prevent enabling a counter that returns zero or a garbage value. Do you think it's worth fixing these registers? If so, I'll do that by special-casing the cycle counter. Regards, Akihiko Odaki > > Thanks, > Oliver > > --- > diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c > index a1bc10d7116a..913a7bab50b5 100644 > --- a/arch/arm64/kvm/pmu-emul.c > +++ b/arch/arm64/kvm/pmu-emul.c > @@ -724,14 +724,21 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) > return; > > memset(&attr, 0, sizeof(struct perf_event_attr)); > - attr.type = arm_pmu->pmu.type; > + > + if (pmc->idx == ARMV8_PMU_CYCLE_IDX) { > + attr.type = PERF_TYPE_HARDWARE; > + attr.config = PERF_COUNT_HW_CPU_CYCLES; > + } else { > + attr.type = arm_pmu->pmu.type; > + attr.config = eventsel; > + } > + > attr.size = sizeof(attr); > attr.pinned = 1; > attr.disabled = !kvm_pmu_counter_is_enabled(pmc); > attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); > attr.exclude_hv = 1; /* Don't count EL2 events */ > attr.exclude_host = 1; /* Don't count host events */ > - attr.config = eventsel; > > /* > * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the
On Wed, 19 Mar 2025 08:37:29 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/19 16:34, Oliver Upton wrote: > > Hi Akihiko, > > > > On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: > >> Problem > >> ------- > >> > >> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: > >>> The observant among you will notice that the supported_cpus > >>> mask does not get updated for the default PMU even though it > >>> is quite possible the selected instance supports only a > >>> subset of cores in the system. This is intentional, and > >>> upholds the preexisting behavior on heterogeneous systems > >>> where vCPUs can be scheduled on any core but the guest > >>> counters could stop working. > >> > >> Despite the reference manual says counters may not continuously > >> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 > >> and crashes with a division-by-zero error and it also crashes when the > >> PMU is not present. > >> > >> To avoid such a problem, the userspace should pin the vCPU threads to > >> pCPUs supported by one host PMU when initializing the vCPUs or specify > >> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the > >> initialization. However, QEMU/libvirt can pin vCPU threads only after the > >> vCPUs are initialized. It also limits the pCPUs the guest can use even > >> for VMMs that support proper pinning. > >> > >> Solution > >> -------- > >> > >> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt > >> should support pinning better, but neither of them are going to happen > >> anytime soon. > >> > >> To allow running Windows on QEMU/libvirt or with heterogeneous cores, > >> combine all host PMUs necessary to cover the cores vCPUs can run and > >> keep PMCCNTR_EL0 working. > > > I'm extremely uneasy about making this a generalized solution. PMUs are > > deeply tied to the microarchitecture of a particular implementation, and > > that isn't something we can abstract away from the guest in KVM. > > > > For example, you could have an event ID that counts on only a subset of > > cores, or better yet an event that counts something completely different > > depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. > > > > The good news though is that the fixed PMU cycle counter is the only > > thing guaranteed to be present in any PMUv3 implementation. Since > > that's the only counter Windows actually needs, perhaps we could > > special-case this in KVM. > > > > I have the following (completely untested) patch, do you want to give it > > a try? There's still going to be observable differences between PMUs > > (e.g. CPU frequency) but at least it should get things booting. > > I don't think it will work, unfortunately. perf_init_event() binds a > perf event to a particular PMU so the event will stop working when the > thread migrates away. > > It should also be the reason why the perf program creates an event for > each PMU. tools/perf/Documentation/intel-hybrid.txt has more > descriptions. But perf on non-Intel behaves pretty differently. ARM PMUs behaves pretty differently, because there is no guarantee of homogeneous events. > > Allowing to enable more than one counter and/or an event type other > than the cycle counter is not the goal. Enabling another event type > may result in a garbage value, but I don't think it's worse than the > current situation where the count stays zero; please tell me if I miss > something. > > There is still room for improvement. Returning a garbage value may not > be worse than returning zero, but counters and event types not > supported by some cores shouldn't be advertised as available in the > first place. More concretely: > > - The vCPU should be limited to run only on cores covered by PMUs when > KVM_ARM_VCPU_PMU_V3 is set. That's userspace's job. Bind to the desired PMU, and run. KVM will actively prevent you from running on the wrong CPU. > - PMCR_EL0.N advertised to the guest should be the minimum of ones of > host PMUs. How do you find out? CPUs can be hot-plugged on long after a VM has started, bringing in a new PMU, with a different number of counters. > - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the > result of the AND operations of ones of host PMUs. Same problem. > > Special-casing the cycle counter may make sense if we are going to fix > the advertised values of PMCR_EL0.N, PMCEID0_EL0, and > PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these > registers. We can also prevent enabling a counter that returns zero or > a garbage value. > > Do you think it's worth fixing these registers? If so, I'll do that by > special-casing the cycle counter. I think this is really going in the wrong direction. The whole design of the PMU emulation is that we expose a single, architecturally correct PMU implementation. This is clearly documented. Furthermore, userspace is being given all the relevant information to place vcpus on the correct physical CPUs. Why should we add this sort of hack in the kernel, creating a new userspace ABI that we will have to support forever, when usespace can do the correct thing right now? Worse case, this is just a 'taskset' away, and everything will work. Frankly, I'm not prepared to add more hacks to KVM for the sake of the combination of broken userspace and broken guest. M.
On 2025/03/19 18:47, Marc Zyngier wrote: > On Wed, 19 Mar 2025 08:37:29 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/19 16:34, Oliver Upton wrote: >>> Hi Akihiko, >>> >>> On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote: >>>> Problem >>>> ------- >>>> >>>> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows: >>>>> The observant among you will notice that the supported_cpus >>>>> mask does not get updated for the default PMU even though it >>>>> is quite possible the selected instance supports only a >>>>> subset of cores in the system. This is intentional, and >>>>> upholds the preexisting behavior on heterogeneous systems >>>>> where vCPUs can be scheduled on any core but the guest >>>>> counters could stop working. >>>> >>>> Despite the reference manual says counters may not continuously >>>> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0 >>>> and crashes with a division-by-zero error and it also crashes when the >>>> PMU is not present. >>>> >>>> To avoid such a problem, the userspace should pin the vCPU threads to >>>> pCPUs supported by one host PMU when initializing the vCPUs or specify >>>> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the >>>> initialization. However, QEMU/libvirt can pin vCPU threads only after the >>>> vCPUs are initialized. It also limits the pCPUs the guest can use even >>>> for VMMs that support proper pinning. >>>> >>>> Solution >>>> -------- >>>> >>>> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt >>>> should support pinning better, but neither of them are going to happen >>>> anytime soon. >>>> >>>> To allow running Windows on QEMU/libvirt or with heterogeneous cores, >>>> combine all host PMUs necessary to cover the cores vCPUs can run and >>>> keep PMCCNTR_EL0 working. >>>> I'm extremely uneasy about making this a generalized solution. PMUs are >>> deeply tied to the microarchitecture of a particular implementation, and >>> that isn't something we can abstract away from the guest in KVM. >>> >>> For example, you could have an event ID that counts on only a subset of >>> cores, or better yet an event that counts something completely different >>> depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve. >>> >>> The good news though is that the fixed PMU cycle counter is the only >>> thing guaranteed to be present in any PMUv3 implementation. Since >>> that's the only counter Windows actually needs, perhaps we could >>> special-case this in KVM. >>> >>> I have the following (completely untested) patch, do you want to give it >>> a try? There's still going to be observable differences between PMUs >>> (e.g. CPU frequency) but at least it should get things booting. >> >> I don't think it will work, unfortunately. perf_init_event() binds a >> perf event to a particular PMU so the event will stop working when the >> thread migrates away. >> >> It should also be the reason why the perf program creates an event for >> each PMU. tools/perf/Documentation/intel-hybrid.txt has more >> descriptions. > > But perf on non-Intel behaves pretty differently. ARM PMUs behaves > pretty differently, because there is no guarantee of homogeneous > events. It works in the same manner in this particular aspect (i.e., "perf stat -e cycles -a" creates events for all PMUs). > >> >> Allowing to enable more than one counter and/or an event type other >> than the cycle counter is not the goal. Enabling another event type >> may result in a garbage value, but I don't think it's worse than the >> current situation where the count stays zero; please tell me if I miss >> something. >> >> There is still room for improvement. Returning a garbage value may not >> be worse than returning zero, but counters and event types not >> supported by some cores shouldn't be advertised as available in the >> first place. More concretely: >> >> - The vCPU should be limited to run only on cores covered by PMUs when >> KVM_ARM_VCPU_PMU_V3 is set. > > That's userspace's job. Bind to the desired PMU, and run. KVM will > actively prevent you from running on the wrong CPU. > >> - PMCR_EL0.N advertised to the guest should be the minimum of ones of >> host PMUs. > > How do you find out? CPUs can be hot-plugged on long after a VM has > started, bringing in a new PMU, with a different number of counters. > >> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the >> result of the AND operations of ones of host PMUs. > > Same problem. I guess special-casing the cycle counter is the only option if the kernel is going to deal with this. > >> >> Special-casing the cycle counter may make sense if we are going to fix >> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and >> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these >> registers. We can also prevent enabling a counter that returns zero or >> a garbage value. >> >> Do you think it's worth fixing these registers? If so, I'll do that by >> special-casing the cycle counter. > > I think this is really going in the wrong direction. > > The whole design of the PMU emulation is that we expose a single, > architecturally correct PMU implementation. This is clearly > documented. > > Furthermore, userspace is being given all the relevant information to > place vcpus on the correct physical CPUs. Why should we add this sort > of hack in the kernel, creating a new userspace ABI that we will have > to support forever, when usespace can do the correct thing right now? > > Worse case, this is just a 'taskset' away, and everything will work. It's surprisingly difficult to do that with libvirt; of course it is a userspace problem though. > > Frankly, I'm not prepared to add more hacks to KVM for the sake of the > combination of broken userspace and broken guest. The only counter argument I have in this regard is that some change is also needed to expose all CPUs to Windows guest even when the userspace does its best. It may result in odd scheduling, but still gives the best throughput. Regards, Akihiko Odaki > > M. >
On Wed, 19 Mar 2025 10:26:57 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > >> It should also be the reason why the perf program creates an event for > >> each PMU. tools/perf/Documentation/intel-hybrid.txt has more > >> descriptions. > > > > But perf on non-Intel behaves pretty differently. ARM PMUs behaves > > pretty differently, because there is no guarantee of homogeneous > > events. > > It works in the same manner in this particular aspect (i.e., "perf > stat -e cycles -a" creates events for all PMUs). But it then becomes a system-wide counter, and that's not what KVM needs to do. > >> Allowing to enable more than one counter and/or an event type other > >> than the cycle counter is not the goal. Enabling another event type > >> may result in a garbage value, but I don't think it's worse than the > >> current situation where the count stays zero; please tell me if I miss > >> something. > >> > >> There is still room for improvement. Returning a garbage value may not > >> be worse than returning zero, but counters and event types not > >> supported by some cores shouldn't be advertised as available in the > >> first place. More concretely: > >> > >> - The vCPU should be limited to run only on cores covered by PMUs when > >> KVM_ARM_VCPU_PMU_V3 is set. > > > > That's userspace's job. Bind to the desired PMU, and run. KVM will > > actively prevent you from running on the wrong CPU. > > > >> - PMCR_EL0.N advertised to the guest should be the minimum of ones of > >> host PMUs. > > > > How do you find out? CPUs can be hot-plugged on long after a VM has > > started, bringing in a new PMU, with a different number of counters. > > > >> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the > >> result of the AND operations of ones of host PMUs. > > > > Same problem. > > I guess special-casing the cycle counter is the only option if the > kernel is going to deal with this. Indeed. I think Oliver's idea is the least bad of them all, but man, this is really ugly. > >> Special-casing the cycle counter may make sense if we are going to fix > >> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and > >> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these > >> registers. We can also prevent enabling a counter that returns zero or > >> a garbage value. > >> > >> Do you think it's worth fixing these registers? If so, I'll do that by > >> special-casing the cycle counter. > > > > I think this is really going in the wrong direction. > > > > The whole design of the PMU emulation is that we expose a single, > > architecturally correct PMU implementation. This is clearly > > documented. > > > > Furthermore, userspace is being given all the relevant information to > > place vcpus on the correct physical CPUs. Why should we add this sort > > of hack in the kernel, creating a new userspace ABI that we will have > > to support forever, when usespace can do the correct thing right now? > > > > Worse case, this is just a 'taskset' away, and everything will work. > > It's surprisingly difficult to do that with libvirt; of course it is a > userspace problem though. Sorry, I must admit I'm completely ignorant of libvirt. I tried it years ago, and concluded that 95% of what I needed was adequately done with a shell script... > > Frankly, I'm not prepared to add more hacks to KVM for the sake of the > > combination of broken userspace and broken guest. > > The only counter argument I have in this regard is that some change is > also needed to expose all CPUs to Windows guest even when the > userspace does its best. It may result in odd scheduling, but still > gives the best throughput. But that'd be a new ABI, which again would require buy-in from userspace. Maybe there is scope for an all CPUs, cycle-counter only PMUv3 exposed to the guest, but that cannot be set automatically, as we would otherwise regress existing setups. At this stage, and given that you need to change userspace, I'm not sure what the best course of action is. Thanks, M.
On 2025/03/19 20:07, Marc Zyngier wrote: > On Wed, 19 Mar 2025 10:26:57 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >>>> It should also be the reason why the perf program creates an event for >>>> each PMU. tools/perf/Documentation/intel-hybrid.txt has more >>>> descriptions. >>> >>> But perf on non-Intel behaves pretty differently. ARM PMUs behaves >>> pretty differently, because there is no guarantee of homogeneous >>> events. >> >> It works in the same manner in this particular aspect (i.e., "perf >> stat -e cycles -a" creates events for all PMUs). > > But it then becomes a system-wide counter, and that's not what KVM > needs to do. There is also an example of program profiling: "perf stat -e cycles \-- taskset -c 16 ./triad_loop" This also creates events for all PMUs. > >>>> Allowing to enable more than one counter and/or an event type other >>>> than the cycle counter is not the goal. Enabling another event type >>>> may result in a garbage value, but I don't think it's worse than the >>>> current situation where the count stays zero; please tell me if I miss >>>> something. >>>> >>>> There is still room for improvement. Returning a garbage value may not >>>> be worse than returning zero, but counters and event types not >>>> supported by some cores shouldn't be advertised as available in the >>>> first place. More concretely: >>>> >>>> - The vCPU should be limited to run only on cores covered by PMUs when >>>> KVM_ARM_VCPU_PMU_V3 is set. >>> >>> That's userspace's job. Bind to the desired PMU, and run. KVM will >>> actively prevent you from running on the wrong CPU. >>> >>>> - PMCR_EL0.N advertised to the guest should be the minimum of ones of >>>> host PMUs. >>> >>> How do you find out? CPUs can be hot-plugged on long after a VM has >>> started, bringing in a new PMU, with a different number of counters. >>> >>>> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the >>>> result of the AND operations of ones of host PMUs. >>> >>> Same problem. >> >> I guess special-casing the cycle counter is the only option if the >> kernel is going to deal with this. > > Indeed. I think Oliver's idea is the least bad of them all, but man, > this is really ugly. > >>>> Special-casing the cycle counter may make sense if we are going to fix >>>> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and >>>> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these >>>> registers. We can also prevent enabling a counter that returns zero or >>>> a garbage value. >>>> >>>> Do you think it's worth fixing these registers? If so, I'll do that by >>>> special-casing the cycle counter. >>> >>> I think this is really going in the wrong direction. >>> >>> The whole design of the PMU emulation is that we expose a single, >>> architecturally correct PMU implementation. This is clearly >>> documented. >>> >>> Furthermore, userspace is being given all the relevant information to >>> place vcpus on the correct physical CPUs. Why should we add this sort >>> of hack in the kernel, creating a new userspace ABI that we will have >>> to support forever, when usespace can do the correct thing right now? >>> >>> Worse case, this is just a 'taskset' away, and everything will work. >> >> It's surprisingly difficult to do that with libvirt; of course it is a >> userspace problem though. > > Sorry, I must admit I'm completely ignorant of libvirt. I tried it > years ago, and concluded that 95% of what I needed was adequately done > with a shell script... > >>> Frankly, I'm not prepared to add more hacks to KVM for the sake of the >>> combination of broken userspace and broken guest. >> >> The only counter argument I have in this regard is that some change is >> also needed to expose all CPUs to Windows guest even when the >> userspace does its best. It may result in odd scheduling, but still >> gives the best throughput. > > But that'd be a new ABI, which again would require buy-in from > userspace. Maybe there is scope for an all CPUs, cycle-counter only > PMUv3 exposed to the guest, but that cannot be set automatically, as > we would otherwise regress existing setups. > > At this stage, and given that you need to change userspace, I'm not > sure what the best course of action is. Having an explicit flag for the userspace is fine for QEMU, which I care. It can flip the flag if and only if threads are not pinned to one PMU and the machine is a new setup. I also wonder what regression you think setting it automatically causes. Regards, Akihiko Odaki > > Thanks, > > M. >
On Wed, 19 Mar 2025 11:26:18 +0000, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2025/03/19 20:07, Marc Zyngier wrote: > > On Wed, 19 Mar 2025 10:26:57 +0000, > >> > > But that'd be a new ABI, which again would require buy-in from > > userspace. Maybe there is scope for an all CPUs, cycle-counter only > > PMUv3 exposed to the guest, but that cannot be set automatically, as > > we would otherwise regress existing setups. > > > > At this stage, and given that you need to change userspace, I'm not > > sure what the best course of action is. > > Having an explicit flag for the userspace is fine for QEMU, which I > care. It can flip the flag if and only if threads are not pinned to > one PMU and the machine is a new setup. > > I also wonder what regression you think setting it automatically causes. The current behaviour is that if you don't specify anything other than creating a PMUv3 (without KVM_ARM_VCPU_PMU_V3_SET_PMU), you get *some* PMU, and userspace is responsible for running the vcpu on CPUs that will implement that PMU. When if does, all the counters, all the events are valid. If it doesn't, nothing counts, but the counters/events are still valid. If you now add this flag automatically, the guest doesn't see the full PMU anymore. Only the cycle counter. That's the regression. Thanks, M.
On 2025/03/19 20:41, Marc Zyngier wrote: > On Wed, 19 Mar 2025 11:26:18 +0000, > Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2025/03/19 20:07, Marc Zyngier wrote: >>> On Wed, 19 Mar 2025 10:26:57 +0000, >>>> >>> But that'd be a new ABI, which again would require buy-in from >>> userspace. Maybe there is scope for an all CPUs, cycle-counter only >>> PMUv3 exposed to the guest, but that cannot be set automatically, as >>> we would otherwise regress existing setups. >>> >>> At this stage, and given that you need to change userspace, I'm not >>> sure what the best course of action is. >> >> Having an explicit flag for the userspace is fine for QEMU, which I >> care. It can flip the flag if and only if threads are not pinned to >> one PMU and the machine is a new setup. >> >> I also wonder what regression you think setting it automatically causes. > > The current behaviour is that if you don't specify anything other than > creating a PMUv3 (without KVM_ARM_VCPU_PMU_V3_SET_PMU), you get *some* > PMU, and userspace is responsible for running the vcpu on CPUs that > will implement that PMU. When if does, all the counters, all the > events are valid. If it doesn't, nothing counts, but the > counters/events are still valid. > > If you now add this flag automatically, the guest doesn't see the full > PMU anymore. Only the cycle counter. That's the regression. What about setting the flag automatically when a user fails to pin vCPUs to CPUs that are covered by one PMU? There would be no change if a user correctly pins vCPUs as it is. Otherwise, they will see a correct feature set advertised to the guest and the cycle counter working. Regards, Akihiko Odaki > > Thanks, > > M. >
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index caa1357fa367..06292a4ccdc8 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -449,8 +449,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) /* Set up the timer */ kvm_timer_vcpu_init(vcpu); - kvm_pmu_vcpu_init(vcpu); - kvm_arm_pvtime_vcpu_init(&vcpu->arch); vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu; diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index aae5713d8993..631752f5327f 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -22,16 +22,16 @@ DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available); static LIST_HEAD(arm_pmus); static DEFINE_MUTEX(arm_pmus_lock); -static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc); -static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc); -static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc); +static void kvm_pmu_create_pmc(struct kvm_vcpu *vcpu, u8 idx); +static void kvm_pmu_release_pmc(struct kvm_vcpu *vcpu, u8 idx); +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u8 idx); -static struct kvm_vcpu *kvm_pmc_to_vcpu(const struct kvm_pmc *pmc) +static struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc * const *pmc) { - return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[pmc->idx]); + return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[(*pmc)->idx]); } -static struct kvm_pmc *kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) +static struct kvm_pmc **kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) { return &vcpu->arch.pmu.pmc[cnt_idx]; } @@ -80,30 +80,27 @@ u64 kvm_pmu_evtyper_mask(struct kvm *kvm) * kvm_pmc_is_64bit - determine if counter is 64bit * @pmc: counter context */ -static bool kvm_pmc_is_64bit(struct kvm_pmc *pmc) +static bool kvm_pmc_is_64bit(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - - return (pmc->idx == ARMV8_PMU_CYCLE_IDX || + return (idx == ARMV8_PMU_CYCLE_IDX || kvm_has_feat(vcpu->kvm, ID_AA64DFR0_EL1, PMUVer, V3P5)); } -static bool kvm_pmc_has_64bit_overflow(struct kvm_pmc *pmc) +static bool kvm_pmc_has_64bit_overflow(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 val = kvm_vcpu_read_pmcr(vcpu); - if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) + if (kvm_pmu_counter_is_hyp(vcpu, idx)) return __vcpu_sys_reg(vcpu, MDCR_EL2) & MDCR_EL2_HLP; - return (pmc->idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || - (pmc->idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); + return (idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || + (idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); } -static bool kvm_pmu_counter_can_chain(struct kvm_pmc *pmc) +static bool kvm_pmu_counter_can_chain(struct kvm_vcpu *vcpu, u8 idx) { - return (!(pmc->idx & 1) && (pmc->idx + 1) < ARMV8_PMU_CYCLE_IDX && - !kvm_pmc_has_64bit_overflow(pmc)); + return (!(idx & 1) && (idx + 1) < ARMV8_PMU_CYCLE_IDX && + !kvm_pmc_has_64bit_overflow(vcpu, idx)); } static u32 counter_index_to_reg(u64 idx) @@ -116,28 +113,30 @@ static u32 counter_index_to_evtreg(u64 idx) return (idx == ARMV8_PMU_CYCLE_IDX) ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + idx; } -static u64 kvm_pmc_read_evtreg(const struct kvm_pmc *pmc) +static u64 kvm_pmc_read_evtreg(const struct kvm_vcpu *vcpu, u8 idx) { - return __vcpu_sys_reg(kvm_pmc_to_vcpu(pmc), counter_index_to_evtreg(pmc->idx)); + return __vcpu_sys_reg(vcpu, counter_index_to_evtreg(idx)); } -static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) +static u64 kvm_pmu_get_pmc_value(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); u64 counter, reg, enabled, running; + unsigned int i; - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); counter = __vcpu_sys_reg(vcpu, reg); /* * The real counter value is equal to the value of counter register plus * the value perf event counts. */ - if (pmc->perf_event) - counter += perf_event_read_value(pmc->perf_event, &enabled, - &running); + if (pmc) + for (i = 0; i < pmc->nr_perf_events; i++) + counter += perf_event_read_value(pmc->perf_events[i], + &enabled, &running); - if (!kvm_pmc_is_64bit(pmc)) + if (!kvm_pmc_is_64bit(vcpu, idx)) counter = lower_32_bits(counter); return counter; @@ -150,20 +149,18 @@ static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) */ u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx) { - return kvm_pmu_get_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx)); + return kvm_pmu_get_pmc_value(vcpu, select_idx); } -static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) +static void kvm_pmu_set_pmc_value(struct kvm_vcpu *vcpu, u8 idx, u64 val, bool force) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 reg; - kvm_pmu_release_perf_event(pmc); + kvm_pmu_release_pmc(vcpu, idx); - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); - if (vcpu_mode_is_32bit(vcpu) && pmc->idx != ARMV8_PMU_CYCLE_IDX && - !force) { + if (vcpu_mode_is_32bit(vcpu) && idx != ARMV8_PMU_CYCLE_IDX && !force) { /* * Even with PMUv3p5, AArch32 cannot write to the top * 32bit of the counters. The only possible course of @@ -176,8 +173,8 @@ static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) __vcpu_sys_reg(vcpu, reg) = val; - /* Recreate the perf event to reflect the updated sample_period */ - kvm_pmu_create_perf_event(pmc); + /* Recreate the pmc to reflect the updated sample_period */ + kvm_pmu_create_pmc(vcpu, idx); } /** @@ -188,7 +185,7 @@ static void kvm_pmu_set_pmc_value(struct kvm_pmc *pmc, u64 val, bool force) */ void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) { - kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, select_idx), val, false); + kvm_pmu_set_pmc_value(vcpu, select_idx, val, false); } /** @@ -199,21 +196,28 @@ void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) */ void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val) { - kvm_pmu_release_perf_event(kvm_vcpu_idx_to_pmc(vcpu, select_idx)); + kvm_pmu_release_pmc(vcpu, select_idx); __vcpu_sys_reg(vcpu, counter_index_to_reg(select_idx)) = val; kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu); } /** - * kvm_pmu_release_perf_event - remove the perf event + * kvm_pmu_release_pmc - remove the pmc * @pmc: The PMU counter pointer */ -static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) +static void kvm_pmu_release_pmc(struct kvm_vcpu *vcpu, u8 idx) { - if (pmc->perf_event) { - perf_event_disable(pmc->perf_event); - perf_event_release_kernel(pmc->perf_event); - pmc->perf_event = NULL; + struct kvm_pmc **pmc = kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (*pmc) { + for (i = 0; i < (*pmc)->nr_perf_events; i++) { + perf_event_disable((*pmc)->perf_events[i]); + perf_event_release_kernel((*pmc)->perf_events[i]); + } + + kfree(*pmc); + *pmc = NULL; } } @@ -223,35 +227,17 @@ static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) * * If this counter has been configured to monitor some event, release it here. */ -static void kvm_pmu_stop_counter(struct kvm_pmc *pmc) +static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 reg, val; - if (!pmc->perf_event) - return; - - val = kvm_pmu_get_pmc_value(pmc); + val = kvm_pmu_get_pmc_value(vcpu, idx); - reg = counter_index_to_reg(pmc->idx); + reg = counter_index_to_reg(idx); __vcpu_sys_reg(vcpu, reg) = val; - kvm_pmu_release_perf_event(pmc); -} - -/** - * kvm_pmu_vcpu_init - assign pmu counter idx for cpu - * @vcpu: The vcpu pointer - * - */ -void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) -{ - int i; - struct kvm_pmu *pmu = &vcpu->arch.pmu; - - for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) - pmu->pmc[i].idx = i; + kvm_pmu_release_pmc(vcpu, idx); } /** @@ -264,7 +250,7 @@ void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) int i; for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) - kvm_pmu_release_perf_event(kvm_vcpu_idx_to_pmc(vcpu, i)); + kvm_pmu_release_pmc(vcpu, i); irq_work_sync(&vcpu->arch.pmu.overflow_work); } @@ -321,22 +307,31 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) return GENMASK(val - 1, 0) | BIT(ARMV8_PMU_CYCLE_IDX); } -static void kvm_pmc_enable_perf_event(struct kvm_pmc *pmc) +static void kvm_pmc_enable_perf_event(struct kvm_vcpu *vcpu, u8 idx) { - if (!pmc->perf_event) { - kvm_pmu_create_perf_event(pmc); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (!pmc) { + kvm_pmu_create_pmc(vcpu, idx); return; } - perf_event_enable(pmc->perf_event); - if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE) - kvm_debug("fail to enable perf event\n"); + for (i = 0; i < pmc->nr_perf_events; i++) { + perf_event_enable(pmc->perf_events[i]); + if (pmc->perf_events[i]->state != PERF_EVENT_STATE_ACTIVE) + kvm_debug("fail to enable perf event\n"); + } } -static void kvm_pmc_disable_perf_event(struct kvm_pmc *pmc) +static void kvm_pmc_disable_perf_event(struct kvm_vcpu *vcpu, u8 idx) { - if (pmc->perf_event) - perf_event_disable(pmc->perf_event); + struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); + unsigned int i; + + if (pmc) + for (i = 0; i < pmc->nr_perf_events; i++) + perf_event_disable(pmc->perf_events[i]); } void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) @@ -347,15 +342,13 @@ void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) return; for (i = 0; i < KVM_ARMV8_PMU_MAX_COUNTERS; i++) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); - if (!(val & BIT(i))) continue; - if (kvm_pmu_counter_is_enabled(pmc)) - kvm_pmc_enable_perf_event(pmc); + if (kvm_pmu_counter_is_enabled(vcpu, i)) + kvm_pmc_enable_perf_event(vcpu, i); else - kvm_pmc_disable_perf_event(pmc); + kvm_pmc_disable_perf_event(vcpu, i); } kvm_vcpu_pmu_restore_guest(vcpu); @@ -486,7 +479,6 @@ static void kvm_pmu_counter_increment(struct kvm_vcpu *vcpu, mask &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); for_each_set_bit(i, &mask, ARMV8_PMU_CYCLE_IDX) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); u64 type, reg; /* Filter on event type */ @@ -497,29 +489,30 @@ static void kvm_pmu_counter_increment(struct kvm_vcpu *vcpu, /* Increment this counter */ reg = __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) + 1; - if (!kvm_pmc_is_64bit(pmc)) + if (!kvm_pmc_is_64bit(vcpu, i)) reg = lower_32_bits(reg); __vcpu_sys_reg(vcpu, counter_index_to_reg(i)) = reg; /* No overflow? move on */ - if (kvm_pmc_has_64bit_overflow(pmc) ? reg : lower_32_bits(reg)) + if (kvm_pmc_has_64bit_overflow(vcpu, i) ? reg : lower_32_bits(reg)) continue; /* Mark overflow */ __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(i); - if (kvm_pmu_counter_can_chain(pmc)) + if (kvm_pmu_counter_can_chain(vcpu, i)) kvm_pmu_counter_increment(vcpu, BIT(i + 1), ARMV8_PMUV3_PERFCTR_CHAIN); } } /* Compute the sample period for a given counter value */ -static u64 compute_period(struct kvm_pmc *pmc, u64 counter) +static u64 compute_period(struct kvm_pmc **pmc, u64 counter) { + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 val; - if (kvm_pmc_is_64bit(pmc) && kvm_pmc_has_64bit_overflow(pmc)) + if (kvm_pmc_is_64bit(vcpu, (*pmc)->idx) && kvm_pmc_has_64bit_overflow(vcpu, (*pmc)->idx)) val = (-counter) & GENMASK(63, 0); else val = (-counter) & GENMASK(31, 0); @@ -534,10 +527,10 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event, struct perf_sample_data *data, struct pt_regs *regs) { - struct kvm_pmc *pmc = perf_event->overflow_handler_context; + struct kvm_pmc **pmc = perf_event->overflow_handler_context; struct arm_pmu *cpu_pmu = to_arm_pmu(perf_event->pmu); struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - int idx = pmc->idx; + int idx = (*pmc)->idx; u64 period; cpu_pmu->pmu.stop(perf_event, PERF_EF_UPDATE); @@ -554,7 +547,7 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event, __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= BIT(idx); - if (kvm_pmu_counter_can_chain(pmc)) + if (kvm_pmu_counter_can_chain(vcpu, idx)) kvm_pmu_counter_increment(vcpu, BIT(idx + 1), ARMV8_PMUV3_PERFCTR_CHAIN); @@ -613,95 +606,74 @@ void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) ~BIT(ARMV8_PMU_CYCLE_IDX); for_each_set_bit(i, &mask, 32) - kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, i), 0, true); + kvm_pmu_set_pmc_value(vcpu, i, 0, true); } } -static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc) +static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); unsigned int mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); - if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx))) + if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(idx))) return false; - if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) + if (kvm_pmu_counter_is_hyp(vcpu, idx)) return mdcr & MDCR_EL2_HPME; return kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E; } -static bool kvm_pmc_counts_at_el0(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el0(struct kvm_vcpu *vcpu, u8 idx) { - u64 evtreg = kvm_pmc_read_evtreg(pmc); + u64 evtreg = kvm_pmc_read_evtreg(vcpu, idx); bool nsu = evtreg & ARMV8_PMU_EXCLUDE_NS_EL0; bool u = evtreg & ARMV8_PMU_EXCLUDE_EL0; return u == nsu; } -static bool kvm_pmc_counts_at_el1(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el1(struct kvm_vcpu *vcpu, u8 idx) { - u64 evtreg = kvm_pmc_read_evtreg(pmc); + u64 evtreg = kvm_pmc_read_evtreg(vcpu, idx); bool nsk = evtreg & ARMV8_PMU_EXCLUDE_NS_EL1; bool p = evtreg & ARMV8_PMU_EXCLUDE_EL1; return p == nsk; } -static bool kvm_pmc_counts_at_el2(struct kvm_pmc *pmc) +static bool kvm_pmc_counts_at_el2(struct kvm_vcpu *vcpu, u8 idx) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); u64 mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); - if (!kvm_pmu_counter_is_hyp(vcpu, pmc->idx) && (mdcr & MDCR_EL2_HPMD)) + if (!kvm_pmu_counter_is_hyp(vcpu, idx) && (mdcr & MDCR_EL2_HPMD)) return false; - return kvm_pmc_read_evtreg(pmc) & ARMV8_PMU_INCLUDE_EL2; + return kvm_pmc_read_evtreg(vcpu, idx) & ARMV8_PMU_INCLUDE_EL2; } -/** - * kvm_pmu_create_perf_event - create a perf event for a counter - * @pmc: Counter context - */ -static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) +static struct kvm_pmc *kvm_pmu_alloc_pmc(u8 idx, unsigned int capacity) { - struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); - struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; - struct perf_event *event; - struct perf_event_attr attr; - u64 eventsel, evtreg; + struct kvm_pmc *pmc = kzalloc(struct_size(pmc, perf_events, capacity), GFP_KERNEL_ACCOUNT); - evtreg = kvm_pmc_read_evtreg(pmc); - - kvm_pmu_stop_counter(pmc); - if (pmc->idx == ARMV8_PMU_CYCLE_IDX) - eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; - else - eventsel = evtreg & kvm_pmu_event_mask(vcpu->kvm); + if (pmc) + pmc->idx = idx; - /* - * Neither SW increment nor chained events need to be backed - * by a perf event. - */ - if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR || - eventsel == ARMV8_PMUV3_PERFCTR_CHAIN) - return; + return pmc; +} - /* - * If we have a filter in place and that the event isn't allowed, do - * not install a perf event either. - */ - if (vcpu->kvm->arch.pmu_filter && - !test_bit(eventsel, vcpu->kvm->arch.pmu_filter)) - return; +static void kvm_pmu_create_perf_event(struct kvm_pmc **pmc, + struct arm_pmu *arm_pmu, u64 eventsel) +{ + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); + struct perf_event *event; + struct perf_event_attr attr; memset(&attr, 0, sizeof(struct perf_event_attr)); attr.type = arm_pmu->pmu.type; attr.size = sizeof(attr); attr.pinned = 1; - attr.disabled = !kvm_pmu_counter_is_enabled(pmc); - attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); + attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, (*pmc)->idx); + attr.exclude_user = !kvm_pmc_counts_at_el0(vcpu, (*pmc)->idx); attr.exclude_hv = 1; /* Don't count EL2 events */ attr.exclude_host = 1; /* Don't count host events */ attr.config = eventsel; @@ -711,19 +683,19 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) * guest's EL2 filter. */ if (unlikely(is_hyp_ctxt(vcpu))) - attr.exclude_kernel = !kvm_pmc_counts_at_el2(pmc); + attr.exclude_kernel = !kvm_pmc_counts_at_el2(vcpu, (*pmc)->idx); else - attr.exclude_kernel = !kvm_pmc_counts_at_el1(pmc); + attr.exclude_kernel = !kvm_pmc_counts_at_el1(vcpu, (*pmc)->idx); /* * If counting with a 64bit counter, advertise it to the perf * code, carefully dealing with the initial sample period * which also depends on the overflow. */ - if (kvm_pmc_is_64bit(pmc)) + if (kvm_pmc_is_64bit(vcpu, (*pmc)->idx)) attr.config1 |= PERF_ATTR_CFG1_COUNTER_64BIT; - attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc)); + attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(vcpu, (*pmc)->idx)); event = perf_event_create_kernel_counter(&attr, -1, current, kvm_pmu_perf_overflow, pmc); @@ -731,10 +703,93 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) if (IS_ERR(event)) { pr_err_once("kvm: pmu event creation failed %ld\n", PTR_ERR(event)); + } + + (*pmc)->perf_events[(*pmc)->nr_perf_events] = event; + (*pmc)->nr_perf_events++; +} + +/** + * kvm_pmu_create_pmc - create a pmc + * @pmc: Counter context + */ +static void kvm_pmu_create_pmc(struct kvm_vcpu *vcpu, u8 idx) +{ + struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; + struct arm_pmu_entry *entry; + struct kvm_pmc **pmc = kvm_vcpu_idx_to_pmc(vcpu, idx); + u64 eventsel, evtreg; + cpumask_t cpus; + size_t n; + + + evtreg = kvm_pmc_read_evtreg(vcpu, idx); + + kvm_pmu_stop_counter(vcpu, idx); + if (idx == ARMV8_PMU_CYCLE_IDX) + eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; + else + eventsel = evtreg & kvm_pmu_event_mask(vcpu->kvm); + + /* + * Neither SW increment nor chained events need to be backed + * by a perf event. + */ + if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR || + eventsel == ARMV8_PMUV3_PERFCTR_CHAIN) return; + + /* + * If we have a filter in place and that the event isn't allowed, do + * not install a perf event either. + */ + if (vcpu->kvm->arch.pmu_filter && + !test_bit(eventsel, vcpu->kvm->arch.pmu_filter)) + return; + + if (cpumask_subset(current->cpus_ptr, &arm_pmu->supported_cpus)) { + *pmc = kvm_pmu_alloc_pmc(idx, 1); + if (!*pmc) + goto err; + + kvm_pmu_create_perf_event(pmc, arm_pmu, eventsel); + } else { + guard(mutex)(&arm_pmus_lock); + + n = 0; + cpumask_clear(&cpus); + list_for_each_entry(entry, &arm_pmus, entry) { + arm_pmu = entry->arm_pmu; + + if (!cpumask_intersects(current->cpus_ptr, &arm_pmu->supported_cpus) || + cpumask_intersects(&cpus, &arm_pmu->supported_cpus)) + continue; + + cpumask_or(&cpus, &cpus, &arm_pmu->supported_cpus); + n++; + } + + *pmc = kvm_pmu_alloc_pmc(idx, n); + if (!*pmc) + goto err; + + cpumask_clear(&cpus); + list_for_each_entry(entry, &arm_pmus, entry) { + arm_pmu = entry->arm_pmu; + + if (!cpumask_intersects(current->cpus_ptr, &arm_pmu->supported_cpus) || + cpumask_intersects(&cpus, &arm_pmu->supported_cpus)) + continue; + + cpumask_or(&cpus, &cpus, &arm_pmu->supported_cpus); + kvm_pmu_create_perf_event(pmc, arm_pmu, eventsel); + } } - pmc->perf_event = event; + return; + +err: + pr_err_once("kvm: pmc allocation failed\n"); } /** @@ -750,13 +805,12 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc) void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, u64 select_idx) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, select_idx); u64 reg; - reg = counter_index_to_evtreg(pmc->idx); + reg = counter_index_to_evtreg(select_idx); __vcpu_sys_reg(vcpu, reg) = data & kvm_pmu_evtyper_mask(vcpu->kvm); - kvm_pmu_create_perf_event(pmc); + kvm_pmu_create_pmc(vcpu, select_idx); } void kvm_host_pmu_init(struct arm_pmu *pmu) @@ -994,7 +1048,8 @@ static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu) * subset of cores in the system. This is intentional, and * upholds the preexisting behavior on heterogeneous systems * where vCPUs can be scheduled on any core but the guest - * counters could stop working. + * may see features not supported on some cores being advertised + * via PMCR_EL0, PMCEID0_EL0, and PMCEID1_EL0. */ int kvm_arm_set_default_pmu(struct kvm *kvm) { @@ -1208,17 +1263,15 @@ void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) mask = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); for_each_set_bit(i, &mask, 32) { - struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); - /* * We only need to reconfigure events where the filter is * different at EL1 vs. EL2, as we're multiplexing the true EL1 * event filter bit for nested. */ - if (kvm_pmc_counts_at_el1(pmc) == kvm_pmc_counts_at_el2(pmc)) + if (kvm_pmc_counts_at_el1(vcpu, i) == kvm_pmc_counts_at_el2(vcpu, i)) continue; - kvm_pmu_create_perf_event(pmc); + kvm_pmu_create_pmc(vcpu, i); reprogrammed = true; } diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h index d6ad13925978..79ab5ff203eb 100644 --- a/include/kvm/arm_pmu.h +++ b/include/kvm/arm_pmu.h @@ -15,7 +15,8 @@ #if IS_ENABLED(CONFIG_HW_PERF_EVENTS) && IS_ENABLED(CONFIG_KVM) struct kvm_pmc { u8 idx; /* index into the pmu->pmc array */ - struct perf_event *perf_event; + unsigned int nr_perf_events; + struct perf_event *perf_events[] __counted_by(nr_perf_events); }; struct kvm_pmu_events { @@ -26,7 +27,7 @@ struct kvm_pmu_events { struct kvm_pmu { struct irq_work overflow_work; struct kvm_pmu_events events; - struct kvm_pmc pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; + struct kvm_pmc *pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; int irq_num; bool created; bool irq_level; @@ -52,7 +53,6 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu); u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1); -void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu); void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu); void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val); void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu); @@ -124,7 +124,6 @@ static inline u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) { return 0; } -static inline void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) {} static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {} static inline void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} static inline void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) {}