Message ID | 20240126085444.324918-1-xiong.y.zhang@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/pmu: Introduce passthrough vPM | expand |
<bikeshed> I think we should call this a mediated PMU, not a passthrough PMU. KVM still emulates the control plane (controls and event selectors), while the data is fully passed through (counters). </bikeshed> On Fri, Jan 26, 2024, Xiong Zhang wrote: > 1. host system wide / QEMU events handling during VM running > At VM-entry, all the host perf events which use host x86 PMU will be > stopped. These events with attr.exclude_guest = 1 will be stopped here > and re-started after vm-exit. These events without attr.exclude_guest=1 > will be in error state, and they cannot recovery into active state even > if the guest stops running. This impacts host perf a lot and request > host system wide perf events have attr.exclude_guest=1. > > This requests QEMU Process's perf event with attr.exclude_guest=1 also. > > During VM running, perf event creation for system wide and QEMU > process without attr.exclude_guest=1 fail with -EBUSY. > > 2. NMI watchdog > the perf event for NMI watchdog is a system wide cpu pinned event, it > will be stopped also during vm running, but it doesn't have > attr.exclude_guest=1, we add it in this RFC. But this still means NMI > watchdog loses function during VM running. > > Two candidates exist for replacing perf event of NMI watchdog: > a. Buddy hardlock detector[3] may be not reliable to replace perf event. > b. HPET-based hardlock detector [4] isn't in the upstream kernel. I think the simplest solution is to allow mediated PMU usage if and only if the NMI watchdog is disabled. Then whether or not the host replaces the NMI watchdog with something else becomes an orthogonal discussion, i.e. not KVM's problem to solve. > 3. Dedicated kvm_pmi_vector > In emulated vPMU, host PMI handler notify KVM to inject a virtual > PMI into guest when physical PMI belongs to guest counter. If the > same mechanism is used in passthrough vPMU and PMI skid exists > which cause physical PMI belonging to guest happens after VM-exit, > then the host PMI handler couldn't identify this PMI belongs to > host or guest. > So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > has this vector only. The PMI belonging to host still has an NMI > vector. > > Without considering PMI skid especially for AMD, the host NMI vector > could be used for guest PMI also, this method is simpler and doesn't I don't see how multiplexing NMIs between guest and host is simpler. At best, the complexity is a wash, just in different locations, and I highly doubt it's a wash. AFAIK, there is no way to precisely know that an NMI came in via the LVTPC. E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we > didn't meet the skid PMI issue on modern Intel processors. > > 4. per-VM passthrough mode configuration > Current RFC uses a KVM module enable_passthrough_pmu RO parameter, > it decides vPMU is passthrough mode or emulated mode at kvm module > load time. > Do we need the capability of per-VM passthrough mode configuration? > So an admin can launch some non-passthrough VM and profile these > non-passthrough VMs in host, but admin still cannot profile all > the VMs once passthrough VM existence. This means passthrough vPMU > and emulated vPMU mix on one platform, it has challenges to implement. > As the commit message in commit 0011, the main challenge is > passthrough vPMU and emulated vPMU have different vPMU features, this > ends up with two different values for kvm_cap.supported_perf_cap, which > is initialized at module load time. To support it, more refactor is > needed. I have no objection to an all-or-nothing setup. I'd honestly love to rip out the existing vPMU support entirely, but that's probably not be realistic, at least not in the near future. > Remain Works > === > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. Before this gets out of its "RFC" phase, I would at least like line of sight to a more optimized switch. I 100% agree that starting with a conservative implementation is the way to go, and the kernel absolutely needs to be able to profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the guest PMU loaded for the entirety of KVM_RUN isn't a viable option. But I also don't want to get into a situation where can't figure out a clean, robust way to do the optimized context switch without needing (another) massive rewrite.
On Fri, Jan 26, 2024, Xiong Zhang wrote: > Dapeng Mi (4): > x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU > KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU > KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS > KVM: x86/pmu: Clear PERF_METRICS MSR for guest > > Kan Liang (2): > perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH > perf: Support guest enter/exit interfaces > > Mingwei Zhang (22): > perf: core/x86: Forbid PMI handler when guest own PMU > perf: core/x86: Plumb passthrough PMU capability from x86_pmu to > x86_pmu_cap > KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and > propage to KVM instance > KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs > KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled > KVM: x86/pmu: Allow RDPMC pass through > KVM: x86/pmu: Create a function prototype to disable MSR interception > KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR > interception > KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with > perf capabilities > KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU > KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU > context > KVM: x86/pmu: Introduce function prototype for Intel CPU to > save/restore PMU context > KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid > information leakage > KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU > capability > KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU > KVM: x86/pmu: Make check_pmu_event_filter() an exported function > KVM: x86/pmu: Allow writing to event selector for GP counters if event > is allowed > KVM: x86/pmu: Allow writing to fixed counter selector if counter is > exposed > KVM: x86/pmu: Introduce PMU helper to increment counter > KVM: x86/pmu: Implement emulated counter increment for passthrough PMU > KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from > non-passthrough vPMU > KVM: nVMX: Add nested virtualization support for passthrough PMU > > Xiong Zhang (13): > perf: Set exclude_guest onto nmi_watchdog > perf: core/x86: Add support to register a new vector for PMI handling > KVM: x86/pmu: Register PMI handler for passthrough PMU > perf: x86: Add function to switch PMI handler > perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW > KVM: x86/pmu: Add get virtual LVTPC_MASK bit function > KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL > KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary > KVM: x86/pmu: Switch PMI handler at KVM context switch boundary > KVM: x86/pmu: Call perf_guest_enter() at PMU context switch > KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter > KVM: x86/pmu: Intercept EVENT_SELECT MSR > KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR All done with this pass. Looks quite good, nothing on the KVM side scares me. Nice! I haven't spent much time thinking about whether or not the overall implementation correct/optimal, i.e. I mostly just reviewed the mechanics. I'll make sure to spend a bit more time on that for the next RFC. Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches that will change quite a bit.
Hi Sean, On Thu, Apr 11, 2024 at 4:26 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Jan 26, 2024, Xiong Zhang wrote: > > Dapeng Mi (4): > > x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU > > KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU > > KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS > > KVM: x86/pmu: Clear PERF_METRICS MSR for guest > > > > Kan Liang (2): > > perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH > > perf: Support guest enter/exit interfaces > > > > Mingwei Zhang (22): > > perf: core/x86: Forbid PMI handler when guest own PMU > > perf: core/x86: Plumb passthrough PMU capability from x86_pmu to > > x86_pmu_cap > > KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and > > propage to KVM instance > > KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs > > KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled > > KVM: x86/pmu: Allow RDPMC pass through > > KVM: x86/pmu: Create a function prototype to disable MSR interception > > KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR > > interception > > KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with > > perf capabilities > > KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU > > KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU > > context > > KVM: x86/pmu: Introduce function prototype for Intel CPU to > > save/restore PMU context > > KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid > > information leakage > > KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU > > capability > > KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU > > KVM: x86/pmu: Make check_pmu_event_filter() an exported function > > KVM: x86/pmu: Allow writing to event selector for GP counters if event > > is allowed > > KVM: x86/pmu: Allow writing to fixed counter selector if counter is > > exposed > > KVM: x86/pmu: Introduce PMU helper to increment counter > > KVM: x86/pmu: Implement emulated counter increment for passthrough PMU > > KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from > > non-passthrough vPMU > > KVM: nVMX: Add nested virtualization support for passthrough PMU > > > > Xiong Zhang (13): > > perf: Set exclude_guest onto nmi_watchdog > > perf: core/x86: Add support to register a new vector for PMI handling > > KVM: x86/pmu: Register PMI handler for passthrough PMU > > perf: x86: Add function to switch PMI handler > > perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW > > KVM: x86/pmu: Add get virtual LVTPC_MASK bit function > > KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL > > KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary > > KVM: x86/pmu: Switch PMI handler at KVM context switch boundary > > KVM: x86/pmu: Call perf_guest_enter() at PMU context switch > > KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter > > KVM: x86/pmu: Intercept EVENT_SELECT MSR > > KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR > > All done with this pass. Looks quite good, nothing on the KVM side scares me. Nice! yay! Thank you Sean for the review! > > I haven't spent much time thinking about whether or not the overall implementation > correct/optimal, i.e. I mostly just reviewed the mechanics. I'll make sure to > spend a bit more time on that for the next RFC. Yes, I am expecting the debate/discussion in PUCK after v2 is sent out. There should be room for optimization as well. > > Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches > that will change quite a bit. Will do the rebase and all of the feedback will be taken and into updates in v2. In v2, we will incorporate passthrough vPMU with AMD support. Will do our best to get it in high quality. Thanks. -Mingwei
On 4/12/2024 1:03 AM, Sean Christopherson wrote: > <bikeshed> > > I think we should call this a mediated PMU, not a passthrough PMU. KVM still > emulates the control plane (controls and event selectors), while the data is > fully passed through (counters). > > </bikeshed> > > On Fri, Jan 26, 2024, Xiong Zhang wrote: > >> 1. host system wide / QEMU events handling during VM running >> At VM-entry, all the host perf events which use host x86 PMU will be >> stopped. These events with attr.exclude_guest = 1 will be stopped here >> and re-started after vm-exit. These events without attr.exclude_guest=1 >> will be in error state, and they cannot recovery into active state even >> if the guest stops running. This impacts host perf a lot and request >> host system wide perf events have attr.exclude_guest=1. >> >> This requests QEMU Process's perf event with attr.exclude_guest=1 also. >> >> During VM running, perf event creation for system wide and QEMU >> process without attr.exclude_guest=1 fail with -EBUSY. >> >> 2. NMI watchdog >> the perf event for NMI watchdog is a system wide cpu pinned event, it >> will be stopped also during vm running, but it doesn't have >> attr.exclude_guest=1, we add it in this RFC. But this still means NMI >> watchdog loses function during VM running. >> >> Two candidates exist for replacing perf event of NMI watchdog: >> a. Buddy hardlock detector[3] may be not reliable to replace perf event. >> b. HPET-based hardlock detector [4] isn't in the upstream kernel. > > I think the simplest solution is to allow mediated PMU usage if and only if > the NMI watchdog is disabled. Then whether or not the host replaces the NMI > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > problem to solve. Make sense. KVM should not affect host high priority work. NMI watchdog is a client of perf and is a system wide perf event, perf can't distinguish a system wide perf event is NMI watchdog or others, so how about we extend this suggestion to all the system wide perf events ? mediated PMU is only allowed when all system wide perf events are disabled or non-exist at vm creation. but NMI watchdog is usually enabled, this will limit mediated PMU usage. > >> 3. Dedicated kvm_pmi_vector >> In emulated vPMU, host PMI handler notify KVM to inject a virtual >> PMI into guest when physical PMI belongs to guest counter. If the >> same mechanism is used in passthrough vPMU and PMI skid exists >> which cause physical PMI belonging to guest happens after VM-exit, >> then the host PMI handler couldn't identify this PMI belongs to >> host or guest. >> So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest >> has this vector only. The PMI belonging to host still has an NMI >> vector. >> >> Without considering PMI skid especially for AMD, the host NMI vector >> could be used for guest PMI also, this method is simpler and doesn't > > I don't see how multiplexing NMIs between guest and host is simpler. At best, > the complexity is a wash, just in different locations, and I highly doubt it's > a wash. AFAIK, there is no way to precisely know that an NMI came in via the > LVTPC. when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing NMI between guest and host, we could extend guest PT's PMI framework to mediated PMU. so I think this is simpler. > > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > >> need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we >> didn't meet the skid PMI issue on modern Intel processors. >> >> 4. per-VM passthrough mode configuration >> Current RFC uses a KVM module enable_passthrough_pmu RO parameter, >> it decides vPMU is passthrough mode or emulated mode at kvm module >> load time. >> Do we need the capability of per-VM passthrough mode configuration? >> So an admin can launch some non-passthrough VM and profile these >> non-passthrough VMs in host, but admin still cannot profile all >> the VMs once passthrough VM existence. This means passthrough vPMU >> and emulated vPMU mix on one platform, it has challenges to implement. >> As the commit message in commit 0011, the main challenge is >> passthrough vPMU and emulated vPMU have different vPMU features, this >> ends up with two different values for kvm_cap.supported_perf_cap, which >> is initialized at module load time. To support it, more refactor is >> needed. > > I have no objection to an all-or-nothing setup. I'd honestly love to rip out the > existing vPMU support entirely, but that's probably not be realistic, at least not > in the near future. > >> Remain Works >> === >> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. > > Before this gets out of its "RFC" phase, I would at least like line of sight to > a more optimized switch. I 100% agree that starting with a conservative > implementation is the way to go, and the kernel absolutely needs to be able to > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the > guest PMU loaded for the entirety of KVM_RUN isn't a viable option. > > But I also don't want to get into a situation where can't figure out a clean, > robust way to do the optimized context switch without needing (another) massive > rewrite. > Current PMU context switch happens at each vm-entry/exit, this impacts guest performance even if guest doesn't use PMU, as our first optimization, we will switch the PMU context only when guest really use PMU. thanks
On Fri, Apr 12, 2024, Xiong Y Zhang wrote: > >> 2. NMI watchdog > >> the perf event for NMI watchdog is a system wide cpu pinned event, it > >> will be stopped also during vm running, but it doesn't have > >> attr.exclude_guest=1, we add it in this RFC. But this still means NMI > >> watchdog loses function during VM running. > >> > >> Two candidates exist for replacing perf event of NMI watchdog: > >> a. Buddy hardlock detector[3] may be not reliable to replace perf event. > >> b. HPET-based hardlock detector [4] isn't in the upstream kernel. > > > > I think the simplest solution is to allow mediated PMU usage if and only if > > the NMI watchdog is disabled. Then whether or not the host replaces the NMI > > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > > problem to solve. > Make sense. KVM should not affect host high priority work. > NMI watchdog is a client of perf and is a system wide perf event, perf can't > distinguish a system wide perf event is NMI watchdog or others, so how about > we extend this suggestion to all the system wide perf events ? mediated PMU > is only allowed when all system wide perf events are disabled or non-exist at > vm creation. What other kernel-driven system wide perf events are there? > but NMI watchdog is usually enabled, this will limit mediated PMU usage. I don't think it is at all unreasonable to require users that want optimal PMU virtualization to adjust their environment. And we can and should document the tradeoffs and alternatives, e.g. so that users that want better PMU results don't need to re-discover all the "gotchas" on their own. This would even be one of the rare times where I would be ok with a dmesg log. E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide perf events, pr_warn() to explain the conflict and direct the user at documentation explaining how to make their system compatible with mediate PMU usage. > >> 3. Dedicated kvm_pmi_vector > >> In emulated vPMU, host PMI handler notify KVM to inject a virtual > >> PMI into guest when physical PMI belongs to guest counter. If the > >> same mechanism is used in passthrough vPMU and PMI skid exists > >> which cause physical PMI belonging to guest happens after VM-exit, > >> then the host PMI handler couldn't identify this PMI belongs to > >> host or guest. > >> So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > >> has this vector only. The PMI belonging to host still has an NMI > >> vector. > >> > >> Without considering PMI skid especially for AMD, the host NMI vector > >> could be used for guest PMI also, this method is simpler and doesn't > > > > I don't see how multiplexing NMIs between guest and host is simpler. At best, > > the complexity is a wash, just in different locations, and I highly doubt it's > > a wash. AFAIK, there is no way to precisely know that an NMI came in via the > > LVTPC. > when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing > NMI between guest and host, we could extend guest PT's PMI framework to > mediated PMU. so I think this is simpler. Heh, what do you mean by "this"? Using a dedicated IRQ vector, or extending the PT framework of multiplexing NMI?
On 4/13/2024 2:32 AM, Sean Christopherson wrote: > On Fri, Apr 12, 2024, Xiong Y Zhang wrote: >>>> 2. NMI watchdog >>>> the perf event for NMI watchdog is a system wide cpu pinned event, it >>>> will be stopped also during vm running, but it doesn't have >>>> attr.exclude_guest=1, we add it in this RFC. But this still means NMI >>>> watchdog loses function during VM running. >>>> >>>> Two candidates exist for replacing perf event of NMI watchdog: >>>> a. Buddy hardlock detector[3] may be not reliable to replace perf event. >>>> b. HPET-based hardlock detector [4] isn't in the upstream kernel. >>> >>> I think the simplest solution is to allow mediated PMU usage if and only if >>> the NMI watchdog is disabled. Then whether or not the host replaces the NMI >>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's >>> problem to solve. >> Make sense. KVM should not affect host high priority work. >> NMI watchdog is a client of perf and is a system wide perf event, perf can't >> distinguish a system wide perf event is NMI watchdog or others, so how about >> we extend this suggestion to all the system wide perf events ? mediated PMU >> is only allowed when all system wide perf events are disabled or non-exist at >> vm creation. > > What other kernel-driven system wide perf events are there? does "kernel-driven" mean perf events created through perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ? User can create system wide perf event through "perf record -e {} -a" also, I call it as user-driven system wide perf events. Perf subsystem doesn't distinguish "kernel-driven" and "user-driven" system wide perf events. > >> but NMI watchdog is usually enabled, this will limit mediated PMU usage. > > I don't think it is at all unreasonable to require users that want optimal PMU > virtualization to adjust their environment. And we can and should document the > tradeoffs and alternatives, e.g. so that users that want better PMU results don't > need to re-discover all the "gotchas" on their own. > > This would even be one of the rare times where I would be ok with a dmesg log. > E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide > perf events, pr_warn() to explain the conflict and direct the user at documentation > explaining how to make their system compatible with mediate PMU usage.> >>>> 3. Dedicated kvm_pmi_vector >>>> In emulated vPMU, host PMI handler notify KVM to inject a virtual >>>> PMI into guest when physical PMI belongs to guest counter. If the >>>> same mechanism is used in passthrough vPMU and PMI skid exists >>>> which cause physical PMI belonging to guest happens after VM-exit, >>>> then the host PMI handler couldn't identify this PMI belongs to >>>> host or guest. >>>> So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest >>>> has this vector only. The PMI belonging to host still has an NMI >>>> vector. >>>> >>>> Without considering PMI skid especially for AMD, the host NMI vector >>>> could be used for guest PMI also, this method is simpler and doesn't >>> >>> I don't see how multiplexing NMIs between guest and host is simpler. At best, >>> the complexity is a wash, just in different locations, and I highly doubt it's >>> a wash. AFAIK, there is no way to precisely know that an NMI came in via the >>> LVTPC. >> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing >> NMI between guest and host, we could extend guest PT's PMI framework to >> mediated PMU. so I think this is simpler. > > Heh, what do you mean by "this"? Using a dedicated IRQ vector, or extending the > PT framework of multiplexing NMI? here "this" means "extending the PT framework of multiplexing NMI". thanks >
On Mon, Apr 15, 2024, Xiong Y Zhang wrote: > On 4/13/2024 2:32 AM, Sean Christopherson wrote: > > On Fri, Apr 12, 2024, Xiong Y Zhang wrote: > >>>> 2. NMI watchdog > >>>> the perf event for NMI watchdog is a system wide cpu pinned event, it > >>>> will be stopped also during vm running, but it doesn't have > >>>> attr.exclude_guest=1, we add it in this RFC. But this still means NMI > >>>> watchdog loses function during VM running. > >>>> > >>>> Two candidates exist for replacing perf event of NMI watchdog: > >>>> a. Buddy hardlock detector[3] may be not reliable to replace perf event. > >>>> b. HPET-based hardlock detector [4] isn't in the upstream kernel. > >>> > >>> I think the simplest solution is to allow mediated PMU usage if and only if > >>> the NMI watchdog is disabled. Then whether or not the host replaces the NMI > >>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > >>> problem to solve. > >> Make sense. KVM should not affect host high priority work. > >> NMI watchdog is a client of perf and is a system wide perf event, perf can't > >> distinguish a system wide perf event is NMI watchdog or others, so how about > >> we extend this suggestion to all the system wide perf events ? mediated PMU > >> is only allowed when all system wide perf events are disabled or non-exist at > >> vm creation. > > > > What other kernel-driven system wide perf events are there? > does "kernel-driven" mean perf events created through > perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ? By kernel-driven I meant events that aren't tied to a single userspace process or action. E.g. KVM creates events, but those events are effectively user-driven because they will go away if the associated VM terminates. > User can create system wide perf event through "perf record -e {} -a" also, I > call it as user-driven system wide perf events. Perf subsystem doesn't > distinguish "kernel-driven" and "user-driven" system wide perf events. Right, but us humans can build a list, even if it's only for documentation, e.g. to provide help for someone to run KVM guests with mediated PMUs, but can't because there are active !exclude_guest events. > >> but NMI watchdog is usually enabled, this will limit mediated PMU usage. > > > > I don't think it is at all unreasonable to require users that want optimal PMU > > virtualization to adjust their environment. And we can and should document the > > tradeoffs and alternatives, e.g. so that users that want better PMU results don't > > need to re-discover all the "gotchas" on their own. > > > > This would even be one of the rare times where I would be ok with a dmesg log. > > E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide > > perf events, pr_warn() to explain the conflict and direct the user at documentation > > explaining how to make their system compatible with mediate PMU usage.> > >>>> 3. Dedicated kvm_pmi_vector > >>>> In emulated vPMU, host PMI handler notify KVM to inject a virtual > >>>> PMI into guest when physical PMI belongs to guest counter. If the > >>>> same mechanism is used in passthrough vPMU and PMI skid exists > >>>> which cause physical PMI belonging to guest happens after VM-exit, > >>>> then the host PMI handler couldn't identify this PMI belongs to > >>>> host or guest. > >>>> So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > >>>> has this vector only. The PMI belonging to host still has an NMI > >>>> vector. > >>>> > >>>> Without considering PMI skid especially for AMD, the host NMI vector > >>>> could be used for guest PMI also, this method is simpler and doesn't > >>> > >>> I don't see how multiplexing NMIs between guest and host is simpler. At best, > >>> the complexity is a wash, just in different locations, and I highly doubt it's > >>> a wash. AFAIK, there is no way to precisely know that an NMI came in via the > >>> LVTPC. > >> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing > >> NMI between guest and host, we could extend guest PT's PMI framework to > >> mediated PMU. so I think this is simpler. > > > > Heh, what do you mean by "this"? Using a dedicated IRQ vector, or extending the > > PT framework of multiplexing NMI? > here "this" means "extending the PT framework of multiplexing NMI". The PT framework's multiplexing is just as crude as regular PMIs though. Perf basically just asks KVM: is this yours? And KVM simply checks that the callback occurred while KVM_HANDLING_NMI is set. E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"), nothing would prevent perf from miscontruing a host PMI as a guest PMI, because KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active while KVM_HANDLING_NMI is set. And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI. KVM's emulated PMI can (and should) eliminate false positives for host PMIs by precisely checking exclude_guest, but that doesn't help with false negatives for guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come from the LVTPC. Is a naive implementation simpler? Maybe. But IMO, multiplexing NMI and getting all the edge cases right is more complex than using a dedicated vector for guest PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that an interrupt is for a guest PMI.
On 4/15/2024 11:05 PM, Sean Christopherson wrote: > On Mon, Apr 15, 2024, Xiong Y Zhang wrote: >> On 4/13/2024 2:32 AM, Sean Christopherson wrote: >>> On Fri, Apr 12, 2024, Xiong Y Zhang wrote: >>>>>> 2. NMI watchdog >>>>>> the perf event for NMI watchdog is a system wide cpu pinned event, it >>>>>> will be stopped also during vm running, but it doesn't have >>>>>> attr.exclude_guest=1, we add it in this RFC. But this still means NMI >>>>>> watchdog loses function during VM running. >>>>>> >>>>>> Two candidates exist for replacing perf event of NMI watchdog: >>>>>> a. Buddy hardlock detector[3] may be not reliable to replace perf event. >>>>>> b. HPET-based hardlock detector [4] isn't in the upstream kernel. >>>>> >>>>> I think the simplest solution is to allow mediated PMU usage if and only if >>>>> the NMI watchdog is disabled. Then whether or not the host replaces the NMI >>>>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's >>>>> problem to solve. >>>> Make sense. KVM should not affect host high priority work. >>>> NMI watchdog is a client of perf and is a system wide perf event, perf can't >>>> distinguish a system wide perf event is NMI watchdog or others, so how about >>>> we extend this suggestion to all the system wide perf events ? mediated PMU >>>> is only allowed when all system wide perf events are disabled or non-exist at >>>> vm creation. >>> >>> What other kernel-driven system wide perf events are there? >> does "kernel-driven" mean perf events created through >> perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ? > > By kernel-driven I meant events that aren't tied to a single userspace process > or action. > > E.g. KVM creates events, but those events are effectively user-driven because > they will go away if the associated VM terminates. > >> User can create system wide perf event through "perf record -e {} -a" also, I >> call it as user-driven system wide perf events. Perf subsystem doesn't >> distinguish "kernel-driven" and "user-driven" system wide perf events. > > Right, but us humans can build a list, even if it's only for documentation, e.g. > to provide help for someone to run KVM guests with mediated PMUs, but can't > because there are active !exclude_guest events. > >>>> but NMI watchdog is usually enabled, this will limit mediated PMU usage. >>> >>> I don't think it is at all unreasonable to require users that want optimal PMU >>> virtualization to adjust their environment. And we can and should document the >>> tradeoffs and alternatives, e.g. so that users that want better PMU results don't >>> need to re-discover all the "gotchas" on their own. >>> >>> This would even be one of the rare times where I would be ok with a dmesg log. >>> E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide >>> perf events, pr_warn() to explain the conflict and direct the user at documentation >>> explaining how to make their system compatible with mediate PMU usage.> >>>>>> 3. Dedicated kvm_pmi_vector >>>>>> In emulated vPMU, host PMI handler notify KVM to inject a virtual >>>>>> PMI into guest when physical PMI belongs to guest counter. If the >>>>>> same mechanism is used in passthrough vPMU and PMI skid exists >>>>>> which cause physical PMI belonging to guest happens after VM-exit, >>>>>> then the host PMI handler couldn't identify this PMI belongs to >>>>>> host or guest. >>>>>> So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest >>>>>> has this vector only. The PMI belonging to host still has an NMI >>>>>> vector. >>>>>> >>>>>> Without considering PMI skid especially for AMD, the host NMI vector >>>>>> could be used for guest PMI also, this method is simpler and doesn't >>>>> >>>>> I don't see how multiplexing NMIs between guest and host is simpler. At best, >>>>> the complexity is a wash, just in different locations, and I highly doubt it's >>>>> a wash. AFAIK, there is no way to precisely know that an NMI came in via the >>>>> LVTPC. >>>> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing >>>> NMI between guest and host, we could extend guest PT's PMI framework to >>>> mediated PMU. so I think this is simpler. >>> >>> Heh, what do you mean by "this"? Using a dedicated IRQ vector, or extending the >>> PT framework of multiplexing NMI? >> here "this" means "extending the PT framework of multiplexing NMI". > > The PT framework's multiplexing is just as crude as regular PMIs though. Perf > basically just asks KVM: is this yours? And KVM simply checks that the callback > occurred while KVM_HANDLING_NMI is set. > > E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"), > nothing would prevent perf from miscontruing a host PMI as a guest PMI, because > KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active > while KVM_HANDLING_NMI is set. > > And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the > NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI. > KVM's emulated PMI can (and should) eliminate false positives for host PMIs by > precisely checking exclude_guest, but that doesn't help with false negatives for > guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come > from the LVTPC> > Is a naive implementation simpler? Maybe. But IMO, multiplexing NMI and getting > all the edge cases right is more complex than using a dedicated vector for guest > PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that > an interrupt is for a guest PMI. >Totally agree the complex to fix multiplexing NMI corner case. Thanks for explanation.
On Thu, Apr 11, 2024, Sean Christopherson wrote: > <bikeshed> > > I think we should call this a mediated PMU, not a passthrough PMU. KVM still > emulates the control plane (controls and event selectors), while the data is > fully passed through (counters). > > </bikeshed> Sean, I feel "mediated PMU" seems to be a little bit off the ..., no? In KVM, almost all of features are mediated. In our specific case, the legacy PMU is mediated by KVM and perf subsystem on the host. In new design, it is mediated by KVM only. We intercept the control plan in current design, but the only thing we do is the event filtering. No fancy code change to emulate the control registers. So, it is still a passthrough logic. In some (rare) business cases, I think maybe we could fully passthrough the control plan as well. For instance, sole-tenant machine, or full-machine VM + full offload. In case if there is a cpu errata, KVM can force vmexit and dynamically intercept the selectors on all vcpus with filters checked. It is not supported in current RFC, but maybe doable in later versions. With the above, I wonder if we can still use passthrough PMU for simplicity? But no strong opinion if you really want to keep this name. I would have to take some time to convince myself. Thanks. -Mingwei > > On Fri, Jan 26, 2024, Xiong Zhang wrote: > > > 1. host system wide / QEMU events handling during VM running > > At VM-entry, all the host perf events which use host x86 PMU will be > > stopped. These events with attr.exclude_guest = 1 will be stopped here > > and re-started after vm-exit. These events without attr.exclude_guest=1 > > will be in error state, and they cannot recovery into active state even > > if the guest stops running. This impacts host perf a lot and request > > host system wide perf events have attr.exclude_guest=1. > > > > This requests QEMU Process's perf event with attr.exclude_guest=1 also. > > > > During VM running, perf event creation for system wide and QEMU > > process without attr.exclude_guest=1 fail with -EBUSY. > > > > 2. NMI watchdog > > the perf event for NMI watchdog is a system wide cpu pinned event, it > > will be stopped also during vm running, but it doesn't have > > attr.exclude_guest=1, we add it in this RFC. But this still means NMI > > watchdog loses function during VM running. > > > > Two candidates exist for replacing perf event of NMI watchdog: > > a. Buddy hardlock detector[3] may be not reliable to replace perf event. > > b. HPET-based hardlock detector [4] isn't in the upstream kernel. > > I think the simplest solution is to allow mediated PMU usage if and only if > the NMI watchdog is disabled. Then whether or not the host replaces the NMI > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > problem to solve. > > > 3. Dedicated kvm_pmi_vector > > In emulated vPMU, host PMI handler notify KVM to inject a virtual > > PMI into guest when physical PMI belongs to guest counter. If the > > same mechanism is used in passthrough vPMU and PMI skid exists > > which cause physical PMI belonging to guest happens after VM-exit, > > then the host PMI handler couldn't identify this PMI belongs to > > host or guest. > > So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > > has this vector only. The PMI belonging to host still has an NMI > > vector. > > > > Without considering PMI skid especially for AMD, the host NMI vector > > could be used for guest PMI also, this method is simpler and doesn't > > I don't see how multiplexing NMIs between guest and host is simpler. At best, > the complexity is a wash, just in different locations, and I highly doubt it's > a wash. AFAIK, there is no way to precisely know that an NMI came in via the > LVTPC. > > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > > > need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we > > didn't meet the skid PMI issue on modern Intel processors. > > > > 4. per-VM passthrough mode configuration > > Current RFC uses a KVM module enable_passthrough_pmu RO parameter, > > it decides vPMU is passthrough mode or emulated mode at kvm module > > load time. > > Do we need the capability of per-VM passthrough mode configuration? > > So an admin can launch some non-passthrough VM and profile these > > non-passthrough VMs in host, but admin still cannot profile all > > the VMs once passthrough VM existence. This means passthrough vPMU > > and emulated vPMU mix on one platform, it has challenges to implement. > > As the commit message in commit 0011, the main challenge is > > passthrough vPMU and emulated vPMU have different vPMU features, this > > ends up with two different values for kvm_cap.supported_perf_cap, which > > is initialized at module load time. To support it, more refactor is > > needed. > > I have no objection to an all-or-nothing setup. I'd honestly love to rip out the > existing vPMU support entirely, but that's probably not be realistic, at least not > in the near future. > > > Remain Works > > === > > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. > > Before this gets out of its "RFC" phase, I would at least like line of sight to > a more optimized switch. I 100% agree that starting with a conservative > implementation is the way to go, and the kernel absolutely needs to be able to > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the > guest PMU loaded for the entirety of KVM_RUN isn't a viable option. > > But I also don't want to get into a situation where can't figure out a clean, > robust way to do the optimized context switch without needing (another) massive > rewrite.
On Thu, Apr 18, 2024, Mingwei Zhang wrote: > On Thu, Apr 11, 2024, Sean Christopherson wrote: > > <bikeshed> > > > > I think we should call this a mediated PMU, not a passthrough PMU. KVM still > > emulates the control plane (controls and event selectors), while the data is > > fully passed through (counters). > > > > </bikeshed> > Sean, > > I feel "mediated PMU" seems to be a little bit off the ..., no? In > KVM, almost all of features are mediated. In our specific case, the > legacy PMU is mediated by KVM and perf subsystem on the host. In new > design, it is mediated by KVM only. > > We intercept the control plan in current design, but the only thing > we do is the event filtering. No fancy code change to emulate the control > registers. So, it is still a passthrough logic. > > In some (rare) business cases, I think maybe we could fully passthrough > the control plan as well. For instance, sole-tenant machine, or > full-machine VM + full offload. In case if there is a cpu errata, KVM > can force vmexit and dynamically intercept the selectors on all vcpus > with filters checked. It is not supported in current RFC, but maybe > doable in later versions. > > With the above, I wonder if we can still use passthrough PMU for > simplicity? But no strong opinion if you really want to keep this name. > I would have to take some time to convince myself. > One propoal. Maybe "direct vPMU"? I think there would be many words that focus on the "passthrough" side but not on the "interception/mediation" side? > Thanks. > -Mingwei > > > > On Fri, Jan 26, 2024, Xiong Zhang wrote: > > > > > 1. host system wide / QEMU events handling during VM running > > > At VM-entry, all the host perf events which use host x86 PMU will be > > > stopped. These events with attr.exclude_guest = 1 will be stopped here > > > and re-started after vm-exit. These events without attr.exclude_guest=1 > > > will be in error state, and they cannot recovery into active state even > > > if the guest stops running. This impacts host perf a lot and request > > > host system wide perf events have attr.exclude_guest=1. > > > > > > This requests QEMU Process's perf event with attr.exclude_guest=1 also. > > > > > > During VM running, perf event creation for system wide and QEMU > > > process without attr.exclude_guest=1 fail with -EBUSY. > > > > > > 2. NMI watchdog > > > the perf event for NMI watchdog is a system wide cpu pinned event, it > > > will be stopped also during vm running, but it doesn't have > > > attr.exclude_guest=1, we add it in this RFC. But this still means NMI > > > watchdog loses function during VM running. > > > > > > Two candidates exist for replacing perf event of NMI watchdog: > > > a. Buddy hardlock detector[3] may be not reliable to replace perf event. > > > b. HPET-based hardlock detector [4] isn't in the upstream kernel. > > > > I think the simplest solution is to allow mediated PMU usage if and only if > > the NMI watchdog is disabled. Then whether or not the host replaces the NMI > > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > > problem to solve. > > > > > 3. Dedicated kvm_pmi_vector > > > In emulated vPMU, host PMI handler notify KVM to inject a virtual > > > PMI into guest when physical PMI belongs to guest counter. If the > > > same mechanism is used in passthrough vPMU and PMI skid exists > > > which cause physical PMI belonging to guest happens after VM-exit, > > > then the host PMI handler couldn't identify this PMI belongs to > > > host or guest. > > > So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > > > has this vector only. The PMI belonging to host still has an NMI > > > vector. > > > > > > Without considering PMI skid especially for AMD, the host NMI vector > > > could be used for guest PMI also, this method is simpler and doesn't > > > > I don't see how multiplexing NMIs between guest and host is simpler. At best, > > the complexity is a wash, just in different locations, and I highly doubt it's > > a wash. AFAIK, there is no way to precisely know that an NMI came in via the > > LVTPC. > > > > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. > > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > > > > > need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we > > > didn't meet the skid PMI issue on modern Intel processors. > > > > > > 4. per-VM passthrough mode configuration > > > Current RFC uses a KVM module enable_passthrough_pmu RO parameter, > > > it decides vPMU is passthrough mode or emulated mode at kvm module > > > load time. > > > Do we need the capability of per-VM passthrough mode configuration? > > > So an admin can launch some non-passthrough VM and profile these > > > non-passthrough VMs in host, but admin still cannot profile all > > > the VMs once passthrough VM existence. This means passthrough vPMU > > > and emulated vPMU mix on one platform, it has challenges to implement. > > > As the commit message in commit 0011, the main challenge is > > > passthrough vPMU and emulated vPMU have different vPMU features, this > > > ends up with two different values for kvm_cap.supported_perf_cap, which > > > is initialized at module load time. To support it, more refactor is > > > needed. > > > > I have no objection to an all-or-nothing setup. I'd honestly love to rip out the > > existing vPMU support entirely, but that's probably not be realistic, at least not > > in the near future. > > > > > Remain Works > > > === > > > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. > > > > Before this gets out of its "RFC" phase, I would at least like line of sight to > > a more optimized switch. I 100% agree that starting with a conservative > > implementation is the way to go, and the kernel absolutely needs to be able to > > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the > > guest PMU loaded for the entirety of KVM_RUN isn't a viable option. > > > > But I also don't want to get into a situation where can't figure out a clean, > > robust way to do the optimized context switch without needing (another) massive > > rewrite.
On Thu, Apr 18, 2024, Mingwei Zhang wrote: > On Thu, Apr 11, 2024, Sean Christopherson wrote: > > <bikeshed> > > > > I think we should call this a mediated PMU, not a passthrough PMU. KVM still > > emulates the control plane (controls and event selectors), while the data is > > fully passed through (counters). > > > > </bikeshed> > Sean, > > I feel "mediated PMU" seems to be a little bit off the ..., no? In > KVM, almost all of features are mediated. In our specific case, the > legacy PMU is mediated by KVM and perf subsystem on the host. In new > design, it is mediated by KVM only. Currently, at a feature level, I mentally bin things into two rough categories in KVM: 1. Virtualized - Guest state is loaded into hardware, or hardware supports running with both host and guest state (e.g. TSC scaling), and the guest has full read/write access to its state while running. 2. Emulated - Guest state is never loaded into hardware, and instead the feature/state is emulated in software. There is no "Passthrough" because that's (mostly) covered by my Virtualized definition. And because I also think of passthrough as being about *assets*, not about the features themselves. They are far from perfect definitions, e.g. individual assets can be passed through, virtualized by hardware, or emulated in software. But for the most part, I think classifying features as virtualized vs. emulated works well, as it helps reason about the expected behavior and performance of a feature. E.g. for some virtualized features, certain assets may need to be explicitly passed through, e.g. access to x2APIC MSRs for APICv. But APICv itself still falls into the virtualized category, e.g. the "real" APIC state isn't passed through to the guest. If KVM didn't already have a PMU implementation to deal with, this wouldn't be an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the virtualized category. But we need to distinguish between the two PMU models, and using "enable_virtualized_pmu" would be comically confusing for users. :-) And because this is user visible, I would like to come up with a name that (some) KVM users will already be familiar with, i.e. will have some chance of intuitively understand without having to go read docs. Which is why I proposed "mediated"; what we are proposing for the PMU is similar to the "mediated device" concepts in VFIO. And I also think "mediated" is a good fit in general, e.g. this becomes my third classification: 3. Mediated - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded into hardware, but the guest does NOT have full read/write access to the feature. But my main motiviation for using "mediated" really is that I hope that it will help KVM users grok the basic gist of the design without having to read and understand KVM documentation, because there is already existing terminology in the broader KVM space. > We intercept the control plan in current design, but the only thing > we do is the event filtering. No fancy code change to emulate the control > registers. So, it is still a passthrough logic. It's not though. Passthrough very specifically means the guest has unfettered access to some asset, and/or KVM does no filtering/adjustments whatseover. "Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses that don't require KVM to intervene and translate. E.g. entire MMUs can be direct, but individual shadow pages can also be direct (no corresponding guest PTE to translate). For this flavor of PMU, it's not full passthrough or direct. Some assets are passed through, e.g. PMCs, but others are not. > In some (rare) business cases, I think maybe we could fully passthrough > the control plan as well. For instance, sole-tenant machine, or > full-machine VM + full offload. In case if there is a cpu errata, KVM > can force vmexit and dynamically intercept the selectors on all vcpus > with filters checked. It is not supported in current RFC, but maybe > doable in later versions. Heh, that's an argument for using something other than "passthrough", because if we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu, or in the spirit of KVM shortlogs, really_passthrough_pmu :-) Though I think even then I would vote for "enable_dedicated_pmu", or something along those lines, purely to avoid overloading "passthrough", i.e. to try to use passhtrough strictly when talking about assets, not features. And because unless we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the PMU as KVM would be emulating PMIs.
> Currently, at a feature level, I mentally bin things into two rough categories > in KVM: > > 1. Virtualized - Guest state is loaded into hardware, or hardware supports > running with both host and guest state (e.g. TSC scaling), and > the guest has full read/write access to its state while running. > > 2. Emulated - Guest state is never loaded into hardware, and instead the > feature/state is emulated in software. > > There is no "Passthrough" because that's (mostly) covered by my Virtualized > definition. And because I also think of passthrough as being about *assets*, > not about the features themselves. Sure. In fact, "virtualized" works for me as well. My mind is aligned with this. > > They are far from perfect definitions, e.g. individual assets can be passed through, > virtualized by hardware, or emulated in software. But for the most part, I think > classifying features as virtualized vs. emulated works well, as it helps reason > about the expected behavior and performance of a feature. > > E.g. for some virtualized features, certain assets may need to be explicitly passed > through, e.g. access to x2APIC MSRs for APICv. But APICv itself still falls > into the virtualized category, e.g. the "real" APIC state isn't passed through > to the guest. > > If KVM didn't already have a PMU implementation to deal with, this wouldn't be > an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the > virtualized category. But we need to distinguish between the two PMU models, > and using "enable_virtualized_pmu" would be comically confusing for users. :-) > > And because this is user visible, I would like to come up with a name that (some) > KVM users will already be familiar with, i.e. will have some chance of intuitively > understand without having to go read docs. > > Which is why I proposed "mediated"; what we are proposing for the PMU is similar > to the "mediated device" concepts in VFIO. And I also think "mediated" is a good > fit in general, e.g. this becomes my third classification: > > 3. Mediated - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded > into hardware, but the guest does NOT have full read/write access > to the feature. > > But my main motiviation for using "mediated" really is that I hope that it will > help KVM users grok the basic gist of the design without having to read and > understand KVM documentation, because there is already existing terminology in > the broader KVM space. Understand this part. Mediated is the fact that KVM sits in between, but I feel we can find a better name :) > > > We intercept the control plan in current design, but the only thing > > we do is the event filtering. No fancy code change to emulate the control > > registers. So, it is still a passthrough logic. > > It's not though. Passthrough very specifically means the guest has unfettered > access to some asset, and/or KVM does no filtering/adjustments whatseover. > > "Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses > that don't require KVM to intervene and translate. E.g. entire MMUs can be direct, > but individual shadow pages can also be direct (no corresponding guest PTE to > translate). Oh, isn't "direct" a perfect word for this? Look, our new design does not require KVM to translate the encodings into events and into encoding again (in "perf subsystem") before entering HW. It is really "direct" in this sense, no? Neither does KVM do any translation of the event encodings across micro-architectures. So, it is really _direct_ from this perspective as well. On the other hand, "direct" means straightforward, indicating passthrough, but not always, in which KVM retains the power of control. > > For this flavor of PMU, it's not full passthrough or direct. Some assets are > passed through, e.g. PMCs, but others are not. > > > In some (rare) business cases, I think maybe we could fully passthrough > > the control plan as well. For instance, sole-tenant machine, or > > full-machine VM + full offload. In case if there is a cpu errata, KVM > > can force vmexit and dynamically intercept the selectors on all vcpus > > with filters checked. It is not supported in current RFC, but maybe > > doable in later versions. > > Heh, that's an argument for using something other than "passthrough", because if > we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu, > or in the spirit of KVM shortlogs, really_passthrough_pmu :-) Full passthrough is possible and the naming of "really_passthrough" and others can all be alive under the "direct PMU". > > Though I think even then I would vote for "enable_dedicated_pmu", or something > along those lines, purely to avoid overloading "passthrough", i.e. to try to use > passhtrough strictly when talking about assets, not features. And because unless > we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the > PMU as KVM would be emulating PMIs. I agree to avoid "passthrough". Dedicated is also a fine word. It indicates the PMU is dedicated to serve the KVM guests. But the scope might be a little narrow. This is just my opinion. Maybe it is because my mind has been stuck with "direct" :) Thanks. -Mingwei