From patchwork Mon Jul 24 10:41:54 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Zhang, Xiong Y" <xiong.y.zhang@intel.com>
X-Patchwork-Id: 13324370
Return-Path: <kvm-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9E691C0015E
	for <kvm@archiver.kernel.org>; Mon, 24 Jul 2023 10:43:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232199AbjGXKnS (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Mon, 24 Jul 2023 06:43:18 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41730 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232171AbjGXKnQ (ORCPT <rfc822;kvm@vger.kernel.org>);
        Mon, 24 Jul 2023 06:43:16 -0400
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45520E6
        for <kvm@vger.kernel.org>; Mon, 24 Jul 2023 03:43:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690195395; x=1721731395;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=Tzg+2W3QJiH31dz3eqTMCkxS3NEVhVyswtVYF5d0HuY=;
  b=F57bCwr7WxTHzDu+kRERcSEaQZjtXFme51jFPWOAMzRZ1EK3jdh7tMR4
   H19HmIquS3QnCq76aI0hYdGazXRvd6ee+ZQLKoRo6q6vVvJfiJnkiwIGE
   4AlIuzEhQ1pJWl+U9KnG4fy7Vc5M6JVthKdAqBVIUBT51bByILaw8erpk
   7/1zFaFyMXKGD5aM2KYvgtXT76h2Gm2FIUfm4NWlM4v1mcnrcMEQKzYbe
   RPevfoBT8Km2mKoynlG4Wthw2FwsrQJnXCmwHgsBYy2kLZktxlClXSWvt
   kwhEiMQEyTUIgptu5Fjh0K6wiGkESRsjh6/ewDmlRDd7JwXAqfPTy5TJz
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10780"; a="370075369"
X-IronPort-AV: E=Sophos;i="6.01,228,1684825200";
   d="scan'208";a="370075369"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Jul 2023 03:43:14 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.01,202,1684825200";
   d="scan'208";a="869031655"
Received: from ylei1-mobl.ccr.corp.intel.com (HELO
 xiongzha-desk1.ccr.corp.intel.com) ([10.255.29.13])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Jul 2023 03:43:13 -0700
From: Xiong Zhang <xiong.y.zhang@intel.com>
To: kvm@vger.kernel.org
Cc: seanjc@google.com, like.xu.linux@gmail.com,
        weijiang.yang@intel.com, zhiyuan.lv@intel.com,
        zhenyu.z.wang@intel.com, kan.liang@intel.com,
        Xiong Zhang <xiong.y.zhang@intel.com>
Subject: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document
Date: Mon, 24 Jul 2023 18:41:54 +0800
Message-Id: <20230724104154.259573-1-xiong.y.zhang@intel.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Add a vPMU implementation and gap document to explain vArch PMU and vLBR
implementation in kvm, especially the current gap to support host and
guest perf event coexist.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
---
 Documentation/virt/kvm/x86/index.rst |   1 +
 Documentation/virt/kvm/x86/pmu.rst   | 249 +++++++++++++++++++++++++++
 2 files changed, 250 insertions(+)
 create mode 100644 Documentation/virt/kvm/x86/pmu.rst

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..02c1c7b01bf3 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -14,5 +14,6 @@ KVM for x86 systems
    mmu
    msr
    nested-vmx
+   pmu
    running-nested-guests
    timekeeping
diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
new file mode 100644
index 000000000000..e95e8c88e0e0
--- /dev/null
+++ b/Documentation/virt/kvm/x86/pmu.rst
@@ -0,0 +1,249 @@
+﻿.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PMU virtualization for X86
+==========================
+
+:Author: Xiong Zhang <xiong.y.zhang@intel.com>
+:Copyright: (c) 2023, Intel.  All rights reserved.
+
+.. Contents
+
+1. Overview
+2. Perf Scheduler
+3. Arch PMU virtualization
+4. LBR virtualization
+
+1. Overview
+===========
+
+KVM has supported PMU virtualization on x86 for many years and provides
+MSR based Arch PMU interface to the guest. The major features include
+Arch PMU v2, LBR and PEBS. Users have the same operation to profile
+performance in guest and host.
+KVM is a normal perf subsystem user as other perf subsystem users. When
+the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
+This perf event takes part in perf scheduler to request PMU resources
+and let the guest use these resources.
+
+This document describes the X86 PMU virtualization architecture design
+and opens. It is organized as follows: Next section describes more
+details of Linux perf scheduler as it takes a key role in vPMU
+implementation and allocates PMU resources for guest usage. Then Arch
+PMU virtualization and LBR virtualization are introduced, each feature
+has sections to introduce implementation overview,  the expectation and
+gaps when host and guest perf events coexist.
+
+2. Perf Scheduler
+=================
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited
+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.
+
+Usually the following cases cause perf event reschedule:
+1) On a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.
+
+When perf event reschedule is needed on a specific cpu, perf will send
+an IPI to the target cpu, and the IPI handler will activate events
+ordered by event type, and will iterate all the eligible events.
+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+Perf defines four types event, their priority are from high to low:
+a. Per-cpu pinned: the event should be measured on the specified logical
+core whenever it is enabled.
+b. Per-process pinned: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+c. Per-cpu flexible: the event should measured on the specified logical
+core whenever it is enabled.
+d. Per-process flexible: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.
+
+3. Arch PMU virtualization
+==========================
+
+3.1. Overview
+-------------
+
+Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
+PMU driver would access the Arch PMU MSRs (including Fixed and GP
+counter) as the host does. All the guest Arch PMU MSRs accessing are
+interceptable.
+
+When a guest virtual counter is enabled through guest MSR writing, the
+KVM trap will create a kvm perf event through the perf subsystem. The
+kvm perf event's attribute is gotten from the guest virtual counter's
+MSR setting.
+
+When a guest changes the virtual counter's setting later, the KVM trap
+will release the old kvm perf event then create a new kvm perf event
+with the new setting.
+
+When guest read the virtual counter's count number, the kvm trap will
+read kvm perf event's counter value and accumulate it to the previous
+counter value.
+
+When guest no longer access the virtual counter's MSR within a
+scheduling time slice and the virtual counter is disabled, KVM will
+release the kvm perf event.
+  ----------------------------
+  |  Guest                   |
+  |  perf subsystem          |
+  ----------------------------
+       |            ^
+  vMSR |            | vPMI
+       v            |
+  ----------------------------
+  |  vPMU        KVM vCPU    |
+  ----------------------------
+        |          ^
+  Call  |          | Callbacks
+        v          |
+  ---------------------------
+  | Host Linux Kernel       |
+  | perf subsystem          |
+  ---------------------------
+               |       ^
+           MSR |       | PMI
+               v       |
+         --------------------
+	 | PMU        CPU   |
+         --------------------
+
+Each guest virtual counter has a corresponding kvm perf event, and the
+kvm perf event joins host perf scheduler and complies with host perf
+scheduler rule. When kvm perf event is scheduled by host perf scheduler
+and is active, the guest virtual counter could supply the correct value.
+However, if another host perf event comes in and takes over the kvm perf
+event resource, the kvm perf event will be inactive, then the virtual
+counter supplies wrong and meaningless value.
+
+3.2. Host and Guest perf event contention
+-----------------------------------------
+
+Kvm perf event is a per-process pinned event, its priority is second.
+When kvm perf event is active, it can be preempted by host per-cpu
+pinned perf event, or it can preempt host flexible perf events. Such
+preemption can be temporarily prohibited through disabling host IRQ.
+
+The following results are expected when host and guest perf event
+coexist according to perf scheduler rule:
+1). if host per cpu pinned events occupy all the HW resource, kvm perf
+event can not be active as no available resource, the virtual counter
+value is  zero always when the guest read it.
+2). if host per cpu pinned event release HW resource, and kvm perf event
+is inactive, kvm perf event can claim the HW resource and switch into
+active, then the guest can get the correct value from the guest virtual
+counter during kvm perf event is active, but the guest total counter
+value is not correct since counter value is lost during kvm perf event
+is inactive.
+3). if kvm perf event is active, then host per cpu pinned perf event
+becomes active and reclaims kvm perf event resource, kvm perf event will
+be inactive. Finally the virtual counter value is kept unchanged and
+stores previous saved value when the guest reads it. So the guest toatal
+counter isn't correct.
+4). If host flexible perf events occupy all the HW resource, kvm perf
+event can be active and preempts host flexible perf event resource,
+guest can get the correct value from the guest virtual counter.
+5). if kvm perf event is active, then other host flexible perf events
+request to active, kvm perf event still own the resource and active, so
+guest can get the correct value from the guest virtual counter.
+
+3.3. vPMU Arch Gaps
+-------------------
+
+The coexist of host and guest perf events has gap:
+1). when guest accesses PMU MSRs at the first time, KVM will trap it and
+create kvm perf event, but this event may be inactive because the
+contention with host perf event. But guest doesn't notice this and when
+guest read virtual counter, the return value is zero.
+2). when kvm perf event is active, host per-cpu pinned perf event can
+reclaim kvm perf event resource at any time once resource contention
+happens. But guest doesn't notice this neither and guest following
+counter accesses get wrong data.
+So maillist had some discussion titled "Reconsider the current approach
+of vPMU".
+
+https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
+
+The major suggestion in this discussion is host pass-through some
+counters into guest, but this suggestion is not feasible, the reasons
+are:
+a. processor has several counters, but counters are not equal, some
+event must bind with a specific counter.
+b. if a special counter is passthrough into guest, host can not support
+such event and lose some capability.
+c. if a normal counter is passthrough into guest, guest can support
+general event only, and the guest has limited capability.
+So both host and guest lose capability in pass-through mode.
+
+4. LBR Virtualization
+=====================
+
+4.1. Overview
+-------------
+
+The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
+and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
+into guest,  The first guest access on LBR related MSRs is always
+interceptable. The KVM trap would create a vLBR perf event which enables
+the callstack mode and none of the hardware counters are assigned. The
+host perf would enable and schedule this event as usual.
+
+When vLBR event is scheduled by host perf scheduler and is active, host
+LBR MSRs are owned by guest and are pass-through into guest, guest will
+access them without VM Exit. However, if another host LBR event comes in
+and takes over the LBR facility, the vLBR event will be inactive, and
+guest following accesses to the LBR MSRs will be trapped and meaningless.
+
+As kvm perf event, vLBR event will be released when guest doesn't access
+LBR-related MSRs within a scheduling time slice and guest unset LBR
+enable bit, then the pass-through state of the LBR MSRs will be canceled.
+
+4.2. Host and Guest LBR contention
+----------------------------------
+
+vLBR event is a per-process pinned event, its priority is second. vLBR
+event together with host other LBR event to contend LBR resource,
+according to perf scheduler rule, when vLBR event is active, it can be
+preempted by host per-cpu pinned LBR event, or it can preempt host
+flexible LBR event. Such preemption can be temporarily prohibited
+through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
+
+The following results are expected when host and guest LBR event coexist:
+1) If host per cpu pinned LBR event is active when vm starts, the guest
+vLBR event can not preempt the LBR resource, so the guest can not use
+LBR.
+2). If host flexible LBR events are active when vm starts, guest vLBR
+event can preempt LBR, so the guest can use LBR.
+3). If host per cpu pinned LBR event becomes enabled when guest vLBR
+event is active, the guest vLBR event will lose LBR and the guest can
+not use LBR anymore.
+4). If host flexible LBR event becomes enabled when guest vLBR event is
+active, the guest vLBR event keeps LBR, the guest can still use LBR.
+5). If host per cpu pinned LBR event becomes inactive when guest vLBR
+event is inactive, guest vLBR event can be active and own LBR, the guest
+can use LBR.
+
+4.3. vLBR Arch Gaps
+-------------------
+
+Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
+event at any time, or vLBR event is inactive at creation, but guest
+can not notice this, so the guest will get meaningless value when the
+vLBR event is inactive.