diff mbox series

[kvm-unit-tests,v2,4/5] x86: pmu: Support validation for Intel PMU fixed counter 3

Message ID 20231031092921.2885109-5-dapeng1.mi@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix PMU test failures on Sapphire Rapids | expand

Commit Message

Mi, Dapeng Oct. 31, 2023, 9:29 a.m. UTC
Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
(fixed counter 3) to counter/sample topdown.slots event, but current
code still doesn't cover this new fixed counter.

So this patch adds code to validate this new fixed counter can count
slots event correctly.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jim Mattson Oct. 31, 2023, 6:47 p.m. UTC | #1
On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
> (fixed counter 3) to counter/sample topdown.slots event, but current
> code still doesn't cover this new fixed counter.
>
> So this patch adds code to validate this new fixed counter can count
> slots event correctly.

I'm not convinced that this actually validates anything.

Suppose, for example, that KVM used fixed counter 1 when the guest
asked for fixed counter 3. Wouldn't this test still pass?

> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  x86/pmu.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 6bd8f6d53f55..404dc7b62ac2 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -47,6 +47,7 @@ struct pmu_event {
>         {"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
>         {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 500*N},
>         {"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 300*N},
> +       {"fixed 4", MSR_CORE_PERF_FIXED_CTR0 + 3, 1*N, 5000*N},
>  };
>
>  char *buf;
> --
> 2.34.1
>
Mi, Dapeng Nov. 1, 2023, 2:33 a.m. UTC | #2
On 11/1/2023 2:47 AM, Jim Mattson wrote:
> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
>> (fixed counter 3) to counter/sample topdown.slots event, but current
>> code still doesn't cover this new fixed counter.
>>
>> So this patch adds code to validate this new fixed counter can count
>> slots event correctly.
> I'm not convinced that this actually validates anything.
>
> Suppose, for example, that KVM used fixed counter 1 when the guest
> asked for fixed counter 3. Wouldn't this test still pass?


Per my understanding, as long as the KVM returns a valid count in the 
reasonable count range, we can think KVM works correctly. We don't need 
to entangle on how KVM really uses the HW, it could be impossible and 
unnecessary.

Yeah, currently the predefined valid count range may be some kind of 
loose since I want to cover as much as hardwares and avoid to cause 
regression. Especially after introducing the random jump and clflush 
instructions, the cycles and slots become much more hard to predict. 
Maybe we can have a comparable restricted count range in the initial 
change, and we can loosen the restriction then if we encounter a failure 
on some specific hardware. do you think it's better? Thanks.


>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>   x86/pmu.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index 6bd8f6d53f55..404dc7b62ac2 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -47,6 +47,7 @@ struct pmu_event {
>>          {"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
>>          {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 500*N},
>>          {"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 300*N},
>> +       {"fixed 4", MSR_CORE_PERF_FIXED_CTR0 + 3, 1*N, 5000*N},
>>   };
>>
>>   char *buf;
>> --
>> 2.34.1
>>
Jim Mattson Nov. 1, 2023, 2:47 a.m. UTC | #3
On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 11/1/2023 2:47 AM, Jim Mattson wrote:
> > On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
> >> (fixed counter 3) to counter/sample topdown.slots event, but current
> >> code still doesn't cover this new fixed counter.
> >>
> >> So this patch adds code to validate this new fixed counter can count
> >> slots event correctly.
> > I'm not convinced that this actually validates anything.
> >
> > Suppose, for example, that KVM used fixed counter 1 when the guest
> > asked for fixed counter 3. Wouldn't this test still pass?
>
>
> Per my understanding, as long as the KVM returns a valid count in the
> reasonable count range, we can think KVM works correctly. We don't need
> to entangle on how KVM really uses the HW, it could be impossible and
> unnecessary.

Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
are in a reasonable range. What's everyone upset about?

> Yeah, currently the predefined valid count range may be some kind of
> loose since I want to cover as much as hardwares and avoid to cause
> regression. Especially after introducing the random jump and clflush
> instructions, the cycles and slots become much more hard to predict.
> Maybe we can have a comparable restricted count range in the initial
> change, and we can loosen the restriction then if we encounter a failure
> on some specific hardware. do you think it's better? Thanks.

I think the test is essentially useless, and should probably just be
deleted, so that it doesn't give a false sense of confidence.
Mi, Dapeng Nov. 1, 2023, 3:15 a.m. UTC | #4
On 11/1/2023 10:47 AM, Jim Mattson wrote:
> On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 11/1/2023 2:47 AM, Jim Mattson wrote:
>>> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
>>>> (fixed counter 3) to counter/sample topdown.slots event, but current
>>>> code still doesn't cover this new fixed counter.
>>>>
>>>> So this patch adds code to validate this new fixed counter can count
>>>> slots event correctly.
>>> I'm not convinced that this actually validates anything.
>>>
>>> Suppose, for example, that KVM used fixed counter 1 when the guest
>>> asked for fixed counter 3. Wouldn't this test still pass?
>>
>> Per my understanding, as long as the KVM returns a valid count in the
>> reasonable count range, we can think KVM works correctly. We don't need
>> to entangle on how KVM really uses the HW, it could be impossible and
>> unnecessary.
> Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
> are in a reasonable range. What's everyone upset about?
>
>> Yeah, currently the predefined valid count range may be some kind of
>> loose since I want to cover as much as hardwares and avoid to cause
>> regression. Especially after introducing the random jump and clflush
>> instructions, the cycles and slots become much more hard to predict.
>> Maybe we can have a comparable restricted count range in the initial
>> change, and we can loosen the restriction then if we encounter a failure
>> on some specific hardware. do you think it's better? Thanks.
> I think the test is essentially useless, and should probably just be
> deleted, so that it doesn't give a false sense of confidence.

IMO, I can't say the tests are totally useless. Yes,  passing the tests 
doesn't mean the KVM vPMU must work correctly, but we can say there is 
something probably wrong if it fails to pass these tests. Considering 
the hardware differences, it's impossible to set an exact value for 
these events in advance and it seems there is no better method to verify 
the PMC count as well. I still prefer to keep these tests until we have 
a better method to verify the accuracy of the PMC count.
Jim Mattson Nov. 1, 2023, 3:24 a.m. UTC | #5
On Tue, Oct 31, 2023 at 8:16 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 11/1/2023 10:47 AM, Jim Mattson wrote:
> > On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 11/1/2023 2:47 AM, Jim Mattson wrote:
> >>> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
> >>>> (fixed counter 3) to counter/sample topdown.slots event, but current
> >>>> code still doesn't cover this new fixed counter.
> >>>>
> >>>> So this patch adds code to validate this new fixed counter can count
> >>>> slots event correctly.
> >>> I'm not convinced that this actually validates anything.
> >>>
> >>> Suppose, for example, that KVM used fixed counter 1 when the guest
> >>> asked for fixed counter 3. Wouldn't this test still pass?
> >>
> >> Per my understanding, as long as the KVM returns a valid count in the
> >> reasonable count range, we can think KVM works correctly. We don't need
> >> to entangle on how KVM really uses the HW, it could be impossible and
> >> unnecessary.
> > Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
> > are in a reasonable range. What's everyone upset about?
> >
> >> Yeah, currently the predefined valid count range may be some kind of
> >> loose since I want to cover as much as hardwares and avoid to cause
> >> regression. Especially after introducing the random jump and clflush
> >> instructions, the cycles and slots become much more hard to predict.
> >> Maybe we can have a comparable restricted count range in the initial
> >> change, and we can loosen the restriction then if we encounter a failure
> >> on some specific hardware. do you think it's better? Thanks.
> > I think the test is essentially useless, and should probably just be
> > deleted, so that it doesn't give a false sense of confidence.
>
> IMO, I can't say the tests are totally useless. Yes,  passing the tests
> doesn't mean the KVM vPMU must work correctly, but we can say there is
> something probably wrong if it fails to pass these tests. Considering
> the hardware differences, it's impossible to set an exact value for
> these events in advance and it seems there is no better method to verify
> the PMC count as well. I still prefer to keep these tests until we have
> a better method to verify the accuracy of the PMC count.

If it's impossible to set an exact value for these events in advance,
how does Intel validate the hardware PMU?
Mi, Dapeng Nov. 1, 2023, 3:57 a.m. UTC | #6
On 11/1/2023 11:24 AM, Jim Mattson wrote:
> On Tue, Oct 31, 2023 at 8:16 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 11/1/2023 10:47 AM, Jim Mattson wrote:
>>> On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 11/1/2023 2:47 AM, Jim Mattson wrote:
>>>>> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
>>>>>> (fixed counter 3) to counter/sample topdown.slots event, but current
>>>>>> code still doesn't cover this new fixed counter.
>>>>>>
>>>>>> So this patch adds code to validate this new fixed counter can count
>>>>>> slots event correctly.
>>>>> I'm not convinced that this actually validates anything.
>>>>>
>>>>> Suppose, for example, that KVM used fixed counter 1 when the guest
>>>>> asked for fixed counter 3. Wouldn't this test still pass?
>>>> Per my understanding, as long as the KVM returns a valid count in the
>>>> reasonable count range, we can think KVM works correctly. We don't need
>>>> to entangle on how KVM really uses the HW, it could be impossible and
>>>> unnecessary.
>>> Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
>>> are in a reasonable range. What's everyone upset about?
>>>
>>>> Yeah, currently the predefined valid count range may be some kind of
>>>> loose since I want to cover as much as hardwares and avoid to cause
>>>> regression. Especially after introducing the random jump and clflush
>>>> instructions, the cycles and slots become much more hard to predict.
>>>> Maybe we can have a comparable restricted count range in the initial
>>>> change, and we can loosen the restriction then if we encounter a failure
>>>> on some specific hardware. do you think it's better? Thanks.
>>> I think the test is essentially useless, and should probably just be
>>> deleted, so that it doesn't give a false sense of confidence.
>> IMO, I can't say the tests are totally useless. Yes,  passing the tests
>> doesn't mean the KVM vPMU must work correctly, but we can say there is
>> something probably wrong if it fails to pass these tests. Considering
>> the hardware differences, it's impossible to set an exact value for
>> these events in advance and it seems there is no better method to verify
>> the PMC count as well. I still prefer to keep these tests until we have
>> a better method to verify the accuracy of the PMC count.
> If it's impossible to set an exact value for these events in advance,
> how does Intel validate the hardware PMU?


I have no much idea how HW team validates the PMU functionality. But per 
my gotten information, they could have some very tiny benchmarks with a 
fixed pattern and run them on a certain scenario, so they can expect an 
very accurate count value. But this is different with our case, a real 
program is executed on a real system (probably shared with other 
programs), the events count is impacted by too much hardware/software 
factors, such as cache contention, it's hard to predict a single 
accurate count in advance.

Anyway, it's only my guess about the ways of hardware validation, still 
add Kan to get more information.

Hi Kan,

Do you have more information about how HW team to validate the PMC count 
accuracy? Thanks.
Liang, Kan Nov. 1, 2023, 1:51 p.m. UTC | #7
On 2023-10-31 11:57 p.m., Mi, Dapeng wrote:
> 
> On 11/1/2023 11:24 AM, Jim Mattson wrote:
>> On Tue, Oct 31, 2023 at 8:16 PM Mi, Dapeng
>> <dapeng1.mi@linux.intel.com> wrote:
>>>
>>> On 11/1/2023 10:47 AM, Jim Mattson wrote:
>>>> On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng
>>>> <dapeng1.mi@linux.intel.com> wrote:
>>>>> On 11/1/2023 2:47 AM, Jim Mattson wrote:
>>>>>> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi
>>>>>> <dapeng1.mi@linux.intel.com> wrote:
>>>>>>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
>>>>>>> (fixed counter 3) to counter/sample topdown.slots event, but current
>>>>>>> code still doesn't cover this new fixed counter.
>>>>>>>
>>>>>>> So this patch adds code to validate this new fixed counter can count
>>>>>>> slots event correctly.
>>>>>> I'm not convinced that this actually validates anything.
>>>>>>
>>>>>> Suppose, for example, that KVM used fixed counter 1 when the guest
>>>>>> asked for fixed counter 3. Wouldn't this test still pass?
>>>>> Per my understanding, as long as the KVM returns a valid count in the
>>>>> reasonable count range, we can think KVM works correctly. We don't
>>>>> need
>>>>> to entangle on how KVM really uses the HW, it could be impossible and
>>>>> unnecessary.
>>>> Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
>>>> are in a reasonable range. What's everyone upset about?
>>>>
>>>>> Yeah, currently the predefined valid count range may be some kind of
>>>>> loose since I want to cover as much as hardwares and avoid to cause
>>>>> regression. Especially after introducing the random jump and clflush
>>>>> instructions, the cycles and slots become much more hard to predict.
>>>>> Maybe we can have a comparable restricted count range in the initial
>>>>> change, and we can loosen the restriction then if we encounter a
>>>>> failure
>>>>> on some specific hardware. do you think it's better? Thanks.
>>>> I think the test is essentially useless, and should probably just be
>>>> deleted, so that it doesn't give a false sense of confidence.
>>> IMO, I can't say the tests are totally useless. Yes,  passing the tests
>>> doesn't mean the KVM vPMU must work correctly, but we can say there is
>>> something probably wrong if it fails to pass these tests. Considering
>>> the hardware differences, it's impossible to set an exact value for
>>> these events in advance and it seems there is no better method to verify
>>> the PMC count as well. I still prefer to keep these tests until we have
>>> a better method to verify the accuracy of the PMC count.
>> If it's impossible to set an exact value for these events in advance,
>> how does Intel validate the hardware PMU?
> 
> 
> I have no much idea how HW team validates the PMU functionality. But per
> my gotten information, they could have some very tiny benchmarks with a
> fixed pattern and run them on a certain scenario, so they can expect an
> very accurate count value. But this is different with our case, a real
> program is executed on a real system (probably shared with other
> programs), the events count is impacted by too much hardware/software
> factors, such as cache contention, it's hard to predict a single
> accurate count in advance.
>

Yes, there are many factors could impact the value of the
microbenchmarks. I don't think there is a universal benchmark for all
generations and all configurations.

Thanks,
Kan

> Anyway, it's only my guess about the ways of hardware validation, still
> add Kan to get more information.
> 
> Hi Kan,
> 
> Do you have more information about how HW team to validate the PMC count
> accuracy? Thanks.
> 
>
diff mbox series

Patch

diff --git a/x86/pmu.c b/x86/pmu.c
index 6bd8f6d53f55..404dc7b62ac2 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -47,6 +47,7 @@  struct pmu_event {
 	{"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
 	{"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 500*N},
 	{"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 300*N},
+	{"fixed 4", MSR_CORE_PERF_FIXED_CTR0 + 3, 1*N, 5000*N},
 };
 
 char *buf;