[v7,0/2] Remove EPYC mode apicid decode and use generic decode

Message ID	159897580089.30750.12581669374705391794.stgit@naples-babu.amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=3pIa=CK=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 98FF9206EB Subject: [PATCH v7 0/2] Remove EPYC mode apicid decode and use generic decode From: Babu Moger <babu.moger@amd.com> To: pbonzini@redhat.com, rth@twiddle.net, ehabkost@redhat.com, imammedo@redhat.com Date: Tue, 01 Sep 2020 10:57:20 -0500 Message-ID: <159897580089.30750.12581669374705391794.stgit@naples-babu.amd.com> User-Agent: StGit/unknown-version Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received-SPF: none client-ip=40.107.220.86; envelope-from=Babu.Moger@amd.com; helo=NAM11-CO1-obe.outbound.protection.outlook.com Precedence: list Cc: babu.moger@amd.com, qemu-devel@nongnu.org, mst@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	Remove EPYC mode apicid decode and use generic decode \| expand [v7,0/2] Remove EPYC mode apicid decode and use generic decode [v7,1/2] i386: Simplify CPUID_8000_001d for AMD [v7,2/2] i386: Simplify CPUID_8000_001E for AMD

Babu Moger Sept. 1, 2020, 3:57 p.m. UTC

To support some of the complex topology, we introduced EPYC mode apicid decode.
But, EPYC mode decode is running into problems. Also it can become quite a
maintenance problem in the future. So, it was decided to remove that code and
use the generic decode which works for majority of the topology. Most of the
SPECed configuration would work just fine. With some non-SPECed user inputs,
it will create some sub-optimal configuration.

Here is the discussion thread.
https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/

This series removes all the EPYC mode specific apicid changes and use the generic
apicid decode.
---
v7:
Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
Fixed CPUID 800000ld based on Igor's comment and few text changes.

v6:
https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
Found out that numa configuration is not mandatory for all the EPYC model topology.
We can use the generic decode which works pretty well. Also noticed that
cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
Took care of couple comments from Igor and Eduardo.
Thank you Igor, Daniel, David, Eduardo for your feedback.

v5:
https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
Revert EPYC specific decode.
Simplify CPUID_8000_001E

v4:
https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
Not much of a change. Just added few text changes.
Error out configuration instead of warning if dies are not configured in EPYC.
Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.

v3:
https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
Added a new check to pass the dies for EPYC numa configuration.
Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
Dropped the patch to build the topology from CpuInstanceProperties.
TODO: Not sure if we still need the Autonuma changes Igor mentioned.
Needs more clarity on that.

v2:
https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
Used the numa information from CpuInstanceProperties for building
the apic_id suggested by Igor.
Also did some minor code re-aarangement to take care of changes.
Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
it later.

v1:
https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com

Babu Moger (2):
i386: Simplify CPUID_8000_001d for AMD
i386: Simplify CPUID_8000_001E for AMD

target/i386/cpu.c | 226 ++++++++++++++---------------------------------------
1 file changed, 61 insertions(+), 165 deletions(-)

--

Michael S. Tsirkin Sept. 8, 2020, 2:11 p.m. UTC | #1

On Tue, Sep 01, 2020 at 10:57:20AM -0500, Babu Moger wrote:
> To support some of the complex topology, we introduced EPYC mode apicid decode.
> But, EPYC mode decode is running into problems. Also it can become quite a
> maintenance problem in the future. So, it was decided to remove that code and
> use the generic decode which works for majority of the topology. Most of the
> SPECed configuration would work just fine. With some non-SPECed user inputs,
> it will create some sub-optimal configuration.
> 
> Here is the discussion thread.
> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
> 
> This series removes all the EPYC mode specific apicid changes and use the generic
> apicid decode.

PC, x86 changes:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

Feel free to merge.

> ---
> v7:
>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
>  
> v6:
>  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
>  Found out that numa configuration is not mandatory for all the EPYC model topology.
>  We can use the generic decode which works pretty well. Also noticed that
>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
>  Took care of couple comments from Igor and Eduardo.
>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
> 
> v5:
>  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
>  Revert EPYC specific decode.
>  Simplify CPUID_8000_001E
> 
> v4:
>   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
>   Not much of a change. Just added few text changes.
>   Error out configuration instead of warning if dies are not configured in EPYC.
>   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
> 
> v3:
>   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
>   Added a new check to pass the dies for EPYC numa configuration.
>   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
>   Dropped the patch to build the topology from CpuInstanceProperties.
>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
>   Needs more clarity on that.
> 
> v2:
>   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
>   Used the numa information from CpuInstanceProperties for building
>   the apic_id suggested by Igor.
>   Also did some minor code re-aarangement to take care of changes.
>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
>   it later.
> 
> v1:
>  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
> 
> Babu Moger (2):
>       i386: Simplify CPUID_8000_001d for AMD
>       i386: Simplify CPUID_8000_001E for AMD
> 
> 
>  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
>  1 file changed, 61 insertions(+), 165 deletions(-)
> 
> --

Eduardo Habkost Sept. 18, 2020, 6:26 p.m. UTC | #2

On Tue, Sep 01, 2020 at 10:57:20AM -0500, Babu Moger wrote:
> To support some of the complex topology, we introduced EPYC mode apicid decode.
> But, EPYC mode decode is running into problems. Also it can become quite a
> maintenance problem in the future. So, it was decided to remove that code and
> use the generic decode which works for majority of the topology. Most of the
> SPECed configuration would work just fine. With some non-SPECed user inputs,
> it will create some sub-optimal configuration.
> 
> Here is the discussion thread.
> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
> 
> This series removes all the EPYC mode specific apicid changes and use the generic
> apicid decode.

Queued, thanks!

Claudio Fontana July 5, 2023, 8:12 a.m. UTC | #3

Hi all, partially resurrecting an old thread.

I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
but I wonder if something more general would be useful to all?

The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.

Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
in some cases the actual apicid values in relationship to the topology do matter,

and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.

This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.

Do I understand the issue correctly, comments, ideas?
How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?

Thanks,

Claudio



On 9/1/20 17:57, Babu Moger wrote:
> To support some of the complex topology, we introduced EPYC mode apicid decode.
> But, EPYC mode decode is running into problems. Also it can become quite a
> maintenance problem in the future. So, it was decided to remove that code and
> use the generic decode which works for majority of the topology. Most of the
> SPECed configuration would work just fine. With some non-SPECed user inputs,
> it will create some sub-optimal configuration.
> 
> Here is the discussion thread.
> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
> 
> This series removes all the EPYC mode specific apicid changes and use the generic
> apicid decode.
> ---
> v7:
>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
>  
> v6:
>  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
>  Found out that numa configuration is not mandatory for all the EPYC model topology.
>  We can use the generic decode which works pretty well. Also noticed that
>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
>  Took care of couple comments from Igor and Eduardo.
>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
> 
> v5:
>  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
>  Revert EPYC specific decode.
>  Simplify CPUID_8000_001E
> 
> v4:
>   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
>   Not much of a change. Just added few text changes.
>   Error out configuration instead of warning if dies are not configured in EPYC.
>   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
> 
> v3:
>   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
>   Added a new check to pass the dies for EPYC numa configuration.
>   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
>   Dropped the patch to build the topology from CpuInstanceProperties.
>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
>   Needs more clarity on that.
> 
> v2:
>   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
>   Used the numa information from CpuInstanceProperties for building
>   the apic_id suggested by Igor.
>   Also did some minor code re-aarangement to take care of changes.
>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
>   it later.
> 
> v1:
>  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
> 
> Babu Moger (2):
>       i386: Simplify CPUID_8000_001d for AMD
>       i386: Simplify CPUID_8000_001E for AMD
> 
> 
>  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
>  1 file changed, 61 insertions(+), 165 deletions(-)
> 
> --
>

Igor Mammedov July 14, 2023, 9:51 a.m. UTC | #4

On Wed, 5 Jul 2023 10:12:40 +0200
Claudio Fontana <cfontana@suse.de> wrote:

> Hi all, partially resurrecting an old thread.
> 
> I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
> but I wonder if something more general would be useful to all?
> 
> The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.

QEMU typically does generate valid APIC IDs
it however doesn't do a good job when using odd number of cores and/or NUMA enabled cases.
(That is what Babu have attempted to fix, but eventually that have been dropped for
reasons described in quoted cover letter)

> Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
> in some cases the actual apicid values in relationship to the topology do matter,

Care to point out specific places you are referring to?

KVM is not the only place where it might matter, it affects topo/numa code on guest side as well. 

> and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.
> 
> This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
> but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.

Indeed EPYC APIC encoding mess increases support cases load downstream,
but as long as one has access to similar host hw, one should be able
to reproduce the issue locally.
However I would expect end result on such support end with an advice
to change topo/use another CPU model.

(what we lack is a documentation what works and what doesn't,
perhaps writing guidelines would be sufficient to steer users
to the usable EPYC configurations)

> Do I understand the issue correctly, comments, ideas?
> How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?

It's not that simple to just set custom APIC ID in register and be done with it,
you'll likely break (from the top of my head: some CPUID leaves might
depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).

Current topo code aims to work on information based on '-smp'/'-numa',
all through out QEMU codebase.
If however we were let user set APIC ID (which is somehow
correct), we would need to take reverse steps to decode that
(in vendor specific way) and incorporate resulting topo into other
code that uses topology info.
That makes it quite messy, not to mention it's x86(AMD specific) and
doesn't fit well with generalizing topo handling.
So I don't really like this route.

(x86 cpus have apic_id property, so theoretically you can set it
and with some minimal hacking lunch a guest, but then
expect guest to be unhappy when ACPI ID goes out of sync with
everything else. I would do that only for the sake of an experiment
and wouldn't try to upstream that)

What I wouldn't mind is taking the 2nd stab at what Babu had tried
do. Provided it manages to encode APIC ID for EPYC correctly and won't
complicate code much (and still using -smp/-numa as the root source for
topo configuration).
 
> Thanks,
> 
> Claudio
> 
> 
> 
> On 9/1/20 17:57, Babu Moger wrote:
> > To support some of the complex topology, we introduced EPYC mode apicid decode.
> > But, EPYC mode decode is running into problems. Also it can become quite a
> > maintenance problem in the future. So, it was decided to remove that code and
> > use the generic decode which works for majority of the topology. Most of the
> > SPECed configuration would work just fine. With some non-SPECed user inputs,
> > it will create some sub-optimal configuration.
> > 
> > Here is the discussion thread.
> > https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> > https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
> > 
> > This series removes all the EPYC mode specific apicid changes and use the generic
> > apicid decode.
> > ---
> > v7:
> >  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
> >  Fixed CPUID 800000ld based on Igor's comment and few text changes.
> >  
> > v6:
> >  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
> >  Found out that numa configuration is not mandatory for all the EPYC model topology.
> >  We can use the generic decode which works pretty well. Also noticed that
> >  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
> >  Took care of couple comments from Igor and Eduardo.
> >  Thank you Igor, Daniel, David, Eduardo for your feedback.  
> > 
> > v5:
> >  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
> >  Revert EPYC specific decode.
> >  Simplify CPUID_8000_001E
> > 
> > v4:
> >   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
> >   Not much of a change. Just added few text changes.
> >   Error out configuration instead of warning if dies are not configured in EPYC.
> >   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
> > 
> > v3:
> >   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
> >   Added a new check to pass the dies for EPYC numa configuration.
> >   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
> >   Dropped the patch to build the topology from CpuInstanceProperties.
> >   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
> >   Needs more clarity on that.
> > 
> > v2:
> >   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
> >   Used the numa information from CpuInstanceProperties for building
> >   the apic_id suggested by Igor.
> >   Also did some minor code re-aarangement to take care of changes.
> >   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
> >   it later.
> > 
> > v1:
> >  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
> > 
> > Babu Moger (2):
> >       i386: Simplify CPUID_8000_001d for AMD
> >       i386: Simplify CPUID_8000_001E for AMD
> > 
> > 
> >  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
> >  1 file changed, 61 insertions(+), 165 deletions(-)
> > 
> > --
> >   
>

Claudio Fontana July 17, 2023, 8:32 a.m. UTC | #5

Hello Igor,

thanks for getting back to me on this,

On 7/14/23 11:51, Igor Mammedov wrote:
> On Wed, 5 Jul 2023 10:12:40 +0200
> Claudio Fontana <cfontana@suse.de> wrote:
> 
>> Hi all, partially resurrecting an old thread.
>>
>> I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
>> but I wonder if something more general would be useful to all?
>>
>> The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.
> 
> QEMU typically does generate valid APIC IDs
> it however doesn't do a good job when using odd number of cores and/or NUMA enabled cases.


Right, this is what I meant, the QEMU assignment is generally a valid choice, it just seems to differ from what (some) hardware/firmware does.




> (That is what Babu have attempted to fix, but eventually that have been dropped for
> reasons described in quoted cover letter)
> 
>> Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
>> in some cases the actual apicid values in relationship to the topology do matter,
> 
> Care to point out specific places you are referring to?


What we wanted to do was to reproduce an issue that only happened when booting our distro in the Cloud,
but did not instead appear by booting locally (neither on bare metal nor under QEMU/KVM).

In the end, after a lot of slow-turnaround research, the issue we encountered was the already fixed:
 
commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
Author: Li RongQing <lirongqing@baidu.com>
Date:   Wed Mar 9 16:35:44 2022 +0800

    KVM: x86: fix sending PV IPI
    
    If apic_id is less than min, and (max - apic_id) is greater than
    KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
    the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
    should send the IPI.
    
    This is mostly theoretical, but it can happen if the apic_ids on three
    iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
    
    Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
    Signed-off-by: Li RongQing <lirongqing@baidu.com>
    Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


But this took a very long time to investigate, because the KVM PV code only misbehaves with the old unpatched algorithm during boot
if it encounters a specific sequence of ACPI IDs,
where countrary to the comment in the commit, the issue can become very practical depending on such ACPI IDs assignments as seen by the guest KVM PV code.

> 
> KVM is not the only place where it might matter, it affects topo/numa code on guest side as well. 
> 
>> and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.
>>
>> This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
>> but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.
> 
> Indeed EPYC APIC encoding mess increases support cases load downstream,
> but as long as one has access to similar host hw, one should be able
> to reproduce the issue locally.

Unfortunately this does not seem to be always the case, with the case in point being the kvm pv code,
but I suspect other buggy guest code whose behaviour depends on APICIds and APICId sequences must exist in more areas of the kernel.

In order to properly reproduce these kinds of issues locally, being able to assign desired APICIDs to cpus in VMs would come very handy.


> However I would expect end result on such support end with an advice
> to change topo/use another CPU model.
> 
> (what we lack is a documentation what works and what doesn't,
> perhaps writing guidelines would be sufficient to steer users
> to the usable EPYC configurations)


In our case we encountered issues on Intel too.


> 
>> Do I understand the issue correctly, comments, ideas?
>> How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?
> 
> It's not that simple to just set custom APIC ID in register and be done with it,


right, I am under no illusion that this is going to be easy.


> you'll likely break (from the top of my head: some CPUID leaves might
> depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
> 
> Current topo code aims to work on information based on '-smp'/'-numa',
> all through out QEMU codebase.


just a thought, that information "-smp, -numa" could be optionally enriched with additional info on apicids assignment


> If however we were let user set APIC ID (which is somehow
> correct), we would need to take reverse steps to decode that
> (in vendor specific way) and incorporate resulting topo into other
> code that uses topology info.
> That makes it quite messy, not to mention it's x86(AMD specific) and
> doesn't fit well with generalizing topo handling.
> So I don't really like this route.


I don't think I am suggesting something like described in the preceding paragraph,
but instead I would think that with the user providing the full apicid assignment map, (in addition / as part of)  to the -smp, -numa options,
all other pieces would be derived from that (I suppose ACPI tables, cpuid leaves, and everything that the guest could see, plus the internal
QEMU conversion functions between apicids and cpu index in topology.h


> 
> (x86 cpus have apic_id property, so theoretically you can set it
> and with some minimal hacking lunch a guest, but then
> expect guest to be unhappy when ACPI ID goes out of sync with
> everything else. I would do that only for the sake of an experiment
> and wouldn't try to upstream that)

Right, it would all need to be consistent.

> 
> What I wouldn't mind is taking the 2nd stab at what Babu had tried
> do. Provided it manages to encode APIC ID for EPYC correctly and won't
> complicate code much (and still using -smp/-numa as the root source for
> topo configuration).


For the specific use case I am thinking (debugging with a guest-visible topology that resembles a cloud one),
I don't think that the EPYC-specific work would be sufficient, it would need to be complemented in any case with Intel work,

but I suppose that a more general solution of the user providing all mappings would be the best and easiest one for this debugging scenario.

Thanks for your thoughts,

Claudio
>>
>>
>> On 9/1/20 17:57, Babu Moger wrote:
>>> To support some of the complex topology, we introduced EPYC mode apicid decode.
>>> But, EPYC mode decode is running into problems. Also it can become quite a
>>> maintenance problem in the future. So, it was decided to remove that code and
>>> use the generic decode which works for majority of the topology. Most of the
>>> SPECed configuration would work just fine. With some non-SPECed user inputs,
>>> it will create some sub-optimal configuration.
>>>
>>> Here is the discussion thread.
>>> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
>>> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
>>>
>>> This series removes all the EPYC mode specific apicid changes and use the generic
>>> apicid decode.
>>> ---
>>> v7:
>>>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
>>>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
>>>  
>>> v6:
>>>  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
>>>  Found out that numa configuration is not mandatory for all the EPYC model topology.
>>>  We can use the generic decode which works pretty well. Also noticed that
>>>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
>>>  Took care of couple comments from Igor and Eduardo.
>>>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
>>>
>>> v5:
>>>  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
>>>  Revert EPYC specific decode.
>>>  Simplify CPUID_8000_001E
>>>
>>> v4:
>>>   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
>>>   Not much of a change. Just added few text changes.
>>>   Error out configuration instead of warning if dies are not configured in EPYC.
>>>   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
>>>
>>> v3:
>>>   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
>>>   Added a new check to pass the dies for EPYC numa configuration.
>>>   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
>>>   Dropped the patch to build the topology from CpuInstanceProperties.
>>>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
>>>   Needs more clarity on that.
>>>
>>> v2:
>>>   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
>>>   Used the numa information from CpuInstanceProperties for building
>>>   the apic_id suggested by Igor.
>>>   Also did some minor code re-aarangement to take care of changes.
>>>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
>>>   it later.
>>>
>>> v1:
>>>  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
>>>
>>> Babu Moger (2):
>>>       i386: Simplify CPUID_8000_001d for AMD
>>>       i386: Simplify CPUID_8000_001E for AMD
>>>
>>>
>>>  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
>>>  1 file changed, 61 insertions(+), 165 deletions(-)
>>>
>>> --
>>>   
>>
>

Claudio Fontana July 17, 2023, 8:36 a.m. UTC | #6

On 7/17/23 10:32, Claudio Fontana wrote:
> Hello Igor,
> 
> thanks for getting back to me on this,
> 
> On 7/14/23 11:51, Igor Mammedov wrote:
>> On Wed, 5 Jul 2023 10:12:40 +0200
>> Claudio Fontana <cfontana@suse.de> wrote:
>>
>>> Hi all, partially resurrecting an old thread.
>>>
>>> I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
>>> but I wonder if something more general would be useful to all?
>>>
>>> The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.
>>
>> QEMU typically does generate valid APIC IDs
>> it however doesn't do a good job when using odd number of cores and/or NUMA enabled cases.
> 
> 
> Right, this is what I meant, the QEMU assignment is generally a valid choice, it just seems to differ from what (some) hardware/firmware does.
> 
> 
> 
> 
>> (That is what Babu have attempted to fix, but eventually that have been dropped for
>> reasons described in quoted cover letter)
>>
>>> Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
>>> in some cases the actual apicid values in relationship to the topology do matter,
>>
>> Care to point out specific places you are referring to?
> 
> 
> What we wanted to do was to reproduce an issue that only happened when booting our distro in the Cloud,
> but did not instead appear by booting locally (neither on bare metal nor under QEMU/KVM).
> 
> In the end, after a lot of slow-turnaround research, the issue we encountered was the already fixed:
>  
> commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
> Author: Li RongQing <lirongqing@baidu.com>
> Date:   Wed Mar 9 16:35:44 2022 +0800
> 
>     KVM: x86: fix sending PV IPI
>     
>     If apic_id is less than min, and (max - apic_id) is greater than
>     KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
>     the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
>     should send the IPI.
>     
>     This is mostly theoretical, but it can happen if the apic_ids on three
>     iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
>     
>     Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
>     Signed-off-by: Li RongQing <lirongqing@baidu.com>
>     Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
>     Cc: stable@vger.kernel.org
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> 
> But this took a very long time to investigate, because the KVM PV code only misbehaves with the old unpatched algorithm during boot
> if it encounters a specific sequence of ACPI IDs,
> where countrary to the comment in the commit, the issue can become very practical depending on such ACPI IDs assignments as seen by the guest KVM PV code.
> 
>>
>> KVM is not the only place where it might matter, it affects topo/numa code on guest side as well. 
>>
>>> and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.
>>>
>>> This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
>>> but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.
>>
>> Indeed EPYC APIC encoding mess increases support cases load downstream,
>> but as long as one has access to similar host hw, one should be able
>> to reproduce the issue locally.
> 
> Unfortunately this does not seem to be always the case, with the case in point being the kvm pv code,
> but I suspect other buggy guest code whose behaviour depends on APICIds and APICId sequences must exist in more areas of the kernel.
> 
> In order to properly reproduce these kinds of issues locally, being able to assign desired APICIDs to cpus in VMs would come very handy.
> 
> 
>> However I would expect end result on such support end with an advice
>> to change topo/use another CPU model.
>>
>> (what we lack is a documentation what works and what doesn't,
>> perhaps writing guidelines would be sufficient to steer users
>> to the usable EPYC configurations)
> 
> 
> In our case we encountered issues on Intel too.
> 
> 
>>
>>> Do I understand the issue correctly, comments, ideas?
>>> How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?
>>
>> It's not that simple to just set custom APIC ID in register and be done with it,
> 
> 
> right, I am under no illusion that this is going to be easy.
> 
> 
>> you'll likely break (from the top of my head: some CPUID leaves might
>> depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
>>
>> Current topo code aims to work on information based on '-smp'/'-numa',
>> all through out QEMU codebase.
> 
> 
> just a thought, that information "-smp, -numa" could be optionally enriched with additional info on apicids assignment
> 
> 
>> If however we were let user set APIC ID (which is somehow
>> correct), we would need to take reverse steps to decode that
>> (in vendor specific way) and incorporate resulting topo into other
>> code that uses topology info.
>> That makes it quite messy, not to mention it's x86(AMD specific) and
>> doesn't fit well with generalizing topo handling.
>> So I don't really like this route.
> 
> 
> I don't think I am suggesting something like described in the preceding paragraph,
> but instead I would think that with the user providing the full apicid assignment map, (in addition / as part of)  to the -smp, -numa options,
> all other pieces would be derived from that (I suppose ACPI tables, cpuid leaves, and everything that the guest could see, plus the internal
> QEMU conversion functions between apicids and cpu index in topology.h
> 
> 
>>
>> (x86 cpus have apic_id property, so theoretically you can set it
>> and with some minimal hacking lunch a guest, but then
>> expect guest to be unhappy when ACPI ID goes out of sync with
>> everything else. I would do that only for the sake of an experiment
>> and wouldn't try to upstream that)
> 
> Right, it would all need to be consistent.
> 
>>
>> What I wouldn't mind is taking the 2nd stab at what Babu had tried
>> do. Provided it manages to encode APIC ID for EPYC correctly and won't
>> complicate code much (and still using -smp/-numa as the root source for
>> topo configuration).
> 
> 
> For the specific use case I am thinking (debugging with a guest-visible topology that resembles a cloud one),
> I don't think that the EPYC-specific work would be sufficient, it would need to be complemented in any case with Intel work,
> 
> but I suppose that a more general solution of the user providing all mappings would be the best and easiest one for this debugging scenario.
> 
> Thanks for your thoughts,
> 
> Claudio

As a PS: I had a lot of typos where I wrote ACPI ID instead of APIC ID, hope it does not cause too much confusion..

Ciao,

C

Igor Mammedov July 17, 2023, 10:37 a.m. UTC | #7

On Mon, 17 Jul 2023 10:32:33 +0200
Claudio Fontana <cfontana@suse.de> wrote:

> Hello Igor,
> 
> thanks for getting back to me on this,
> 
> On 7/14/23 11:51, Igor Mammedov wrote:
> > On Wed, 5 Jul 2023 10:12:40 +0200
> > Claudio Fontana <cfontana@suse.de> wrote:
> >   
> >> Hi all, partially resurrecting an old thread.
> >>
> >> I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
> >> but I wonder if something more general would be useful to all?
> >>
> >> The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.  
> > 
> > QEMU typically does generate valid APIC IDs
> > it however doesn't do a good job when using odd number of cores and/or NUMA enabled cases.  
> 
> 
> Right, this is what I meant, the QEMU assignment is generally a valid choice, it just seems to differ from what (some) hardware/firmware does.
> 
> 
> 
> 
> > (That is what Babu have attempted to fix, but eventually that have been dropped for
> > reasons described in quoted cover letter)
> >   
> >> Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
> >> in some cases the actual apicid values in relationship to the topology do matter,  
> > 
> > Care to point out specific places you are referring to?  
> 
> 
> What we wanted to do was to reproduce an issue that only happened when booting our distro in the Cloud,
> but did not instead appear by booting locally (neither on bare metal nor under QEMU/KVM).
> 
> In the end, after a lot of slow-turnaround research, the issue we encountered was the already fixed:
>  
> commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
> Author: Li RongQing <lirongqing@baidu.com>
> Date:   Wed Mar 9 16:35:44 2022 +0800
> 
>     KVM: x86: fix sending PV IPI
>     
>     If apic_id is less than min, and (max - apic_id) is greater than
>     KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
>     the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
>     should send the IPI.
>     
>     This is mostly theoretical, but it can happen if the apic_ids on three
>     iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
>     
>     Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
>     Signed-off-by: Li RongQing <lirongqing@baidu.com>
>     Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
>     Cc: stable@vger.kernel.org
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> 
> But this took a very long time to investigate, because the KVM PV code only misbehaves with the old unpatched algorithm during boot
> if it encounters a specific sequence of ACPI IDs,
> where countrary to the comment in the commit, the issue can become very practical depending on such ACPI IDs assignments as seen by the guest KVM PV code.
> 
> > 
> > KVM is not the only place where it might matter, it affects topo/numa code on guest side as well. 
> >   
> >> and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.
> >>
> >> This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
> >> but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.  
> > 
> > Indeed EPYC APIC encoding mess increases support cases load downstream,
> > but as long as one has access to similar host hw, one should be able
> > to reproduce the issue locally.  
> 
> Unfortunately this does not seem to be always the case, with the case in point being the kvm pv code,
> but I suspect other buggy guest code whose behaviour depends on APICIds and APICId sequences must exist in more areas of the kernel.
> 
> In order to properly reproduce these kinds of issues locally, being able to assign desired APICIDs to cpus in VMs would come very handy.
> 
> 
> > However I would expect end result on such support end with an advice
> > to change topo/use another CPU model.
> > 
> > (what we lack is a documentation what works and what doesn't,
> > perhaps writing guidelines would be sufficient to steer users
> > to the usable EPYC configurations)  
> 
> 
> In our case we encountered issues on Intel too.

on Intel I've seen issues only in cases where specified topo wasn't
really supported by given CPU model (Intel CPUs are far from perfect
and kernel has a few quirks when it comes to handling topo (
the same as it has quirks for AMD cpus)).
On QEMU level we hasn't bothered implementing topo quirks
(modulo EPYC attempt, which we've later decided not worth of
complexity of supporting it)


> >   
> >> Do I understand the issue correctly, comments, ideas?
> >> How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?  
> > 
> > It's not that simple to just set custom APIC ID in register and be done with it,  
> 
> 
> right, I am under no illusion that this is going to be easy.
> 
> 
> > you'll likely break (from the top of my head: some CPUID leaves might
> > depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
> > 
> > Current topo code aims to work on information based on '-smp'/'-numa',
> > all through out QEMU codebase.  
> 
> 
> just a thought, that information "-smp, -numa" could be optionally enriched with additional info on apicids assignment
> 
> 
> > If however we were let user set APIC ID (which is somehow
> > correct), we would need to take reverse steps to decode that
> > (in vendor specific way) and incorporate resulting topo into other
> > code that uses topology info.
> > That makes it quite messy, not to mention it's x86(AMD specific) and
> > doesn't fit well with generalizing topo handling.
> > So I don't really like this route.  
> 
> 
> I don't think I am suggesting something like described in the preceding paragraph,
> but instead I would think that with the user providing the full apicid assignment map, (in addition / as part of)  to the -smp, -numa options,
> all other pieces would be derived from that (I suppose ACPI tables, cpuid leaves, and everything that the guest could see, plus the internal
> QEMU conversion functions between apicids and cpu index in topology.h

As I read it, you described the same approach as me, just in different words.

ACPI ID, is a function of -smp/-numa (and sometimes -cpu, which we ignore atm).
Pushing APIC ID up to a user visible interface duplicates that information.
That's would be my main objection to the approach.

Also my educated guess tells me that it would even more complicate
already not easy to deal with topo code.

If you really wish QEMU emulate FOO hypervisor, create a dedicated
board for it, that behaves as FOO and that could generate the same
topology based on -smp/-numa. (That approach might be acceptable,
assuming it self-containing and doesn't complicate common code a lot)

> > (x86 cpus have apic_id property, so theoretically you can set it
> > and with some minimal hacking lunch a guest, but then
> > expect guest to be unhappy when ACPI ID goes out of sync with
> > everything else. I would do that only for the sake of an experiment
> > and wouldn't try to upstream that)  
> 
> Right, it would all need to be consistent.
> 
> > 
> > What I wouldn't mind is taking the 2nd stab at what Babu had tried
> > do. Provided it manages to encode APIC ID for EPYC correctly and won't
> > complicate code much (and still using -smp/-numa as the root source for
> > topo configuration).  
> 
> 
> For the specific use case I am thinking (debugging with a guest-visible topology that resembles a cloud one),
> I don't think that the EPYC-specific work would be sufficient, it would need to be complemented in any case with Intel work,
> 
> but I suppose that a more general solution of the user providing all mappings would be the best and easiest one for this debugging scenario.

All of that for the sake of debugging some thirdparty hypervisor
issues (which in my opinion doesn't benefit QEMU* at all).
So do we really need this in QEMU code-base?

I think about this use-case as 1-off thingy, where I can quickly hack
QEMU by hardcodding desired specifics to get where I want, but not
something that I would try to upstream as it's not useful for the project
in general.

* under QEMU here I mean all the folks who have to read/touch/maintain
this code. (not to mention poor users who decided use this feature).

> Thanks for your thoughts,
> 
> Claudio
> >>
> >>
> >> On 9/1/20 17:57, Babu Moger wrote:  
> >>> To support some of the complex topology, we introduced EPYC mode apicid decode.
> >>> But, EPYC mode decode is running into problems. Also it can become quite a
> >>> maintenance problem in the future. So, it was decided to remove that code and
> >>> use the generic decode which works for majority of the topology. Most of the
> >>> SPECed configuration would work just fine. With some non-SPECed user inputs,
> >>> it will create some sub-optimal configuration.
> >>>
> >>> Here is the discussion thread.
> >>> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> >>> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
> >>>
> >>> This series removes all the EPYC mode specific apicid changes and use the generic
> >>> apicid decode.
> >>> ---
> >>> v7:
> >>>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
> >>>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
> >>>  
> >>> v6:
> >>>  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
> >>>  Found out that numa configuration is not mandatory for all the EPYC model topology.
> >>>  We can use the generic decode which works pretty well. Also noticed that
> >>>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
> >>>  Took care of couple comments from Igor and Eduardo.
> >>>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
> >>>
> >>> v5:
> >>>  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
> >>>  Revert EPYC specific decode.
> >>>  Simplify CPUID_8000_001E
> >>>
> >>> v4:
> >>>   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
> >>>   Not much of a change. Just added few text changes.
> >>>   Error out configuration instead of warning if dies are not configured in EPYC.
> >>>   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
> >>>
> >>> v3:
> >>>   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
> >>>   Added a new check to pass the dies for EPYC numa configuration.
> >>>   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
> >>>   Dropped the patch to build the topology from CpuInstanceProperties.
> >>>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
> >>>   Needs more clarity on that.
> >>>
> >>> v2:
> >>>   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
> >>>   Used the numa information from CpuInstanceProperties for building
> >>>   the apic_id suggested by Igor.
> >>>   Also did some minor code re-aarangement to take care of changes.
> >>>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
> >>>   it later.
> >>>
> >>> v1:
> >>>  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
> >>>
> >>> Babu Moger (2):
> >>>       i386: Simplify CPUID_8000_001d for AMD
> >>>       i386: Simplify CPUID_8000_001E for AMD
> >>>
> >>>
> >>>  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
> >>>  1 file changed, 61 insertions(+), 165 deletions(-)
> >>>
> >>> --
> >>>     
> >>  
> >   
>

Claudio Fontana July 17, 2023, 1:28 p.m. UTC | #8

On 7/17/23 12:37, Igor Mammedov wrote:
> On Mon, 17 Jul 2023 10:32:33 +0200
> Claudio Fontana <cfontana@suse.de> wrote:
> 
>> Hello Igor,
>>
>> thanks for getting back to me on this,
>>
>> On 7/14/23 11:51, Igor Mammedov wrote:
>>> On Wed, 5 Jul 2023 10:12:40 +0200
>>> Claudio Fontana <cfontana@suse.de> wrote:
>>>   
>>>> Hi all, partially resurrecting an old thread.
>>>>
>>>> I've seen how for Epyc something special was done in the past in terms of apicid assignments based on topology, which was then reverted apparently,
>>>> but I wonder if something more general would be useful to all?
>>>>
>>>> The QEMU apicid assignments first of all do not seem to match what is happening on real hardware.  
>>>
>>> QEMU typically does generate valid APIC IDs
>>> it however doesn't do a good job when using odd number of cores and/or NUMA enabled cases.  
>>
>>
>> Right, this is what I meant, the QEMU assignment is generally a valid choice, it just seems to differ from what (some) hardware/firmware does.
>>
>>
>>
>>
>>> (That is what Babu have attempted to fix, but eventually that have been dropped for
>>> reasons described in quoted cover letter)
>>>   
>>>> Functionally things are ok, but then when trying to investigate issues, specifically in the guest kernel KVM PV code (arch/x86/kernel/kvm.c),
>>>> in some cases the actual apicid values in relationship to the topology do matter,  
>>>
>>> Care to point out specific places you are referring to?  
>>
>>
>> What we wanted to do was to reproduce an issue that only happened when booting our distro in the Cloud,
>> but did not instead appear by booting locally (neither on bare metal nor under QEMU/KVM).
>>
>> In the end, after a lot of slow-turnaround research, the issue we encountered was the already fixed:
>>  
>> commit c15e0ae42c8e5a61e9aca8aac920517cf7b3e94e
>> Author: Li RongQing <lirongqing@baidu.com>
>> Date:   Wed Mar 9 16:35:44 2022 +0800
>>
>>     KVM: x86: fix sending PV IPI
>>     
>>     If apic_id is less than min, and (max - apic_id) is greater than
>>     KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
>>     the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
>>     should send the IPI.
>>     
>>     This is mostly theoretical, but it can happen if the apic_ids on three
>>     iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
>>     
>>     Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
>>     Signed-off-by: Li RongQing <lirongqing@baidu.com>
>>     Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
>>     Cc: stable@vger.kernel.org
>>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>>
>> But this took a very long time to investigate, because the KVM PV code only misbehaves with the old unpatched algorithm during boot
>> if it encounters a specific sequence of ACPI IDs,
>> where countrary to the comment in the commit, the issue can become very practical depending on such ACPI IDs assignments as seen by the guest KVM PV code.
>>
>>>
>>> KVM is not the only place where it might matter, it affects topo/numa code on guest side as well. 
>>>   
>>>> and currently there is no way (I know of), of supplying our own apicid assignment, more closely matching what happens on hardware.
>>>>
>>>> This has been an issue when debugging guest images in the cloud, where being able to reproduce issues locally would be very beneficial as opposed to using cloud images as the feedback loop,
>>>> but unfortunately QEMU cannot currently create the right apicid values to associate to the cpus.  
>>>
>>> Indeed EPYC APIC encoding mess increases support cases load downstream,
>>> but as long as one has access to similar host hw, one should be able
>>> to reproduce the issue locally.  
>>
>> Unfortunately this does not seem to be always the case, with the case in point being the kvm pv code,
>> but I suspect other buggy guest code whose behaviour depends on APICIds and APICId sequences must exist in more areas of the kernel.
>>
>> In order to properly reproduce these kinds of issues locally, being able to assign desired APICIDs to cpus in VMs would come very handy.
>>
>>
>>> However I would expect end result on such support end with an advice
>>> to change topo/use another CPU model.
>>>
>>> (what we lack is a documentation what works and what doesn't,
>>> perhaps writing guidelines would be sufficient to steer users
>>> to the usable EPYC configurations)  
>>
>>
>> In our case we encountered issues on Intel too.
> 
> on Intel I've seen issues only in cases where specified topo wasn't
> really supported by given CPU model (Intel CPUs are far from perfect
> and kernel has a few quirks when it comes to handling topo (
> the same as it has quirks for AMD cpus)).
> On QEMU level we hasn't bothered implementing topo quirks
> (modulo EPYC attempt, which we've later decided not worth of
> complexity of supporting it)
> 
> 
>>>   
>>>> Do I understand the issue correctly, comments, ideas?
>>>> How receptive the project would be for changes aimed at providing a custom assignment of apicids to cpus, regardless of Intel or AMD?  
>>>
>>> It's not that simple to just set custom APIC ID in register and be done with it,  
>>
>>
>> right, I am under no illusion that this is going to be easy.
>>
>>
>>> you'll likely break (from the top of my head: some CPUID leaves might
>>> depend on it, ACPI tables, NUMA mapping, KVM's vcpu_id).
>>>
>>> Current topo code aims to work on information based on '-smp'/'-numa',
>>> all through out QEMU codebase.  
>>
>>
>> just a thought, that information "-smp, -numa" could be optionally enriched with additional info on apicids assignment
>>
>>
>>> If however we were let user set APIC ID (which is somehow
>>> correct), we would need to take reverse steps to decode that
>>> (in vendor specific way) and incorporate resulting topo into other
>>> code that uses topology info.
>>> That makes it quite messy, not to mention it's x86(AMD specific) and
>>> doesn't fit well with generalizing topo handling.
>>> So I don't really like this route.  
>>
>>
>> I don't think I am suggesting something like described in the preceding paragraph,
>> but instead I would think that with the user providing the full apicid assignment map, (in addition / as part of)  to the -smp, -numa options,
>> all other pieces would be derived from that (I suppose ACPI tables, cpuid leaves, and everything that the guest could see, plus the internal
>> QEMU conversion functions between apicids and cpu index in topology.h
> 
> As I read it, you described the same approach as me, just in different words.
> 
> ACPI ID, is a function of -smp/-numa (and sometimes -cpu, which we ignore atm).
> Pushing APIC ID up to a user visible interface duplicates that information.
> That's would be my main objection to the approach.


The user would have to provide at least that specific "function", ie how those -smp and -numa map to apicids.


> 
> Also my educated guess tells me that it would even more complicate
> already not easy to deal with topo code.
> 
> If you really wish QEMU emulate FOO hypervisor, create a dedicated
> board for it, that behaves as FOO and that could generate the same
> topology based on -smp/-numa. (That approach might be acceptable,
> assuming it self-containing and doesn't complicate common code a lot)

hmm interesting.

> 
>>> (x86 cpus have apic_id property, so theoretically you can set it
>>> and with some minimal hacking lunch a guest, but then
>>> expect guest to be unhappy when ACPI ID goes out of sync with
>>> everything else. I would do that only for the sake of an experiment
>>> and wouldn't try to upstream that)  
>>
>> Right, it would all need to be consistent.
>>
>>>
>>> What I wouldn't mind is taking the 2nd stab at what Babu had tried
>>> do. Provided it manages to encode APIC ID for EPYC correctly and won't
>>> complicate code much (and still using -smp/-numa as the root source for
>>> topo configuration).  
>>
>>
>> For the specific use case I am thinking (debugging with a guest-visible topology that resembles a cloud one),
>> I don't think that the EPYC-specific work would be sufficient, it would need to be complemented in any case with Intel work,
>>
>> but I suppose that a more general solution of the user providing all mappings would be the best and easiest one for this debugging scenario.
> 
> All of that for the sake of debugging some thirdparty hypervisor
> issues (which in my opinion doesn't benefit QEMU* at all).
> So do we really need this in QEMU code-base?
> 
> I think about this use-case as 1-off thingy, where I can quickly hack
> QEMU by hardcodding desired specifics to get where I want, but not
> something that I would try to upstream as it's not useful for the project
> in general.

I don't think debugging cloud issues will be a one-off, I expect this problem to appear more and more,

in any case thanks,

Claudio

> 
> * under QEMU here I mean all the folks who have to read/touch/maintain
> this code. (not to mention poor users who decided use this feature).
> 
>> Thanks for your thoughts,
>>
>> Claudio
>>>>
>>>>
>>>> On 9/1/20 17:57, Babu Moger wrote:  
>>>>> To support some of the complex topology, we introduced EPYC mode apicid decode.
>>>>> But, EPYC mode decode is running into problems. Also it can become quite a
>>>>> maintenance problem in the future. So, it was decided to remove that code and
>>>>> use the generic decode which works for majority of the topology. Most of the
>>>>> SPECed configuration would work just fine. With some non-SPECed user inputs,
>>>>> it will create some sub-optimal configuration.
>>>>>
>>>>> Here is the discussion thread.
>>>>> https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
>>>>> https://lore.kernel.org/qemu-devel/20200826143849.59f6970b@redhat.com/
>>>>>
>>>>> This series removes all the EPYC mode specific apicid changes and use the generic
>>>>> apicid decode.
>>>>> ---
>>>>> v7:
>>>>>  Eduardo has already queued 1-8 from the v6. Sending rest of the patches.
>>>>>  Fixed CPUID 800000ld based on Igor's comment and few text changes.
>>>>>  
>>>>> v6:
>>>>>  https://lore.kernel.org/qemu-devel/159889924378.21294.16494070903874534542.stgit@naples-babu.amd.com/
>>>>>  Found out that numa configuration is not mandatory for all the EPYC model topology.
>>>>>  We can use the generic decode which works pretty well. Also noticed that
>>>>>  cpuid does not changes when the numa nodes change(NPS- Nodes per socket).
>>>>>  Took care of couple comments from Igor and Eduardo.
>>>>>  Thank you Igor, Daniel, David, Eduardo for your feedback.  
>>>>>
>>>>> v5:
>>>>>  https://lore.kernel.org/qemu-devel/159804762216.39954.15502128500494116468.stgit@naples-babu.amd.com/
>>>>>  Revert EPYC specific decode.
>>>>>  Simplify CPUID_8000_001E
>>>>>
>>>>> v4:
>>>>>   https://lore.kernel.org/qemu-devel/159744083536.39197.13827776633866601278.stgit@naples-babu.amd.com/
>>>>>   Not much of a change. Just added few text changes.
>>>>>   Error out configuration instead of warning if dies are not configured in EPYC.
>>>>>   Few other text changes to clarify the removal of node_id, nr_nodes and nodes_per_pkg.
>>>>>
>>>>> v3:
>>>>>   https://lore.kernel.org/qemu-devel/159681772267.9679.1334429994189974662.stgit@naples-babu.amd.com/#r
>>>>>   Added a new check to pass the dies for EPYC numa configuration.
>>>>>   Added Simplify CPUID_8000_001E patch with some changes suggested by Igor.
>>>>>   Dropped the patch to build the topology from CpuInstanceProperties.
>>>>>   TODO: Not sure if we still need the Autonuma changes Igor mentioned.
>>>>>   Needs more clarity on that.
>>>>>
>>>>> v2:
>>>>>   https://lore.kernel.org/qemu-devel/159362436285.36204.986406297373871949.stgit@naples-babu.amd.com/
>>>>>   Used the numa information from CpuInstanceProperties for building
>>>>>   the apic_id suggested by Igor.
>>>>>   Also did some minor code re-aarangement to take care of changes.
>>>>>   Dropped the patch "Simplify CPUID_8000_001E" from v1. Will send
>>>>>   it later.
>>>>>
>>>>> v1:
>>>>>  https://lore.kernel.org/qemu-devel/159164739269.20543.3074052993891532749.stgit@naples-babu.amd.com
>>>>>
>>>>> Babu Moger (2):
>>>>>       i386: Simplify CPUID_8000_001d for AMD
>>>>>       i386: Simplify CPUID_8000_001E for AMD
>>>>>
>>>>>
>>>>>  target/i386/cpu.c |  226 ++++++++++++++---------------------------------------
>>>>>  1 file changed, 61 insertions(+), 165 deletions(-)
>>>>>
>>>>> --
>>>>>     
>>>>  
>>>   
>>
>

[v7,0/2] Remove EPYC mode apicid decode and use generic decode

Message

Comments