mbox series

[v3,0/3] x86/edac/amd64: Add support for noncpu nodes

Message ID 20210823185437.94417-1-nchatrad@amd.com (mailing list archive)
Headers show
Series x86/edac/amd64: Add support for noncpu nodes | expand

Message

Naveen Krishna Chatradhi Aug. 23, 2021, 6:54 p.m. UTC
From: Muralidhara M K <muralimk@amd.com>

On newer heterogeneous systems the data fabrics of the CPUs and GPUs
are connected directly via a custom links.

This patchset do not have any dependency on series by Yazen Ghannam
AMD MCA Address Translation Updates
[https://patchwork.kernel.org/project/linux-edac/list/?series=505989]

This patchset does the following
1. Add support for northbridges on Aldebaran
        * x86/amd_nb: Add support for northbridges on Aldebaran
2. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
        * EDAC/mce_amd: Extract node id from MCA_IPID
   b. Enumerate PCI IDs and HBM memory
        * EDAC/amd64: Enumerate memory on noncpu nodes

Muralidhara M K (1):
  x86/amd_nb: Add support for northbridges on Aldebaran

Naveen Krishna Chatradhi (2):
  EDAC/mce_amd: Extract node id from MCA_IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

 arch/x86/include/asm/amd_nb.h |  10 ++
 arch/x86/kernel/amd_nb.c      |  63 +++++++++-
 drivers/edac/amd64_edac.c     | 219 ++++++++++++++++++++++++++++++----
 drivers/edac/amd64_edac.h     |  28 +++++
 drivers/edac/mce_amd.c        |  19 ++-
 include/linux/pci_ids.h       |   1 +
 6 files changed, 308 insertions(+), 32 deletions(-)

Comments

Naveen Krishna Chatradhi Oct. 14, 2021, 6:53 p.m. UTC | #1
On newer heterogeneous systems with AMD CPUs the data fabrics of GPUs
can be connected directly via custom links.

This patchset does the following
1. amd_nb.c:
   a. Add support for northbridges on Aldebaran GPU nodes
   b. export AMD node map details to be used by edac and mce modules
	
2. mce_amd module:
   a. Identify the node ID where the error occurred and map the node id
      to linux enumerated node id.

2. Modifies the amd64_edac module
   a. Add new family op routines
   b. Enumerate UMCs and HBMs on the GPU nodes

This patchset is rebased on top of
"
commit 07416cadfdfa38283b840e700427ae3782c76f6b
Author: Yazen Ghannam <yazen.ghannam@amd.com>
Date:   Tue Oct 5 15:44:19 2021 +0000

    EDAC/amd64: Handle three rank interleaving mode
"

Muralidhara M K (2):
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Extend family ops functions

Naveen Krishna Chatradhi (2):
  EDAC/mce_amd: Extract node id from MCA_IPID
  EDAC/amd64: Enumerate memory on Aldebaran GPU nodes

 arch/x86/include/asm/amd_nb.h |   9 +
 arch/x86/kernel/amd_nb.c      | 131 +++++++--
 drivers/edac/amd64_edac.c     | 517 +++++++++++++++++++++++++---------
 drivers/edac/amd64_edac.h     |  33 +++
 drivers/edac/mce_amd.c        |  24 +-
 include/linux/pci_ids.h       |   1 +
 6 files changed, 564 insertions(+), 151 deletions(-)
Borislav Petkov Oct. 14, 2021, 7:53 p.m. UTC | #2
On Fri, Oct 15, 2021 at 12:23:56AM +0530, Naveen Krishna Chatradhi wrote:
> On newer heterogeneous systems with AMD CPUs the data fabrics of GPUs
> can be connected directly via custom links.
> 
> This patchset does the following
> 1. amd_nb.c:
>    a. Add support for northbridges on Aldebaran GPU nodes
>    b. export AMD node map details to be used by edac and mce modules
> 	
> 2. mce_amd module:
>    a. Identify the node ID where the error occurred and map the node id
>       to linux enumerated node id.
> 
> 2. Modifies the amd64_edac module
>    a. Add new family op routines
>    b. Enumerate UMCs and HBMs on the GPU nodes
> 
> This patchset is rebased on top of
> "
> commit 07416cadfdfa38283b840e700427ae3782c76f6b
> Author: Yazen Ghannam <yazen.ghannam@amd.com>
> Date:   Tue Oct 5 15:44:19 2021 +0000
> 
>     EDAC/amd64: Handle three rank interleaving mode
> "
> 
> Muralidhara M K (2):
>   x86/amd_nb: Add support for northbridges on Aldebaran
>   EDAC/amd64: Extend family ops functions
> 
> Naveen Krishna Chatradhi (2):
>   EDAC/mce_amd: Extract node id from MCA_IPID
>   EDAC/amd64: Enumerate memory on Aldebaran GPU nodes
> 
>  arch/x86/include/asm/amd_nb.h |   9 +
>  arch/x86/kernel/amd_nb.c      | 131 +++++++--
>  drivers/edac/amd64_edac.c     | 517 +++++++++++++++++++++++++---------
>  drivers/edac/amd64_edac.h     |  33 +++
>  drivers/edac/mce_amd.c        |  24 +-
>  include/linux/pci_ids.h       |   1 +
>  6 files changed, 564 insertions(+), 151 deletions(-)

So which v4 should I be looking at - this one or

https://lore.kernel.org/r/20211014185058.9587-1-nchatrad@amd.com

?

Btw, you don't have to do --in-reply-to and keep all patchsets in a
single thread - just send the new revision as a separate thread.

Thx.
Naveen Krishna Chatradhi Oct. 15, 2021, 12:18 p.m. UTC | #3
Hi Boris,

On 10/15/2021 1:23 AM, Borislav Petkov wrote:
> [CAUTION: External Email]
>
> On Fri, Oct 15, 2021 at 12:23:56AM +0530, Naveen Krishna Chatradhi wrote:
>> On newer heterogeneous systems with AMD CPUs the data fabrics of GPUs
>> can be connected directly via custom links.
>>
>> This patchset does the following
>> 1. amd_nb.c:
>>     a. Add support for northbridges on Aldebaran GPU nodes
>>     b. export AMD node map details to be used by edac and mce modules
>>
>> 2. mce_amd module:
>>     a. Identify the node ID where the error occurred and map the node id
>>        to linux enumerated node id.
>>
>> 2. Modifies the amd64_edac module
>>     a. Add new family op routines
>>     b. Enumerate UMCs and HBMs on the GPU nodes
>>
>> This patchset is rebased on top of
>> "
>> commit 07416cadfdfa38283b840e700427ae3782c76f6b
>> Author: Yazen Ghannam <yazen.ghannam@amd.com>
>> Date:   Tue Oct 5 15:44:19 2021 +0000
>>
>>      EDAC/amd64: Handle three rank interleaving mode
>> "
>>
>> Muralidhara M K (2):
>>    x86/amd_nb: Add support for northbridges on Aldebaran
>>    EDAC/amd64: Extend family ops functions
>>
>> Naveen Krishna Chatradhi (2):
>>    EDAC/mce_amd: Extract node id from MCA_IPID
>>    EDAC/amd64: Enumerate memory on Aldebaran GPU nodes
>>
>>   arch/x86/include/asm/amd_nb.h |   9 +
>>   arch/x86/kernel/amd_nb.c      | 131 +++++++--
>>   drivers/edac/amd64_edac.c     | 517 +++++++++++++++++++++++++---------
>>   drivers/edac/amd64_edac.h     |  33 +++
>>   drivers/edac/mce_amd.c        |  24 +-
>>   include/linux/pci_ids.h       |   1 +
>>   6 files changed, 564 insertions(+), 151 deletions(-)
> So which v4 should I be looking at - this one or
I've noticed the v4 tag missing on the 3rd and 4th patch in the series. 
i tried to abort but the git send-email went through.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fr%2F20211014185058.9587-1-nchatrad%40amd.com&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7C74d7878274cb4849f2b008d98f4c42c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637698379996982690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=g1%2Bw29J4X%2FbdsugpYICDKMzM4Qd2nvsS1dV8zgR150w%3D&amp;reserved=0
Could you please review the latest one (above link) or should i push 
them as v5, to avoid the confusion.
>
> ?
>
> Btw, you don't have to do --in-reply-to and keep all patchsets in a
> single thread - just send the new revision as a separate thread.
Sure, will do that. thank you.
>
> Thx.
>
> --
> Regards/Gruss,
>      Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7C74d7878274cb4849f2b008d98f4c42c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637698379996982690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RNIqyNMrAXLLIzfDB06wXSrmRXH7C596oHAnGIQM1e0%3D&amp;reserved=0
Borislav Petkov Oct. 15, 2021, 7:25 p.m. UTC | #4
On Fri, Oct 15, 2021 at 05:48:32PM +0530, Chatradhi, Naveen Krishna wrote:
> Could you please review the latest one (above link)

Ok.
 
> or should i push them as v5, to avoid the confusion.

Nah, not necessary.

The goal is to always avoid spamming maintainers with patchsets if not
absolutely necessary. :-)

Thx.