Message ID | 20210507190140.18854-1-Yazen.Ghannam@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | AMD MCA Address Translation Updates | expand |
On 5/7/21 12:01 PM, Yazen Ghannam wrote: > From: Yazen Ghannam <yazen.ghannam@amd.com> > ... > > Rome: > No interleaving > Nodes-per-Socket 0 (NPS0) > Nodes-per-Socket 1 (NPS1) > Nodes-per-Socket 2 (NPS2) > Nodes-per-Socket 4 (NPS4) > NPS2 w/o hashing > NPS4 w/o hashing > > Thanks, > Yazen Hi Yazen, It appears that you need to provide a glossary of acronyms, e.g.: DF = (Data Fabric ?) CS = MCE = NPS = Nodes per Socket etc. thanks.
On Fri, May 07, 2021 at 01:32:28PM -0700, Randy Dunlap wrote: > On 5/7/21 12:01 PM, Yazen Ghannam wrote: > > From: Yazen Ghannam <yazen.ghannam@amd.com> > > > ... > > > > Rome: > > No interleaving > > Nodes-per-Socket 0 (NPS0) > > Nodes-per-Socket 1 (NPS1) > > Nodes-per-Socket 2 (NPS2) > > Nodes-per-Socket 4 (NPS4) > > NPS2 w/o hashing > > NPS4 w/o hashing > > > > Thanks, > > Yazen > > Hi Yazen, > > It appears that you need to provide a glossary of acronyms, e.g.: > > DF = (Data Fabric ?) > CS = > MCE = > NPS = Nodes per Socket > etc. > > Hi Randy, That's a good idea. Should it be in the cover letter, patches, or somewhere else? Thanks, Yazen
On 5/11/21 8:42 AM, Yazen Ghannam wrote: > On Fri, May 07, 2021 at 01:32:28PM -0700, Randy Dunlap wrote: >> On 5/7/21 12:01 PM, Yazen Ghannam wrote: >>> From: Yazen Ghannam <yazen.ghannam@amd.com> >>> >> ... >>> >>> Rome: >>> No interleaving >>> Nodes-per-Socket 0 (NPS0) >>> Nodes-per-Socket 1 (NPS1) >>> Nodes-per-Socket 2 (NPS2) >>> Nodes-per-Socket 4 (NPS4) >>> NPS2 w/o hashing >>> NPS4 w/o hashing >>> >>> Thanks, >>> Yazen >> >> Hi Yazen, >> >> It appears that you need to provide a glossary of acronyms, e.g.: >> >> DF = (Data Fabric ?) >> CS = >> MCE = >> NPS = Nodes per Socket >> etc. >> >> > > Hi Randy, > > That's a good idea. Should it be in the cover letter, patches, or > somewhere else? Cover letter seems good to me. Thanks.
On Tue, May 11, 2021 at 09:13:56AM -0700, Randy Dunlap wrote:
> Cover letter seems good to me.
Cover letter gets lost and if we didn't have lore.kernel.org, it
would be a total waste of effort. I think if it is only a small,
acronym-mapping thing, the top of drivers/edac/mce_amd.c should be a
much better place.
Thx.
On Fri, May 07, 2021 at 03:01:15PM -0400, Yazen Ghannam wrote: > Patches 1-24 do the refactor without adding new system support. The goal > is to break down the translation algorithm into smaller chunks. There > are some simple wrapper functions defined. These will be filled in when > supporting newer systems. The intention is that new system support can > be added without any major refactor. I tried to make a patch for each > logical change. There's a bit of churn so as to not break the build with > each change. I think many of these patches can be squashed together, if > desired. The top level function was split first, then the next level of > functions, etc. in a somewhat breadth-first approach. No, that's great what you did and keeping each logical change in a single patch is a lot easier on everybody involved. Now, looking at this - and I know we've talked about this before - but: umc_normaddr_to_sysaddr() is used only in amd64_edac.c. amd_df_indirect_read() is used only by this function, so how about moving both to amd64_edac, where they're needed and then doing the refactoring ontop? You can simply reuse your current patches - just change the file they patch from arch/x86/kernel/cpu/mce/amd.c to drivers/edac/amd64_edac.c I went through te umc_... function and AFAICT, it doesn't need any core MCE facilities so it should be just fine in EDAC land. Or?
On Mon, May 17, 2021 at 02:57:04PM +0200, Borislav Petkov wrote: > On Fri, May 07, 2021 at 03:01:15PM -0400, Yazen Ghannam wrote: > > Patches 1-24 do the refactor without adding new system support. The goal > > is to break down the translation algorithm into smaller chunks. There > > are some simple wrapper functions defined. These will be filled in when > > supporting newer systems. The intention is that new system support can > > be added without any major refactor. I tried to make a patch for each > > logical change. There's a bit of churn so as to not break the build with > > each change. I think many of these patches can be squashed together, if > > desired. The top level function was split first, then the next level of > > functions, etc. in a somewhat breadth-first approach. > > No, that's great what you did and keeping each logical change in a > single patch is a lot easier on everybody involved. > > Now, looking at this - and I know we've talked about this before - but: > > umc_normaddr_to_sysaddr() is used only in amd64_edac.c. > amd_df_indirect_read() is used only by this function, so how about > moving both to amd64_edac, where they're needed and then doing the > refactoring ontop? > > You can simply reuse your current patches - just change the file they > patch from > > arch/x86/kernel/cpu/mce/amd.c > > to > > drivers/edac/amd64_edac.c > > I went through te umc_... function and AFAICT, it doesn't need any core > MCE facilities so it should be just fine in EDAC land. > > Or? > I think this is a good idea. The only hang up is that we should be using the output of this function, i.e. the systeme physical address, when handling memory errors in the MCE notifier blocks. But I have an idea where we can handle this. I can send that as a follow up series, if that's okay. One other issue is what if a user doesn't want to use amd64_edac_mod? This is more of a user preference and/or configuration issue. Maybe the module loads, but an uninterested user can tell EDAC to not log errors, etc.? Or should the translation code live in its own module? So for version 2, I have 1) Add a glossary of terms, and 2) Move everything to EDAC. Any other comments? Thanks, Yazen
On Tue, May 18, 2021 at 11:52:07PM -0400, Yazen Ghannam wrote: > I think this is a good idea. The only hang up is that we should be using > the output of this function, i.e. the systeme physical address, when > handling memory errors in the MCE notifier blocks. But I have an idea > where we can handle this. I can send that as a follow up series, if > that's okay. Yeah, so frankly I'm not happy with all this clumsy plugging we do with notifiers and then amd_register_ecc_decoder() which is called in mce_amd.c to be used in amd64_edac.c which finally logs it. What I'd like to see is mce_amd.c still decoding all kinds of errors and additionally amd64_edac continues processing only the DRAM errors. > One other issue is what if a user doesn't want to use amd64_edac_mod? Then she/he doesn't get DRAM errors mapped to a DIMM - simple. > This is more of a user preference and/or configuration issue. Maybe the > module loads, but an uninterested user can tell EDAC to not log errors, > etc.? Or should the translation code live in its own module? No need. Translation is part of EDAC so if you don't load it, you don't get the functionality. > So for version 2, I have 1) Add a glossary of terms, and 2) Move > everything to EDAC. Any other comments? None at the moment - I'll do a deeper review with v2. Thx.
From: Yazen Ghannam <yazen.ghannam@amd.com> This patchset refactors the AMD MCA Address Translation code and adds support for newer systems. This patchset was written from scratch compared to previous patchsets. The reference code was recently refactored in preparation for updates for future systems. These patches try to follow the reference code as closely as possible. I also tried to address comments from previous patchset reviews. Patches 1-24 do the refactor without adding new system support. The goal is to break down the translation algorithm into smaller chunks. There are some simple wrapper functions defined. These will be filled in when supporting newer systems. The intention is that new system support can be added without any major refactor. I tried to make a patch for each logical change. There's a bit of churn so as to not break the build with each change. I think many of these patches can be squashed together, if desired. The top level function was split first, then the next level of functions, etc. in a somewhat breadth-first approach. Patch 25 adds support for systems with Data Fabric version 3 (Rome and later). Each patch was build tested individually. The entire set was functionally tested with the following modes. Naples: No interleaving Channel interleaving Die interleaving Socket interleaving Rome: No interleaving Nodes-per-Socket 0 (NPS0) Nodes-per-Socket 1 (NPS1) Nodes-per-Socket 2 (NPS2) Nodes-per-Socket 4 (NPS4) NPS2 w/o hashing NPS4 w/o hashing Thanks, Yazen Link: https://lkml.kernel.org/r/20200903200144.310991-1-Yazen.Ghannam@amd.com Yazen Ghannam (25): x86/MCE/AMD: Don't use naked values for DF registers x86/MCE/AMD: Add context struct x86/MCE/AMD: Define functions for DramOffset x86/MCE/AMD: Define function to read DRAM address map registers x86/MCE/AMD: Define function to find interleaving mode x86/MCE/AMD: Define function to denormalize address x86/MCE/AMD: Define function to add DRAM base and hole x86/MCE/AMD: Define function to dehash address x86/MCE/AMD: Define function to check DRAM limit address x86/MCE/AMD: Remove goto statements x86/MCE/AMD: Simplify function parameters x86/MCE/AMD: Define function to get Interleave Address Bit x86/MCE/AMD: Skip denormalization if no interleaving x86/MCE/AMD: Define function to get number of interleaved channels x86/MCE/AMD: Define function to get number of interleaved dies x86/MCE/AMD: Define function to get number of interleaved sockets x86/MCE/AMD: Remove unnecessary assert x86/MCE/AMD: Define function to make space for CS ID x86/MCE/AMD: Define function to calculate CS ID x86/MCE/AMD: Define function to insert CS ID into address x86/MCE/AMD: Define function to get CS Fabric ID x86/MCE/AMD: Define function to find shift and mask values x86/MCE/AMD: Update CS ID calculation to match reference code x86/MCE/AMD: Match hash function to reference code x86/MCE/AMD: Add support for address translation on DF3 systems arch/x86/include/asm/amd_nb.h | 7 +- arch/x86/include/asm/mce.h | 5 +- arch/x86/kernel/amd_nb.c | 6 +- arch/x86/kernel/cpu/mce/amd.c | 681 +++++++++++++++++++++++++++------- drivers/edac/amd64_edac.c | 4 +- 5 files changed, 555 insertions(+), 148 deletions(-)