diff mbox series

EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM is absent

Message ID 20230516033133.340936-1-kai.heng.feng@canonical.com (mailing list archive)
State New, archived
Headers show
Series EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM is absent | expand

Commit Message

Kai-Heng Feng May 16, 2023, 3:31 a.m. UTC
The following splat can be found on many systems equipped with EDAC:
[   13.875276] UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
[   13.875279] shift exponent -66 is negative
[   13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+ #1
[   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, BIOS U61 Ver. 01.01.15 04/19/2023
[   13.875283] Call Trace:
[   13.875285]  <TASK>
[   13.875287]  dump_stack_lvl+0x48/0x70
[   13.875295]  dump_stack+0x10/0x20
[   13.875297]  __ubsan_handle_shift_out_of_bounds+0x156/0x310
[   13.875302]  ? __kmem_cache_alloc_node+0x196/0x300
[   13.875307]  skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac]
[   13.875312]  i10nm_get_dimm_config+0x240/0x360 [i10nm_edac]
[   13.875316]  ? kasprintf+0x4e/0x80
[   13.875321]  skx_register_mci+0x12b/0x1d0 [i10nm_edac]
[   13.875324]  ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
[   13.875329]  i10nm_init+0x89f/0x1d10 [i10nm_edac]
[   13.875333]  ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
[   13.875337]  do_one_initcall+0x46/0x240
[   13.875342]  ? kmalloc_trace+0x2a/0xb0
[   13.875346]  do_init_module+0x6a/0x280
[   13.875350]  load_module+0x2419/0x2500
[   13.875353]  ? security_kernel_post_read_file+0x5c/0x80
[   13.875358]  __do_sys_finit_module+0xcc/0x150
[   13.875360]  ? __do_sys_finit_module+0xcc/0x150
[   13.875363]  __x64_sys_finit_module+0x18/0x30
[   13.875365]  do_syscall_64+0x59/0x90
[   13.875368]  ? syscall_exit_to_user_mode+0x2a/0x50
[   13.875371]  ? do_syscall_64+0x69/0x90
[   13.875372]  ? do_syscall_64+0x69/0x90
[   13.875373]  ? do_syscall_64+0x69/0x90
[   13.875374]  ? do_syscall_64+0x69/0x90
[   13.875375]  ? syscall_exit_to_user_mode+0x2a/0x50
[   13.875376]  ? do_syscall_64+0x69/0x90
[   13.875377]  ? do_syscall_64+0x69/0x90
[   13.875378]  ? do_syscall_64+0x69/0x90
[   13.875379]  ? sysvec_call_function+0x4e/0xb0
[   13.875381]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

When a DIMM slot is empty, the read value of mtr can be 0xffffffff,
therefore the wrong "ranks" value creates shift-out-of-bounds error. The
same issue can be found on NVDIMM too.

So only consider DIMM/NVDIMM is present when the value of mtr/mcddrtcfg
is not ~0.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/edac/sb_edac.c    | 2 +-
 drivers/edac/skx_common.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Zhuo, Qiuxu May 16, 2023, 12:53 p.m. UTC | #1
> From: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ...
> Subject: [PATCH] EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM
> is absent
> 
> The following splat can be found on many systems equipped with EDAC:
> [   13.875276] UBSAN: shift-out-of-bounds in
> drivers/edac/skx_common.c:369:16
> [   13.875279] shift exponent -66 is negative
> [   13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+
> #1
> [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> BIOS U61 Ver. 01.01.15 04/19/2023
> [   13.875283] Call Trace:
> [   13.875285]  <TASK>
> [   13.875287]  dump_stack_lvl+0x48/0x70
> [   13.875295]  dump_stack+0x10/0x20
> [   13.875297]  __ubsan_handle_shift_out_of_bounds+0x156/0x310
> [   13.875302]  ? __kmem_cache_alloc_node+0x196/0x300
> [   13.875307]  skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac]
> [   13.875312]  i10nm_get_dimm_config+0x240/0x360 [i10nm_edac]
> [   13.875316]  ? kasprintf+0x4e/0x80
> [   13.875321]  skx_register_mci+0x12b/0x1d0 [i10nm_edac]
> [   13.875324]  ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
> [   13.875329]  i10nm_init+0x89f/0x1d10 [i10nm_edac]
> [   13.875333]  ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
> [   13.875337]  do_one_initcall+0x46/0x240
> [   13.875342]  ? kmalloc_trace+0x2a/0xb0
> [   13.875346]  do_init_module+0x6a/0x280
> [   13.875350]  load_module+0x2419/0x2500
> [   13.875353]  ? security_kernel_post_read_file+0x5c/0x80
> [   13.875358]  __do_sys_finit_module+0xcc/0x150
> [   13.875360]  ? __do_sys_finit_module+0xcc/0x150
> [   13.875363]  __x64_sys_finit_module+0x18/0x30
> [   13.875365]  do_syscall_64+0x59/0x90
> [   13.875368]  ? syscall_exit_to_user_mode+0x2a/0x50
> [   13.875371]  ? do_syscall_64+0x69/0x90
> [   13.875372]  ? do_syscall_64+0x69/0x90
> [   13.875373]  ? do_syscall_64+0x69/0x90
> [   13.875374]  ? do_syscall_64+0x69/0x90
> [   13.875375]  ? syscall_exit_to_user_mode+0x2a/0x50
> [   13.875376]  ? do_syscall_64+0x69/0x90
> [   13.875377]  ? do_syscall_64+0x69/0x90
> [   13.875378]  ? do_syscall_64+0x69/0x90
> [   13.875379]  ? sysvec_call_function+0x4e/0xb0
> [   13.875381]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore

Looked like a buggy BIOS/hw that didn't set the mtr register.

1. Did you print the mtr register whose value was 0xffffffff?
2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
3. What was the CPU? Please take the output of "lscpu". 
4. Did you verify your patch that the issue was fixed on your systems?

Thanks!
-Qiuxu

> the wrong "ranks" value creates shift-out-of-bounds error. The same issue
> can be found on NVDIMM too.
> 
> So only consider DIMM/NVDIMM is present when the value of
> mtr/mcddrtcfg is not ~0.
> ...
Luck, Tony May 16, 2023, 5:12 p.m. UTC | #2
>> [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> > BIOS U61 Ver. 01.01.15 04/19/2023


>> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore

> Looked like a buggy BIOS/hw that didn't set the mtr register.
>
> 1. Did you print the mtr register whose value was 0xffffffff?
> 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
> 3. What was the CPU? Please take the output of "lscpu". 
> 4. Did you verify your patch that the issue was fixed on your systems?

I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is
the standard PCI response for reading a non-existent register. But that doesn't
quite make sense with having a "dimm present" bit in the MTR register. If
the register only exists if the DIMM is present, then there is no need for
a "dimm present" bit.

Some "lspci" output may also be useful.

-Tony
Kai-Heng Feng May 17, 2023, 7:47 a.m. UTC | #3
On Tue, May 16, 2023 at 8:53 PM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
> > From: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ...
> > Subject: [PATCH] EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM
> > is absent
> >
> > The following splat can be found on many systems equipped with EDAC:
> > [   13.875276] UBSAN: shift-out-of-bounds in
> > drivers/edac/skx_common.c:369:16
> > [   13.875279] shift exponent -66 is negative
> > [   13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+
> > #1
> > [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> > BIOS U61 Ver. 01.01.15 04/19/2023
> > [   13.875283] Call Trace:
> > [   13.875285]  <TASK>
> > [   13.875287]  dump_stack_lvl+0x48/0x70
> > [   13.875295]  dump_stack+0x10/0x20
> > [   13.875297]  __ubsan_handle_shift_out_of_bounds+0x156/0x310
> > [   13.875302]  ? __kmem_cache_alloc_node+0x196/0x300
> > [   13.875307]  skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac]
> > [   13.875312]  i10nm_get_dimm_config+0x240/0x360 [i10nm_edac]
> > [   13.875316]  ? kasprintf+0x4e/0x80
> > [   13.875321]  skx_register_mci+0x12b/0x1d0 [i10nm_edac]
> > [   13.875324]  ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
> > [   13.875329]  i10nm_init+0x89f/0x1d10 [i10nm_edac]
> > [   13.875333]  ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
> > [   13.875337]  do_one_initcall+0x46/0x240
> > [   13.875342]  ? kmalloc_trace+0x2a/0xb0
> > [   13.875346]  do_init_module+0x6a/0x280
> > [   13.875350]  load_module+0x2419/0x2500
> > [   13.875353]  ? security_kernel_post_read_file+0x5c/0x80
> > [   13.875358]  __do_sys_finit_module+0xcc/0x150
> > [   13.875360]  ? __do_sys_finit_module+0xcc/0x150
> > [   13.875363]  __x64_sys_finit_module+0x18/0x30
> > [   13.875365]  do_syscall_64+0x59/0x90
> > [   13.875368]  ? syscall_exit_to_user_mode+0x2a/0x50
> > [   13.875371]  ? do_syscall_64+0x69/0x90
> > [   13.875372]  ? do_syscall_64+0x69/0x90
> > [   13.875373]  ? do_syscall_64+0x69/0x90
> > [   13.875374]  ? do_syscall_64+0x69/0x90
> > [   13.875375]  ? syscall_exit_to_user_mode+0x2a/0x50
> > [   13.875376]  ? do_syscall_64+0x69/0x90
> > [   13.875377]  ? do_syscall_64+0x69/0x90
> > [   13.875378]  ? do_syscall_64+0x69/0x90
> > [   13.875379]  ? sysvec_call_function+0x4e/0xb0
> > [   13.875381]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >
> > When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore
>
> Looked like a buggy BIOS/hw that didn't set the mtr register.

If that's the case, I suspect the bug comes from Intel BIOS RC,
because the issue happens on different vendors' hardware.

>
> 1. Did you print the mtr register whose value was 0xffffffff?

Yes, 0xffffffff is the value. mcddrtcfg is also 0xffffffff.

> 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
> 3. What was the CPU? Please take the output of "lscpu".

Both attached in Bugzlla [1].

> 4. Did you verify your patch that the issue was fixed on your systems?

I did, that's why I sent the patch to mailing list.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217453

Kai-Heng

>
> Thanks!
> -Qiuxu
>
> > the wrong "ranks" value creates shift-out-of-bounds error. The same issue
> > can be found on NVDIMM too.
> >
> > So only consider DIMM/NVDIMM is present when the value of
> > mtr/mcddrtcfg is not ~0.
> > ...
Kai-Heng Feng May 17, 2023, 7:49 a.m. UTC | #4
On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> >> [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> > > BIOS U61 Ver. 01.01.15 04/19/2023
>
>
> >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore
>
> > Looked like a buggy BIOS/hw that didn't set the mtr register.
> >
> > 1. Did you print the mtr register whose value was 0xffffffff?
> > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
> > 3. What was the CPU? Please take the output of "lscpu".
> > 4. Did you verify your patch that the issue was fixed on your systems?
>
> I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is
> the standard PCI response for reading a non-existent register. But that doesn't
> quite make sense with having a "dimm present" bit in the MTR register. If
> the register only exists if the DIMM is present, then there is no need for
> a "dimm present" bit.

I wonder if the "non-existent register" read is intended?

>
> Some "lspci" output may also be useful.

lspci can be found in [1]:

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217453

Kai-Heng

>
> -Tony
Kai-Heng Feng June 14, 2023, 7:58 a.m. UTC | #5
On Wed, May 17, 2023 at 3:49 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > >> [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> > > > BIOS U61 Ver. 01.01.15 04/19/2023
> >
> >
> > >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore
> >
> > > Looked like a buggy BIOS/hw that didn't set the mtr register.
> > >
> > > 1. Did you print the mtr register whose value was 0xffffffff?
> > > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
> > > 3. What was the CPU? Please take the output of "lscpu".
> > > 4. Did you verify your patch that the issue was fixed on your systems?
> >
> > I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is
> > the standard PCI response for reading a non-existent register. But that doesn't
> > quite make sense with having a "dimm present" bit in the MTR register. If
> > the register only exists if the DIMM is present, then there is no need for
> > a "dimm present" bit.
>
> I wonder if the "non-existent register" read is intended?
>
> >
> > Some "lspci" output may also be useful.
>
> lspci can be found in [1]:
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453

A gentle ping...

>
> Kai-Heng
>
> >
> > -Tony
Kai-Heng Feng July 4, 2023, 4:27 a.m. UTC | #6
On Wed, Jun 14, 2023 at 3:58 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> On Wed, May 17, 2023 at 3:49 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> > On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote:
> > >
> > > >> [   13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962,
> > > > > BIOS U61 Ver. 01.01.15 04/19/2023
> > >
> > >
> > > >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore
> > >
> > > > Looked like a buggy BIOS/hw that didn't set the mtr register.
> > > >
> > > > 1. Did you print the mtr register whose value was 0xffffffff?
> > > > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled?
> > > > 3. What was the CPU? Please take the output of "lscpu".
> > > > 4. Did you verify your patch that the issue was fixed on your systems?
> > >
> > > I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is
> > > the standard PCI response for reading a non-existent register. But that doesn't
> > > quite make sense with having a "dimm present" bit in the MTR register. If
> > > the register only exists if the DIMM is present, then there is no need for
> > > a "dimm present" bit.
> >
> > I wonder if the "non-existent register" read is intended?
> >
> > >
> > > Some "lspci" output may also be useful.
> >
> > lspci can be found in [1]:
> >
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453
>
> A gentle ping...

Another gentle ping...

>
> >
> > Kai-Heng
> >
> > >
> > > -Tony
Zhuo, Qiuxu July 6, 2023, 2:19 p.m. UTC | #7
> From: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ...
> > > > Some "lspci" output may also be useful.
> > >
> > > lspci can be found in [1]:
> > >
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453
> >
> > A gentle ping...
> 
> Another gentle ping...

The same issue was also reported recently by Koba Ko:
    https://lore.kernel.org/linux-edac/CAJB-X+XtfBm0a4btt6NT9rvdrxETNLNMVQ3G=u513Nh8RKwjWw@mail.gmail.com/T/#t

The root cause of this issue was that the absent memory controllers still 
appeared as PCIe devices that fooled the i10nm_edac driver.

Could you please verify the new fix patch below on your machine
and feedback on the testing results? Thanks!
    https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u

- Qiuxu
diff mbox series

Patch

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 0c779a0326b6..bc5155e84514 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -211,7 +211,7 @@  static const int mtr_regs[] = {
 static const int knl_mtr_reg = 0xb60;
 
 #define RANK_DISABLE(mtr)		GET_BITFIELD(mtr, 16, 19)
-#define IS_DIMM_PRESENT(mtr)		GET_BITFIELD(mtr, 14, 14)
+#define IS_DIMM_PRESENT(mtr)		((mtr != ~0) && GET_BITFIELD(mtr, 14, 14))
 #define RANK_CNT_BITS(mtr)		GET_BITFIELD(mtr, 12, 13)
 #define RANK_WIDTH_BITS(mtr)		GET_BITFIELD(mtr, 2, 4)
 #define COL_WIDTH_BITS(mtr)		GET_BITFIELD(mtr, 0, 1)
diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h
index b6d3607dffe2..2f975ffeaac9 100644
--- a/drivers/edac/skx_common.h
+++ b/drivers/edac/skx_common.h
@@ -50,8 +50,8 @@ 
 #define NUM_CHANNELS	MAX(SKX_NUM_CHANNELS, I10NM_NUM_CHANNELS)
 #define NUM_DIMMS	MAX(SKX_NUM_DIMMS, I10NM_NUM_DIMMS)
 
-#define IS_DIMM_PRESENT(r)		GET_BITFIELD(r, 15, 15)
-#define IS_NVDIMM_PRESENT(r, i)		GET_BITFIELD(r, i, i)
+#define IS_DIMM_PRESENT(r)		((r != ~0) && GET_BITFIELD(r, 15, 15))
+#define IS_NVDIMM_PRESENT(r, i)		((r != ~0) && GET_BITFIELD(r, i, i))
 
 #define MCI_MISC_ECC_MODE(m)	(((m) >> 59) & 15)
 #define MCI_MISC_ECC_DDRT	8	/* read from DDRT */