mbox series

[0/5] Fix DPC hotplug race and enhance error hanlding

Message ID 20200925023423.42675-1-haifeng.zhao@intel.com (mailing list archive)
Headers show
Series Fix DPC hotplug race and enhance error hanlding | expand

Message

Zhao, Haifeng Sept. 25, 2020, 2:34 a.m. UTC
This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternativly etc). 

With this patch set applied, stable 5.9-rc6 could pass the PCIe Gen4 NVMe
SSD brute force hotplug test with any time interval between hot-remove and
plug-in operation tens of times without any errors occur and system works
normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no 
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
        setpci -s 64:02.0 0x196.w=000a
        setpci -s 65:00.0 0x04.w=0544
        mount /dev/nvme0n1p1 /root/nvme
        sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
    DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid null pointer
    reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h                | 31 +++----------------------------
 drivers/pci/pcie/err.c           | 18 ++++++++++++++++--
 include/linux/pci.h              | 31 +++++++++++++++++++++++++++++++
 4 files changed, 53 insertions(+), 31 deletions(-)

Comments

Andy Shevchenko Sept. 25, 2020, 12:24 p.m. UTC | #1
On Thu, Sep 24, 2020 at 10:34:19PM -0400, Ethan Zhao wrote:
> Once root port DPC capability is enabled and triggered, at the beginning
> of DPC is triggered, the DPC status bits are set by hardware and then
> sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
> take the port and software DPC interrupt handler 10ms to 50ms (test data
> on ICX platform & stable 5.9-rc6) to complete the DPC containment procedure
> till the DPC status is cleared at the end of the DPC interrupt handler.
> 
> We use this function to check if the root port is in DPC handling status
> and wait till the hardware and software completed the procedure.

>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/resource_ext.h>

> +#include <linux/delay.h>

Keep it sorted?

>  #include <uapi/linux/pci.h>

...

> +#ifdef CONFIG_PCIE_DPC
> +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
> +{
> +	u16 cap = pdev->dpc_cap, status;
> +	u16 loop = 0;
> +
> +	if (!cap) {

> +		pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");

But why? Is this feature mandatory to have? Then the same question about
ifdeffery, otherwise it's pretty normal to not have a feature, right?

> +		return false;
> +	}
> +	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> +	pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);

> +	while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
> +		msleep(10);
> +		loop++;
> +		pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> +	}

Can we have rather something like readx_poll_timeout() for PCI and use them here?

> +	if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
> +		pci_dbg(pdev, "Out of DPC status %x, time cost %d ms\n", status, loop*10);
> +		return true;
> +	}
> +	pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
> +	return false;
> +}
> +#else
> +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
> +{
> +	return true;
> +}
> +#endif
>  #endif /* LINUX_PCI_H */
> -- 
> 2.18.4
>
Andy Shevchenko Sept. 25, 2020, 12:32 p.m. UTC | #2
On Thu, Sep 24, 2020 at 10:34:20PM -0400, Ethan Zhao wrote:
> When root port has DPC capability and it is enabled, then triggered by errors, DPC
> DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time.
> That will cause following result:
> 
> 1. Link and device are recovered by hardware DPC and software DPC driver, device
>    isn't removed, but the pciehp might treat it as devce was hot removed.
> 
> 2. Race condition happens between pciehp_unconfigure_device() called by
>    pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in
>    DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device()
>    against pci_walk_bus(), they hold different semaphore and mutex,
>    pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus()
>    holds pci_bus_sem.
> 
> This race condition is not purely code analysis, it could be triggered by following
> command series:
> 
>   # setpci -s 64:02.0 0x196.w=000a   // 64:02.0 is rootport has DPC capability
>   # setpci -s 65:00.0 0x04.w=0544    // 65:00.0 is NVMe SSD populated in above port
>   # mount /dev/nvme0n1p1 nvme
> 
> One shot will cause system panic and null pointer reference happened.
> (tested on stable 5.8 & ICX platform)

What is ICX (yes, I know, but you are writing this for wider audience)?

>    Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read
>    BUG: kernel NULL pointer dereference, address: 0000000000000050
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page

Please, try to remove non-relevant information from the crashes.

>    Oops: 0000 [#1] SMP NOPTI
>    CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
>    Hardware name: Intel Corporation Wilxxxx.200x0x0xxx 0x/0x/20x0
>    RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
>    Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
>    ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
>    41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
>    RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
>    RAX: 0000000000000000 RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
>    RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
>    RBP: ff8e06cf8762fdd0 R08: 00000000000000bf R09: 0000000000000000
>    R10: 000000eb8ebeab53 R11: ffffffff93453258 R12: 0000000000000002
>    R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
>    FS:  0000000000000000(0000) GS:ff4e3eab1fd00000(0000) knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 0000000000000050 CR3: 0000000f8f80a004 CR4: 0000000000761ee0
>    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    PKRU: 55555554
>    Call Trace:
>    ? report_normal_detected+0x20/0x20
>    report_frozen_detected+0x16/0x20
>    pci_walk_bus+0x75/0x90
>    ? dpc_irq+0x90/0x90
>    pcie_do_recovery+0x157/0x201
>    ? irq_finalize_oneshot.part.47+0xe0/0xe0
>    dpc_handler+0x29/0x40
>    irq_thread_fn+0x24/0x60
>    irq_thread+0xea/0x170
>    ? irq_forced_thread_fn+0x80/0x80
>    ? irq_thread_check_affinity+0xf0/0xf0
>    kthread+0x124/0x140
>    ? kthread_park+0x90/0x90
>    ret_from_fork+0x1f/0x30
>    Modules linked in: nft_fib_inet.........
>    CR2: 0000000000000050
> 
> With this patch, the handling flow of DPC containment and hotplug is partly ordered
> and serialized, let hardware DPC do the controller reset etc recovery action first,
> then DPC driver handling the call-back from device drivers, clear the DPC status,
> at the end, pciehp driver handles the DLLSC and PDC etc.

> This patch closes the race conditon partly.

Not sure if Bjorn require to get rid of "this patch" from commit messages... In
any case it's written in documentation.
Andy Shevchenko Sept. 25, 2020, 12:35 p.m. UTC | #3
On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> During DPC error injection test we found there is race condition between
> pciehp and DPC driver, null pointer reference caused panic as following

null -> NULL

> 
>  # setpci -s 64:02.0 0x196.w=000a
>   // 64:02.0 is rootport has DPC capability
>  # setpci -s 65:00.0 0x04.w=0544
>   // 65:00.0 is NVMe SSD populated in above port
>  # mount /dev/nvme0n1p1 nvme
> 
>  (tested on stable 5.8 & ICX platform)
> 
>  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
>  async page read
>  BUG: kernel NULL pointer dereference, address: 0000000000000050
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page

Same comment about Oops.
Zhao, Haifeng Sept. 27, 2020, 1:50 a.m. UTC | #4
Andy, 

    About the ICX code name, I will align it to public documentation in next version.
' non-relevant information'  about the crash, I'm not sure what's not relevant. 

Thanks,
Ethan

-----Original Message-----
From: Andy Shevchenko <andriy.shevchenko@linux.intel.com> 
Sent: Friday, September 25, 2020 8:33 PM
To: Zhao, Haifeng <haifeng.zhao@intel.com>
Cc: bhelgaas@google.com; oohall@gmail.com; ruscur@russell.cc; lukas@wunner.de; stuart.w.hayes@gmail.com; mr.nuke.me@gmail.com; mika.westerberg@linux.intel.com; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Jia, Pei P <pei.p.jia@intel.com>
Subject: Re: [PATCH 2/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

On Thu, Sep 24, 2020 at 10:34:20PM -0400, Ethan Zhao wrote:
> When root port has DPC capability and it is enabled, then triggered by 
> errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time.
> That will cause following result:
> 
> 1. Link and device are recovered by hardware DPC and software DPC driver, device
>    isn't removed, but the pciehp might treat it as devce was hot removed.
> 
> 2. Race condition happens between pciehp_unconfigure_device() called by
>    pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in
>    DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device()
>    against pci_walk_bus(), they hold different semaphore and mutex,
>    pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus()
>    holds pci_bus_sem.
> 
> This race condition is not purely code analysis, it could be triggered 
> by following command series:
> 
>   # setpci -s 64:02.0 0x196.w=000a   // 64:02.0 is rootport has DPC capability
>   # setpci -s 65:00.0 0x04.w=0544    // 65:00.0 is NVMe SSD populated in above port
>   # mount /dev/nvme0n1p1 nvme
> 
> One shot will cause system panic and null pointer reference happened.
> (tested on stable 5.8 & ICX platform)

What is ICX (yes, I know, but you are writing this for wider audience)?

>    Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read
>    BUG: kernel NULL pointer dereference, address: 0000000000000050
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page

Please, try to remove non-relevant information from the crashes.

>    Oops: 0000 [#1] SMP NOPTI
>    CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
>    Hardware name: Intel Corporation Wilxxxx.200x0x0xxx 0x/0x/20x0
>    RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
>    Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
>    ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
>    41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
>    RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
>    RAX: 0000000000000000 RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
>    RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
>    RBP: ff8e06cf8762fdd0 R08: 00000000000000bf R09: 0000000000000000
>    R10: 000000eb8ebeab53 R11: ffffffff93453258 R12: 0000000000000002
>    R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
>    FS:  0000000000000000(0000) GS:ff4e3eab1fd00000(0000) knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 0000000000000050 CR3: 0000000f8f80a004 CR4: 0000000000761ee0
>    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    PKRU: 55555554
>    Call Trace:
>    ? report_normal_detected+0x20/0x20
>    report_frozen_detected+0x16/0x20
>    pci_walk_bus+0x75/0x90
>    ? dpc_irq+0x90/0x90
>    pcie_do_recovery+0x157/0x201
>    ? irq_finalize_oneshot.part.47+0xe0/0xe0
>    dpc_handler+0x29/0x40
>    irq_thread_fn+0x24/0x60
>    irq_thread+0xea/0x170
>    ? irq_forced_thread_fn+0x80/0x80
>    ? irq_thread_check_affinity+0xf0/0xf0
>    kthread+0x124/0x140
>    ? kthread_park+0x90/0x90
>    ret_from_fork+0x1f/0x30
>    Modules linked in: nft_fib_inet.........
>    CR2: 0000000000000050
> 
> With this patch, the handling flow of DPC containment and hotplug is 
> partly ordered and serialized, let hardware DPC do the controller 
> reset etc recovery action first, then DPC driver handling the 
> call-back from device drivers, clear the DPC status, at the end, pciehp driver handles the DLLSC and PDC etc.

> This patch closes the race conditon partly.

Not sure if Bjorn require to get rid of "this patch" from commit messages... In any case it's written in documentation.


--
With Best Regards,
Andy Shevchenko
Zhao, Haifeng Sept. 27, 2020, 1:53 a.m. UTC | #5
Andy,
   About the header file, yes, to keep the order. 
   The function was already defined with  #ifdef CONFIG_PCIE_DPC.
   As to ' readx_poll_timeout()' if there is generic one, I would like to use it.
   Seems there is no yet ?

Thanks,
Ethan


-----Original Message-----
From: Andy Shevchenko <andriy.shevchenko@linux.intel.com> 
Sent: Friday, September 25, 2020 8:25 PM
To: Zhao, Haifeng <haifeng.zhao@intel.com>
Cc: bhelgaas@google.com; oohall@gmail.com; ruscur@russell.cc; lukas@wunner.de; stuart.w.hayes@gmail.com; mr.nuke.me@gmail.com; mika.westerberg@linux.intel.com; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Jia, Pei P <pei.p.jia@intel.com>
Subject: Re: [PATCH 1/5] PCI: define a function to check and wait till port finish DPC handling

On Thu, Sep 24, 2020 at 10:34:19PM -0400, Ethan Zhao wrote:
> Once root port DPC capability is enabled and triggered, at the 
> beginning of DPC is triggered, the DPC status bits are set by hardware 
> and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, 
> it will take the port and software DPC interrupt handler 10ms to 50ms 
> (test data on ICX platform & stable 5.9-rc6) to complete the DPC 
> containment procedure till the DPC status is cleared at the end of the DPC interrupt handler.
> 
> We use this function to check if the root port is in DPC handling 
> status and wait till the hardware and software completed the procedure.

>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/resource_ext.h>

> +#include <linux/delay.h>

Keep it sorted? 



>  #include <uapi/linux/pci.h>

...

> +#ifdef CONFIG_PCIE_DPC
> +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) {
> +	u16 cap = pdev->dpc_cap, status;
> +	u16 loop = 0;
> +
> +	if (!cap) {

> +		pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");

But why? Is this feature mandatory to have? Then the same question about ifdeffery, otherwise it's pretty normal to not have a feature, right?

> +		return false;
> +	}
> +	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> +	pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);

> +	while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
> +		msleep(10);
> +		loop++;
> +		pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> +	}

Can we have rather something like readx_poll_timeout() for PCI and use them here?

> +	if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
> +		pci_dbg(pdev, "Out of DPC status %x, time cost %d ms\n", status, loop*10);
> +		return true;
> +	}
> +	pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
> +	return false;
> +}
> +#else
> +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) {
> +	return true;
> +}
> +#endif
>  #endif /* LINUX_PCI_H */
> --
> 2.18.4
>
Ethan Zhao Sept. 29, 2020, 2:35 a.m. UTC | #6
Preferred style, there will be cleared comment in v6.

Thanks,
Ethan

On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> > During DPC error injection test we found there is race condition between
> > pciehp and DPC driver, null pointer reference caused panic as following
>
> null -> NULL
>
> >
> >  # setpci -s 64:02.0 0x196.w=000a
> >   // 64:02.0 is rootport has DPC capability
> >  # setpci -s 65:00.0 0x04.w=0544
> >   // 65:00.0 is NVMe SSD populated in above port
> >  # mount /dev/nvme0n1p1 nvme
> >
> >  (tested on stable 5.8 & ICX platform)
> >
> >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> >  async page read
> >  BUG: kernel NULL pointer dereference, address: 0000000000000050
> >  #PF: supervisor read access in kernel mode
> >  #PF: error_code(0x0000) - not-present page
>
> Same comment about Oops.
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
Andy Shevchenko Sept. 29, 2020, 8:51 a.m. UTC | #7
On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> Preferred style, there will be cleared comment in v6.

Avoid top postings.

> On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> >
> > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> > > During DPC error injection test we found there is race condition between
> > > pciehp and DPC driver, null pointer reference caused panic as following
> >
> > null -> NULL
> >
> > >
> > >  # setpci -s 64:02.0 0x196.w=000a
> > >   // 64:02.0 is rootport has DPC capability
> > >  # setpci -s 65:00.0 0x04.w=0544
> > >   // 65:00.0 is NVMe SSD populated in above port
> > >  # mount /dev/nvme0n1p1 nvme
> > >
> > >  (tested on stable 5.8 & ICX platform)
> > >
> > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > >  async page read
> > >  BUG: kernel NULL pointer dereference, address: 0000000000000050
> > >  #PF: supervisor read access in kernel mode
> > >  #PF: error_code(0x0000) - not-present page
> >
> > Same comment about Oops.

In another thread it was a good advice to move the full Oops (if you think it's
very useful to have) after the cutter '---' line, so it will be in email
archives but Git history.
Ethan Zhao Sept. 29, 2020, 9:38 a.m. UTC | #8
Andy,

On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> > Preferred style, there will be cleared comment in v6.
>
> Avoid top postings.
>
> > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> > <andriy.shevchenko@linux.intel.com> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> > > > During DPC error injection test we found there is race condition between
> > > > pciehp and DPC driver, null pointer reference caused panic as following
> > >
> > > null -> NULL
> > >
> > > >
> > > >  # setpci -s 64:02.0 0x196.w=000a
> > > >   // 64:02.0 is rootport has DPC capability
> > > >  # setpci -s 65:00.0 0x04.w=0544
> > > >   // 65:00.0 is NVMe SSD populated in above port
> > > >  # mount /dev/nvme0n1p1 nvme
> > > >
> > > >  (tested on stable 5.8 & ICX platform)
> > > >
> > > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > > >  async page read
> > > >  BUG: kernel NULL pointer dereference, address: 0000000000000050
> > > >  #PF: supervisor read access in kernel mode
> > > >  #PF: error_code(0x0000) - not-present page
> > >
> > > Same comment about Oops.
>
> In another thread it was a good advice to move the full Oops (if you think it's
> very useful to have) after the cutter '---' line, so it will be in email
> archives but Git history.

So git history wouldn't give any of the Oops context, and he/she has
to access LKML,
if offline, then ...lost.

Thanks,
Ethan
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
Andy Shevchenko Sept. 29, 2020, 10:48 a.m. UTC | #9
On Tue, Sep 29, 2020 at 05:38:00PM +0800, Ethan Zhao wrote:
> On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> > > Preferred style, there will be cleared comment in v6.
> >
> > Avoid top postings.
> >
> > > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> > > <andriy.shevchenko@linux.intel.com> wrote:
> > > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:

...

> > > > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > > > >  async page read
> > > > >  BUG: kernel NULL pointer dereference, address: 0000000000000050
> > > > >  #PF: supervisor read access in kernel mode
> > > > >  #PF: error_code(0x0000) - not-present page
> > > >
> > > > Same comment about Oops.
> >
> > In another thread it was a good advice to move the full Oops (if you think it's
> > very useful to have) after the cutter '---' line, so it will be in email
> > archives but Git history.
> 
> So git history wouldn't give any of the Oops context, and he/she has
> to access LKML,
> if offline, then ...lost.

Tell me, do you really think the line:
   #PF: error_code(0x0000) - not-present page
makes any sense in the commit message?

I do not think so. And so on, for almost 60% of the Oops.
Really, to help one person you will make millions suffering. It's not okay.
Ethan Zhao Sept. 30, 2020, 2:07 a.m. UTC | #10
On Tue, Sep 29, 2020 at 6:48 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Tue, Sep 29, 2020 at 05:38:00PM +0800, Ethan Zhao wrote:
> > On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko
> > <andriy.shevchenko@linux.intel.com> wrote:
> > > On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> > > > Preferred style, there will be cleared comment in v6.
> > >
> > > Avoid top postings.
> > >
> > > > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> > > > <andriy.shevchenko@linux.intel.com> wrote:
> > > > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
>
> ...
>
> > > > > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > > > > >  async page read
> > > > > >  BUG: kernel NULL pointer dereference, address: 0000000000000050
> > > > > >  #PF: supervisor read access in kernel mode
> > > > > >  #PF: error_code(0x0000) - not-present page
> > > > >
> > > > > Same comment about Oops.
> > >
> > > In another thread it was a good advice to move the full Oops (if you think it's
> > > very useful to have) after the cutter '---' line, so it will be in email
> > > archives but Git history.
> >
> > So git history wouldn't give any of the Oops context, and he/she has
> > to access LKML,
> > if offline, then ...lost.
>
> Tell me, do you really think the line:
>    #PF: error_code(0x0000) - not-present page
> makes any sense in the commit message?
>
> I do not think so. And so on, for almost 60% of the Oops.
> Really, to help one person you will make millions suffering. It's not okay.
>
If you and millions  feel so suffered,  why not try to simplify the
Oops code to not
output nonsense too much to console and log. :)

They might not be so important to old birds as you. but it is easy to cut off
the road for newcomers.

Anyway, it is not the focus of this patchset. help to take a look and
try the code.


Thanks,
Ethan

> --
> With Best Regards,
> Andy Shevchenko
>
>