diff mbox series

PCI: pciehp: Differentiate between surprise and safe removal

Message ID dac488604766f3dcb948e702210ecc381c4f907b.1533015755.git.lukas@wunner.de (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series PCI: pciehp: Differentiate between surprise and safe removal | expand

Commit Message

Lukas Wunner July 31, 2018, 5:50 a.m. UTC
When removing PCI devices below a hotplug bridge, pciehp marks them as
disconnected if the card is no longer present in the slot or it quiesces
them if the card is still present (by disabling INTx interrupts, bus
mastering and SERR# reporting).

To detect whether the card is still present, pciehp checks the Presence
Detect State bit in the Slot Status register.  The problem with this
approach is that even if the card is present, the link to it may be
down, and it that case it would be better to mark the devices as
disconnected instead of trying to quiesce them.  Moreover, if the card
in the slot was quickly replaced by another one, the Presence Detect
State bit would be set, yet trying to quiesce the new card's devices
would be wrong and the correct thing to do is to mark the previous
card's devices as disconnected.

Instead of looking at the Presence Detect State bit, it is better to
differentiate whether the card was surprise removed versus safely
removed (via sysfs or an Attention Button press).  On surprise removal,
the devices should be marked as disconnected, whereas on safe removal it
is correct to quiesce the devices.

The knowledge whether a surprise removal or a safe removal is at hand
does exist further up in the call stack:  A surprise removal is
initiated by pciehp_handle_presence_or_link_change(), a safe removal by
pciehp_handle_disable_request().

Pass that information down to pciehp_unconfigure_device() and use it in
lieu of the Presence Detect State bit.  While there, add kernel-doc to
pciehp_unconfigure_device() and pciehp_configure_device().

Tested-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp.h      |  2 +-
 drivers/pci/hotplug/pciehp_ctrl.c | 22 +++++++++++++---------
 drivers/pci/hotplug/pciehp_pci.c  | 23 ++++++++++++++++++++---
 3 files changed, 34 insertions(+), 13 deletions(-)

Comments

Mika Westerberg Aug. 1, 2018, 4:43 p.m. UTC | #1
On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
> When removing PCI devices below a hotplug bridge, pciehp marks them as
> disconnected if the card is no longer present in the slot or it quiesces
> them if the card is still present (by disabling INTx interrupts, bus
> mastering and SERR# reporting).
> 
> To detect whether the card is still present, pciehp checks the Presence
> Detect State bit in the Slot Status register.  The problem with this
> approach is that even if the card is present, the link to it may be
> down, and it that case it would be better to mark the devices as
> disconnected instead of trying to quiesce them.  Moreover, if the card
> in the slot was quickly replaced by another one, the Presence Detect
> State bit would be set, yet trying to quiesce the new card's devices
> would be wrong and the correct thing to do is to mark the previous
> card's devices as disconnected.
> 
> Instead of looking at the Presence Detect State bit, it is better to
> differentiate whether the card was surprise removed versus safely
> removed (via sysfs or an Attention Button press).  On surprise removal,
> the devices should be marked as disconnected, whereas on safe removal it
> is correct to quiesce the devices.
> 
> The knowledge whether a surprise removal or a safe removal is at hand
> does exist further up in the call stack:  A surprise removal is
> initiated by pciehp_handle_presence_or_link_change(), a safe removal by
> pciehp_handle_disable_request().
> 
> Pass that information down to pciehp_unconfigure_device() and use it in
> lieu of the Presence Detect State bit.  While there, add kernel-doc to
> pciehp_unconfigure_device() and pciehp_configure_device().
> 
> Tested-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/hotplug/pciehp.h      |  2 +-
>  drivers/pci/hotplug/pciehp_ctrl.c | 22 +++++++++++++---------
>  drivers/pci/hotplug/pciehp_pci.c  | 23 ++++++++++++++++++++---
>  3 files changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 811cf83f956d..39c9c8815a35 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -181,7 +181,7 @@ void pciehp_handle_button_press(struct slot *slot);
>  void pciehp_handle_disable_request(struct slot *slot);
>  void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events);
>  int pciehp_configure_device(struct slot *p_slot);
> -void pciehp_unconfigure_device(struct slot *p_slot);
> +void pciehp_unconfigure_device(struct slot *p_slot, bool presence);
>  void pciehp_queue_pushbutton_work(struct work_struct *work);
>  struct controller *pcie_init(struct pcie_device *dev);
>  int pcie_init_notification(struct controller *ctrl);
> diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
> index 6855933ab372..7932e70e9f29 100644
> --- a/drivers/pci/hotplug/pciehp_ctrl.c
> +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> @@ -26,6 +26,9 @@
>     hotplug controller logic
>   */
>  
> +#define SAFE_REMOVAL	 true
> +#define SURPRISE_REMOVAL false
> +
>  static void set_slot_off(struct controller *ctrl, struct slot *pslot)
>  {
>  	/* turn off slot, turn on Amber LED, turn off Green LED if supported*/
> @@ -101,12 +104,13 @@ static int board_added(struct slot *p_slot)
>  /**
>   * remove_board - Turns off slot and LEDs
>   * @p_slot: slot where board is being removed
> + * @safe_removal: whether the board is safely removed (versus surprise removed)
>   */
> -static void remove_board(struct slot *p_slot)
> +static void remove_board(struct slot *p_slot, bool safe_removal)
>  {
>  	struct controller *ctrl = p_slot->ctrl;
>  
> -	pciehp_unconfigure_device(p_slot);
> +	pciehp_unconfigure_device(p_slot, safe_removal);

Below we turn off power to the slot if it has power controller. Even if
we disable slot from sysfs, I think it ends up being inaccessible after
power is turned off. I wonder if we should mark the devices disconnected
in that case as well?

>  
>  	if (POWER_CTRL(ctrl)) {
>  		pciehp_power_off_slot(p_slot);
Lukas Wunner Aug. 1, 2018, 5:15 p.m. UTC | #2
On Wed, Aug 01, 2018 at 07:43:58PM +0300, Mika Westerberg wrote:
> On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
> > -static void remove_board(struct slot *p_slot)
> > +static void remove_board(struct slot *p_slot, bool safe_removal)
> >  {
> >  	struct controller *ctrl = p_slot->ctrl;
> >  
> > -	pciehp_unconfigure_device(p_slot);
> > +	pciehp_unconfigure_device(p_slot, safe_removal);
> 
> Below we turn off power to the slot if it has power controller. Even if
> we disable slot from sysfs, I think it ends up being inaccessible after
> power is turned off. I wonder if we should mark the devices disconnected
> in that case as well?
> 
> >  
> >  	if (POWER_CTRL(ctrl)) {
> >  		pciehp_power_off_slot(p_slot);

No, when pciehp_unconfigure_device() returns, the PCI devices below
the hotplug bridge are unbound and removed from the system.  They're
gone, so the bit set in their pci_dev struct would no longer be
accessible anyway.  Unless of course something is holding a ref on
the pci_dev, but that would seem to be a bug.  (Accessing a device
that's already removed from the system, that is.)

Calling pci_dev_set_disconnected() only gives the PCI core and the
driver bound to the device an indication that's it's inaccessible,
so any code paths during unbound and PCI device teardown can skip
accesses.  (Because pci_dev_is_disconnected() is currently scoped
to the PCI core, the disconnected status can only be queried from
drivers that live in the PCI core, such as portdrv and all the
port services drivers.)  After the pci_dev is removed from the
hierarchy, accessing it seems at least questionable.

Does this make things clearer?  Shout if it not. :-)

Thanks,

Lukas
Alex G. Aug. 1, 2018, 7:09 p.m. UTC | #3
On 08/01/2018 12:15 PM, Lukas Wunner wrote:
> On Wed, Aug 01, 2018 at 07:43:58PM +0300, Mika Westerberg wrote:
>> On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
>>> -static void remove_board(struct slot *p_slot)
>>> +static void remove_board(struct slot *p_slot, bool safe_removal)
>>>   {
>>>   	struct controller *ctrl = p_slot->ctrl;
>>>   
>>> -	pciehp_unconfigure_device(p_slot);
>>> +	pciehp_unconfigure_device(p_slot, safe_removal);
>>
>> Below we turn off power to the slot if it has power controller. Even if
>> we disable slot from sysfs, I think it ends up being inaccessible after
>> power is turned off. I wonder if we should mark the devices disconnected
>> in that case as well?
>>
>>>   
>>>   	if (POWER_CTRL(ctrl)) {
>>>   		pciehp_power_off_slot(p_slot);
> 
> No, when pciehp_unconfigure_device() returns, the PCI devices below
> the hotplug bridge are unbound and removed from the system.  They're
> gone, so the bit set in their pci_dev struct would no longer be
> accessible anyway.  Unless of course something is holding a ref on
> the pci_dev, but that would seem to be a bug. 

And a very common bug at that. Choose your favorite open-source GPU (or 
NVME) driver, and try unloading it -- "cannot unload/in use". I stopped 
being bothered by it, and take it as the status quo.

> (Accessing a device that's already removed from the system, that is.)

Why? lspci/userspace can still send config reads/writes. Of course, 
userspace reads/writes are not protected by PCI_DEV_DISCONNECTED.

> Calling pci_dev_set_disconnected() only gives the PCI core and the
> driver bound to the device an indication that's it's inaccessible,
> so any code paths during unbound and PCI device teardown can skip
> accesses.  (Because pci_dev_is_disconnected() is currently scoped
> to the PCI core, the disconnected status can only be queried from
> drivers that live in the PCI core, such as portdrv and all the
> port services drivers.)  After the pci_dev is removed from the
> hierarchy, accessing it seems at least questionable.

I fully agree.

> Does this make things clearer?  Shout if it not. :-)
> 
> Thanks,
> 
> Lukas
>
Mika Westerberg Aug. 2, 2018, 7:20 a.m. UTC | #4
On Wed, Aug 01, 2018 at 07:15:12PM +0200, Lukas Wunner wrote:
> On Wed, Aug 01, 2018 at 07:43:58PM +0300, Mika Westerberg wrote:
> > On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
> > > -static void remove_board(struct slot *p_slot)
> > > +static void remove_board(struct slot *p_slot, bool safe_removal)
> > >  {
> > >  	struct controller *ctrl = p_slot->ctrl;
> > >  
> > > -	pciehp_unconfigure_device(p_slot);
> > > +	pciehp_unconfigure_device(p_slot, safe_removal);
> > 
> > Below we turn off power to the slot if it has power controller. Even if
> > we disable slot from sysfs, I think it ends up being inaccessible after
> > power is turned off. I wonder if we should mark the devices disconnected
> > in that case as well?
> > 
> > >  
> > >  	if (POWER_CTRL(ctrl)) {
> > >  		pciehp_power_off_slot(p_slot);
> 
> No, when pciehp_unconfigure_device() returns, the PCI devices below
> the hotplug bridge are unbound and removed from the system.  They're
> gone, so the bit set in their pci_dev struct would no longer be
> accessible anyway.  Unless of course something is holding a ref on
> the pci_dev, but that would seem to be a bug.  (Accessing a device
> that's already removed from the system, that is.)
> 
> Calling pci_dev_set_disconnected() only gives the PCI core and the
> driver bound to the device an indication that's it's inaccessible,
> so any code paths during unbound and PCI device teardown can skip
> accesses.  (Because pci_dev_is_disconnected() is currently scoped
> to the PCI core, the disconnected status can only be queried from
> drivers that live in the PCI core, such as portdrv and all the
> port services drivers.)  After the pci_dev is removed from the
> hierarchy, accessing it seems at least questionable.
> 
> Does this make things clearer?  Shout if it not. :-)

Yes it does. Thank you :)
gokul cg Aug. 2, 2018, 7:29 a.m. UTC | #5
Hi ,

>After the pci_dev is removed from the
>> hierarchy, accessing it seems at least questionable.

What about AER driver .  I was discussing the same in another mail chain
with subject  "Possible race condition in the kernel between PCI driver and
AER handling"

Regards
Gokul,

--------------------FYI---------------

I am suspecting a possible race condition in the kernel between PCI driver
and AER handling.

Because of the same kernel panic happens from worker thread which handles
bottom half of aer irq.


I am seeing this issue when I suddenly power off PCI card which
supports/enabled PCIE AER error reporting.

While powering off PCI device, AER driver will get AER IRQ for the device,
from AER IRQ handler, it will cache AER error code and schedule worker
thread to handle error.

The PCIe device will get removed from PCI tree before worker thread
completes its task and kernel panic is  happening when worker thread tries
to access PCI device's config space.


Issue:

crash>

crash> bt

PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"

#0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18

#1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05

#2 [ffff88027469fbf0] oops_end at ffffffff8176d960

#3 [ffff88027469fc18] die at ffffffff810060db

#4 [ffff88027469fc48] do_general_protection at ffffffff8176d452

#5 [ffff88027469fc70] general_protection at ffffffff8176cdf2

    [exception RIP: pci_bus_read_config_dword+100]

    RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046

    RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004

    RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000

    RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0

    R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c

    R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000

    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000

#6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b

#7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225

#8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d

#9 [ffff88027469fdd0] aer_isr at ffffffff81357a38

#10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0

#11 [ffff88027469fe70] worker_thread at ffffffff8105e251

#12 [ffff88027469fed0] kthread at ffffffff81064260

#13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38


crash>


I have tested it on kernel 3.10 . But from source i could see that this
case is still relevant for latest Linux source .

--------------------END--------------

On Thu, Aug 2, 2018 at 12:50 PM, Mika Westerberg <
mika.westerberg@linux.intel.com> wrote:

> On Wed, Aug 01, 2018 at 07:15:12PM +0200, Lukas Wunner wrote:
> > On Wed, Aug 01, 2018 at 07:43:58PM +0300, Mika Westerberg wrote:
> > > On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
> > > > -static void remove_board(struct slot *p_slot)
> > > > +static void remove_board(struct slot *p_slot, bool safe_removal)
> > > >  {
> > > >   struct controller *ctrl = p_slot->ctrl;
> > > >
> > > > - pciehp_unconfigure_device(p_slot);
> > > > + pciehp_unconfigure_device(p_slot, safe_removal);
> > >
> > > Below we turn off power to the slot if it has power controller. Even if
> > > we disable slot from sysfs, I think it ends up being inaccessible after
> > > power is turned off. I wonder if we should mark the devices
> disconnected
> > > in that case as well?
> > >
> > > >
> > > >   if (POWER_CTRL(ctrl)) {
> > > >           pciehp_power_off_slot(p_slot);
> >
> > No, when pciehp_unconfigure_device() returns, the PCI devices below
> > the hotplug bridge are unbound and removed from the system.  They're
> > gone, so the bit set in their pci_dev struct would no longer be
> > accessible anyway.  Unless of course something is holding a ref on
> > the pci_dev, but that would seem to be a bug.  (Accessing a device
> > that's already removed from the system, that is.)
> >
> > Calling pci_dev_set_disconnected() only gives the PCI core and the
> > driver bound to the device an indication that's it's inaccessible,
> > so any code paths during unbound and PCI device teardown can skip
> > accesses.  (Because pci_dev_is_disconnected() is currently scoped
> > to the PCI core, the disconnected status can only be queried from
> > drivers that live in the PCI core, such as portdrv and all the
> > port services drivers.)  After the pci_dev is removed from the
> > hierarchy, accessing it seems at least questionable.
> >
> > Does this make things clearer?  Shout if it not. :-)
>
> Yes it does. Thank you :)
>
<div dir="ltr">Hi ,<div><br></div><div>&gt;<span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">After the pci_dev is removed from the</span><br style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">&gt;&gt; hierarchy, accessing it seems at least questionable.</span></div><div><br></div><div>What about AER driver .  I was discussing the same in another mail chain with subject  &quot;<span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Possible race condition in the kernel between PCI driver and AER handling&quot;</span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Regards</span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Gokul,</span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">--------------------FYI---------------</span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><span style="font-size:12px;font-family:Helvetica"><span>I </span>am suspecting a possible race condition in the kernel between PCI driver and AER handling.</span><br></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">Because of the same kernel panic happens from worker thread which handles bottom half of aer irq.</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px"><br></span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">I am seeing this issue when I suddenly power off PCI card which supports/enabled PCIE AER error reporting.</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">While powering off PCI device, AER driver will get AER IRQ for the device, from AER IRQ handler, it will cache AER error code and schedule worker thread to handle error.</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">The PCIe device will get removed from PCI tree before worker thread completes its task and kernel panic is  happening when worker thread tries to access PCI device&#39;s config space.</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><br></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:small;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">Issue:</span></font></p><br class="gmail-m_6437089603770748812gmail-Apple-interchange-newline" style="font-size:12.8px;text-decoration-style:initial;text-decoration-color:initial"><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">crash&gt;</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">crash&gt; bt</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: &quot;kworker/0:2&quot;</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#2 [ffff88027469fbf0] oops_end at ffffffff8176d960</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#3 [ffff88027469fc18] die at ffffffff810060db</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#4 [ffff88027469fc48] do_general_protection at ffffffff8176d452</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#5 [ffff88027469fc70] general_protection at ffffffff8176cdf2</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    [exception RIP: pci_bus_read_config_dword+100]</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#9 [ffff88027469fdd0] aer_isr at ffffffff81357a38</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#11 [ffff88027469fe70] worker_thread at ffffffff8105e251</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#12 [ffff88027469fed0] kthread at ffffffff81064260</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">#13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px"><br></span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">crash&gt;</span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px"><br></span></font></p><p class="gmail-m_6437089603770748812gmail-p1" style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12.8px;line-height:normal;text-decoration-style:initial;text-decoration-color:initial;margin:0px"><font face="Helvetica"><span style="font-size:12px">I have tested it on kernel 3.10 . But from source i could see that this case is still relevant for latest Linux source .</span></font></p></span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">--------------------END-------------- </span></div><div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Aug 2, 2018 at 12:50 PM, Mika Westerberg <span dir="ltr">&lt;<a href="mailto:mika.westerberg@linux.intel.com" target="_blank">mika.westerberg@linux.intel.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Wed, Aug 01, 2018 at 07:15:12PM +0200, Lukas Wunner wrote:<br>
&gt; On Wed, Aug 01, 2018 at 07:43:58PM +0300, Mika Westerberg wrote:<br>
&gt; &gt; On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:<br>
&gt; &gt; &gt; -static void remove_board(struct slot *p_slot)<br>
&gt; &gt; &gt; +static void remove_board(struct slot *p_slot, bool safe_removal)<br>
&gt; &gt; &gt;  {<br>
&gt; &gt; &gt;   struct controller *ctrl = p_slot-&gt;ctrl;<br>
&gt; &gt; &gt;  <br>
&gt; &gt; &gt; - pciehp_unconfigure_device(p_<wbr>slot);<br>
&gt; &gt; &gt; + pciehp_unconfigure_device(p_<wbr>slot, safe_removal);<br>
&gt; &gt; <br>
&gt; &gt; Below we turn off power to the slot if it has power controller. Even if<br>
&gt; &gt; we disable slot from sysfs, I think it ends up being inaccessible after<br>
&gt; &gt; power is turned off. I wonder if we should mark the devices disconnected<br>
&gt; &gt; in that case as well?<br>
&gt; &gt; <br>
&gt; &gt; &gt;  <br>
&gt; &gt; &gt;   if (POWER_CTRL(ctrl)) {<br>
&gt; &gt; &gt;           pciehp_power_off_slot(p_slot);<br>
&gt; <br>
&gt; No, when pciehp_unconfigure_device() returns, the PCI devices below<br>
&gt; the hotplug bridge are unbound and removed from the system.  They&#39;re<br>
&gt; gone, so the bit set in their pci_dev struct would no longer be<br>
&gt; accessible anyway.  Unless of course something is holding a ref on<br>
&gt; the pci_dev, but that would seem to be a bug.  (Accessing a device<br>
&gt; that&#39;s already removed from the system, that is.)<br>
&gt; <br>
&gt; Calling pci_dev_set_disconnected() only gives the PCI core and the<br>
&gt; driver bound to the device an indication that&#39;s it&#39;s inaccessible,<br>
&gt; so any code paths during unbound and PCI device teardown can skip<br>
&gt; accesses.  (Because pci_dev_is_disconnected() is currently scoped<br>
&gt; to the PCI core, the disconnected status can only be queried from<br>
&gt; drivers that live in the PCI core, such as portdrv and all the<br>
&gt; port services drivers.)  After the pci_dev is removed from the<br>
&gt; hierarchy, accessing it seems at least questionable.<br>
&gt; <br>
&gt; Does this make things clearer?  Shout if it not. :-)<br>
<br>
Yes it does. Thank you :)<br>
</blockquote></div><br></div></div></div>
Lukas Wunner Aug. 2, 2018, 8:46 a.m. UTC | #6
On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> I am suspecting a possible race condition in the kernel between PCI driver
> and AER handling.
> 
> Because of the same kernel panic happens from worker thread which handles
> bottom half of aer irq.
> 
> I am seeing this issue when I suddenly power off PCI card which
> supports/enabled PCIE AER error reporting.
> 
> While powering off PCI device, AER driver will get AER IRQ for the device,
> from AER IRQ handler, it will cache AER error code and schedule worker
> thread to handle error.
> 
> The PCIe device will get removed from PCI tree before worker thread
> completes its task and kernel panic is  happening when worker thread tries
> to access PCI device's config space.
> 
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>     [exception RIP: pci_bus_read_config_dword+100]
> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> #12 [ffff88027469fed0] kthread at ffffffff81064260
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> I have tested it on kernel 3.10 . But from source i could see that this
> case is still relevant for latest Linux source .

I'm not really familiar with the AER driver, but the problem is
actually easy to spot:

find_source_device() walks the hierarchy and saves a pointer to
pci_dev's in an array.  That array is later traversed and the
pci_dev's are accessed.

The solution is to acquire a ref on each device in add_error_device():

-	e_info->dev[e_info->error_dev_num] = dev;
+	e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);

Then release the ref aer_process_err_devices() by calling pci_dev_put().

I believe there's an ongoing refactoring of the AER driver and the
issue may be addressed in the course of it, but as a quick fix for
an ancient v3.10 kernel, the above should do the trick.

HTH,

Lukas
gokul cg Aug. 2, 2018, 12:28 p.m. UTC | #7
Hi Lukas,

>The solution is to acquire a ref on each device in add_error_device():
I will give try and update if it works .

regards,
Gokul
On Thu, Aug 2, 2018 at 2:16 PM, Lukas Wunner <lukas@wunner.de> wrote:

> On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> > I am suspecting a possible race condition in the kernel between PCI
> driver
> > and AER handling.
> >
> > Because of the same kernel panic happens from worker thread which handles
> > bottom half of aer irq.
> >
> > I am seeing this issue when I suddenly power off PCI card which
> > supports/enabled PCIE AER error reporting.
> >
> > While powering off PCI device, AER driver will get AER IRQ for the
> device,
> > from AER IRQ handler, it will cache AER error code and schedule worker
> > thread to handle error.
> >
> > The PCIe device will get removed from PCI tree before worker thread
> > completes its task and kernel panic is  happening when worker thread
> tries
> > to access PCI device's config space.
> >
> > #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
> >     [exception RIP: pci_bus_read_config_dword+100]
> > #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
> > #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> > #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> > #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> > #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> > #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> > #12 [ffff88027469fed0] kthread at ffffffff81064260
> > #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> >
> > I have tested it on kernel 3.10 . But from source i could see that this
> > case is still relevant for latest Linux source .
>
> I'm not really familiar with the AER driver, but the problem is
> actually easy to spot:
>
> find_source_device() walks the hierarchy and saves a pointer to
> pci_dev's in an array.  That array is later traversed and the
> pci_dev's are accessed.
>
> The solution is to acquire a ref on each device in add_error_device():
>
> -       e_info->dev[e_info->error_dev_num] = dev;
> +       e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
>
> Then release the ref aer_process_err_devices() by calling pci_dev_put().
>
> I believe there's an ongoing refactoring of the AER driver and the
> issue may be addressed in the course of it, but as a quick fix for
> an ancient v3.10 kernel, the above should do the trick.
>
> HTH,
>
> Lukas
>
<div dir="ltr">Hi <span style="font-size:small;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Lukas,</span><div><br></div><div><div>&gt;<span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">The solution is to acquire a ref on each device in add_error_device():</span></div><div>I will give try and update if it works .<br><div class="gmail_extra"><br></div><div class="gmail_extra">regards,</div><div class="gmail_extra">Gokul </div><div class="gmail_extra"><div class="gmail_quote">On Thu, Aug 2, 2018 at 2:16 PM, Lukas Wunner <span dir="ltr">&lt;<a href="mailto:lukas@wunner.de" target="_blank">lukas@wunner.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:<br>
&gt; I am suspecting a possible race condition in the kernel between PCI driver<br>
&gt; and AER handling.<br>
&gt; <br>
&gt; Because of the same kernel panic happens from worker thread which handles<br>
&gt; bottom half of aer irq.<br>
&gt; <br>
&gt; I am seeing this issue when I suddenly power off PCI card which<br>
&gt; supports/enabled PCIE AER error reporting.<br>
&gt; <br>
&gt; While powering off PCI device, AER driver will get AER IRQ for the device,<br>
&gt; from AER IRQ handler, it will cache AER error code and schedule worker<br>
&gt; thread to handle error.<br>
&gt; <br>
&gt; The PCIe device will get removed from PCI tree before worker thread<br>
&gt; completes its task and kernel panic is  happening when worker thread tries<br>
&gt; to access PCI device&#39;s config space.<br>
&gt; <br>
</span><span class="">&gt; #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2<br>
&gt;     [exception RIP: pci_bus_read_config_dword+100]<br>
</span><span class="">&gt; #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b<br>
&gt; #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225<br>
&gt; #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d<br>
&gt; #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38<br>
&gt; #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0<br>
&gt; #11 [ffff88027469fe70] worker_thread at ffffffff8105e251<br>
&gt; #12 [ffff88027469fed0] kthread at ffffffff81064260<br>
&gt; #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38<br>
&gt; <br>
</span><span class="">&gt; I have tested it on kernel 3.10 . But from source i could see that this<br>
&gt; case is still relevant for latest Linux source .<br>
<br>
</span>I&#39;m not really familiar with the AER driver, but the problem is<br>
actually easy to spot:<br>
<br>
find_source_device() walks the hierarchy and saves a pointer to<br>
pci_dev&#39;s in an array.  That array is later traversed and the<br>
pci_dev&#39;s are accessed.<br>
<br>
The solution is to acquire a ref on each device in add_error_device():<br>
<br>
-       e_info-&gt;dev[e_info-&gt;error_dev_<wbr>num] = dev;<br>
+       e_info-&gt;dev[e_info-&gt;error_dev_<wbr>num] = pci_dev_get(dev);<br>
<br>
Then release the ref aer_process_err_devices() by calling pci_dev_put().<br>
<br>
I believe there&#39;s an ongoing refactoring of the AER driver and the<br>
issue may be addressed in the course of it, but as a quick fix for<br>
an ancient v3.10 kernel, the above should do the trick.<br>
<br>
HTH,<br>
<br>
Lukas<br>
</blockquote></div><br></div></div></div></div>
Lukas Wunner Aug. 2, 2018, 3:07 p.m. UTC | #8
[cc += Thomas Tai]

On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner wrote:
> On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> > I am suspecting a possible race condition in the kernel between PCI driver
> > and AER handling.
> 
> The solution is to acquire a ref on each device in add_error_device().
> Then release the ref aer_process_err_devices() by calling pci_dev_put().

So in case it wasn't clear, the below is what I had in mind.
Completely untested though.  Does this work for you?

For v3.10 compatibility, cherry-pick 89ee9f768003 (or alternatively
cherry-pick 8496e85c20e7 and replace pci_dev_is_disconnected(dev)
with !pci_device_is_present(dev)).

-- >8 --
Subject: [PATCH] PCI/AER: Fix use-after-free on surprise removal

The work item to consume errors, aer_isr(), walks the hierarchy using
pci_walk_bus() and stores a pointer to PCI devices which reported an
error in an array.  As long as pci_walk_bus() runs, those pointers are
valid because pci_bus_sem is held.  But once pci_walk_bus() finishes,
nothing prevents the pointers from becoming invalid, e.g. through
unplugging of the PCI devices.  The unprotected pointers are then
dereferenced in aer_process_err_devices(), which may oops:

  #5  general_protection at ffffffff8176cdf2
      [exception RIP: pci_bus_read_config_dword+100]
  #6  pci_find_next_ext_capability at ffffffff81345d7b
  #7  pci_find_ext_capability at ffffffff81347225
  #8  get_device_error_info at ffffffff81356c4d
  #9  aer_isr at ffffffff81357a38

Fix by holding a ref on the devices until they have been processed.
Skip processing of unplugged devices.

Reported-by: gokul cg <gokuljnpr@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pcie/aer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e8838..937592e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
 {
 	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-		e_info->dev[e_info->error_dev_num] = dev;
+		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
 		e_info->error_dev_num++;
 		return 0;
 	}
@@ -898,6 +898,9 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	if (!pos)
 		return 0;
 
+	if (pci_dev_is_disconnected(dev))
+		return 0;
+
 	if (info->severity == AER_CORRECTABLE) {
 		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 			&info->status);
@@ -948,6 +951,7 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
 		if (get_device_error_info(e_info->dev[i], e_info))
 			handle_error_source(e_info->dev[i], e_info);
+		pci_dev_put(e_info->dev[i]);
 	}
 }
Thomas Tai Aug. 2, 2018, 5:09 p.m. UTC | #9
On 08/02/2018 11:07 AM, Lukas Wunner wrote:
> [cc += Thomas Tai]

Hi Lukas,
Thank you very much for cc me.
> 
> On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner wrote:
>> On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
>>> I am suspecting a possible race condition in the kernel between PCI driver
>>> and AER handling.
>>
>> The solution is to acquire a ref on each device in add_error_device().
>> Then release the ref aer_process_err_devices() by calling pci_dev_put().
> 
> So in case it wasn't clear, the below is what I had in mind.
> Completely untested though.  Does this work for you?
> 
> For v3.10 compatibility, cherry-pick 89ee9f768003 (or alternatively
> cherry-pick 8496e85c20e7 and replace pci_dev_is_disconnected(dev)
> with !pci_device_is_present(dev)).
> 
> -- >8 --
> Subject: [PATCH] PCI/AER: Fix use-after-free on surprise removal
> 
> The work item to consume errors, aer_isr(), walks the hierarchy using
> pci_walk_bus() and stores a pointer to PCI devices which reported an
> error in an array.  As long as pci_walk_bus() runs, those pointers are
> valid because pci_bus_sem is held.  But once pci_walk_bus() finishes,
> nothing prevents the pointers from becoming invalid, e.g. through
> unplugging of the PCI devices.  The unprotected pointers are then
> dereferenced in aer_process_err_devices(), which may oops:

I like your idea to increment the refcount during pci_walk_bus(), that 
should fix the use-after-free issue. We just need Gokul to confirm if it 
fixes his issue or not.

Thanks,
Thomas

> 
>    #5  general_protection at ffffffff8176cdf2
>        [exception RIP: pci_bus_read_config_dword+100]
>    #6  pci_find_next_ext_capability at ffffffff81345d7b
>    #7  pci_find_ext_capability at ffffffff81347225
>    #8  get_device_error_info at ffffffff81356c4d
>    #9  aer_isr at ffffffff81357a38
> 
> Fix by holding a ref on the devices until they have been processed.
> Skip processing of unplugged devices.
> 
> Reported-by: gokul cg <gokuljnpr@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> ---
>   drivers/pci/pcie/aer.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e8838..937592e 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
>   static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>   {
>   	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> -		e_info->dev[e_info->error_dev_num] = dev;
> +		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
>   		e_info->error_dev_num++;
>   		return 0;
>   	}
> @@ -898,6 +898,9 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>   	if (!pos)
>   		return 0;
>   
> +	if (pci_dev_is_disconnected(dev))
> +		return 0;
> +
>   	if (info->severity == AER_CORRECTABLE) {
>   		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
>   			&info->status);
> @@ -948,6 +951,7 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>   	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>   		if (get_device_error_info(e_info->dev[i], e_info))
>   			handle_error_source(e_info->dev[i], e_info);
> +		pci_dev_put(e_info->dev[i]);
>   	}
>   }
>   
>
Thomas Tai Aug. 7, 2018, 2:26 p.m. UTC | #10
On 08/06/2018 02:33 PM, gokul cg wrote:
> Hi,
> 
> I have tried with following patch and I am still getting same kernel panic.

Sorry to hear that. What kernel message did you see before the crash. We 
assume the hotplug is corrupting the pci_dev, but is it really? Have you 
tried without insmod the hotplug provided that you can still power off 
the device via your I2C.

> 
> -------------X++++++++++++++++++++X---------------------
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 0f4554e..05592aa 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -26,6 +26,7 @@
>   #include <linux/slab.h>
>   #include <linux/kfifo.h>
>   #include "aerdrv.h"
> +#include "../../pci.h"
> 
>   static bool forceload;
>   static bool nosourceid;
> @@ -82,7 +82,7 @@ EXPORT_SYMBOL_GPL(pci_cleanup_aer_uncorrect_error_status);
>   static int add_error_device(struct aer_err_info *e_info, struct 
> pci_dev *dev)
>   {
> if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> -e_info->dev[e_info->error_dev_num] = dev;
> +e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> e_info->error_dev_num++;
> return 0;
> }
> @@ -659,6 +659,9 @@ static int get_device_error_info(struct pci_dev 
> *dev, struct aer_err_info *info)
> if (!pos)
> return 1;
> 
> +        if (pci_dev_is_disconnected(dev))
> +                return 0;
> +
> if (info->severity == AER_CORRECTABLE) {
> pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
> &info->status);
> @@ -710,6 +713,8 @@ static inline void aer_process_err_devices(struct 
> pcie_device *p_device,
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (get_device_error_info(e_info->dev[i], e_info))
> handle_error_source(p_device, e_info->dev[i], e_info);
> +
> +                pci_dev_put(e_info->dev[i]);
> }
>   }
> -------------X++++++++++++++++++++X---------------------
> 
> 
> Note: I have configured CONFIG_HOTPLUG_PCI_PCIE and CONFIG_HOTPLUG_PCI 
> as modules and  loading in start up using script.
> 
> root@/proc/:~# cat config | grep -i HOT
> CONFIG_TICK_ONESHOT=y
> CONFIG_HOTPLUG=y
> # CONFIG_MEMORY_HOTPLUG is not set
> CONFIG_HOTPLUG_CPU=y
> # CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
> # CONFIG_DEBUG_HOTPLUG_CPU0 is not set
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> CONFIG_ACPI_HOTPLUG_CPU=y
> CONFIG_HOTPLUG_PCI_PCIE=m
> CONFIG_HOTPLUG_PCI=m
> # CONFIG_HOTPLUG_PCI_CPCI is not set
> # CONFIG_HOTPLUG_PCI_SHPC is not set
> CONFIG_DM_SNAPSHOT=y
> # CONFIG_USB_STORAGE_JUMPSHOT is not set
> # CONFIG_TRACER_SNAPSHOT is not set
> root@/proc/:~#
> 
> Panic back trace :
> crash> bt
> PID: 24     TASK: ffff880274ac0000  CPU: 0   COMMAND: "kworker/0:1"
>   #0 [ffff880274abbac8] machine_kexec at ffffffff8102cf18
>   #1 [ffff880274abbb28] crash_kexec at ffffffff810a6b05
>   #2 [ffff880274abbbf0] oops_end at ffffffff8176d8a0
>   #3 [ffff880274abbc18] die at ffffffff810060db
>   #4 [ffff880274abbc48] do_general_protection at ffffffff8176d392
>   #5 [ffff880274abbc70] general_protection at ffffffff8176cd32
>      [exception RIP: pci_bus_read_config_dword+100]
>      RIP: ffffffff813405f4  RSP: ffff880274abbd20  RFLAGS: 00010046
>      RAX: 435f494350006963  RBX: ffff880274891800  RCX: 0000000000000004
>      RDX: 0000000000000ffc  RSI: 0000000000000060  RDI: ffff880274891800
>      RBP: ffff880274abbd48   R8: ffff880274abbd2c   R9: 00000000000002b8
>      R10: ffff880274340000  R11: 0000000000000246  R12: ffff880274abbd5c
>      R13: 0000000000000246  R14: 0000000000000000  R15: ffff880274920000
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #6 [ffff880274abbd50] pci_find_next_ext_capability at ffffffff81345db6
>   #7 [ffff880274abbd90] pci_find_ext_capability at ffffffff81347225
>   #8 [ffff880274abbda0] get_device_error_info at ffffffff81356c4d
>   #9 [ffff880274abbdd0] aer_isr at ffffffff81357ab0
> #10 [ffff880274abbe28] process_one_work at ffffffff8105d4c0
> #11 [ffff880274abbe70] worker_thread at ffffffff8105e251
> #12 [ffff880274abbed0] kthread at ffffffff81064260
> #13 [ffff880274abbf50] ret_from_fork at ffffffff81773978
> crash>
> 
> 
> Regards,
> Gokul
> 
> On Thu, Aug 2, 2018 at 10:39 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
> 
>     On 08/02/2018 11:07 AM, Lukas Wunner wrote:
> 
>         [cc += Thomas Tai]
> 
> 
>     Hi Lukas,
>     Thank you very much for cc me.
> 
> 
>         On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner wrote:
> 
>             On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> 
>                 I am suspecting a possible race condition in the kernel
>                 between PCI driver
>                 and AER handling.
> 
> 
>             The solution is to acquire a ref on each device in
>             add_error_device().
>             Then release the ref aer_process_err_devices() by calling
>             pci_dev_put().
> 
> 
>         So in case it wasn't clear, the below is what I had in mind.
>         Completely untested though.  Does this work for you?
> 
>         For v3.10 compatibility, cherry-pick 89ee9f768003 (or alternatively
>         cherry-pick 8496e85c20e7 and replace pci_dev_is_disconnected(dev)
>         with !pci_device_is_present(dev)).
> 
>         -- >8 --
>         Subject: [PATCH] PCI/AER: Fix use-after-free on surprise removal
> 
>         The work item to consume errors, aer_isr(), walks the hierarchy
>         using
>         pci_walk_bus() and stores a pointer to PCI devices which reported an
>         error in an array.  As long as pci_walk_bus() runs, those
>         pointers are
>         valid because pci_bus_sem is held.  But once pci_walk_bus()
>         finishes,
>         nothing prevents the pointers from becoming invalid, e.g. through
>         unplugging of the PCI devices.  The unprotected pointers are then
>         dereferenced in aer_process_err_devices(), which may oops:
> 
> 
>     I like your idea to increment the refcount during pci_walk_bus(),
>     that should fix the use-after-free issue. We just need Gokul to
>     confirm if it fixes his issue or not.
> 
>     Thanks,
>     Thomas
> 
> 
> 
>             #5  general_protection at ffffffff8176cdf2
>                 [exception RIP: pci_bus_read_config_dword+100]
>             #6  pci_find_next_ext_capability at ffffffff81345d7b
>             #7  pci_find_ext_capability at ffffffff81347225
>             #8  get_device_error_info at ffffffff81356c4d
>             #9  aer_isr at ffffffff81357a38
> 
>         Fix by holding a ref on the devices until they have been processed.
>         Skip processing of unplugged devices.
> 
>         Reported-by: gokul cg <gokuljnpr@gmail.com
>         <mailto:gokuljnpr@gmail.com>>
>         Signed-off-by: Lukas Wunner <lukas@wunner.de
>         <mailto:lukas@wunner.de>>
>         ---
>            drivers/pci/pcie/aer.c | 6 +++++-
>            1 file changed, 5 insertions(+), 1 deletion(-)
> 
>         diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>         index a2e8838..937592e 100644
>         --- a/drivers/pci/pcie/aer.c
>         +++ b/drivers/pci/pcie/aer.c
>         @@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev *dev, int
>         aer_severity,
>            static int add_error_device(struct aer_err_info *e_info,
>         struct pci_dev *dev)
>            {
>                  if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>         -               e_info->dev[e_info->error_dev_num] = dev;
>         +               e_info->dev[e_info->error_dev_num] =
>         pci_dev_get(dev);
>                          e_info->error_dev_num++;
>                          return 0;
>                  }
>         @@ -898,6 +898,9 @@ static int get_device_error_info(struct
>         pci_dev *dev, struct aer_err_info *info)
>                  if (!pos)
>                          return 0;
>            +     if (pci_dev_is_disconnected(dev))
>         +               return 0;
>         +
>                  if (info->severity == AER_CORRECTABLE) {
>                          pci_read_config_dword(dev, pos +
>         PCI_ERR_COR_STATUS,
>                                  &info->status);
>         @@ -948,6 +951,7 @@ static inline void
>         aer_process_err_devices(struct aer_err_info *e_info)
>                  for (i = 0; i < e_info->error_dev_num &&
>         e_info->dev[i]; i++) {
>                          if (get_device_error_info(e_info->dev[i], e_info))
>                                  handle_error_source(e_info->dev[i],
>         e_info);
>         +               pci_dev_put(e_info->dev[i]);
>                  }
>            }
> 
>
Thomas Tai Aug. 7, 2018, 3:30 p.m. UTC | #11
Hi Gokul,
Something pop up in my mind and want to share with you. I assume that 
your device is not a root port device or a switch device. I assume when 
you power off the device, a FATAL error is sent to the root port thus 
trigger the aer_isr.

Since it is a fatal error and your device is not a switch device, the 
code should not reach out your device because fatal error means that the 
link to your device is not reliable. So the pci_find_ext_capability() 
looks strange to me. When compare the code with the master branch. v3.10 
is missing following patch. Would you think you can give it a try?

commit 66b808099146166c44157600a166c8372172cd76
Author: Keith Busch <keith.busch@intel.com>
Date:   Tue Sep 27 16:23:34 2016 -0400

     PCI/AER: Cache capability position

     Save the position of the error reporting capability so it doesn't 
need to
     be rediscovered during error handling.

     Signed-off-by: Keith Busch <keith.busch@intel.com>
     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
     CC: Lukas Wunner <lukas@wunner.de>

- Thomas


On 08/06/2018 02:33 PM, gokul cg wrote:
> Hi,
> 
> I have tried with following patch and I am still getting same kernel panic.
> 
> -------------X++++++++++++++++++++X---------------------
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 0f4554e..05592aa 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -26,6 +26,7 @@
>   #include <linux/slab.h>
>   #include <linux/kfifo.h>
>   #include "aerdrv.h"
> +#include "../../pci.h"
> 
>   static bool forceload;
>   static bool nosourceid;
> @@ -82,7 +82,7 @@ EXPORT_SYMBOL_GPL(pci_cleanup_aer_uncorrect_error_status);
>   static int add_error_device(struct aer_err_info *e_info, struct 
> pci_dev *dev)
>   {
> if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> -e_info->dev[e_info->error_dev_num] = dev;
> +e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> e_info->error_dev_num++;
> return 0;
> }
> @@ -659,6 +659,9 @@ static int get_device_error_info(struct pci_dev 
> *dev, struct aer_err_info *info)
> if (!pos)
> return 1;
> 
> +        if (pci_dev_is_disconnected(dev))
> +                return 0;
> +
> if (info->severity == AER_CORRECTABLE) {
> pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
> &info->status);
> @@ -710,6 +713,8 @@ static inline void aer_process_err_devices(struct 
> pcie_device *p_device,
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (get_device_error_info(e_info->dev[i], e_info))
> handle_error_source(p_device, e_info->dev[i], e_info);
> +
> +                pci_dev_put(e_info->dev[i]);
> }
>   }
> -------------X++++++++++++++++++++X---------------------
> 
> 
> Note: I have configured CONFIG_HOTPLUG_PCI_PCIE and CONFIG_HOTPLUG_PCI 
> as modules and  loading in start up using script.
> 
> root@/proc/:~# cat config | grep -i HOT
> CONFIG_TICK_ONESHOT=y
> CONFIG_HOTPLUG=y
> # CONFIG_MEMORY_HOTPLUG is not set
> CONFIG_HOTPLUG_CPU=y
> # CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
> # CONFIG_DEBUG_HOTPLUG_CPU0 is not set
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> CONFIG_ACPI_HOTPLUG_CPU=y
> CONFIG_HOTPLUG_PCI_PCIE=m
> CONFIG_HOTPLUG_PCI=m
> # CONFIG_HOTPLUG_PCI_CPCI is not set
> # CONFIG_HOTPLUG_PCI_SHPC is not set
> CONFIG_DM_SNAPSHOT=y
> # CONFIG_USB_STORAGE_JUMPSHOT is not set
> # CONFIG_TRACER_SNAPSHOT is not set
> root@/proc/:~#
> 
> Panic back trace :
> crash> bt
> PID: 24     TASK: ffff880274ac0000  CPU: 0   COMMAND: "kworker/0:1"
>   #0 [ffff880274abbac8] machine_kexec at ffffffff8102cf18
>   #1 [ffff880274abbb28] crash_kexec at ffffffff810a6b05
>   #2 [ffff880274abbbf0] oops_end at ffffffff8176d8a0
>   #3 [ffff880274abbc18] die at ffffffff810060db
>   #4 [ffff880274abbc48] do_general_protection at ffffffff8176d392
>   #5 [ffff880274abbc70] general_protection at ffffffff8176cd32
>      [exception RIP: pci_bus_read_config_dword+100]
>      RIP: ffffffff813405f4  RSP: ffff880274abbd20  RFLAGS: 00010046
>      RAX: 435f494350006963  RBX: ffff880274891800  RCX: 0000000000000004
>      RDX: 0000000000000ffc  RSI: 0000000000000060  RDI: ffff880274891800
>      RBP: ffff880274abbd48   R8: ffff880274abbd2c   R9: 00000000000002b8
>      R10: ffff880274340000  R11: 0000000000000246  R12: ffff880274abbd5c
>      R13: 0000000000000246  R14: 0000000000000000  R15: ffff880274920000
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #6 [ffff880274abbd50] pci_find_next_ext_capability at ffffffff81345db6
>   #7 [ffff880274abbd90] pci_find_ext_capability at ffffffff81347225
>   #8 [ffff880274abbda0] get_device_error_info at ffffffff81356c4d
>   #9 [ffff880274abbdd0] aer_isr at ffffffff81357ab0
> #10 [ffff880274abbe28] process_one_work at ffffffff8105d4c0
> #11 [ffff880274abbe70] worker_thread at ffffffff8105e251
> #12 [ffff880274abbed0] kthread at ffffffff81064260
> #13 [ffff880274abbf50] ret_from_fork at ffffffff81773978
> crash>
> 
> 
> Regards,
> Gokul
> 
> On Thu, Aug 2, 2018 at 10:39 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
> 
>     On 08/02/2018 11:07 AM, Lukas Wunner wrote:
> 
>         [cc += Thomas Tai]
> 
> 
>     Hi Lukas,
>     Thank you very much for cc me.
> 
> 
>         On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner wrote:
> 
>             On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> 
>                 I am suspecting a possible race condition in the kernel
>                 between PCI driver
>                 and AER handling.
> 
> 
>             The solution is to acquire a ref on each device in
>             add_error_device().
>             Then release the ref aer_process_err_devices() by calling
>             pci_dev_put().
> 
> 
>         So in case it wasn't clear, the below is what I had in mind.
>         Completely untested though.  Does this work for you?
> 
>         For v3.10 compatibility, cherry-pick 89ee9f768003 (or alternatively
>         cherry-pick 8496e85c20e7 and replace pci_dev_is_disconnected(dev)
>         with !pci_device_is_present(dev)).
> 
>         -- >8 --
>         Subject: [PATCH] PCI/AER: Fix use-after-free on surprise removal
> 
>         The work item to consume errors, aer_isr(), walks the hierarchy
>         using
>         pci_walk_bus() and stores a pointer to PCI devices which reported an
>         error in an array.  As long as pci_walk_bus() runs, those
>         pointers are
>         valid because pci_bus_sem is held.  But once pci_walk_bus()
>         finishes,
>         nothing prevents the pointers from becoming invalid, e.g. through
>         unplugging of the PCI devices.  The unprotected pointers are then
>         dereferenced in aer_process_err_devices(), which may oops:
> 
> 
>     I like your idea to increment the refcount during pci_walk_bus(),
>     that should fix the use-after-free issue. We just need Gokul to
>     confirm if it fixes his issue or not.
> 
>     Thanks,
>     Thomas
> 
> 
> 
>             #5  general_protection at ffffffff8176cdf2
>                 [exception RIP: pci_bus_read_config_dword+100]
>             #6  pci_find_next_ext_capability at ffffffff81345d7b
>             #7  pci_find_ext_capability at ffffffff81347225
>             #8  get_device_error_info at ffffffff81356c4d
>             #9  aer_isr at ffffffff81357a38
> 
>         Fix by holding a ref on the devices until they have been processed.
>         Skip processing of unplugged devices.
> 
>         Reported-by: gokul cg <gokuljnpr@gmail.com
>         <mailto:gokuljnpr@gmail.com>>
>         Signed-off-by: Lukas Wunner <lukas@wunner.de
>         <mailto:lukas@wunner.de>>
>         ---
>            drivers/pci/pcie/aer.c | 6 +++++-
>            1 file changed, 5 insertions(+), 1 deletion(-)
> 
>         diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>         index a2e8838..937592e 100644
>         --- a/drivers/pci/pcie/aer.c
>         +++ b/drivers/pci/pcie/aer.c
>         @@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev *dev, int
>         aer_severity,
>            static int add_error_device(struct aer_err_info *e_info,
>         struct pci_dev *dev)
>            {
>                  if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>         -               e_info->dev[e_info->error_dev_num] = dev;
>         +               e_info->dev[e_info->error_dev_num] =
>         pci_dev_get(dev);
>                          e_info->error_dev_num++;
>                          return 0;
>                  }
>         @@ -898,6 +898,9 @@ static int get_device_error_info(struct
>         pci_dev *dev, struct aer_err_info *info)
>                  if (!pos)
>                          return 0;
>            +     if (pci_dev_is_disconnected(dev))
>         +               return 0;
>         +
>                  if (info->severity == AER_CORRECTABLE) {
>                          pci_read_config_dword(dev, pos +
>         PCI_ERR_COR_STATUS,
>                                  &info->status);
>         @@ -948,6 +951,7 @@ static inline void
>         aer_process_err_devices(struct aer_err_info *e_info)
>                  for (i = 0; i < e_info->error_dev_num &&
>         e_info->dev[i]; i++) {
>                          if (get_device_error_info(e_info->dev[i], e_info))
>                                  handle_error_source(e_info->dev[i],
>         e_info);
>         +               pci_dev_put(e_info->dev[i]);
>                  }
>            }
> 
>
From 66b808099146166c44157600a166c8372172cd76 Mon Sep 17 00:00:00 2001
From: Keith Busch <keith.busch@intel.com>
Date: Tue, 27 Sep 2016 16:23:34 -0400
Subject: [PATCH] PCI/AER: Cache capability position

Save the position of the error reporting capability so it doesn't need to
be rediscovered during error handling.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pcie/aer/aerdrv.c      | 10 +++++-----
 drivers/pci/pcie/aer/aerdrv_core.c | 18 ++++++++++++------
 drivers/pci/probe.c                |  3 ++-
 include/linux/pci.h                |  5 +++++
 4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 08ce257..e99efaa 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -134,7 +134,7 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
 	pcie_capability_clear_word(pdev, PCI_EXP_RTCTL,
 				   SYSTEM_ERROR_INTR_ON_MESG_MASK);
 
-	aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	aer_pos = pdev->aer_cap;
 	/* Clear error status */
 	pci_read_config_dword(pdev, aer_pos + PCI_ERR_ROOT_STATUS, &reg32);
 	pci_write_config_dword(pdev, aer_pos + PCI_ERR_ROOT_STATUS, reg32);
@@ -173,7 +173,7 @@ static void aer_disable_rootport(struct aer_rpc *rpc)
 	 */
 	set_downstream_devices_error_reporting(pdev, false);
 
-	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	pos = pdev->aer_cap;
 	/* Disable Root's interrupt in response to error messages */
 	pci_read_config_dword(pdev, pos + PCI_ERR_ROOT_COMMAND, &reg32);
 	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
@@ -200,7 +200,7 @@ irqreturn_t aer_irq(int irq, void *context)
 	unsigned long flags;
 	int pos;
 
-	pos = pci_find_ext_capability(pdev->port, PCI_EXT_CAP_ID_ERR);
+	pos = pdev->port->aer_cap;
 	/*
 	 * Must lock access to Root Error Status Reg, Root Error ID Reg,
 	 * and Root error producer/consumer index
@@ -338,7 +338,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	u32 reg32;
 	int pos;
 
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 
 	/* Disable Root's interrupt in response to error messages */
 	pci_read_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, &reg32);
@@ -391,7 +391,7 @@ static void aer_error_resume(struct pci_dev *dev)
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
 
 	/* Clean AER Root Error Status */
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
 	if (dev->error_state == pci_channel_io_normal)
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 9fd18a0..b1303b3 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -35,7 +35,7 @@ int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 	if (pcie_aer_get_firmware_first(dev))
 		return -EIO;
 
-	if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR))
+	if (!dev->aer_cap)
 		return -EIO;
 
 	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
@@ -57,7 +57,7 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 	int pos;
 	u32 status;
 
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 	if (!pos)
 		return -EIO;
 
@@ -78,7 +78,7 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 	if (!pos)
 		return -EIO;
 
@@ -97,6 +97,12 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	return 0;
 }
 
+int pci_aer_init(struct pci_dev *dev)
+{
+	dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	return pci_cleanup_aer_error_status_regs(dev);
+}
+
 /**
  * add_error_device - list device to be handled
  * @e_info: pointer to error info
@@ -154,7 +160,7 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 	if (!(reg16 & PCI_EXP_AER_FLAGS))
 		return false;
 
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 	if (!pos)
 		return false;
 
@@ -551,7 +557,7 @@ static void handle_error_source(struct pcie_device *aerdev,
 		 * Correctable error does not need software intervention.
 		 * No need to go through error recovery process.
 		 */
-		pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+		pos = dev->aer_cap;
 		if (pos)
 			pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 					info->status);
@@ -643,7 +649,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	info->status = 0;
 	info->tlp_header_valid = 0;
 
-	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+	pos = dev->aer_cap;
 
 	/* The device might not support AER */
 	if (!pos)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 93f280d..1575724 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1666,7 +1666,8 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	/* Enable ACS P2P upstream forwarding */
 	pci_enable_acs(dev);
 
-	pci_cleanup_aer_error_status_regs(dev);
+	/* Advanced Error Reporting */
+	pci_aer_init(dev);
 }
 
 /*
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 57bc838..ab6b027 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -269,6 +269,9 @@ struct pci_dev {
 	unsigned int	class;		/* 3 bytes: (base,sub,prog-if) */
 	u8		revision;	/* PCI revision, low byte of class word */
 	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
+#ifdef CONFIG_PCIEAER
+	u16		aer_cap;	/* AER capability offset */
+#endif
 	u8		pcie_cap;	/* PCIe capability offset */
 	u8		msi_cap;	/* MSI capability offset */
 	u8		msix_cap;	/* MSI-X capability offset */
@@ -1369,9 +1372,11 @@ static inline int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
 #ifdef CONFIG_PCIEAER
 void pci_no_aer(void);
 bool pci_aer_available(void);
+int pci_aer_init(struct pci_dev *dev);
 #else
 static inline void pci_no_aer(void) { }
 static inline bool pci_aer_available(void) { return false; }
+static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
 #endif
 
 #ifdef CONFIG_PCIE_ECRC
Thomas Tai Aug. 8, 2018, 8:49 p.m. UTC | #12
On 08/08/2018 07:21 AM, gokul cg wrote:
> Thanks Thomas,
> 
> With patch you suggested , panic has gone away from 
> 'pci_find_next_ext_capability' as we not using inside aer_isr , but now 
> it hits at pci_bus_read_config_dword.

Hmm, that's too bad. You probably are right, the dev->bus->ops->read may 
be corrupted. I am wondering can you print out the dev->bus->ops->read 
in normal working condition and compare it with the surprise power off.

By the way, I am using following aer-inject tools:
https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Others thoughts is that I did pci_stop_and_remove_bus_device() and then 
call pci_read_config_dword() I still don't get the protection fault. Do 
your device driver do something special when it detects the power off? 
Or, may be the BIOS/UEFI did something to prevent the configure read?

Regards,
Thomas

> 
> -------------------xxxxxxx bt og xxxxxxxx-----------------"
> PID: 24     TASK: ffff880274ac0000  CPU: 0   COMMAND: "kworker/0:1"
>   #0 [ffff880274abbb18] machine_kexec at ffffffff8102cf18
>   #1 [ffff880274abbb78] crash_kexec at ffffffff810a6b05
>   #2 [ffff880274abbc40] oops_end at ffffffff8176d960
>   #3 [ffff880274abbc68] die at ffffffff810060db
>   #4 [ffff880274abbc98] do_general_protection at ffffffff8176d452
>   #5 [ffff880274abbcc0] general_protection at ffffffff8176cdf2
>      [exception RIP: pci_bus_read_config_dword+100]
>      RIP: ffffffff813405f4  RSP: ffff880274abbd70  RFLAGS: 00010046
>      RAX: 455a494c41495449  RBX: ffff880274891800  RCX: 0000000000000004
>      RDX: 0000000000000110  RSI: 0000000000000060  RDI: ffff880274891800
>      RBP: ffff880274abbd98   R8: ffff880274abbd7c   R9: 00000000000011b5
>      R10: 0000000000000000  R11: 00000000000011b4  R12: ffff8802741a0210
>      R13: 0000000000000246  R14: ffff880272afc008  R15: ffff880272af8800
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #6 [ffff880274abbda0] get_device_error_info at ffffffff81356d74
>   #7 [ffff880274abbdd0] aer_isr at ffffffff81357b41
>   #8 [ffff880274abbe28] process_one_work at ffffffff8105d4c0
>   #9 [ffff880274abbe70] worker_thread at ffffffff8105e251
> #10 [ffff880274abbed0] kthread at ffffffff81064260
> #11 [ffff880274abbf50] ret_from_fork at ffffffff81773a38"
> 
> -------------------xxxxxxx bt og end xxxxxxxx-----------------
> 
> 
> Regards,
> Gokul
> 
> 
> On Tue, Aug 7, 2018 at 9:00 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
>     Hi Gokul,
>     Something pop up in my mind and want to share with you. I assume
>     that your device is not a root port device or a switch device. I
>     assume when you power off the device, a FATAL error is sent to the
>     root port thus trigger the aer_isr.
> 
>     Since it is a fatal error and your device is not a switch device,
>     the code should not reach out your device because fatal error means
>     that the link to your device is not reliable. So the
>     pci_find_ext_capability() looks strange to me. When compare the code
>     with the master branch. v3.10 is missing following patch. Would you
>     think you can give it a try?
> 
>     commit 66b808099146166c44157600a166c8372172cd76
>     Author: Keith Busch <keith.busch@intel.com
>     <mailto:keith.busch@intel.com>>
>     Date:   Tue Sep 27 16:23:34 2016 -0400
> 
>          PCI/AER: Cache capability position
> 
>          Save the position of the error reporting capability so it
>     doesn't need to
>          be rediscovered during error handling.
> 
>          Signed-off-by: Keith Busch <keith.busch@intel.com
>     <mailto:keith.busch@intel.com>>
>          Signed-off-by: Bjorn Helgaas <bhelgaas@google.com
>     <mailto:bhelgaas@google.com>>
>          CC: Lukas Wunner <lukas@wunner.de <mailto:lukas@wunner.de>>
> 
>     - Thomas
> 
> 
>     On 08/06/2018 02:33 PM, gokul cg wrote:
> 
>         Hi,
> 
>         I have tried with following patch and I am still getting same
>         kernel panic.
> 
>         -------------X++++++++++++++++++++X---------------------
> 
>         diff --git a/drivers/pci/pcie/aer/aerdrv_core.c
>         b/drivers/pci/pcie/aer/aerdrv_core.c
>         index 0f4554e..05592aa 100644
>         --- a/drivers/pci/pcie/aer/aerdrv_core.c
>         +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>         @@ -26,6 +26,7 @@
>            #include <linux/slab.h>
>            #include <linux/kfifo.h>
>            #include "aerdrv.h"
>         +#include "../../pci.h"
> 
>            static bool forceload;
>            static bool nosourceid;
>         @@ -82,7 +82,7 @@
>         EXPORT_SYMBOL_GPL(pci_cleanup_aer_uncorrect_error_status);
>            static int add_error_device(struct aer_err_info *e_info,
>         struct pci_dev *dev)
>            {
>         if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>         -e_info->dev[e_info->error_dev_num] = dev;
>         +e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> 
>         e_info->error_dev_num++;
>         return 0;
>         }
>         @@ -659,6 +659,9 @@ static int get_device_error_info(struct
>         pci_dev *dev, struct aer_err_info *info)
>         if (!pos)
>         return 1;
> 
>         +        if (pci_dev_is_disconnected(dev))
>         +                return 0;
>         +
>         if (info->severity == AER_CORRECTABLE) {
>         pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
>         &info->status);
>         @@ -710,6 +713,8 @@ static inline void
>         aer_process_err_devices(struct pcie_device *p_device,
>         for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>         if (get_device_error_info(e_info->dev[i], e_info))
>         handle_error_source(p_device, e_info->dev[i], e_info);
>         +
>         +                pci_dev_put(e_info->dev[i]);
>         }
>            }
>         -------------X++++++++++++++++++++X---------------------
> 
> 
>         Note: I have configured CONFIG_HOTPLUG_PCI_PCIE and
>         CONFIG_HOTPLUG_PCI as modules and  loading in start up using script.
> 
>         root@/proc/:~# cat config | grep -i HOT
>         CONFIG_TICK_ONESHOT=y
>         CONFIG_HOTPLUG=y
>         # CONFIG_MEMORY_HOTPLUG is not set
>         CONFIG_HOTPLUG_CPU=y
>         # CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
>         # CONFIG_DEBUG_HOTPLUG_CPU0 is not set
>         CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
>         CONFIG_ACPI_HOTPLUG_CPU=y
>         CONFIG_HOTPLUG_PCI_PCIE=m
>         CONFIG_HOTPLUG_PCI=m
>         # CONFIG_HOTPLUG_PCI_CPCI is not set
>         # CONFIG_HOTPLUG_PCI_SHPC is not set
>         CONFIG_DM_SNAPSHOT=y
>         # CONFIG_USB_STORAGE_JUMPSHOT is not set
>         # CONFIG_TRACER_SNAPSHOT is not set
>         root@/proc/:~#
> 
>         Panic back trace :
>         crash> bt
>         PID: 24     TASK: ffff880274ac0000  CPU: 0   COMMAND: "kworker/0:1"
>            #0 [ffff880274abbac8] machine_kexec at ffffffff8102cf18
>            #1 [ffff880274abbb28] crash_kexec at ffffffff810a6b05
>            #2 [ffff880274abbbf0] oops_end at ffffffff8176d8a0
>            #3 [ffff880274abbc18] die at ffffffff810060db
>            #4 [ffff880274abbc48] do_general_protection at ffffffff8176d392
>            #5 [ffff880274abbc70] general_protection at ffffffff8176cd32
>               [exception RIP: pci_bus_read_config_dword+100]
>               RIP: ffffffff813405f4  RSP: ffff880274abbd20  RFLAGS: 00010046
>               RAX: 435f494350006963  RBX: ffff880274891800  RCX:
>         0000000000000004
>               RDX: 0000000000000ffc  RSI: 0000000000000060  RDI:
>         ffff880274891800
>               RBP: ffff880274abbd48   R8: ffff880274abbd2c   R9:
>         00000000000002b8
>               R10: ffff880274340000  R11: 0000000000000246  R12:
>         ffff880274abbd5c
>               R13: 0000000000000246  R14: 0000000000000000  R15:
>         ffff880274920000
>               ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>            #6 [ffff880274abbd50] pci_find_next_ext_capability at
>         ffffffff81345db6
>            #7 [ffff880274abbd90] pci_find_ext_capability at ffffffff81347225
>            #8 [ffff880274abbda0] get_device_error_info at ffffffff81356c4d
>            #9 [ffff880274abbdd0] aer_isr at ffffffff81357ab0
>         #10 [ffff880274abbe28] process_one_work at ffffffff8105d4c0
>         #11 [ffff880274abbe70] worker_thread at ffffffff8105e251
>         #12 [ffff880274abbed0] kthread at ffffffff81064260
>         #13 [ffff880274abbf50] ret_from_fork at ffffffff81773978
>         crash>
> 
> 
>         Regards,
>         Gokul
> 
>         On Thu, Aug 2, 2018 at 10:39 PM, Thomas Tai
>         <thomas.tai@oracle.com <mailto:thomas.tai@oracle.com>
>         <mailto:thomas.tai@oracle.com <mailto:thomas.tai@oracle.com>>>
>         wrote:
> 
> 
>              On 08/02/2018 11:07 AM, Lukas Wunner wrote:
> 
>                  [cc += Thomas Tai]
> 
> 
>              Hi Lukas,
>              Thank you very much for cc me.
> 
> 
>                  On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner
>         wrote:
> 
>                      On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg
>         wrote:
> 
>                          I am suspecting a possible race condition in
>         the kernel
>                          between PCI driver
>                          and AER handling.
> 
> 
>                      The solution is to acquire a ref on each device in
>                      add_error_device().
>                      Then release the ref aer_process_err_devices() by
>         calling
>                      pci_dev_put().
> 
> 
>                  So in case it wasn't clear, the below is what I had in
>         mind.
>                  Completely untested though.  Does this work for you?
> 
>                  For v3.10 compatibility, cherry-pick 89ee9f768003 (or
>         alternatively
>                  cherry-pick 8496e85c20e7 and replace
>         pci_dev_is_disconnected(dev)
>                  with !pci_device_is_present(dev)).
> 
>                  -- >8 --
>                  Subject: [PATCH] PCI/AER: Fix use-after-free on
>         surprise removal
> 
>                  The work item to consume errors, aer_isr(), walks the
>         hierarchy
>                  using
>                  pci_walk_bus() and stores a pointer to PCI devices
>         which reported an
>                  error in an array.  As long as pci_walk_bus() runs, those
>                  pointers are
>                  valid because pci_bus_sem is held.  But once pci_walk_bus()
>                  finishes,
>                  nothing prevents the pointers from becoming invalid,
>         e.g. through
>                  unplugging of the PCI devices.  The unprotected
>         pointers are then
>                  dereferenced in aer_process_err_devices(), which may oops:
> 
> 
>              I like your idea to increment the refcount during
>         pci_walk_bus(),
>              that should fix the use-after-free issue. We just need Gokul to
>              confirm if it fixes his issue or not.
> 
>              Thanks,
>              Thomas
> 
> 
> 
>                      #5  general_protection at ffffffff8176cdf2
>                          [exception RIP: pci_bus_read_config_dword+100]
>                      #6  pci_find_next_ext_capability at ffffffff81345d7b
>                      #7  pci_find_ext_capability at ffffffff81347225
>                      #8  get_device_error_info at ffffffff81356c4d
>                      #9  aer_isr at ffffffff81357a38
> 
>                  Fix by holding a ref on the devices until they have
>         been processed.
>                  Skip processing of unplugged devices.
> 
>                  Reported-by: gokul cg <gokuljnpr@gmail.com
>         <mailto:gokuljnpr@gmail.com>
>                  <mailto:gokuljnpr@gmail.com <mailto:gokuljnpr@gmail.com>>>
>                  Signed-off-by: Lukas Wunner <lukas@wunner.de
>         <mailto:lukas@wunner.de>
>                  <mailto:lukas@wunner.de <mailto:lukas@wunner.de>>>
> 
>                  ---
>                     drivers/pci/pcie/aer.c | 6 +++++-
>                     1 file changed, 5 insertions(+), 1 deletion(-)
> 
>                  diff --git a/drivers/pci/pcie/aer.c
>         b/drivers/pci/pcie/aer.c
>                  index a2e8838..937592e 100644
>                  --- a/drivers/pci/pcie/aer.c
>                  +++ b/drivers/pci/pcie/aer.c
>                  @@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev
>         *dev, int
>                  aer_severity,
>                     static int add_error_device(struct aer_err_info *e_info,
>                  struct pci_dev *dev)
>                     {
>                           if (e_info->error_dev_num <
>         AER_MAX_MULTI_ERR_DEVICES) {
>                  -               e_info->dev[e_info->error_dev_num] = dev;
>                  +               e_info->dev[e_info->error_dev_num] =
>                  pci_dev_get(dev);
>                                   e_info->error_dev_num++;
>                                   return 0;
>                           }
>                  @@ -898,6 +898,9 @@ static int get_device_error_info(struct
>                  pci_dev *dev, struct aer_err_info *info)
>                           if (!pos)
>                                   return 0;
>                     +     if (pci_dev_is_disconnected(dev))
>                  +               return 0;
>                  +
>                           if (info->severity == AER_CORRECTABLE) {
>                                   pci_read_config_dword(dev, pos +
>                  PCI_ERR_COR_STATUS,
>                                           &info->status);
>                  @@ -948,6 +951,7 @@ static inline void
>                  aer_process_err_devices(struct aer_err_info *e_info)
>                           for (i = 0; i < e_info->error_dev_num &&
>                  e_info->dev[i]; i++) {
>                                   if
>         (get_device_error_info(e_info->dev[i], e_info))
>                                          
>         handle_error_source(e_info->dev[i],
>                  e_info);
>                  +               pci_dev_put(e_info->dev[i]);
>                           }
>                     }
> 
> 
>
Bjorn Helgaas Sept. 4, 2018, 5:53 p.m. UTC | #13
On Tue, Jul 31, 2018 at 07:50:37AM +0200, Lukas Wunner wrote:
> When removing PCI devices below a hotplug bridge, pciehp marks them as
> disconnected if the card is no longer present in the slot or it quiesces
> them if the card is still present (by disabling INTx interrupts, bus
> mastering and SERR# reporting).
> 
> To detect whether the card is still present, pciehp checks the Presence
> Detect State bit in the Slot Status register.  The problem with this
> approach is that even if the card is present, the link to it may be
> down, and it that case it would be better to mark the devices as
> disconnected instead of trying to quiesce them.  Moreover, if the card
> in the slot was quickly replaced by another one, the Presence Detect
> State bit would be set, yet trying to quiesce the new card's devices
> would be wrong and the correct thing to do is to mark the previous
> card's devices as disconnected.
> 
> Instead of looking at the Presence Detect State bit, it is better to
> differentiate whether the card was surprise removed versus safely
> removed (via sysfs or an Attention Button press).  On surprise removal,
> the devices should be marked as disconnected, whereas on safe removal it
> is correct to quiesce the devices.
> 
> The knowledge whether a surprise removal or a safe removal is at hand
> does exist further up in the call stack:  A surprise removal is
> initiated by pciehp_handle_presence_or_link_change(), a safe removal by
> pciehp_handle_disable_request().
> 
> Pass that information down to pciehp_unconfigure_device() and use it in
> lieu of the Presence Detect State bit.  While there, add kernel-doc to
> pciehp_unconfigure_device() and pciehp_configure_device().
> 
> Tested-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Keith Busch <keith.busch@intel.com>

Applied to pci/hotplug for v4.20, thanks!

> ---
>  drivers/pci/hotplug/pciehp.h      |  2 +-
>  drivers/pci/hotplug/pciehp_ctrl.c | 22 +++++++++++++---------
>  drivers/pci/hotplug/pciehp_pci.c  | 23 ++++++++++++++++++++---
>  3 files changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 811cf83f956d..39c9c8815a35 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -181,7 +181,7 @@ void pciehp_handle_button_press(struct slot *slot);
>  void pciehp_handle_disable_request(struct slot *slot);
>  void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events);
>  int pciehp_configure_device(struct slot *p_slot);
> -void pciehp_unconfigure_device(struct slot *p_slot);
> +void pciehp_unconfigure_device(struct slot *p_slot, bool presence);
>  void pciehp_queue_pushbutton_work(struct work_struct *work);
>  struct controller *pcie_init(struct pcie_device *dev);
>  int pcie_init_notification(struct controller *ctrl);
> diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
> index 6855933ab372..7932e70e9f29 100644
> --- a/drivers/pci/hotplug/pciehp_ctrl.c
> +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> @@ -26,6 +26,9 @@
>     hotplug controller logic
>   */
>  
> +#define SAFE_REMOVAL	 true
> +#define SURPRISE_REMOVAL false
> +
>  static void set_slot_off(struct controller *ctrl, struct slot *pslot)
>  {
>  	/* turn off slot, turn on Amber LED, turn off Green LED if supported*/
> @@ -101,12 +104,13 @@ static int board_added(struct slot *p_slot)
>  /**
>   * remove_board - Turns off slot and LEDs
>   * @p_slot: slot where board is being removed
> + * @safe_removal: whether the board is safely removed (versus surprise removed)
>   */
> -static void remove_board(struct slot *p_slot)
> +static void remove_board(struct slot *p_slot, bool safe_removal)
>  {
>  	struct controller *ctrl = p_slot->ctrl;
>  
> -	pciehp_unconfigure_device(p_slot);
> +	pciehp_unconfigure_device(p_slot, safe_removal);
>  
>  	if (POWER_CTRL(ctrl)) {
>  		pciehp_power_off_slot(p_slot);
> @@ -124,7 +128,7 @@ static void remove_board(struct slot *p_slot)
>  }
>  
>  static int pciehp_enable_slot(struct slot *slot);
> -static int pciehp_disable_slot(struct slot *slot);
> +static int pciehp_disable_slot(struct slot *slot, bool safe_removal);
>  
>  void pciehp_request(struct controller *ctrl, int action)
>  {
> @@ -215,7 +219,7 @@ void pciehp_handle_disable_request(struct slot *slot)
>  	slot->state = POWEROFF_STATE;
>  	mutex_unlock(&slot->lock);
>  
> -	ctrl->request_result = pciehp_disable_slot(slot);
> +	ctrl->request_result = pciehp_disable_slot(slot, SAFE_REMOVAL);
>  }
>  
>  void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events)
> @@ -241,7 +245,7 @@ void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events)
>  		if (events & PCI_EXP_SLTSTA_PDC)
>  			ctrl_info(ctrl, "Slot(%s): Card not present\n",
>  				  slot_name(slot));
> -		pciehp_disable_slot(slot);
> +		pciehp_disable_slot(slot, SURPRISE_REMOVAL);
>  		break;
>  	default:
>  		mutex_unlock(&slot->lock);
> @@ -324,7 +328,7 @@ static int pciehp_enable_slot(struct slot *slot)
>  	return ret;
>  }
>  
> -static int __pciehp_disable_slot(struct slot *p_slot)
> +static int __pciehp_disable_slot(struct slot *p_slot, bool safe_removal)
>  {
>  	u8 getstatus = 0;
>  	struct controller *ctrl = p_slot->ctrl;
> @@ -338,17 +342,17 @@ static int __pciehp_disable_slot(struct slot *p_slot)
>  		}
>  	}
>  
> -	remove_board(p_slot);
> +	remove_board(p_slot, safe_removal);
>  	return 0;
>  }
>  
> -static int pciehp_disable_slot(struct slot *slot)
> +static int pciehp_disable_slot(struct slot *slot, bool safe_removal)
>  {
>  	struct controller *ctrl = slot->ctrl;
>  	int ret;
>  
>  	pm_runtime_get_sync(&ctrl->pcie->port->dev);
> -	ret = __pciehp_disable_slot(slot);
> +	ret = __pciehp_disable_slot(slot, safe_removal);
>  	pm_runtime_put(&ctrl->pcie->port->dev);
>  
>  	mutex_lock(&slot->lock);
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index 5c58c22e0c08..18f83e554c73 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -20,6 +20,14 @@
>  #include "../pci.h"
>  #include "pciehp.h"
>  
> +/**
> + * pciehp_configure_device() - enumerate PCI devices below a hotplug bridge
> + * @p_slot: PCIe hotplug slot
> + *
> + * Enumerate PCI devices below a hotplug bridge and add them to the system.
> + * Return 0 on success, %-EEXIST if the devices are already enumerated or
> + * %-ENODEV if enumeration failed.
> + */
>  int pciehp_configure_device(struct slot *p_slot)
>  {
>  	struct pci_dev *dev;
> @@ -62,9 +70,19 @@ int pciehp_configure_device(struct slot *p_slot)
>  	return ret;
>  }
>  
> -void pciehp_unconfigure_device(struct slot *p_slot)
> +/**
> + * pciehp_unconfigure_device() - remove PCI devices below a hotplug bridge
> + * @p_slot: PCIe hotplug slot
> + * @presence: whether the card is still present in the slot;
> + *	true for safe removal via sysfs or an Attention Button press,
> + *	false for surprise removal
> + *
> + * Unbind PCI devices below a hotplug bridge from their drivers and remove
> + * them from the system.  Safely removed devices are quiesced.  Surprise
> + * removed devices are marked as such to prevent further accesses.
> + */
> +void pciehp_unconfigure_device(struct slot *p_slot, bool presence)
>  {
> -	u8 presence = 0;
>  	struct pci_dev *dev, *temp;
>  	struct pci_bus *parent = p_slot->ctrl->pcie->port->subordinate;
>  	u16 command;
> @@ -72,7 +90,6 @@ void pciehp_unconfigure_device(struct slot *p_slot)
>  
>  	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
>  		 __func__, pci_domain_nr(parent), parent->number);
> -	pciehp_get_adapter_status(p_slot, &presence);
>  
>  	pci_lock_rescan_remove();
>  
> -- 
> 2.18.0
>
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 811cf83f956d..39c9c8815a35 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -181,7 +181,7 @@  void pciehp_handle_button_press(struct slot *slot);
 void pciehp_handle_disable_request(struct slot *slot);
 void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events);
 int pciehp_configure_device(struct slot *p_slot);
-void pciehp_unconfigure_device(struct slot *p_slot);
+void pciehp_unconfigure_device(struct slot *p_slot, bool presence);
 void pciehp_queue_pushbutton_work(struct work_struct *work);
 struct controller *pcie_init(struct pcie_device *dev);
 int pcie_init_notification(struct controller *ctrl);
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index 6855933ab372..7932e70e9f29 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -26,6 +26,9 @@ 
    hotplug controller logic
  */
 
+#define SAFE_REMOVAL	 true
+#define SURPRISE_REMOVAL false
+
 static void set_slot_off(struct controller *ctrl, struct slot *pslot)
 {
 	/* turn off slot, turn on Amber LED, turn off Green LED if supported*/
@@ -101,12 +104,13 @@  static int board_added(struct slot *p_slot)
 /**
  * remove_board - Turns off slot and LEDs
  * @p_slot: slot where board is being removed
+ * @safe_removal: whether the board is safely removed (versus surprise removed)
  */
-static void remove_board(struct slot *p_slot)
+static void remove_board(struct slot *p_slot, bool safe_removal)
 {
 	struct controller *ctrl = p_slot->ctrl;
 
-	pciehp_unconfigure_device(p_slot);
+	pciehp_unconfigure_device(p_slot, safe_removal);
 
 	if (POWER_CTRL(ctrl)) {
 		pciehp_power_off_slot(p_slot);
@@ -124,7 +128,7 @@  static void remove_board(struct slot *p_slot)
 }
 
 static int pciehp_enable_slot(struct slot *slot);
-static int pciehp_disable_slot(struct slot *slot);
+static int pciehp_disable_slot(struct slot *slot, bool safe_removal);
 
 void pciehp_request(struct controller *ctrl, int action)
 {
@@ -215,7 +219,7 @@  void pciehp_handle_disable_request(struct slot *slot)
 	slot->state = POWEROFF_STATE;
 	mutex_unlock(&slot->lock);
 
-	ctrl->request_result = pciehp_disable_slot(slot);
+	ctrl->request_result = pciehp_disable_slot(slot, SAFE_REMOVAL);
 }
 
 void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events)
@@ -241,7 +245,7 @@  void pciehp_handle_presence_or_link_change(struct slot *slot, u32 events)
 		if (events & PCI_EXP_SLTSTA_PDC)
 			ctrl_info(ctrl, "Slot(%s): Card not present\n",
 				  slot_name(slot));
-		pciehp_disable_slot(slot);
+		pciehp_disable_slot(slot, SURPRISE_REMOVAL);
 		break;
 	default:
 		mutex_unlock(&slot->lock);
@@ -324,7 +328,7 @@  static int pciehp_enable_slot(struct slot *slot)
 	return ret;
 }
 
-static int __pciehp_disable_slot(struct slot *p_slot)
+static int __pciehp_disable_slot(struct slot *p_slot, bool safe_removal)
 {
 	u8 getstatus = 0;
 	struct controller *ctrl = p_slot->ctrl;
@@ -338,17 +342,17 @@  static int __pciehp_disable_slot(struct slot *p_slot)
 		}
 	}
 
-	remove_board(p_slot);
+	remove_board(p_slot, safe_removal);
 	return 0;
 }
 
-static int pciehp_disable_slot(struct slot *slot)
+static int pciehp_disable_slot(struct slot *slot, bool safe_removal)
 {
 	struct controller *ctrl = slot->ctrl;
 	int ret;
 
 	pm_runtime_get_sync(&ctrl->pcie->port->dev);
-	ret = __pciehp_disable_slot(slot);
+	ret = __pciehp_disable_slot(slot, safe_removal);
 	pm_runtime_put(&ctrl->pcie->port->dev);
 
 	mutex_lock(&slot->lock);
diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index 5c58c22e0c08..18f83e554c73 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -20,6 +20,14 @@ 
 #include "../pci.h"
 #include "pciehp.h"
 
+/**
+ * pciehp_configure_device() - enumerate PCI devices below a hotplug bridge
+ * @p_slot: PCIe hotplug slot
+ *
+ * Enumerate PCI devices below a hotplug bridge and add them to the system.
+ * Return 0 on success, %-EEXIST if the devices are already enumerated or
+ * %-ENODEV if enumeration failed.
+ */
 int pciehp_configure_device(struct slot *p_slot)
 {
 	struct pci_dev *dev;
@@ -62,9 +70,19 @@  int pciehp_configure_device(struct slot *p_slot)
 	return ret;
 }
 
-void pciehp_unconfigure_device(struct slot *p_slot)
+/**
+ * pciehp_unconfigure_device() - remove PCI devices below a hotplug bridge
+ * @p_slot: PCIe hotplug slot
+ * @presence: whether the card is still present in the slot;
+ *	true for safe removal via sysfs or an Attention Button press,
+ *	false for surprise removal
+ *
+ * Unbind PCI devices below a hotplug bridge from their drivers and remove
+ * them from the system.  Safely removed devices are quiesced.  Surprise
+ * removed devices are marked as such to prevent further accesses.
+ */
+void pciehp_unconfigure_device(struct slot *p_slot, bool presence)
 {
-	u8 presence = 0;
 	struct pci_dev *dev, *temp;
 	struct pci_bus *parent = p_slot->ctrl->pcie->port->subordinate;
 	u16 command;
@@ -72,7 +90,6 @@  void pciehp_unconfigure_device(struct slot *p_slot)
 
 	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
 		 __func__, pci_domain_nr(parent), parent->number);
-	pciehp_get_adapter_status(p_slot, &presence);
 
 	pci_lock_rescan_remove();