diff mbox

[v2,04/10] PCI: Destroy pci dev only once

Message ID 33026900.FbCAopDPFz@vostro.rjw.lan (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Rafael J. Wysocki Dec. 2, 2013, 2:49 p.m. UTC
On Monday, December 02, 2013 02:29:46 AM Rafael J. Wysocki wrote:
> On Sunday, December 01, 2013 02:24:33 AM Rafael J. Wysocki wrote:
> > On Saturday, November 30, 2013 02:27:15 PM Yinghai Lu wrote:
> > > On Sat, Nov 30, 2013 at 1:37 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
> > > >> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> > > >> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> > > >> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > > >> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >> > > > >
> > > >> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > > >> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > > >> > > > > finds that it is OK to do the device_del() and put_device() which causes
> > > >> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > > >> > > > > on the already freed device object and crashes the kernel.
> > > >> > > > >
> > > >> > > > thread2 should still hold one extra reference.
> > > >> > > > that is in
> > > >> > > >   device_schedule_callback
> > > >> > > >      ==> sysfs_schedule_callback
> > > >> > > >          ==> kobject_get(kobj)
> > > >> > > >
> > > >> > > > pci_destroy_dev for thread2 is called at this point.
> > > >> > > >
> > > >> > > > and that reference will be released from
> > > >> > > >         sysfs_schedule_callback
> > > >> > > >         ==> kobject_put()...
> > > >> > >
> > > >> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> > > >> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> > > >> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> > > >> > > is racy anyway, because it walks bus->devices without locking.  Which is my
> > > >> > > fault too, because I overlooked that.  Shame, shame.]
> > > >> > >
> > > 
> > > can you add extra reference to that path?
> > 
> > hotplug_event_work()
> > 	hotplug_event()
> > 		acpiphp_check_bridge()
> > 			trim_stale_devices()
> > 				pci_stop_and_remove_bus_device()
> > 
> > Yes, it should hold a reference to dev, but adding it there doesn't really help,
> > because there are list walks over &bus->devices in acpiphp_check_bridge() and
> > trim_stale_devices() that are racy with respect to pci_stop_and_remove_bus_device()
> > run from device_schedule_callback().
> > 
> > > >> > > Perhaps we can do something like the (untested) patch below (in addition to the
> > > >> > > $subject patch).  Do you see any immediate problems with it?
> > > >> >
> > > >> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> > > >> > So much for being clever.
> > > >> >
> > > >> > Moreover, it looks like those two routines above are racy too for the same
> > > >> > reason?
> > > >>
> > > >> The (still untested) patch below is what I have come up with for now.  The
> > > >> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
> > > >> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
> > > >> (the "locked" one) should hold a ref to the device being removed to avoid
> > > >> use-after-free (the callers need to be audited for that).
> > > 
> > > if you can use device_schedule_...,
> > 
> > No, I can't.  I need to hold acpi_scan_lock taken in hotplug_event_work()
> > throughout all bus trimming/scanning and I need to protect list walks over
> > &bus->devices too.
> > 
> > > should have hold reference may be
> > > atomic would be better than lock/unlock everywhere?
> > 
> > The locking is necessary not only for the device removal itself, but also for
> > the safety of the &bus->devices list walks.
> > 
> > Besides, remove_callback() in remove.c already holds pci_remove_rescan_mutex
> > around pci_stop_and_remove_bus_device() and I don't see how it would be safe
> > to run pci_stop_and_remove_bus_device() without holding that mutex from
> > anywhere else.
> > 
> > For one example, pci_stop_and_remove_bus_device() that is not run under
> > pci_remove_rescan_mutex can race with the stuff called under that mutex
> > in dev_bus_rescan_store() (and elsewhere in pci-sysfs.c).
> > 
> > So either pci_remove_rescan_mutex is useless and should be dropped, or
> > it is there for a purpose, in which case it needs to be used around
> > pci_stop_and_remove_bus_device() everywhere.  There's no other possibility
> > and to my eyes that mutex is necessary.
> 
> So below is a new version of the patch (which has been tested on my Thunderbolt
> rig without visibly breaking anything) with the description of all the problems
> it attempts to address.  If any of the scenarios described in the changelog are
> not possible for some reason, please tell me why that is the case.  I couldn't
> find such reasons myself.

And I forgot about the ACPIPHP slot "power" attribute that also may trigger
race conditions with device removal.  Updated patch follows.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal

The following are concurrency problems related to the PCI device
removal code in pci-sysfs.c and in ACPIPHP present in the current
mainline kernel:

Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
  the same top device from remove_callback() in pci-sysfs.c and
  from trim_stale_devices() in acpiphp_glue.c.

  In this scenario the second code path is executed without
  pci_remove_rescan_mutex locked, so the &bus->devices list
  walks in either trim_stale_devices() itself or in
  acpiphp_check_bridge() can suffer from list corruption while the
  first code path is executing pci_destroy_dev() for one of the
  devices on those lists.

  Moreover, if any of the device objects in question is freed
  after pci_destroy_dev() executed by the first code path, the
  second code path may suffer a use-after-free problem while
  trying to access that device object.

  Conversely, the second code path may execute pci_destroy_dev()
  for one of the devices in question such that one of the
  &bus->devices list walks in pci_stop_bus_device()
  or pci_remove_bus_device() executed by the first code path will
  suffer from a list corruption.

  Moreover, use-after-free is also possible if one of the device
  objects in question is freed as a result of calling
  pci_destroy_dev() by the second code path and then the first
  code path tries to access it (the first code path only holds
  an extra reference to the device it has been run for, but not
  for its child devices).

Scenario 2: ACPI hotplug event occurs for a device under a bridge
  being removed by pci_stop_and_remove_bus_device() run from
  remove_callback() in pci-sysfs.c.

  In that case it doesn't make sense to handle the hotplug event,
  because the device in question will be removed anyway along with
  its parent bridge and that will cause the context objects needed
  for hotplug handling to be freed as well.

  Moreover, if the event is handled regardless, it may cause one
  or more devices already removed by pci_stop_and_remove_bus_device()
  to be added again by the code handling the event, which will
  conflict with the bridge removal.

Scenario 3: pci_stop_and_remove_bus_device() is run from
  trim_stale_devices() (as a result of an ACPI hotplug event) in
  parallel with dev_bus_rescan_store() or bus_rescan_store(),
  or dev_rescan_store().

  In that scenario the second code path may attempt to operate
  on device objects being removed by the first code path which
  may lead to many interesting types of breakage.

Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
  host bridge removal event) in  parallel with bus_rescan_store(),
  dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
  for any devices under the host bridge in question.

  In that case the same symptoms as in Scenarios 1 and 3 may occur
  depending on which code path wins the races involved.

Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
  for a device and its parent bridge via remove_callback().

  In that case both code paths attempt to acquire
  pci_remove_rescan_mutex.  If the child device removal acquires
  it first, there will be no problems.  However, if the parent
  bridge removal acquires it first, it will eventually execute
  pci_destroy_dev() for the child device, but that device will
  not be freed yet due to the reference held by the concurrent
  child removal.  Consequently, both pci_stop_bus_device() and
  pci_remove_bus_device() will be executed for that device
  unnecessarily and pci_destroy_dev() will see a corrupted list
  head in that object.  Moreover, an excess put_device() will
  be executed for that device in that case which may lead to a
  use-after-free in the final kobject_put() done by
  sysfs_schedule_callback_work().

Scenario 6: ACPIPHP slot enabling/disabling triggered by the
  slot's "power" attribute in parallel with device removal run
  from remove_callback().

  This scenario may lead to race conditions analogous to the
  ones described in Scenario 1.  It also may lead to situations
  in which an already removed device under a bridge scheduled
  for removal will be added which is analogous to Scenario 2.

All of these scenarios are addressed by the patch below as follows.

(1) To prevent the races in Scenarios 1 and 3 from happening hold
    pci_remove_rescan_mutex around hotplug_event() in
    hotplug_event_work(() (acpiphp_glue.c).

(2) To prevent the races in Scenario 2 from happening, add an ACPIPHP
    bridge flag is_going_away indicating that hotplug events should
    be ignored for children below that bridge.  That flag is set
    by cleanup_bridge() that for non-root bridges should be run
    under pci_remove_rescan_mutex (for root bridges it is only
    run under acpi_scan_lock anyway).

(3) To prevent the races in Scenario 4 from happening, hold
    pci_remove_rescan_mutex around pci_stop_root_bus() and
    pci_remove_root_bus() in acpi_pci_root_remove().

(4) To prevent the races in Scenario 5 from happening, add an new
    is_gone flag to struct pci_dev that will be set by pci_destroy_dev()
    and checked by pci_stop_and_remove_bus_device().  That only
    covers cases in which pci_stop_and_remove_bus_device() is
    run under pci_remove_rescan_mutex, but the other existing
    cases need to be fixed to use that mutex anyway for other
    reasons (analogous to Scenarios 1 and 3 above, for example).

(5) To prevent the races in Scenario 6 from happening, add
    the PCI remove/rescan locking to acpiphp_enable_slot() and
    acpiphp_disable_and_eject_slot() and make these functions
    check the slot's parent bridge status.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/pci_root.c            |    4 ++
 drivers/pci/hotplug/acpiphp.h      |    1 
 drivers/pci/hotplug/acpiphp_glue.c |   74 ++++++++++++++++++++++++++++++-------
 drivers/pci/pci-sysfs.c            |   11 +++++
 drivers/pci/remove.c               |    7 ++-
 include/linux/pci.h                |    3 +
 6 files changed, 84 insertions(+), 16 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas Dec. 5, 2013, 10:40 p.m. UTC | #1
On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> ...
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal
>
> The following are concurrency problems related to the PCI device
> removal code in pci-sysfs.c and in ACPIPHP present in the current
> mainline kernel:

You've found a bunch of issues.  I don't think there's anything to
gain by fixing them all in a single patch, and I think it would be
useful to split them out to help us think about them and find other
places that have similar problems.

> Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
>   the same top device from remove_callback() in pci-sysfs.c and
>   from trim_stale_devices() in acpiphp_glue.c.
>
>   In this scenario the second code path is executed without
>   pci_remove_rescan_mutex locked, so the &bus->devices list
>   walks in either trim_stale_devices() itself or in
>   acpiphp_check_bridge() can suffer from list corruption while the
>   first code path is executing pci_destroy_dev() for one of the
>   devices on those lists.

Protecting &bus->devices is a generic problem, isn't it?  There are
about a zillion uses of it.  Many are in the pcibios_fixup_bus() path.
 I think we can get rid of most of those by integrating the work into
the pci_scan_device() path instead of doing it as a post-discovery
fixup, but there will be several other cases left.  If using
pci_remove_rescan_mutex to protect &bus->devices is the right generic
answer, we should document that and audit every place that uses the
list.

>   Moreover, if any of the device objects in question is freed
>   after pci_destroy_dev() executed by the first code path, the
>   second code path may suffer a use-after-free problem while
>   trying to access that device object.
>
>   Conversely, the second code path may execute pci_destroy_dev()
>   for one of the devices in question such that one of the
>   &bus->devices list walks in pci_stop_bus_device()
>   or pci_remove_bus_device() executed by the first code path will
>   suffer from a list corruption.
>
>   Moreover, use-after-free is also possible if one of the device
>   objects in question is freed as a result of calling
>   pci_destroy_dev() by the second code path and then the first
>   code path tries to access it (the first code path only holds
>   an extra reference to the device it has been run for, but not
>   for its child devices).

The use-after-free problems *sound* like a reference counting issue.
Yinghai's patch [1] should fix some of this; how much is left after
that?

[1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org

> Scenario 2: ACPI hotplug event occurs for a device under a bridge
>   being removed by pci_stop_and_remove_bus_device() run from
>   remove_callback() in pci-sysfs.c.
>
>   In that case it doesn't make sense to handle the hotplug event,
>   because the device in question will be removed anyway along with
>   its parent bridge and that will cause the context objects needed
>   for hotplug handling to be freed as well.
>
>   Moreover, if the event is handled regardless, it may cause one
>   or more devices already removed by pci_stop_and_remove_bus_device()
>   to be added again by the code handling the event, which will
>   conflict with the bridge removal.

We definitely need to serialize hotplug events from ACPI and sysfs
(and other sources, like other hotplug drivers).  Would that be
enough?  Adding the is_going_away flag is ACPI-specific and seems like
sort of a point workaround.

> Scenario 3: pci_stop_and_remove_bus_device() is run from
>   trim_stale_devices() (as a result of an ACPI hotplug event) in
>   parallel with dev_bus_rescan_store() or bus_rescan_store(),
>   or dev_rescan_store().
>
>   In that scenario the second code path may attempt to operate
>   on device objects being removed by the first code path which
>   may lead to many interesting types of breakage.
>
> Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
>   host bridge removal event) in  parallel with bus_rescan_store(),
>   dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
>   for any devices under the host bridge in question.
>
>   In that case the same symptoms as in Scenarios 1 and 3 may occur
>   depending on which code path wins the races involved.

Scenarios 3 and 4 sound like more cases of hotplug operations needing
to be serialized, right?  If we serialized them sufficiently, would
there still be a problem?  Using pci_remove_rescan_mutex would
serialize *all* PCI hotplug operations, which is more than strictly
necessary, but maybe there's no reason to do anything finer-grained.

> Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>   for a device and its parent bridge via remove_callback().
>
>   In that case both code paths attempt to acquire
>   pci_remove_rescan_mutex.  If the child device removal acquires
>   it first, there will be no problems.  However, if the parent
>   bridge removal acquires it first, it will eventually execute
>   pci_destroy_dev() for the child device, but that device will
>   not be freed yet due to the reference held by the concurrent
>   child removal.  Consequently, both pci_stop_bus_device() and
>   pci_remove_bus_device() will be executed for that device
>   unnecessarily and pci_destroy_dev() will see a corrupted list
>   head in that object.  Moreover, an excess put_device() will
>   be executed for that device in that case which may lead to a
>   use-after-free in the final kobject_put() done by
>   sysfs_schedule_callback_work().

The corrupted list head should be fixed by Yinghai's patch [1].

Where is the extra put_device()?  I see the
kobject_get()/kobject_put() pair in sysfs_schedule_callback() and
sysfs_schedule_callback_work().  Oh, I see -- the remove_store() ->
remove_callback() path acquires no references, but it calls
pci_stop_and_remove_bus_device(), which ultimately does the
put_device() in pci_destroy_dev().

So if both the parent and the child removal manage to get to
remove_callback() and the parent acquires pci_remove_rescan_mutex
first, the child removal will do the extra put_device().

There are only six callers of device_schedule_callback(), and I think
five of them are susceptible to this same problem: they are sysfs
store methods, and they use device_schedule_callback() with a callback
that does a put_device() on the device:

  drivers/pci/pci-sysfs.c: remove_store()
  drivers/scsi/scsi_sysfs.c: sdev_store_delete()
  arch/s390/pci/pci_sysfs.c: store_recover()
  drivers/s390/block/dcssblk.c: dcssblk_shared_store()
  drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store()

I don't know what the right fix is, but adding "is_gone" to struct
pci_dev only addresses one of the five places, of course.

Bjorn

> Scenario 6: ACPIPHP slot enabling/disabling triggered by the
>   slot's "power" attribute in parallel with device removal run
>   from remove_callback().
>
>   This scenario may lead to race conditions analogous to the
>   ones described in Scenario 1.  It also may lead to situations
>   in which an already removed device under a bridge scheduled
>   for removal will be added which is analogous to Scenario 2.
>
> All of these scenarios are addressed by the patch below as follows.
>
> (1) To prevent the races in Scenarios 1 and 3 from happening hold
>     pci_remove_rescan_mutex around hotplug_event() in
>     hotplug_event_work(() (acpiphp_glue.c).
>
> (2) To prevent the races in Scenario 2 from happening, add an ACPIPHP
>     bridge flag is_going_away indicating that hotplug events should
>     be ignored for children below that bridge.  That flag is set
>     by cleanup_bridge() that for non-root bridges should be run
>     under pci_remove_rescan_mutex (for root bridges it is only
>     run under acpi_scan_lock anyway).
>
> (3) To prevent the races in Scenario 4 from happening, hold
>     pci_remove_rescan_mutex around pci_stop_root_bus() and
>     pci_remove_root_bus() in acpi_pci_root_remove().
>
> (4) To prevent the races in Scenario 5 from happening, add an new
>     is_gone flag to struct pci_dev that will be set by pci_destroy_dev()
>     and checked by pci_stop_and_remove_bus_device().  That only
>     covers cases in which pci_stop_and_remove_bus_device() is
>     run under pci_remove_rescan_mutex, but the other existing
>     cases need to be fixed to use that mutex anyway for other
>     reasons (analogous to Scenarios 1 and 3 above, for example).
>
> (5) To prevent the races in Scenario 6 from happening, add
>     the PCI remove/rescan locking to acpiphp_enable_slot() and
>     acpiphp_disable_and_eject_slot() and make these functions
>     check the slot's parent bridge status.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/acpi/pci_root.c            |    4 ++
>  drivers/pci/hotplug/acpiphp.h      |    1
>  drivers/pci/hotplug/acpiphp_glue.c |   74 ++++++++++++++++++++++++++++++-------
>  drivers/pci/pci-sysfs.c            |   11 +++++
>  drivers/pci/remove.c               |    7 ++-
>  include/linux/pci.h                |    3 +
>  6 files changed, 84 insertions(+), 16 deletions(-)
>
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>         unsigned int    multifunction:1;/* Part of multi-function device */
>         /* keep track of device state */
>         unsigned int    is_added:1;
> +       unsigned int    is_gone:1;
>         unsigned int    is_busmaster:1; /* device is busmaster */
>         unsigned int    no_msi:1;       /* device may not use msi */
>         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> @@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_
>  int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
>  unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
>  unsigned int pci_rescan_bus(struct pci_bus *bus);
> +void lock_pci_remove_rescan(void);
> +void unlock_pci_remove_rescan(void);
>
>  /* Vital product data routines */
>  ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
> Index: linux-pm/drivers/pci/pci-sysfs.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-sysfs.c
> +++ linux-pm/drivers/pci/pci-sysfs.c
> @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
>  static DEVICE_ATTR_RW(msi_bus);
>
>  static DEFINE_MUTEX(pci_remove_rescan_mutex);
> +
> +void lock_pci_remove_rescan(void)
> +{
> +       mutex_lock(&pci_remove_rescan_mutex);
> +}
> +
> +void unlock_pci_remove_rescan(void)
> +{
> +       mutex_unlock(&pci_remove_rescan_mutex);
> +}
> +
>  static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
>                                 size_t count)
>  {
> Index: linux-pm/drivers/pci/remove.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/remove.c
> +++ linux-pm/drivers/pci/remove.c
> @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> +       dev->is_gone = 1;
>         device_del(&dev->dev);
>
>         down_write(&pci_bus_sem);
> @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>   */
>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>  {
> -       pci_stop_bus_device(dev);
> -       pci_remove_bus_device(dev);
> +       if (!dev->is_gone) {
> +               pci_stop_bus_device(dev);
> +               pci_remove_bus_device(dev);
> +       }
>  }
>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>
> Index: linux-pm/drivers/pci/hotplug/acpiphp.h
> ===================================================================
> --- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
> +++ linux-pm/drivers/pci/hotplug/acpiphp.h
> @@ -71,6 +71,7 @@ struct acpiphp_bridge {
>         struct acpiphp_context *context;
>
>         int nr_slots;
> +       bool is_going_away;
>
>         /* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */
>         struct pci_bus *pci_bus;
> Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
> +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> @@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph
>         mutex_lock(&bridge_mutex);
>         list_del(&bridge->list);
>         mutex_unlock(&bridge_mutex);
> +
> +       /*
> +        * For non-root bridges this flag is protected by the PCI remove/rescan
> +        * locking.  For root bridges it is only operated under acpi_scan_lock
> +        * anyway.
> +        */
> +       bridge->is_going_away = true;
>  }
>
>  /**
> @@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc
>   *
>   * Iterate over all slots under this bridge and make sure that if a
>   * card is present they are enabled, and if not they are disabled.
> + *
> + * For non-root bridges call under the PCI remove/rescan mutex.
>   */
>  static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
>  {
>         struct acpiphp_slot *slot;
>
> +       /* Bail out if the bridge is going away. */
> +       if (bridge->is_going_away)
> +               return;
> +
>         list_for_each_entry(slot, &bridge->slots, node) {
>                 struct pci_bus *bus = slot->bus;
>                 struct pci_dev *dev, *tmp;
> @@ -807,6 +820,8 @@ void acpiphp_check_host_bridge(acpi_hand
>         }
>  }
>
> +static int disable_and_eject_slot(struct acpiphp_slot *slot);
> +
>  static void hotplug_event(acpi_handle handle, u32 type, void *data)
>  {
>         struct acpiphp_context *context = data;
> @@ -866,7 +881,7 @@ static void hotplug_event(acpi_handle ha
>         case ACPI_NOTIFY_EJECT_REQUEST:
>                 /* request device eject */
>                 pr_debug("%s: Device eject notify on %s\n", __func__, objname);
> -               acpiphp_disable_and_eject_slot(func->slot);
> +               disable_and_eject_slot(func->slot);
>                 break;
>         }
>
> @@ -878,14 +893,19 @@ static void hotplug_event_work(void *dat
>  {
>         struct acpiphp_context *context = data;
>         acpi_handle handle = context->handle;
> +       struct acpiphp_bridge *bridge = context->func.parent;
>
>         acpi_scan_lock_acquire();
> +       lock_pci_remove_rescan();
>
> -       hotplug_event(handle, type, context);
> +       /* Bail out if the parent bridge is going away. */
> +       if (!bridge->is_going_away)
> +               hotplug_event(handle, type, context);
>
> +       unlock_pci_remove_rescan();
>         acpi_scan_lock_release();
>         acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
> -       put_bridge(context->func.parent);
> +       put_bridge(bridge);
>  }
>
>  /**
> @@ -1050,20 +1070,27 @@ void acpiphp_remove_slots(struct pci_bus
>   */
>  int acpiphp_enable_slot(struct acpiphp_slot *slot)
>  {
> -       mutex_lock(&slot->crit_sect);
> -       /* configure all functions */
> -       if (!(slot->flags & SLOT_ENABLED))
> -               enable_slot(slot);
> +       struct acpiphp_func *func;
> +       int ret = -ENODEV;
>
> -       mutex_unlock(&slot->crit_sect);
> -       return 0;
> +       lock_pci_remove_rescan();
> +
> +       func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
> +       if (!func->parent->is_going_away) {
> +               mutex_lock(&slot->crit_sect);
> +               /* configure all functions */
> +               if (!(slot->flags & SLOT_ENABLED))
> +                       enable_slot(slot);
> +
> +               mutex_unlock(&slot->crit_sect);
> +               ret = 0;
> +       }
> +
> +       unlock_pci_remove_rescan();
> +       return ret;
>  }
>
> -/**
> - * acpiphp_disable_and_eject_slot - power off and eject slot
> - * @slot: ACPI PHP slot
> - */
> -int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> +static int disable_and_eject_slot(struct acpiphp_slot *slot)
>  {
>         struct acpiphp_func *func;
>         int retval = 0;
> @@ -1087,6 +1114,25 @@ int acpiphp_disable_and_eject_slot(struc
>         return retval;
>  }
>
> +/**
> + * acpiphp_disable_and_eject_slot - power off and eject slot.
> + * @slot: ACPIPHP slot.
> + */
> +int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> +{
> +       struct acpiphp_func *func;
> +       int ret = -ENODEV;
> +
> +       lock_pci_remove_rescan();
> +
> +       func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
> +       if (!func->parent->is_going_away)
> +               ret = disable_and_eject_slot(slot);
> +
> +       unlock_pci_remove_rescan();
> +       return ret;
> +}
> +
>
>  /*
>   * slot enabled:  1
> Index: linux-pm/drivers/acpi/pci_root.c
> ===================================================================
> --- linux-pm.orig/drivers/acpi/pci_root.c
> +++ linux-pm/drivers/acpi/pci_root.c
> @@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct
>  {
>         struct acpi_pci_root *root = acpi_driver_data(device);
>
> +       lock_pci_remove_rescan();
> +
>         pci_stop_root_bus(root->bus);
>
>         device_set_run_wake(root->bus->bridge, false);
> @@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct
>
>         pci_remove_root_bus(root->bus);
>
> +       unlock_pci_remove_rescan();
> +
>         kfree(root);
>  }
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Dec. 6, 2013, 1:21 a.m. UTC | #2
On Thursday, December 05, 2013 03:40:39 PM Bjorn Helgaas wrote:
> On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > ...
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal
> >
> > The following are concurrency problems related to the PCI device
> > removal code in pci-sysfs.c and in ACPIPHP present in the current
> > mainline kernel:
> 
> You've found a bunch of issues.  I don't think there's anything to
> gain by fixing them all in a single patch, and I think it would be
> useful to split them out to help us think about them and find other
> places that have similar problems.

The problem is that they are all related.  Whatever we decide to do with
one of them will likely affect how we deal with the others.

> > Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
> >   the same top device from remove_callback() in pci-sysfs.c and
> >   from trim_stale_devices() in acpiphp_glue.c.
> >
> >   In this scenario the second code path is executed without
> >   pci_remove_rescan_mutex locked, so the &bus->devices list
> >   walks in either trim_stale_devices() itself or in
> >   acpiphp_check_bridge() can suffer from list corruption while the
> >   first code path is executing pci_destroy_dev() for one of the
> >   devices on those lists.
> 
> Protecting &bus->devices is a generic problem, isn't it?

Yes, it is.

In fact, there are two sides of it.  One is with read-only users (that is,
walking the list without modifying it) which should be done under read-locked
pci_bus_sem and that's quite obvious.  It looks like the majority of users of
that list are readers and they fall under this case.  That case is addressed
by write-locking pci_bus_sem around the dev->bus_list deletion in pci_destroy_dev()
(it will wait for all readers to complete their walks as long as they use
pci_bus_sem correcty, but we may as well replace that with SRCU).

The second one is where someone is walking the list and *deleting* entires from
it in the process, which has to be synchronized with other writers and the
write-locking of pci_bus_sem in pci_destroy_dev() is not sufficient for that,
because pci_bus_sem can't be read-locked around a bus->devices walk deleting
entries from that list (exactly because pci_bus_sem is write-locked during
device deletion).  So this is a special case and to fix races between different
code paths that walk bus->devices *and* delete entries from there we need a
separate lock to be held around the entire list walk.  Incidentally,
pci_remove_rescan_mutex is already used in that context by the code in
pci-sysfs.c, so it can be used for this purpose by all of the other code
paths doing this thing, like for example acpiphp_check_bridge() and
trim_stale_devices() (there's more).

> There are about a zillion uses of it.  Many are in the pcibios_fixup_bus() path.
>  I think we can get rid of most of those by integrating the work into
> the pci_scan_device() path instead of doing it as a post-discovery
> fixup, but there will be several other cases left.  If using
> pci_remove_rescan_mutex to protect &bus->devices is the right generic
> answer, we should document that and audit every place that uses the
> list.

No, we don't have to do that as explained above.  We only need it around
code that walks the list in order to (possibly) delete entries from it
and there are not too many of such places, so they can be audited quite easily.
[Basically, wherever pci_stop_and_remove_bus_device() is called during a list
walk over bus->devices.]

And *if* we use pci_remove_rescan_mutex for that, it will *also* address some
other synchronization problems.

> >   Moreover, if any of the device objects in question is freed
> >   after pci_destroy_dev() executed by the first code path, the
> >   second code path may suffer a use-after-free problem while
> >   trying to access that device object.
> >
> >   Conversely, the second code path may execute pci_destroy_dev()
> >   for one of the devices in question such that one of the
> >   &bus->devices list walks in pci_stop_bus_device()
> >   or pci_remove_bus_device() executed by the first code path will
> >   suffer from a list corruption.
> >
> >   Moreover, use-after-free is also possible if one of the device
> >   objects in question is freed as a result of calling
> >   pci_destroy_dev() by the second code path and then the first
> >   code path tries to access it (the first code path only holds
> >   an extra reference to the device it has been run for, but not
> >   for its child devices).
> 
> The use-after-free problems *sound* like a reference counting issue.
> Yinghai's patch [1] should fix some of this; how much is left after
> that?
> 
> [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org

Hmm, no.  Do you mean https://patchwork.kernel.org/patch/3261171/ ?

This patch doesn't fix the problem either, however, because in only fixes
the one case described in Scenario 5 below.

Suppose that (1) trim_stale_devices() calls pci_stop_amd_remove_bus_devices(X)
and (2) remove_callback() calls pci_stop_and_remove_bus_devices(X) (i.e. for
the same device) at the same time.

Suppose that X is a leaf device for simplicity and say that code path (2)
is executed first.  We run pci_stop_dev(X) and pci_destroy_dev(X), which
does put_device(&X->dev) and then sysfs_schedule_callback_work() executes
kobject_put() for the X's kobject.  The device pointed to by X is gone at this
point.

Now, code path (1) runs and and crashes with a great bang trying to access
X->subordinate in pci_stop_bus_device().  This happens regardless of whether or
not the Yinghai's patch has been applied.  QED

Of course, you can argue that this is because code path (1) should have done
a pci_dev_get(X) before running pci_stop_amd_remove_bus_devices(X) which
would have avoided the crash.  I can agree with that, but *if* code path (1)
had held pci_remove_rescan_mutex around the whole trim_stale_devices(), then
it wouldn't have had to do that pci_dev_get(X), because the entire race
above wouldn't have been possible (code path (2) already holds that mutex
around pci_stop_and_remove_bus_devices(X)).

OK, you can ask, but why the heck does code path (1) need to hold
pci_remove_rescan_mutex around trim_stale_devices()?  Well, because it needs
to synchronize the list walks in acpiphp_check_bridge() and trim_stale_devices()
with the list_del() in pci_destroy_dev() executed by code path (2).

That's why I'm proposing to acquire pci_remove_rescan_mutex in
hotplug_event_work() and hold it around the whole hotplug_event().

Of course, analogous races are possible for acpiphp_enable_slot() and
acpiphp_disable_and_eject_slot() and they may be addressed analogously -
by holding pci_remove_rescan_mutex around possible modifications of the PCI
device hierarchy.

> > Scenario 2: ACPI hotplug event occurs for a device under a bridge
> >   being removed by pci_stop_and_remove_bus_device() run from
> >   remove_callback() in pci-sysfs.c.
> >
> >   In that case it doesn't make sense to handle the hotplug event,
> >   because the device in question will be removed anyway along with
> >   its parent bridge and that will cause the context objects needed
> >   for hotplug handling to be freed as well.
> >
> >   Moreover, if the event is handled regardless, it may cause one
> >   or more devices already removed by pci_stop_and_remove_bus_device()
> >   to be added again by the code handling the event, which will
> >   conflict with the bridge removal.
> 
> We definitely need to serialize hotplug events from ACPI and sysfs
> (and other sources, like other hotplug drivers).  Would that be
> enough?  Adding the is_going_away flag is ACPI-specific and seems like
> sort of a point workaround.

Well, we basically need to prevent hotplug_event() from being run in that
case, and since that function is ACPI-specific and the data structures
hotplug_event_work() operates on are ACPI-specific, I'm using an ACPI-specific
flag to achieve that goal.  If you can suggest any approach that would not
be ACPI-specific to address this particular case, I'll be happy to use it. :-)

> > Scenario 3: pci_stop_and_remove_bus_device() is run from
> >   trim_stale_devices() (as a result of an ACPI hotplug event) in
> >   parallel with dev_bus_rescan_store() or bus_rescan_store(),
> >   or dev_rescan_store().
> >
> >   In that scenario the second code path may attempt to operate
> >   on device objects being removed by the first code path which
> >   may lead to many interesting types of breakage.
> >
> > Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
> >   host bridge removal event) in  parallel with bus_rescan_store(),
> >   dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
> >   for any devices under the host bridge in question.
> >
> >   In that case the same symptoms as in Scenarios 1 and 3 may occur
> >   depending on which code path wins the races involved.
> 
> Scenarios 3 and 4 sound like more cases of hotplug operations needing
> to be serialized, right?  If we serialized them sufficiently, would
> there still be a problem?

That depends on what exactly you mean by "sufficiently".  That is, what's
the goal?  Different approaches may be sufficient to achieve specific goals,
but not necessarily more than one of them.

> Using pci_remove_rescan_mutex would
> serialize *all* PCI hotplug operations, which is more than strictly
> necessary, but maybe there's no reason to do anything finer-grained.

My answer to this is yes, using pci_remove_rescan_mutex *will* serialize all
PCI hotplug operations, which will be a very good first step.  Having done that,
we can see how much we can relax things and how far we *want* to go with that.

> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >   for a device and its parent bridge via remove_callback().
> >
> >   In that case both code paths attempt to acquire
> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >   it first, there will be no problems.  However, if the parent
> >   bridge removal acquires it first, it will eventually execute
> >   pci_destroy_dev() for the child device, but that device will
> >   not be freed yet due to the reference held by the concurrent
> >   child removal.  Consequently, both pci_stop_bus_device() and
> >   pci_remove_bus_device() will be executed for that device
> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >   head in that object.  Moreover, an excess put_device() will
> >   be executed for that device in that case which may lead to a
> >   use-after-free in the final kobject_put() done by
> >   sysfs_schedule_callback_work().
> 
> The corrupted list head should be fixed by Yinghai's patch [1].

I think you really mean https://patchwork.kernel.org/patch/3261171/ :-)

Generally, patches in that series would mitigate that problem somewhat.

I actually stole the Yinghai's idea with the new PCI device flag (sorry,
Yinghai), but I think it's better to check that flag to start with in
pci_stop_and_remove_bus_device(), because then we avoid calling
pci_stop_bus_device() unnecessarily as well (surely, the device was stopped
before it has gone, wasn't it?).

> Where is the extra put_device()?  I see the
> kobject_get()/kobject_put() pair in sysfs_schedule_callback() and
> sysfs_schedule_callback_work().  Oh, I see -- the remove_store() ->
> remove_callback() path acquires no references, but it calls
> pci_stop_and_remove_bus_device(), which ultimately does the
> put_device() in pci_destroy_dev().
> 
> So if both the parent and the child removal manage to get to
> remove_callback() and the parent acquires pci_remove_rescan_mutex
> first, the child removal will do the extra put_device().
> 
> There are only six callers of device_schedule_callback(), and I think
> five of them are susceptible to this same problem: they are sysfs
> store methods, and they use device_schedule_callback() with a callback
> that does a put_device() on the device:
> 
>   drivers/pci/pci-sysfs.c: remove_store()
>   drivers/scsi/scsi_sysfs.c: sdev_store_delete()
>   arch/s390/pci/pci_sysfs.c: store_recover()
>   drivers/s390/block/dcssblk.c: dcssblk_shared_store()
>   drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store()
> 
> I don't know what the right fix is, but adding "is_gone" to struct
> pci_dev only addresses one of the five places, of course.

That's correct.

I'm not sure if there *is* a generic solution to this problem, because
sysfs_schedule_callback_work() doesn't know what the callback function is
going to do.  In principle it can look at ss->kobj->parent and skip the
execution of ss->func if that is NULL (which means that the kobject has
been deleted and it is likely the last thing holding a reference to that
kobject), but that still will be racy without extra synchronization in the
users of device_schedule_callback().

Which probably means that device_schedule_callback() is ill concieved in the
first place, because it can't really do the work it is designed for in general.
So perhaps it's better to use something simpler and PCI-specific for PCI and
analogously in the other cases you mentioned.

OK

To be a bit more constructive, as the next step I'd try to use
pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
above) without making the other changes made by my patch.  Does that sound
reasonable?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 6, 2013, 6:29 a.m. UTC | #3
On Thu, Dec 5, 2013 at 5:21 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>
>> The use-after-free problems *sound* like a reference counting issue.
>> Yinghai's patch [1] should fix some of this; how much is left after
>> that?
>>
>> [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org
>
> Hmm, no.  Do you mean https://patchwork.kernel.org/patch/3261171/ ?

should be

https://patchwork.kernel.org/patch/3261001/
[v3,03/12] PCI: Move resources and bus_list releasing to pci_release_dev

move down list_del(&pci_dev->bus_list).

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 6, 2013, 6:52 a.m. UTC | #4
On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>   for a device and its parent bridge via remove_callback().
>
>   In that case both code paths attempt to acquire
>   pci_remove_rescan_mutex.  If the child device removal acquires
>   it first, there will be no problems.  However, if the parent
>   bridge removal acquires it first, it will eventually execute
>   pci_destroy_dev() for the child device, but that device will
>   not be freed yet due to the reference held by the concurrent
>   child removal.  Consequently, both pci_stop_bus_device() and
>   pci_remove_bus_device() will be executed for that device
>   unnecessarily and pci_destroy_dev() will see a corrupted list
>   head in that object.  Moreover, an excess put_device() will
>   be executed for that device in that case which may lead to a
>   use-after-free in the final kobject_put() done by
>   sysfs_schedule_callback_work().
>
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>         unsigned int    multifunction:1;/* Part of multi-function device */
>         /* keep track of device state */
>         unsigned int    is_added:1;
> +       unsigned int    is_gone:1;
>         unsigned int    is_busmaster:1; /* device is busmaster */
>         unsigned int    no_msi:1;       /* device may not use msi */
>         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> Index: linux-pm/drivers/pci/remove.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/remove.c
> +++ linux-pm/drivers/pci/remove.c
> @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> +       dev->is_gone = 1;
>         device_del(&dev->dev);
>
>         down_write(&pci_bus_sem);
> @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>   */
>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>  {
> -       pci_stop_bus_device(dev);
> -       pci_remove_bus_device(dev);
> +       if (!dev->is_gone) {
> +               pci_stop_bus_device(dev);
> +               pci_remove_bus_device(dev);
> +       }
>  }
>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>

Yes, above change should address sys double remove problem.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Dec. 7, 2013, 1:27 a.m. UTC | #5
On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >   for a device and its parent bridge via remove_callback().
> >
> >   In that case both code paths attempt to acquire
> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >   it first, there will be no problems.  However, if the parent
> >   bridge removal acquires it first, it will eventually execute
> >   pci_destroy_dev() for the child device, but that device will
> >   not be freed yet due to the reference held by the concurrent
> >   child removal.  Consequently, both pci_stop_bus_device() and
> >   pci_remove_bus_device() will be executed for that device
> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >   head in that object.  Moreover, an excess put_device() will
> >   be executed for that device in that case which may lead to a
> >   use-after-free in the final kobject_put() done by
> >   sysfs_schedule_callback_work().
> >
> > Index: linux-pm/include/linux/pci.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/pci.h
> > +++ linux-pm/include/linux/pci.h
> > @@ -321,6 +321,7 @@ struct pci_dev {
> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >         /* keep track of device state */
> >         unsigned int    is_added:1;
> > +       unsigned int    is_gone:1;
> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >         unsigned int    no_msi:1;       /* device may not use msi */
> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> > Index: linux-pm/drivers/pci/remove.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/remove.c
> > +++ linux-pm/drivers/pci/remove.c
> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >
> >  static void pci_destroy_dev(struct pci_dev *dev)
> >  {
> > +       dev->is_gone = 1;
> >         device_del(&dev->dev);
> >
> >         down_write(&pci_bus_sem);
> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >   */
> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >  {
> > -       pci_stop_bus_device(dev);
> > -       pci_remove_bus_device(dev);
> > +       if (!dev->is_gone) {
> > +               pci_stop_bus_device(dev);
> > +               pci_remove_bus_device(dev);
> > +       }
> >  }
> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >
> 
> Yes, above change should address sys double remove problem.

I've just realized that we don't need a new flag for that, though.

It looks like we only need to check dev->dev.kobj.parent and return if that is
NULL, because that means pci_destroy_dev() has run for that device already
(I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
it should do that?).

Of course, that still is going to be racy if we don't hold
pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
path using it (or use another similar synchronization mechanism).

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 8, 2013, 3:31 a.m. UTC | #6
[+ GregKH]

On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >
>> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >   for a device and its parent bridge via remove_callback().
>> >
>> >   In that case both code paths attempt to acquire
>> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >   it first, there will be no problems.  However, if the parent
>> >   bridge removal acquires it first, it will eventually execute
>> >   pci_destroy_dev() for the child device, but that device will
>> >   not be freed yet due to the reference held by the concurrent
>> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >   pci_remove_bus_device() will be executed for that device
>> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >   head in that object.  Moreover, an excess put_device() will
>> >   be executed for that device in that case which may lead to a
>> >   use-after-free in the final kobject_put() done by
>> >   sysfs_schedule_callback_work().
>> >
>> > Index: linux-pm/include/linux/pci.h
>> > ===================================================================
>> > --- linux-pm.orig/include/linux/pci.h
>> > +++ linux-pm/include/linux/pci.h
>> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >         /* keep track of device state */
>> >         unsigned int    is_added:1;
>> > +       unsigned int    is_gone:1;
>> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> > Index: linux-pm/drivers/pci/remove.c
>> > ===================================================================
>> > --- linux-pm.orig/drivers/pci/remove.c
>> > +++ linux-pm/drivers/pci/remove.c
>> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >
>> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >  {
>> > +       dev->is_gone = 1;
>> >         device_del(&dev->dev);
>> >
>> >         down_write(&pci_bus_sem);
>> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >   */
>> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >  {
>> > -       pci_stop_bus_device(dev);
>> > -       pci_remove_bus_device(dev);
>> > +       if (!dev->is_gone) {
>> > +               pci_stop_bus_device(dev);
>> > +               pci_remove_bus_device(dev);
>> > +       }
>> >  }
>> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >
>>
>> Yes, above change should address sys double remove problem.
>
> I've just realized that we don't need a new flag for that, though.
>
> It looks like we only need to check dev->dev.kobj.parent and return if that is
> NULL, because that means pci_destroy_dev() has run for that device already
> (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> it should do that?).
>
> Of course, that still is going to be racy if we don't hold
> pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> path using it (or use another similar synchronization mechanism).

Wonder if we can have safe way to check if device_del() is called already.

And those access_after_free should be addressed by driver core instead
of pci code?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Dec. 8, 2013, 3:50 a.m. UTC | #7
On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
> [+ GregKH]
> 
> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> >
> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >> >   for a device and its parent bridge via remove_callback().
> >> >
> >> >   In that case both code paths attempt to acquire
> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >> >   it first, there will be no problems.  However, if the parent
> >> >   bridge removal acquires it first, it will eventually execute
> >> >   pci_destroy_dev() for the child device, but that device will
> >> >   not be freed yet due to the reference held by the concurrent
> >> >   child removal.  Consequently, both pci_stop_bus_device() and
> >> >   pci_remove_bus_device() will be executed for that device
> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >> >   head in that object.  Moreover, an excess put_device() will
> >> >   be executed for that device in that case which may lead to a
> >> >   use-after-free in the final kobject_put() done by
> >> >   sysfs_schedule_callback_work().
> >> >
> >> > Index: linux-pm/include/linux/pci.h
> >> > ===================================================================
> >> > --- linux-pm.orig/include/linux/pci.h
> >> > +++ linux-pm/include/linux/pci.h
> >> > @@ -321,6 +321,7 @@ struct pci_dev {
> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >> >         /* keep track of device state */
> >> >         unsigned int    is_added:1;
> >> > +       unsigned int    is_gone:1;
> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >> >         unsigned int    no_msi:1;       /* device may not use msi */
> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> >> > Index: linux-pm/drivers/pci/remove.c
> >> > ===================================================================
> >> > --- linux-pm.orig/drivers/pci/remove.c
> >> > +++ linux-pm/drivers/pci/remove.c
> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >> >
> >> >  static void pci_destroy_dev(struct pci_dev *dev)
> >> >  {
> >> > +       dev->is_gone = 1;
> >> >         device_del(&dev->dev);
> >> >
> >> >         down_write(&pci_bus_sem);
> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >> >   */
> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >> >  {
> >> > -       pci_stop_bus_device(dev);
> >> > -       pci_remove_bus_device(dev);
> >> > +       if (!dev->is_gone) {
> >> > +               pci_stop_bus_device(dev);
> >> > +               pci_remove_bus_device(dev);
> >> > +       }
> >> >  }
> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >> >
> >>
> >> Yes, above change should address sys double remove problem.
> >
> > I've just realized that we don't need a new flag for that, though.
> >
> > It looks like we only need to check dev->dev.kobj.parent and return if that is
> > NULL, because that means pci_destroy_dev() has run for that device already
> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> > it should do that?).
> >
> > Of course, that still is going to be racy if we don't hold
> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> > path using it (or use another similar synchronization mechanism).
> 
> Wonder if we can have safe way to check if device_del() is called already.

Nope.

> And those access_after_free should be addressed by driver core instead
> of pci code?

Nope, it's up to the bus to handle this.  It shouldn't be hard, you
shouldn't actually care about this, if you do, something is wrong.

How is this PCI code so hard to get right?  Look at USB for devices that
disappear from anywhere at anytime as an example for how to handle
this.  PCI should be doing the same thing, no need for this "is_gone"
stuff.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ethan Zhao Dec. 9, 2013, 3:24 p.m. UTC | #8
On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
>> [+ GregKH]
>>
>> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> >
>> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >> >   for a device and its parent bridge via remove_callback().
>> >> >
>> >> >   In that case both code paths attempt to acquire
>> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >> >   it first, there will be no problems.  However, if the parent
>> >> >   bridge removal acquires it first, it will eventually execute
>> >> >   pci_destroy_dev() for the child device, but that device will
>> >> >   not be freed yet due to the reference held by the concurrent
>> >> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >> >   pci_remove_bus_device() will be executed for that device
>> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >> >   head in that object.  Moreover, an excess put_device() will
>> >> >   be executed for that device in that case which may lead to a
>> >> >   use-after-free in the final kobject_put() done by
>> >> >   sysfs_schedule_callback_work().
>> >> >
>> >> > Index: linux-pm/include/linux/pci.h
>> >> > ===================================================================
>> >> > --- linux-pm.orig/include/linux/pci.h
>> >> > +++ linux-pm/include/linux/pci.h
>> >> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >> >         /* keep track of device state */
>> >> >         unsigned int    is_added:1;
>> >> > +       unsigned int    is_gone:1;
>> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> >> > Index: linux-pm/drivers/pci/remove.c
>> >> > ===================================================================
>> >> > --- linux-pm.orig/drivers/pci/remove.c
>> >> > +++ linux-pm/drivers/pci/remove.c
>> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >> >
>> >> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >> >  {
>> >> > +       dev->is_gone = 1;
>> >> >         device_del(&dev->dev);
>> >> >
>> >> >         down_write(&pci_bus_sem);
>> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >> >   */
>> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >> >  {
>> >> > -       pci_stop_bus_device(dev);
>> >> > -       pci_remove_bus_device(dev);
>> >> > +       if (!dev->is_gone) {
>> >> > +               pci_stop_bus_device(dev);
>> >> > +               pci_remove_bus_device(dev);
>> >> > +       }
>> >> >  }
>> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >> >
>> >>
>> >> Yes, above change should address sys double remove problem.
>> >
>> > I've just realized that we don't need a new flag for that, though.
>> >
>> > It looks like we only need to check dev->dev.kobj.parent and return if that is
>> > NULL, because that means pci_destroy_dev() has run for that device already
>> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
>> > it should do that?).
>> >
>> > Of course, that still is going to be racy if we don't hold
>> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
>> > path using it (or use another similar synchronization mechanism).
>>
>> Wonder if we can have safe way to check if device_del() is called already.
>
> Nope.
>
>> And those access_after_free should be addressed by driver core instead
>> of pci code?
>
> Nope, it's up to the bus to handle this.  It shouldn't be hard, you
> shouldn't actually care about this, if you do, something is wrong.
>
> How is this PCI code so hard to get right?  Look at USB for devices that
> disappear from anywhere at anytime as an example for how to handle
> this.  PCI should be doing the same thing, no need for this "is_gone"
> stuff.
Greg,

  Don't agree USB is a good example to follow, do you never hit panic
when you pull out USB device from anywhere at anytime without unmount
or stop it via command ? that is not truth.  the truth is none regards
it as enterprise level interface to attach devices.
  Is there a feature for an USB disk to tell the host you want to pull
out it and should sync all the data in cache and unmount the files
system then power it off ?
  What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?

Thanks,
Ethan
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Dec. 9, 2013, 7:08 p.m. UTC | #9
On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote:
> On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
> >> [+ GregKH]
> >>
> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> >> >
> >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >> >> >   for a device and its parent bridge via remove_callback().
> >> >> >
> >> >> >   In that case both code paths attempt to acquire
> >> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >> >> >   it first, there will be no problems.  However, if the parent
> >> >> >   bridge removal acquires it first, it will eventually execute
> >> >> >   pci_destroy_dev() for the child device, but that device will
> >> >> >   not be freed yet due to the reference held by the concurrent
> >> >> >   child removal.  Consequently, both pci_stop_bus_device() and
> >> >> >   pci_remove_bus_device() will be executed for that device
> >> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >> >> >   head in that object.  Moreover, an excess put_device() will
> >> >> >   be executed for that device in that case which may lead to a
> >> >> >   use-after-free in the final kobject_put() done by
> >> >> >   sysfs_schedule_callback_work().
> >> >> >
> >> >> > Index: linux-pm/include/linux/pci.h
> >> >> > ===================================================================
> >> >> > --- linux-pm.orig/include/linux/pci.h
> >> >> > +++ linux-pm/include/linux/pci.h
> >> >> > @@ -321,6 +321,7 @@ struct pci_dev {
> >> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >> >> >         /* keep track of device state */
> >> >> >         unsigned int    is_added:1;
> >> >> > +       unsigned int    is_gone:1;
> >> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >> >> >         unsigned int    no_msi:1;       /* device may not use msi */
> >> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> >> >> > Index: linux-pm/drivers/pci/remove.c
> >> >> > ===================================================================
> >> >> > --- linux-pm.orig/drivers/pci/remove.c
> >> >> > +++ linux-pm/drivers/pci/remove.c
> >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >> >> >
> >> >> >  static void pci_destroy_dev(struct pci_dev *dev)
> >> >> >  {
> >> >> > +       dev->is_gone = 1;
> >> >> >         device_del(&dev->dev);
> >> >> >
> >> >> >         down_write(&pci_bus_sem);
> >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >> >> >   */
> >> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >> >> >  {
> >> >> > -       pci_stop_bus_device(dev);
> >> >> > -       pci_remove_bus_device(dev);
> >> >> > +       if (!dev->is_gone) {
> >> >> > +               pci_stop_bus_device(dev);
> >> >> > +               pci_remove_bus_device(dev);
> >> >> > +       }
> >> >> >  }
> >> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >> >> >
> >> >>
> >> >> Yes, above change should address sys double remove problem.
> >> >
> >> > I've just realized that we don't need a new flag for that, though.
> >> >
> >> > It looks like we only need to check dev->dev.kobj.parent and return if that is
> >> > NULL, because that means pci_destroy_dev() has run for that device already
> >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> >> > it should do that?).
> >> >
> >> > Of course, that still is going to be racy if we don't hold
> >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> >> > path using it (or use another similar synchronization mechanism).
> >>
> >> Wonder if we can have safe way to check if device_del() is called already.
> >
> > Nope.
> >
> >> And those access_after_free should be addressed by driver core instead
> >> of pci code?
> >
> > Nope, it's up to the bus to handle this.  It shouldn't be hard, you
> > shouldn't actually care about this, if you do, something is wrong.
> >
> > How is this PCI code so hard to get right?  Look at USB for devices that
> > disappear from anywhere at anytime as an example for how to handle
> > this.  PCI should be doing the same thing, no need for this "is_gone"
> > stuff.
> Greg,
> 
>   Don't agree USB is a good example to follow, do you never hit panic
> when you pull out USB device from anywhere at anytime without unmount
> or stop it via command ?

You shouldn't.  If you do, it's a bug, let us know and we will fix it.

> that is not truth.  the truth is none regards it as enterprise level
> interface to attach devices.

Huh?

>   Is there a feature for an USB disk to tell the host you want to pull
> out it and should sync all the data in cache and unmount the files
> system then power it off ?

Nope, neither is there one for when I yank out my PCI storage device
without telling the OS about it either.  Everything better "just work",
with the exception of any lost data that might be in flight.

>   What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?

I don't understand what that means at all.

We have USB network ethernet devices, I have a USB 3.0 one here that
works really well.  Infiniband is merely a transport, with some "verbs"
on top of it, that has nothing to do with PCI other than you can have a
IB PCI controller in the system.  And I have a USB graphics adapter here
that works just fine as well (people chain lots of them on one system.)

So how does this apply to PCI at all?  It's the same thing, you have to
be able to handle a PCI device going away at any point in time, with or
without telling the OS ahead of time that you are going to remove it.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ethan Zhao Dec. 10, 2013, 7:43 a.m. UTC | #10
On Tue, Dec 10, 2013 at 3:08 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote:
>> On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
>> >> [+ GregKH]
>> >>
>> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> >> >
>> >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >> >> >   for a device and its parent bridge via remove_callback().
>> >> >> >
>> >> >> >   In that case both code paths attempt to acquire
>> >> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >> >> >   it first, there will be no problems.  However, if the parent
>> >> >> >   bridge removal acquires it first, it will eventually execute
>> >> >> >   pci_destroy_dev() for the child device, but that device will
>> >> >> >   not be freed yet due to the reference held by the concurrent
>> >> >> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >> >> >   pci_remove_bus_device() will be executed for that device
>> >> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >> >> >   head in that object.  Moreover, an excess put_device() will
>> >> >> >   be executed for that device in that case which may lead to a
>> >> >> >   use-after-free in the final kobject_put() done by
>> >> >> >   sysfs_schedule_callback_work().
>> >> >> >
>> >> >> > Index: linux-pm/include/linux/pci.h
>> >> >> > ===================================================================
>> >> >> > --- linux-pm.orig/include/linux/pci.h
>> >> >> > +++ linux-pm/include/linux/pci.h
>> >> >> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >> >> >         /* keep track of device state */
>> >> >> >         unsigned int    is_added:1;
>> >> >> > +       unsigned int    is_gone:1;
>> >> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >> >> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> >> >> > Index: linux-pm/drivers/pci/remove.c
>> >> >> > ===================================================================
>> >> >> > --- linux-pm.orig/drivers/pci/remove.c
>> >> >> > +++ linux-pm/drivers/pci/remove.c
>> >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >> >> >
>> >> >> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >> >> >  {
>> >> >> > +       dev->is_gone = 1;
>> >> >> >         device_del(&dev->dev);
>> >> >> >
>> >> >> >         down_write(&pci_bus_sem);
>> >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >> >> >   */
>> >> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >> >> >  {
>> >> >> > -       pci_stop_bus_device(dev);
>> >> >> > -       pci_remove_bus_device(dev);
>> >> >> > +       if (!dev->is_gone) {
>> >> >> > +               pci_stop_bus_device(dev);
>> >> >> > +               pci_remove_bus_device(dev);
>> >> >> > +       }
>> >> >> >  }
>> >> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >> >> >
>> >> >>
>> >> >> Yes, above change should address sys double remove problem.
>> >> >
>> >> > I've just realized that we don't need a new flag for that, though.
>> >> >
>> >> > It looks like we only need to check dev->dev.kobj.parent and return if that is
>> >> > NULL, because that means pci_destroy_dev() has run for that device already
>> >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
>> >> > it should do that?).
>> >> >
>> >> > Of course, that still is going to be racy if we don't hold
>> >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
>> >> > path using it (or use another similar synchronization mechanism).
>> >>
>> >> Wonder if we can have safe way to check if device_del() is called already.
>> >
>> > Nope.
>> >
>> >> And those access_after_free should be addressed by driver core instead
>> >> of pci code?
>> >
>> > Nope, it's up to the bus to handle this.  It shouldn't be hard, you
>> > shouldn't actually care about this, if you do, something is wrong.
>> >
>> > How is this PCI code so hard to get right?  Look at USB for devices that
>> > disappear from anywhere at anytime as an example for how to handle
>> > this.  PCI should be doing the same thing, no need for this "is_gone"
>> > stuff.
>> Greg,
>>
>>   Don't agree USB is a good example to follow, do you never hit panic
>> when you pull out USB device from anywhere at anytime without unmount
>> or stop it via command ?
>
> You shouldn't.  If you do, it's a bug, let us know and we will fix it.

Of coz, next time hit, bore you with a calltrace.

>
>> that is not truth.  the truth is none regards it as enterprise level
>> interface to attach devices.
>
> Huh?

USB 3.0 still not fast enough for enterprise level.
>
>>   Is there a feature for an USB disk to tell the host you want to pull
>> out it and should sync all the data in cache and unmount the files
>> system then power it off ?
>
> Nope, neither is there one for when I yank out my PCI storage device
> without telling the OS about it either.  Everything better "just work",
> with the exception of any lost data that might be in flight.

To a desktop, you do have option to issue 'sync, umount' and pull out
the device and it 'just work',
To a server, someone wouldn't stand for any data lost in flight.  USB
need additional feature added
for you to tell udev sync data etc without a console in hand.

>
>>   What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?
>
> I don't understand what that means at all.

Don't be sleepy, man, you know USB is not powerful enough today, just
as you said, someday,
all the outdated thing will go away, just like those ISA, VESA, we
don't care the low level data link layer anymore. but today, PCIe is
still a little more complex/out than USB to handle.

Thanks,
Ethan
>
> We have USB network ethernet devices, I have a USB 3.0 one here that
> works really well.  Infiniband is merely a transport, with some "verbs"
> on top of it, that has nothing to do with PCI other than you can have a
> IB PCI controller in the system.  And I have a USB graphics adapter here
> that works just fine as well (people chain lots of them on one system.)
>
> So how does this apply to PCI at all?  It's the same thing, you have to
> be able to handle a PCI device going away at any point in time, with or
> without telling the OS ahead of time that you are going to remove it.
>
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 10, 2014, 2:20 p.m. UTC | #11
[Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for
 platform/x86 and Konrad for Xen]

On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote:

[...]

> 
> OK
> 
> To be a bit more constructive, as the next step I'd try to use
> pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
> above) without making the other changes made by my patch.  Does that sound
> reasonable?

Well, no answer here, so as a followup, a series implementing that idea
follows.

I *hope* I found all of the places that need to be synchronized vs the bus
rescan and device removal that can be triggered via sysfs, but I might overlook
something.  Also in some cases I wasn't quite sure how much stuff to put under
the lock, because said stuff is not exactly straightforward.

Enjoy!

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 15, 2014, 6:02 p.m. UTC | #12
On Fri, Jan 10, 2014 at 03:20:44PM +0100, Rafael J. Wysocki wrote:
> [Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for
>  platform/x86 and Konrad for Xen]
> 
> On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote:
> 
> [...]
> 
> > 
> > OK
> > 
> > To be a bit more constructive, as the next step I'd try to use
> > pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
> > above) without making the other changes made by my patch.  Does that sound
> > reasonable?
> 
> Well, no answer here, so as a followup, a series implementing that idea
> follows.
> 
> I *hope* I found all of the places that need to be synchronized vs the bus
> rescan and device removal that can be triggered via sysfs, but I might overlook
> something.  Also in some cases I wasn't quite sure how much stuff to put under
> the lock, because said stuff is not exactly straightforward.

I applied this series to my pci/locking branch for v3.14.  It should appear
in -next tomorrow.

Note that this touches some areas that are not strictly PCI, so speak
up if I'm treading on your toes:

 arch/powerpc/kernel/eeh_driver.c       |   19 ++++++++++++--
 drivers/acpi/pci_root.c                |    6 ++++
 drivers/message/fusion/mptbase.c       |    2 -
 drivers/pci/hotplug/acpiphp.h          |    5 +++
 drivers/pci/hotplug/acpiphp_core.c     |    2 -
 drivers/pci/hotplug/acpiphp_glue.c     |   43 +++++++++++++++++++++++++++++----
 drivers/pci/hotplug/cpci_hotplug_pci.c |   14 +++++++++-
 drivers/pci/hotplug/cpqphp_pci.c       |    8 +++++-
 drivers/pci/hotplug/ibmphp_core.c      |   13 ++++++++-
 drivers/pci/hotplug/pciehp_pci.c       |   17 +++++++++----
 drivers/pci/hotplug/rpadlpar_core.c    |   19 ++++++++++----
 drivers/pci/hotplug/rpaphp_core.c      |    4 +++
 drivers/pci/hotplug/s390_pci_hpc.c     |    4 ++-
 drivers/pci/hotplug/sgi_hotplug.c      |    5 +++
 drivers/pci/hotplug/shpchp_pci.c       |   18 ++++++++++---
 drivers/pci/pci-sysfs.c                |   19 +++++---------
 drivers/pci/probe.c                    |   18 +++++++++++++
 drivers/pci/remove.c                   |   11 ++++++++
 drivers/pci/xen-pcifront.c             |    8 ++++++
 drivers/pcmcia/cardbus.c               |    7 +++++
 drivers/platform/x86/asus-wmi.c        |    2 +
 drivers/platform/x86/eeepc-laptop.c    |    2 +
 drivers/scsi/mpt2sas/mpt2sas_base.c    |    2 -
 drivers/scsi/mpt3sas/mpt3sas_base.c    |    2 -
 include/linux/pci.h                    |    3 ++

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -321,6 +321,7 @@  struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_gone:1;
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	block_cfg_access:1;	/* config space access is blocked */
@@ -1022,6 +1023,8 @@  void set_pcie_hotplug_bridge(struct pci_
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
 unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
+void lock_pci_remove_rescan(void);
+void unlock_pci_remove_rescan(void);
 
 /* Vital product data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
Index: linux-pm/drivers/pci/pci-sysfs.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-sysfs.c
+++ linux-pm/drivers/pci/pci-sysfs.c
@@ -298,6 +298,17 @@  msi_bus_store(struct device *dev, struct
 static DEVICE_ATTR_RW(msi_bus);
 
 static DEFINE_MUTEX(pci_remove_rescan_mutex);
+
+void lock_pci_remove_rescan(void)
+{
+	mutex_lock(&pci_remove_rescan_mutex);
+}
+
+void unlock_pci_remove_rescan(void)
+{
+	mutex_unlock(&pci_remove_rescan_mutex);
+}
+
 static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
 				size_t count)
 {
Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -34,6 +34,7 @@  static void pci_stop_dev(struct pci_dev
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
+	dev->is_gone = 1;
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);
@@ -109,8 +110,10 @@  static void pci_remove_bus_device(struct
  */
 void pci_stop_and_remove_bus_device(struct pci_dev *dev)
 {
-	pci_stop_bus_device(dev);
-	pci_remove_bus_device(dev);
+	if (!dev->is_gone) {
+		pci_stop_bus_device(dev);
+		pci_remove_bus_device(dev);
+	}
 }
 EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
Index: linux-pm/drivers/pci/hotplug/acpiphp.h
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
+++ linux-pm/drivers/pci/hotplug/acpiphp.h
@@ -71,6 +71,7 @@  struct acpiphp_bridge {
 	struct acpiphp_context *context;
 
 	int nr_slots;
+	bool is_going_away;
 
 	/* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */
 	struct pci_bus *pci_bus;
Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -439,6 +439,13 @@  static void cleanup_bridge(struct acpiph
 	mutex_lock(&bridge_mutex);
 	list_del(&bridge->list);
 	mutex_unlock(&bridge_mutex);
+
+	/*
+	 * For non-root bridges this flag is protected by the PCI remove/rescan
+	 * locking.  For root bridges it is only operated under acpi_scan_lock
+	 * anyway.
+	 */
+	bridge->is_going_away = true;
 }
 
 /**
@@ -733,11 +740,17 @@  static void trim_stale_devices(struct pc
  *
  * Iterate over all slots under this bridge and make sure that if a
  * card is present they are enabled, and if not they are disabled.
+ *
+ * For non-root bridges call under the PCI remove/rescan mutex.
  */
 static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
 {
 	struct acpiphp_slot *slot;
 
+	/* Bail out if the bridge is going away. */
+	if (bridge->is_going_away)
+		return;
+
 	list_for_each_entry(slot, &bridge->slots, node) {
 		struct pci_bus *bus = slot->bus;
 		struct pci_dev *dev, *tmp;
@@ -807,6 +820,8 @@  void acpiphp_check_host_bridge(acpi_hand
 	}
 }
 
+static int disable_and_eject_slot(struct acpiphp_slot *slot);
+
 static void hotplug_event(acpi_handle handle, u32 type, void *data)
 {
 	struct acpiphp_context *context = data;
@@ -866,7 +881,7 @@  static void hotplug_event(acpi_handle ha
 	case ACPI_NOTIFY_EJECT_REQUEST:
 		/* request device eject */
 		pr_debug("%s: Device eject notify on %s\n", __func__, objname);
-		acpiphp_disable_and_eject_slot(func->slot);
+		disable_and_eject_slot(func->slot);
 		break;
 	}
 
@@ -878,14 +893,19 @@  static void hotplug_event_work(void *dat
 {
 	struct acpiphp_context *context = data;
 	acpi_handle handle = context->handle;
+	struct acpiphp_bridge *bridge = context->func.parent;
 
 	acpi_scan_lock_acquire();
+	lock_pci_remove_rescan();
 
-	hotplug_event(handle, type, context);
+	/* Bail out if the parent bridge is going away. */
+	if (!bridge->is_going_away)
+		hotplug_event(handle, type, context);
 
+	unlock_pci_remove_rescan();
 	acpi_scan_lock_release();
 	acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
-	put_bridge(context->func.parent);
+	put_bridge(bridge);
 }
 
 /**
@@ -1050,20 +1070,27 @@  void acpiphp_remove_slots(struct pci_bus
  */
 int acpiphp_enable_slot(struct acpiphp_slot *slot)
 {
-	mutex_lock(&slot->crit_sect);
-	/* configure all functions */
-	if (!(slot->flags & SLOT_ENABLED))
-		enable_slot(slot);
+	struct acpiphp_func *func;
+	int ret = -ENODEV;
 
-	mutex_unlock(&slot->crit_sect);
-	return 0;
+	lock_pci_remove_rescan();
+
+	func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
+	if (!func->parent->is_going_away) {
+		mutex_lock(&slot->crit_sect);
+		/* configure all functions */
+		if (!(slot->flags & SLOT_ENABLED))
+			enable_slot(slot);
+
+		mutex_unlock(&slot->crit_sect);
+		ret = 0;
+	}
+
+	unlock_pci_remove_rescan();
+	return ret;
 }
 
-/**
- * acpiphp_disable_and_eject_slot - power off and eject slot
- * @slot: ACPI PHP slot
- */
-int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+static int disable_and_eject_slot(struct acpiphp_slot *slot)
 {
 	struct acpiphp_func *func;
 	int retval = 0;
@@ -1087,6 +1114,25 @@  int acpiphp_disable_and_eject_slot(struc
 	return retval;
 }
 
+/**
+ * acpiphp_disable_and_eject_slot - power off and eject slot.
+ * @slot: ACPIPHP slot.
+ */
+int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+{
+	struct acpiphp_func *func;
+	int ret = -ENODEV;
+
+	lock_pci_remove_rescan();
+
+	func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
+	if (!func->parent->is_going_away)
+		ret = disable_and_eject_slot(slot);
+
+	unlock_pci_remove_rescan();
+	return ret;
+}
+
 
 /*
  * slot enabled:  1
Index: linux-pm/drivers/acpi/pci_root.c
===================================================================
--- linux-pm.orig/drivers/acpi/pci_root.c
+++ linux-pm/drivers/acpi/pci_root.c
@@ -616,6 +616,8 @@  static void acpi_pci_root_remove(struct
 {
 	struct acpi_pci_root *root = acpi_driver_data(device);
 
+	lock_pci_remove_rescan();
+
 	pci_stop_root_bus(root->bus);
 
 	device_set_run_wake(root->bus->bridge, false);
@@ -623,6 +625,8 @@  static void acpi_pci_root_remove(struct
 
 	pci_remove_root_bus(root->bus);
 
+	unlock_pci_remove_rescan();
+
 	kfree(root);
 }