Message ID | 33026900.FbCAopDPFz@vostro.rjw.lan (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > ... > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal > > The following are concurrency problems related to the PCI device > removal code in pci-sysfs.c and in ACPIPHP present in the current > mainline kernel: You've found a bunch of issues. I don't think there's anything to gain by fixing them all in a single patch, and I think it would be useful to split them out to help us think about them and find other places that have similar problems. > Scenario 1: pci_stop_and_remove_bus_device() run concurrently for > the same top device from remove_callback() in pci-sysfs.c and > from trim_stale_devices() in acpiphp_glue.c. > > In this scenario the second code path is executed without > pci_remove_rescan_mutex locked, so the &bus->devices list > walks in either trim_stale_devices() itself or in > acpiphp_check_bridge() can suffer from list corruption while the > first code path is executing pci_destroy_dev() for one of the > devices on those lists. Protecting &bus->devices is a generic problem, isn't it? There are about a zillion uses of it. Many are in the pcibios_fixup_bus() path. I think we can get rid of most of those by integrating the work into the pci_scan_device() path instead of doing it as a post-discovery fixup, but there will be several other cases left. If using pci_remove_rescan_mutex to protect &bus->devices is the right generic answer, we should document that and audit every place that uses the list. > Moreover, if any of the device objects in question is freed > after pci_destroy_dev() executed by the first code path, the > second code path may suffer a use-after-free problem while > trying to access that device object. > > Conversely, the second code path may execute pci_destroy_dev() > for one of the devices in question such that one of the > &bus->devices list walks in pci_stop_bus_device() > or pci_remove_bus_device() executed by the first code path will > suffer from a list corruption. > > Moreover, use-after-free is also possible if one of the device > objects in question is freed as a result of calling > pci_destroy_dev() by the second code path and then the first > code path tries to access it (the first code path only holds > an extra reference to the device it has been run for, but not > for its child devices). The use-after-free problems *sound* like a reference counting issue. Yinghai's patch [1] should fix some of this; how much is left after that? [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org > Scenario 2: ACPI hotplug event occurs for a device under a bridge > being removed by pci_stop_and_remove_bus_device() run from > remove_callback() in pci-sysfs.c. > > In that case it doesn't make sense to handle the hotplug event, > because the device in question will be removed anyway along with > its parent bridge and that will cause the context objects needed > for hotplug handling to be freed as well. > > Moreover, if the event is handled regardless, it may cause one > or more devices already removed by pci_stop_and_remove_bus_device() > to be added again by the code handling the event, which will > conflict with the bridge removal. We definitely need to serialize hotplug events from ACPI and sysfs (and other sources, like other hotplug drivers). Would that be enough? Adding the is_going_away flag is ACPI-specific and seems like sort of a point workaround. > Scenario 3: pci_stop_and_remove_bus_device() is run from > trim_stale_devices() (as a result of an ACPI hotplug event) in > parallel with dev_bus_rescan_store() or bus_rescan_store(), > or dev_rescan_store(). > > In that scenario the second code path may attempt to operate > on device objects being removed by the first code path which > may lead to many interesting types of breakage. > > Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI > host bridge removal event) in parallel with bus_rescan_store(), > dev_bus_rescan_store(), dev_rescan_store(), or remove_callback() > for any devices under the host bridge in question. > > In that case the same symptoms as in Scenarios 1 and 3 may occur > depending on which code path wins the races involved. Scenarios 3 and 4 sound like more cases of hotplug operations needing to be serialized, right? If we serialized them sufficiently, would there still be a problem? Using pci_remove_rescan_mutex would serialize *all* PCI hotplug operations, which is more than strictly necessary, but maybe there's no reason to do anything finer-grained. > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > for a device and its parent bridge via remove_callback(). > > In that case both code paths attempt to acquire > pci_remove_rescan_mutex. If the child device removal acquires > it first, there will be no problems. However, if the parent > bridge removal acquires it first, it will eventually execute > pci_destroy_dev() for the child device, but that device will > not be freed yet due to the reference held by the concurrent > child removal. Consequently, both pci_stop_bus_device() and > pci_remove_bus_device() will be executed for that device > unnecessarily and pci_destroy_dev() will see a corrupted list > head in that object. Moreover, an excess put_device() will > be executed for that device in that case which may lead to a > use-after-free in the final kobject_put() done by > sysfs_schedule_callback_work(). The corrupted list head should be fixed by Yinghai's patch [1]. Where is the extra put_device()? I see the kobject_get()/kobject_put() pair in sysfs_schedule_callback() and sysfs_schedule_callback_work(). Oh, I see -- the remove_store() -> remove_callback() path acquires no references, but it calls pci_stop_and_remove_bus_device(), which ultimately does the put_device() in pci_destroy_dev(). So if both the parent and the child removal manage to get to remove_callback() and the parent acquires pci_remove_rescan_mutex first, the child removal will do the extra put_device(). There are only six callers of device_schedule_callback(), and I think five of them are susceptible to this same problem: they are sysfs store methods, and they use device_schedule_callback() with a callback that does a put_device() on the device: drivers/pci/pci-sysfs.c: remove_store() drivers/scsi/scsi_sysfs.c: sdev_store_delete() arch/s390/pci/pci_sysfs.c: store_recover() drivers/s390/block/dcssblk.c: dcssblk_shared_store() drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store() I don't know what the right fix is, but adding "is_gone" to struct pci_dev only addresses one of the five places, of course. Bjorn > Scenario 6: ACPIPHP slot enabling/disabling triggered by the > slot's "power" attribute in parallel with device removal run > from remove_callback(). > > This scenario may lead to race conditions analogous to the > ones described in Scenario 1. It also may lead to situations > in which an already removed device under a bridge scheduled > for removal will be added which is analogous to Scenario 2. > > All of these scenarios are addressed by the patch below as follows. > > (1) To prevent the races in Scenarios 1 and 3 from happening hold > pci_remove_rescan_mutex around hotplug_event() in > hotplug_event_work(() (acpiphp_glue.c). > > (2) To prevent the races in Scenario 2 from happening, add an ACPIPHP > bridge flag is_going_away indicating that hotplug events should > be ignored for children below that bridge. That flag is set > by cleanup_bridge() that for non-root bridges should be run > under pci_remove_rescan_mutex (for root bridges it is only > run under acpi_scan_lock anyway). > > (3) To prevent the races in Scenario 4 from happening, hold > pci_remove_rescan_mutex around pci_stop_root_bus() and > pci_remove_root_bus() in acpi_pci_root_remove(). > > (4) To prevent the races in Scenario 5 from happening, add an new > is_gone flag to struct pci_dev that will be set by pci_destroy_dev() > and checked by pci_stop_and_remove_bus_device(). That only > covers cases in which pci_stop_and_remove_bus_device() is > run under pci_remove_rescan_mutex, but the other existing > cases need to be fixed to use that mutex anyway for other > reasons (analogous to Scenarios 1 and 3 above, for example). > > (5) To prevent the races in Scenario 6 from happening, add > the PCI remove/rescan locking to acpiphp_enable_slot() and > acpiphp_disable_and_eject_slot() and make these functions > check the slot's parent bridge status. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/acpi/pci_root.c | 4 ++ > drivers/pci/hotplug/acpiphp.h | 1 > drivers/pci/hotplug/acpiphp_glue.c | 74 ++++++++++++++++++++++++++++++------- > drivers/pci/pci-sysfs.c | 11 +++++ > drivers/pci/remove.c | 7 ++- > include/linux/pci.h | 3 + > 6 files changed, 84 insertions(+), 16 deletions(-) > > Index: linux-pm/include/linux/pci.h > =================================================================== > --- linux-pm.orig/include/linux/pci.h > +++ linux-pm/include/linux/pci.h > @@ -321,6 +321,7 @@ struct pci_dev { > unsigned int multifunction:1;/* Part of multi-function device */ > /* keep track of device state */ > unsigned int is_added:1; > + unsigned int is_gone:1; > unsigned int is_busmaster:1; /* device is busmaster */ > unsigned int no_msi:1; /* device may not use msi */ > unsigned int block_cfg_access:1; /* config space access is blocked */ > @@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_ > int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap); > unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge); > unsigned int pci_rescan_bus(struct pci_bus *bus); > +void lock_pci_remove_rescan(void); > +void unlock_pci_remove_rescan(void); > > /* Vital product data routines */ > ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf); > Index: linux-pm/drivers/pci/pci-sysfs.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-sysfs.c > +++ linux-pm/drivers/pci/pci-sysfs.c > @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct > static DEVICE_ATTR_RW(msi_bus); > > static DEFINE_MUTEX(pci_remove_rescan_mutex); > + > +void lock_pci_remove_rescan(void) > +{ > + mutex_lock(&pci_remove_rescan_mutex); > +} > + > +void unlock_pci_remove_rescan(void) > +{ > + mutex_unlock(&pci_remove_rescan_mutex); > +} > + > static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf, > size_t count) > { > Index: linux-pm/drivers/pci/remove.c > =================================================================== > --- linux-pm.orig/drivers/pci/remove.c > +++ linux-pm/drivers/pci/remove.c > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev > > static void pci_destroy_dev(struct pci_dev *dev) > { > + dev->is_gone = 1; > device_del(&dev->dev); > > down_write(&pci_bus_sem); > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct > */ > void pci_stop_and_remove_bus_device(struct pci_dev *dev) > { > - pci_stop_bus_device(dev); > - pci_remove_bus_device(dev); > + if (!dev->is_gone) { > + pci_stop_bus_device(dev); > + pci_remove_bus_device(dev); > + } > } > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); > > Index: linux-pm/drivers/pci/hotplug/acpiphp.h > =================================================================== > --- linux-pm.orig/drivers/pci/hotplug/acpiphp.h > +++ linux-pm/drivers/pci/hotplug/acpiphp.h > @@ -71,6 +71,7 @@ struct acpiphp_bridge { > struct acpiphp_context *context; > > int nr_slots; > + bool is_going_away; > > /* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */ > struct pci_bus *pci_bus; > Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c > =================================================================== > --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c > +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c > @@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph > mutex_lock(&bridge_mutex); > list_del(&bridge->list); > mutex_unlock(&bridge_mutex); > + > + /* > + * For non-root bridges this flag is protected by the PCI remove/rescan > + * locking. For root bridges it is only operated under acpi_scan_lock > + * anyway. > + */ > + bridge->is_going_away = true; > } > > /** > @@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc > * > * Iterate over all slots under this bridge and make sure that if a > * card is present they are enabled, and if not they are disabled. > + * > + * For non-root bridges call under the PCI remove/rescan mutex. > */ > static void acpiphp_check_bridge(struct acpiphp_bridge *bridge) > { > struct acpiphp_slot *slot; > > + /* Bail out if the bridge is going away. */ > + if (bridge->is_going_away) > + return; > + > list_for_each_entry(slot, &bridge->slots, node) { > struct pci_bus *bus = slot->bus; > struct pci_dev *dev, *tmp; > @@ -807,6 +820,8 @@ void acpiphp_check_host_bridge(acpi_hand > } > } > > +static int disable_and_eject_slot(struct acpiphp_slot *slot); > + > static void hotplug_event(acpi_handle handle, u32 type, void *data) > { > struct acpiphp_context *context = data; > @@ -866,7 +881,7 @@ static void hotplug_event(acpi_handle ha > case ACPI_NOTIFY_EJECT_REQUEST: > /* request device eject */ > pr_debug("%s: Device eject notify on %s\n", __func__, objname); > - acpiphp_disable_and_eject_slot(func->slot); > + disable_and_eject_slot(func->slot); > break; > } > > @@ -878,14 +893,19 @@ static void hotplug_event_work(void *dat > { > struct acpiphp_context *context = data; > acpi_handle handle = context->handle; > + struct acpiphp_bridge *bridge = context->func.parent; > > acpi_scan_lock_acquire(); > + lock_pci_remove_rescan(); > > - hotplug_event(handle, type, context); > + /* Bail out if the parent bridge is going away. */ > + if (!bridge->is_going_away) > + hotplug_event(handle, type, context); > > + unlock_pci_remove_rescan(); > acpi_scan_lock_release(); > acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL); > - put_bridge(context->func.parent); > + put_bridge(bridge); > } > > /** > @@ -1050,20 +1070,27 @@ void acpiphp_remove_slots(struct pci_bus > */ > int acpiphp_enable_slot(struct acpiphp_slot *slot) > { > - mutex_lock(&slot->crit_sect); > - /* configure all functions */ > - if (!(slot->flags & SLOT_ENABLED)) > - enable_slot(slot); > + struct acpiphp_func *func; > + int ret = -ENODEV; > > - mutex_unlock(&slot->crit_sect); > - return 0; > + lock_pci_remove_rescan(); > + > + func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling); > + if (!func->parent->is_going_away) { > + mutex_lock(&slot->crit_sect); > + /* configure all functions */ > + if (!(slot->flags & SLOT_ENABLED)) > + enable_slot(slot); > + > + mutex_unlock(&slot->crit_sect); > + ret = 0; > + } > + > + unlock_pci_remove_rescan(); > + return ret; > } > > -/** > - * acpiphp_disable_and_eject_slot - power off and eject slot > - * @slot: ACPI PHP slot > - */ > -int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) > +static int disable_and_eject_slot(struct acpiphp_slot *slot) > { > struct acpiphp_func *func; > int retval = 0; > @@ -1087,6 +1114,25 @@ int acpiphp_disable_and_eject_slot(struc > return retval; > } > > +/** > + * acpiphp_disable_and_eject_slot - power off and eject slot. > + * @slot: ACPIPHP slot. > + */ > +int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) > +{ > + struct acpiphp_func *func; > + int ret = -ENODEV; > + > + lock_pci_remove_rescan(); > + > + func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling); > + if (!func->parent->is_going_away) > + ret = disable_and_eject_slot(slot); > + > + unlock_pci_remove_rescan(); > + return ret; > +} > + > > /* > * slot enabled: 1 > Index: linux-pm/drivers/acpi/pci_root.c > =================================================================== > --- linux-pm.orig/drivers/acpi/pci_root.c > +++ linux-pm/drivers/acpi/pci_root.c > @@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct > { > struct acpi_pci_root *root = acpi_driver_data(device); > > + lock_pci_remove_rescan(); > + > pci_stop_root_bus(root->bus); > > device_set_run_wake(root->bus->bridge, false); > @@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct > > pci_remove_root_bus(root->bus); > > + unlock_pci_remove_rescan(); > + > kfree(root); > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, December 05, 2013 03:40:39 PM Bjorn Helgaas wrote: > On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > ... > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal > > > > The following are concurrency problems related to the PCI device > > removal code in pci-sysfs.c and in ACPIPHP present in the current > > mainline kernel: > > You've found a bunch of issues. I don't think there's anything to > gain by fixing them all in a single patch, and I think it would be > useful to split them out to help us think about them and find other > places that have similar problems. The problem is that they are all related. Whatever we decide to do with one of them will likely affect how we deal with the others. > > Scenario 1: pci_stop_and_remove_bus_device() run concurrently for > > the same top device from remove_callback() in pci-sysfs.c and > > from trim_stale_devices() in acpiphp_glue.c. > > > > In this scenario the second code path is executed without > > pci_remove_rescan_mutex locked, so the &bus->devices list > > walks in either trim_stale_devices() itself or in > > acpiphp_check_bridge() can suffer from list corruption while the > > first code path is executing pci_destroy_dev() for one of the > > devices on those lists. > > Protecting &bus->devices is a generic problem, isn't it? Yes, it is. In fact, there are two sides of it. One is with read-only users (that is, walking the list without modifying it) which should be done under read-locked pci_bus_sem and that's quite obvious. It looks like the majority of users of that list are readers and they fall under this case. That case is addressed by write-locking pci_bus_sem around the dev->bus_list deletion in pci_destroy_dev() (it will wait for all readers to complete their walks as long as they use pci_bus_sem correcty, but we may as well replace that with SRCU). The second one is where someone is walking the list and *deleting* entires from it in the process, which has to be synchronized with other writers and the write-locking of pci_bus_sem in pci_destroy_dev() is not sufficient for that, because pci_bus_sem can't be read-locked around a bus->devices walk deleting entries from that list (exactly because pci_bus_sem is write-locked during device deletion). So this is a special case and to fix races between different code paths that walk bus->devices *and* delete entries from there we need a separate lock to be held around the entire list walk. Incidentally, pci_remove_rescan_mutex is already used in that context by the code in pci-sysfs.c, so it can be used for this purpose by all of the other code paths doing this thing, like for example acpiphp_check_bridge() and trim_stale_devices() (there's more). > There are about a zillion uses of it. Many are in the pcibios_fixup_bus() path. > I think we can get rid of most of those by integrating the work into > the pci_scan_device() path instead of doing it as a post-discovery > fixup, but there will be several other cases left. If using > pci_remove_rescan_mutex to protect &bus->devices is the right generic > answer, we should document that and audit every place that uses the > list. No, we don't have to do that as explained above. We only need it around code that walks the list in order to (possibly) delete entries from it and there are not too many of such places, so they can be audited quite easily. [Basically, wherever pci_stop_and_remove_bus_device() is called during a list walk over bus->devices.] And *if* we use pci_remove_rescan_mutex for that, it will *also* address some other synchronization problems. > > Moreover, if any of the device objects in question is freed > > after pci_destroy_dev() executed by the first code path, the > > second code path may suffer a use-after-free problem while > > trying to access that device object. > > > > Conversely, the second code path may execute pci_destroy_dev() > > for one of the devices in question such that one of the > > &bus->devices list walks in pci_stop_bus_device() > > or pci_remove_bus_device() executed by the first code path will > > suffer from a list corruption. > > > > Moreover, use-after-free is also possible if one of the device > > objects in question is freed as a result of calling > > pci_destroy_dev() by the second code path and then the first > > code path tries to access it (the first code path only holds > > an extra reference to the device it has been run for, but not > > for its child devices). > > The use-after-free problems *sound* like a reference counting issue. > Yinghai's patch [1] should fix some of this; how much is left after > that? > > [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org Hmm, no. Do you mean https://patchwork.kernel.org/patch/3261171/ ? This patch doesn't fix the problem either, however, because in only fixes the one case described in Scenario 5 below. Suppose that (1) trim_stale_devices() calls pci_stop_amd_remove_bus_devices(X) and (2) remove_callback() calls pci_stop_and_remove_bus_devices(X) (i.e. for the same device) at the same time. Suppose that X is a leaf device for simplicity and say that code path (2) is executed first. We run pci_stop_dev(X) and pci_destroy_dev(X), which does put_device(&X->dev) and then sysfs_schedule_callback_work() executes kobject_put() for the X's kobject. The device pointed to by X is gone at this point. Now, code path (1) runs and and crashes with a great bang trying to access X->subordinate in pci_stop_bus_device(). This happens regardless of whether or not the Yinghai's patch has been applied. QED Of course, you can argue that this is because code path (1) should have done a pci_dev_get(X) before running pci_stop_amd_remove_bus_devices(X) which would have avoided the crash. I can agree with that, but *if* code path (1) had held pci_remove_rescan_mutex around the whole trim_stale_devices(), then it wouldn't have had to do that pci_dev_get(X), because the entire race above wouldn't have been possible (code path (2) already holds that mutex around pci_stop_and_remove_bus_devices(X)). OK, you can ask, but why the heck does code path (1) need to hold pci_remove_rescan_mutex around trim_stale_devices()? Well, because it needs to synchronize the list walks in acpiphp_check_bridge() and trim_stale_devices() with the list_del() in pci_destroy_dev() executed by code path (2). That's why I'm proposing to acquire pci_remove_rescan_mutex in hotplug_event_work() and hold it around the whole hotplug_event(). Of course, analogous races are possible for acpiphp_enable_slot() and acpiphp_disable_and_eject_slot() and they may be addressed analogously - by holding pci_remove_rescan_mutex around possible modifications of the PCI device hierarchy. > > Scenario 2: ACPI hotplug event occurs for a device under a bridge > > being removed by pci_stop_and_remove_bus_device() run from > > remove_callback() in pci-sysfs.c. > > > > In that case it doesn't make sense to handle the hotplug event, > > because the device in question will be removed anyway along with > > its parent bridge and that will cause the context objects needed > > for hotplug handling to be freed as well. > > > > Moreover, if the event is handled regardless, it may cause one > > or more devices already removed by pci_stop_and_remove_bus_device() > > to be added again by the code handling the event, which will > > conflict with the bridge removal. > > We definitely need to serialize hotplug events from ACPI and sysfs > (and other sources, like other hotplug drivers). Would that be > enough? Adding the is_going_away flag is ACPI-specific and seems like > sort of a point workaround. Well, we basically need to prevent hotplug_event() from being run in that case, and since that function is ACPI-specific and the data structures hotplug_event_work() operates on are ACPI-specific, I'm using an ACPI-specific flag to achieve that goal. If you can suggest any approach that would not be ACPI-specific to address this particular case, I'll be happy to use it. :-) > > Scenario 3: pci_stop_and_remove_bus_device() is run from > > trim_stale_devices() (as a result of an ACPI hotplug event) in > > parallel with dev_bus_rescan_store() or bus_rescan_store(), > > or dev_rescan_store(). > > > > In that scenario the second code path may attempt to operate > > on device objects being removed by the first code path which > > may lead to many interesting types of breakage. > > > > Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI > > host bridge removal event) in parallel with bus_rescan_store(), > > dev_bus_rescan_store(), dev_rescan_store(), or remove_callback() > > for any devices under the host bridge in question. > > > > In that case the same symptoms as in Scenarios 1 and 3 may occur > > depending on which code path wins the races involved. > > Scenarios 3 and 4 sound like more cases of hotplug operations needing > to be serialized, right? If we serialized them sufficiently, would > there still be a problem? That depends on what exactly you mean by "sufficiently". That is, what's the goal? Different approaches may be sufficient to achieve specific goals, but not necessarily more than one of them. > Using pci_remove_rescan_mutex would > serialize *all* PCI hotplug operations, which is more than strictly > necessary, but maybe there's no reason to do anything finer-grained. My answer to this is yes, using pci_remove_rescan_mutex *will* serialize all PCI hotplug operations, which will be a very good first step. Having done that, we can see how much we can relax things and how far we *want* to go with that. > > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > > for a device and its parent bridge via remove_callback(). > > > > In that case both code paths attempt to acquire > > pci_remove_rescan_mutex. If the child device removal acquires > > it first, there will be no problems. However, if the parent > > bridge removal acquires it first, it will eventually execute > > pci_destroy_dev() for the child device, but that device will > > not be freed yet due to the reference held by the concurrent > > child removal. Consequently, both pci_stop_bus_device() and > > pci_remove_bus_device() will be executed for that device > > unnecessarily and pci_destroy_dev() will see a corrupted list > > head in that object. Moreover, an excess put_device() will > > be executed for that device in that case which may lead to a > > use-after-free in the final kobject_put() done by > > sysfs_schedule_callback_work(). > > The corrupted list head should be fixed by Yinghai's patch [1]. I think you really mean https://patchwork.kernel.org/patch/3261171/ :-) Generally, patches in that series would mitigate that problem somewhat. I actually stole the Yinghai's idea with the new PCI device flag (sorry, Yinghai), but I think it's better to check that flag to start with in pci_stop_and_remove_bus_device(), because then we avoid calling pci_stop_bus_device() unnecessarily as well (surely, the device was stopped before it has gone, wasn't it?). > Where is the extra put_device()? I see the > kobject_get()/kobject_put() pair in sysfs_schedule_callback() and > sysfs_schedule_callback_work(). Oh, I see -- the remove_store() -> > remove_callback() path acquires no references, but it calls > pci_stop_and_remove_bus_device(), which ultimately does the > put_device() in pci_destroy_dev(). > > So if both the parent and the child removal manage to get to > remove_callback() and the parent acquires pci_remove_rescan_mutex > first, the child removal will do the extra put_device(). > > There are only six callers of device_schedule_callback(), and I think > five of them are susceptible to this same problem: they are sysfs > store methods, and they use device_schedule_callback() with a callback > that does a put_device() on the device: > > drivers/pci/pci-sysfs.c: remove_store() > drivers/scsi/scsi_sysfs.c: sdev_store_delete() > arch/s390/pci/pci_sysfs.c: store_recover() > drivers/s390/block/dcssblk.c: dcssblk_shared_store() > drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store() > > I don't know what the right fix is, but adding "is_gone" to struct > pci_dev only addresses one of the five places, of course. That's correct. I'm not sure if there *is* a generic solution to this problem, because sysfs_schedule_callback_work() doesn't know what the callback function is going to do. In principle it can look at ss->kobj->parent and skip the execution of ss->func if that is NULL (which means that the kobject has been deleted and it is likely the last thing holding a reference to that kobject), but that still will be racy without extra synchronization in the users of device_schedule_callback(). Which probably means that device_schedule_callback() is ill concieved in the first place, because it can't really do the work it is designed for in general. So perhaps it's better to use something simpler and PCI-specific for PCI and analogously in the other cases you mentioned. OK To be a bit more constructive, as the next step I'd try to use pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said above) without making the other changes made by my patch. Does that sound reasonable? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 5, 2013 at 5:21 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> >> The use-after-free problems *sound* like a reference counting issue. >> Yinghai's patch [1] should fix some of this; how much is left after >> that? >> >> [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org > > Hmm, no. Do you mean https://patchwork.kernel.org/patch/3261171/ ? should be https://patchwork.kernel.org/patch/3261001/ [v3,03/12] PCI: Move resources and bus_list releasing to pci_release_dev move down list_del(&pci_dev->bus_list). Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > for a device and its parent bridge via remove_callback(). > > In that case both code paths attempt to acquire > pci_remove_rescan_mutex. If the child device removal acquires > it first, there will be no problems. However, if the parent > bridge removal acquires it first, it will eventually execute > pci_destroy_dev() for the child device, but that device will > not be freed yet due to the reference held by the concurrent > child removal. Consequently, both pci_stop_bus_device() and > pci_remove_bus_device() will be executed for that device > unnecessarily and pci_destroy_dev() will see a corrupted list > head in that object. Moreover, an excess put_device() will > be executed for that device in that case which may lead to a > use-after-free in the final kobject_put() done by > sysfs_schedule_callback_work(). > > Index: linux-pm/include/linux/pci.h > =================================================================== > --- linux-pm.orig/include/linux/pci.h > +++ linux-pm/include/linux/pci.h > @@ -321,6 +321,7 @@ struct pci_dev { > unsigned int multifunction:1;/* Part of multi-function device */ > /* keep track of device state */ > unsigned int is_added:1; > + unsigned int is_gone:1; > unsigned int is_busmaster:1; /* device is busmaster */ > unsigned int no_msi:1; /* device may not use msi */ > unsigned int block_cfg_access:1; /* config space access is blocked */ > Index: linux-pm/drivers/pci/remove.c > =================================================================== > --- linux-pm.orig/drivers/pci/remove.c > +++ linux-pm/drivers/pci/remove.c > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev > > static void pci_destroy_dev(struct pci_dev *dev) > { > + dev->is_gone = 1; > device_del(&dev->dev); > > down_write(&pci_bus_sem); > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct > */ > void pci_stop_and_remove_bus_device(struct pci_dev *dev) > { > - pci_stop_bus_device(dev); > - pci_remove_bus_device(dev); > + if (!dev->is_gone) { > + pci_stop_bus_device(dev); > + pci_remove_bus_device(dev); > + } > } > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); > Yes, above change should address sys double remove problem. Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: > On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > > for a device and its parent bridge via remove_callback(). > > > > In that case both code paths attempt to acquire > > pci_remove_rescan_mutex. If the child device removal acquires > > it first, there will be no problems. However, if the parent > > bridge removal acquires it first, it will eventually execute > > pci_destroy_dev() for the child device, but that device will > > not be freed yet due to the reference held by the concurrent > > child removal. Consequently, both pci_stop_bus_device() and > > pci_remove_bus_device() will be executed for that device > > unnecessarily and pci_destroy_dev() will see a corrupted list > > head in that object. Moreover, an excess put_device() will > > be executed for that device in that case which may lead to a > > use-after-free in the final kobject_put() done by > > sysfs_schedule_callback_work(). > > > > Index: linux-pm/include/linux/pci.h > > =================================================================== > > --- linux-pm.orig/include/linux/pci.h > > +++ linux-pm/include/linux/pci.h > > @@ -321,6 +321,7 @@ struct pci_dev { > > unsigned int multifunction:1;/* Part of multi-function device */ > > /* keep track of device state */ > > unsigned int is_added:1; > > + unsigned int is_gone:1; > > unsigned int is_busmaster:1; /* device is busmaster */ > > unsigned int no_msi:1; /* device may not use msi */ > > unsigned int block_cfg_access:1; /* config space access is blocked */ > > Index: linux-pm/drivers/pci/remove.c > > =================================================================== > > --- linux-pm.orig/drivers/pci/remove.c > > +++ linux-pm/drivers/pci/remove.c > > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev > > > > static void pci_destroy_dev(struct pci_dev *dev) > > { > > + dev->is_gone = 1; > > device_del(&dev->dev); > > > > down_write(&pci_bus_sem); > > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct > > */ > > void pci_stop_and_remove_bus_device(struct pci_dev *dev) > > { > > - pci_stop_bus_device(dev); > > - pci_remove_bus_device(dev); > > + if (!dev->is_gone) { > > + pci_stop_bus_device(dev); > > + pci_remove_bus_device(dev); > > + } > > } > > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); > > > > Yes, above change should address sys double remove problem. I've just realized that we don't need a new flag for that, though. It looks like we only need to check dev->dev.kobj.parent and return if that is NULL, because that means pci_destroy_dev() has run for that device already (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like it should do that?). Of course, that still is going to be racy if we don't hold pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code path using it (or use another similar synchronization mechanism). Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[+ GregKH] On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> > >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently >> > for a device and its parent bridge via remove_callback(). >> > >> > In that case both code paths attempt to acquire >> > pci_remove_rescan_mutex. If the child device removal acquires >> > it first, there will be no problems. However, if the parent >> > bridge removal acquires it first, it will eventually execute >> > pci_destroy_dev() for the child device, but that device will >> > not be freed yet due to the reference held by the concurrent >> > child removal. Consequently, both pci_stop_bus_device() and >> > pci_remove_bus_device() will be executed for that device >> > unnecessarily and pci_destroy_dev() will see a corrupted list >> > head in that object. Moreover, an excess put_device() will >> > be executed for that device in that case which may lead to a >> > use-after-free in the final kobject_put() done by >> > sysfs_schedule_callback_work(). >> > >> > Index: linux-pm/include/linux/pci.h >> > =================================================================== >> > --- linux-pm.orig/include/linux/pci.h >> > +++ linux-pm/include/linux/pci.h >> > @@ -321,6 +321,7 @@ struct pci_dev { >> > unsigned int multifunction:1;/* Part of multi-function device */ >> > /* keep track of device state */ >> > unsigned int is_added:1; >> > + unsigned int is_gone:1; >> > unsigned int is_busmaster:1; /* device is busmaster */ >> > unsigned int no_msi:1; /* device may not use msi */ >> > unsigned int block_cfg_access:1; /* config space access is blocked */ >> > Index: linux-pm/drivers/pci/remove.c >> > =================================================================== >> > --- linux-pm.orig/drivers/pci/remove.c >> > +++ linux-pm/drivers/pci/remove.c >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev >> > >> > static void pci_destroy_dev(struct pci_dev *dev) >> > { >> > + dev->is_gone = 1; >> > device_del(&dev->dev); >> > >> > down_write(&pci_bus_sem); >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct >> > */ >> > void pci_stop_and_remove_bus_device(struct pci_dev *dev) >> > { >> > - pci_stop_bus_device(dev); >> > - pci_remove_bus_device(dev); >> > + if (!dev->is_gone) { >> > + pci_stop_bus_device(dev); >> > + pci_remove_bus_device(dev); >> > + } >> > } >> > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); >> > >> >> Yes, above change should address sys double remove problem. > > I've just realized that we don't need a new flag for that, though. > > It looks like we only need to check dev->dev.kobj.parent and return if that is > NULL, because that means pci_destroy_dev() has run for that device already > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like > it should do that?). > > Of course, that still is going to be racy if we don't hold > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code > path using it (or use another similar synchronization mechanism). Wonder if we can have safe way to check if device_del() is called already. And those access_after_free should be addressed by driver core instead of pci code? Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote: > [+ GregKH] > > On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: > >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > >> > > >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > >> > for a device and its parent bridge via remove_callback(). > >> > > >> > In that case both code paths attempt to acquire > >> > pci_remove_rescan_mutex. If the child device removal acquires > >> > it first, there will be no problems. However, if the parent > >> > bridge removal acquires it first, it will eventually execute > >> > pci_destroy_dev() for the child device, but that device will > >> > not be freed yet due to the reference held by the concurrent > >> > child removal. Consequently, both pci_stop_bus_device() and > >> > pci_remove_bus_device() will be executed for that device > >> > unnecessarily and pci_destroy_dev() will see a corrupted list > >> > head in that object. Moreover, an excess put_device() will > >> > be executed for that device in that case which may lead to a > >> > use-after-free in the final kobject_put() done by > >> > sysfs_schedule_callback_work(). > >> > > >> > Index: linux-pm/include/linux/pci.h > >> > =================================================================== > >> > --- linux-pm.orig/include/linux/pci.h > >> > +++ linux-pm/include/linux/pci.h > >> > @@ -321,6 +321,7 @@ struct pci_dev { > >> > unsigned int multifunction:1;/* Part of multi-function device */ > >> > /* keep track of device state */ > >> > unsigned int is_added:1; > >> > + unsigned int is_gone:1; > >> > unsigned int is_busmaster:1; /* device is busmaster */ > >> > unsigned int no_msi:1; /* device may not use msi */ > >> > unsigned int block_cfg_access:1; /* config space access is blocked */ > >> > Index: linux-pm/drivers/pci/remove.c > >> > =================================================================== > >> > --- linux-pm.orig/drivers/pci/remove.c > >> > +++ linux-pm/drivers/pci/remove.c > >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev > >> > > >> > static void pci_destroy_dev(struct pci_dev *dev) > >> > { > >> > + dev->is_gone = 1; > >> > device_del(&dev->dev); > >> > > >> > down_write(&pci_bus_sem); > >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct > >> > */ > >> > void pci_stop_and_remove_bus_device(struct pci_dev *dev) > >> > { > >> > - pci_stop_bus_device(dev); > >> > - pci_remove_bus_device(dev); > >> > + if (!dev->is_gone) { > >> > + pci_stop_bus_device(dev); > >> > + pci_remove_bus_device(dev); > >> > + } > >> > } > >> > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); > >> > > >> > >> Yes, above change should address sys double remove problem. > > > > I've just realized that we don't need a new flag for that, though. > > > > It looks like we only need to check dev->dev.kobj.parent and return if that is > > NULL, because that means pci_destroy_dev() has run for that device already > > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like > > it should do that?). > > > > Of course, that still is going to be racy if we don't hold > > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code > > path using it (or use another similar synchronization mechanism). > > Wonder if we can have safe way to check if device_del() is called already. Nope. > And those access_after_free should be addressed by driver core instead > of pci code? Nope, it's up to the bus to handle this. It shouldn't be hard, you shouldn't actually care about this, if you do, something is wrong. How is this PCI code so hard to get right? Look at USB for devices that disappear from anywhere at anytime as an example for how to handle this. PCI should be doing the same thing, no need for this "is_gone" stuff. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote: >> [+ GregKH] >> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> >> > >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently >> >> > for a device and its parent bridge via remove_callback(). >> >> > >> >> > In that case both code paths attempt to acquire >> >> > pci_remove_rescan_mutex. If the child device removal acquires >> >> > it first, there will be no problems. However, if the parent >> >> > bridge removal acquires it first, it will eventually execute >> >> > pci_destroy_dev() for the child device, but that device will >> >> > not be freed yet due to the reference held by the concurrent >> >> > child removal. Consequently, both pci_stop_bus_device() and >> >> > pci_remove_bus_device() will be executed for that device >> >> > unnecessarily and pci_destroy_dev() will see a corrupted list >> >> > head in that object. Moreover, an excess put_device() will >> >> > be executed for that device in that case which may lead to a >> >> > use-after-free in the final kobject_put() done by >> >> > sysfs_schedule_callback_work(). >> >> > >> >> > Index: linux-pm/include/linux/pci.h >> >> > =================================================================== >> >> > --- linux-pm.orig/include/linux/pci.h >> >> > +++ linux-pm/include/linux/pci.h >> >> > @@ -321,6 +321,7 @@ struct pci_dev { >> >> > unsigned int multifunction:1;/* Part of multi-function device */ >> >> > /* keep track of device state */ >> >> > unsigned int is_added:1; >> >> > + unsigned int is_gone:1; >> >> > unsigned int is_busmaster:1; /* device is busmaster */ >> >> > unsigned int no_msi:1; /* device may not use msi */ >> >> > unsigned int block_cfg_access:1; /* config space access is blocked */ >> >> > Index: linux-pm/drivers/pci/remove.c >> >> > =================================================================== >> >> > --- linux-pm.orig/drivers/pci/remove.c >> >> > +++ linux-pm/drivers/pci/remove.c >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev >> >> > >> >> > static void pci_destroy_dev(struct pci_dev *dev) >> >> > { >> >> > + dev->is_gone = 1; >> >> > device_del(&dev->dev); >> >> > >> >> > down_write(&pci_bus_sem); >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct >> >> > */ >> >> > void pci_stop_and_remove_bus_device(struct pci_dev *dev) >> >> > { >> >> > - pci_stop_bus_device(dev); >> >> > - pci_remove_bus_device(dev); >> >> > + if (!dev->is_gone) { >> >> > + pci_stop_bus_device(dev); >> >> > + pci_remove_bus_device(dev); >> >> > + } >> >> > } >> >> > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); >> >> > >> >> >> >> Yes, above change should address sys double remove problem. >> > >> > I've just realized that we don't need a new flag for that, though. >> > >> > It looks like we only need to check dev->dev.kobj.parent and return if that is >> > NULL, because that means pci_destroy_dev() has run for that device already >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like >> > it should do that?). >> > >> > Of course, that still is going to be racy if we don't hold >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code >> > path using it (or use another similar synchronization mechanism). >> >> Wonder if we can have safe way to check if device_del() is called already. > > Nope. > >> And those access_after_free should be addressed by driver core instead >> of pci code? > > Nope, it's up to the bus to handle this. It shouldn't be hard, you > shouldn't actually care about this, if you do, something is wrong. > > How is this PCI code so hard to get right? Look at USB for devices that > disappear from anywhere at anytime as an example for how to handle > this. PCI should be doing the same thing, no need for this "is_gone" > stuff. Greg, Don't agree USB is a good example to follow, do you never hit panic when you pull out USB device from anywhere at anytime without unmount or stop it via command ? that is not truth. the truth is none regards it as enterprise level interface to attach devices. Is there a feature for an USB disk to tell the host you want to pull out it and should sync all the data in cache and unmount the files system then power it off ? What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ? Thanks, Ethan > > greg k-h > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote: > On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote: > >> [+ GregKH] > >> > >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: > >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > >> >> > > >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently > >> >> > for a device and its parent bridge via remove_callback(). > >> >> > > >> >> > In that case both code paths attempt to acquire > >> >> > pci_remove_rescan_mutex. If the child device removal acquires > >> >> > it first, there will be no problems. However, if the parent > >> >> > bridge removal acquires it first, it will eventually execute > >> >> > pci_destroy_dev() for the child device, but that device will > >> >> > not be freed yet due to the reference held by the concurrent > >> >> > child removal. Consequently, both pci_stop_bus_device() and > >> >> > pci_remove_bus_device() will be executed for that device > >> >> > unnecessarily and pci_destroy_dev() will see a corrupted list > >> >> > head in that object. Moreover, an excess put_device() will > >> >> > be executed for that device in that case which may lead to a > >> >> > use-after-free in the final kobject_put() done by > >> >> > sysfs_schedule_callback_work(). > >> >> > > >> >> > Index: linux-pm/include/linux/pci.h > >> >> > =================================================================== > >> >> > --- linux-pm.orig/include/linux/pci.h > >> >> > +++ linux-pm/include/linux/pci.h > >> >> > @@ -321,6 +321,7 @@ struct pci_dev { > >> >> > unsigned int multifunction:1;/* Part of multi-function device */ > >> >> > /* keep track of device state */ > >> >> > unsigned int is_added:1; > >> >> > + unsigned int is_gone:1; > >> >> > unsigned int is_busmaster:1; /* device is busmaster */ > >> >> > unsigned int no_msi:1; /* device may not use msi */ > >> >> > unsigned int block_cfg_access:1; /* config space access is blocked */ > >> >> > Index: linux-pm/drivers/pci/remove.c > >> >> > =================================================================== > >> >> > --- linux-pm.orig/drivers/pci/remove.c > >> >> > +++ linux-pm/drivers/pci/remove.c > >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev > >> >> > > >> >> > static void pci_destroy_dev(struct pci_dev *dev) > >> >> > { > >> >> > + dev->is_gone = 1; > >> >> > device_del(&dev->dev); > >> >> > > >> >> > down_write(&pci_bus_sem); > >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct > >> >> > */ > >> >> > void pci_stop_and_remove_bus_device(struct pci_dev *dev) > >> >> > { > >> >> > - pci_stop_bus_device(dev); > >> >> > - pci_remove_bus_device(dev); > >> >> > + if (!dev->is_gone) { > >> >> > + pci_stop_bus_device(dev); > >> >> > + pci_remove_bus_device(dev); > >> >> > + } > >> >> > } > >> >> > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); > >> >> > > >> >> > >> >> Yes, above change should address sys double remove problem. > >> > > >> > I've just realized that we don't need a new flag for that, though. > >> > > >> > It looks like we only need to check dev->dev.kobj.parent and return if that is > >> > NULL, because that means pci_destroy_dev() has run for that device already > >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like > >> > it should do that?). > >> > > >> > Of course, that still is going to be racy if we don't hold > >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code > >> > path using it (or use another similar synchronization mechanism). > >> > >> Wonder if we can have safe way to check if device_del() is called already. > > > > Nope. > > > >> And those access_after_free should be addressed by driver core instead > >> of pci code? > > > > Nope, it's up to the bus to handle this. It shouldn't be hard, you > > shouldn't actually care about this, if you do, something is wrong. > > > > How is this PCI code so hard to get right? Look at USB for devices that > > disappear from anywhere at anytime as an example for how to handle > > this. PCI should be doing the same thing, no need for this "is_gone" > > stuff. > Greg, > > Don't agree USB is a good example to follow, do you never hit panic > when you pull out USB device from anywhere at anytime without unmount > or stop it via command ? You shouldn't. If you do, it's a bug, let us know and we will fix it. > that is not truth. the truth is none regards it as enterprise level > interface to attach devices. Huh? > Is there a feature for an USB disk to tell the host you want to pull > out it and should sync all the data in cache and unmount the files > system then power it off ? Nope, neither is there one for when I yank out my PCI storage device without telling the OS about it either. Everything better "just work", with the exception of any lost data that might be in flight. > What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ? I don't understand what that means at all. We have USB network ethernet devices, I have a USB 3.0 one here that works really well. Infiniband is merely a transport, with some "verbs" on top of it, that has nothing to do with PCI other than you can have a IB PCI controller in the system. And I have a USB graphics adapter here that works just fine as well (people chain lots of them on one system.) So how does this apply to PCI at all? It's the same thing, you have to be able to handle a PCI device going away at any point in time, with or without telling the OS ahead of time that you are going to remove it. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 10, 2013 at 3:08 AM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote: >> On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman >> <gregkh@linuxfoundation.org> wrote: >> > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote: >> >> [+ GregKH] >> >> >> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote: >> >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> >> >> > >> >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently >> >> >> > for a device and its parent bridge via remove_callback(). >> >> >> > >> >> >> > In that case both code paths attempt to acquire >> >> >> > pci_remove_rescan_mutex. If the child device removal acquires >> >> >> > it first, there will be no problems. However, if the parent >> >> >> > bridge removal acquires it first, it will eventually execute >> >> >> > pci_destroy_dev() for the child device, but that device will >> >> >> > not be freed yet due to the reference held by the concurrent >> >> >> > child removal. Consequently, both pci_stop_bus_device() and >> >> >> > pci_remove_bus_device() will be executed for that device >> >> >> > unnecessarily and pci_destroy_dev() will see a corrupted list >> >> >> > head in that object. Moreover, an excess put_device() will >> >> >> > be executed for that device in that case which may lead to a >> >> >> > use-after-free in the final kobject_put() done by >> >> >> > sysfs_schedule_callback_work(). >> >> >> > >> >> >> > Index: linux-pm/include/linux/pci.h >> >> >> > =================================================================== >> >> >> > --- linux-pm.orig/include/linux/pci.h >> >> >> > +++ linux-pm/include/linux/pci.h >> >> >> > @@ -321,6 +321,7 @@ struct pci_dev { >> >> >> > unsigned int multifunction:1;/* Part of multi-function device */ >> >> >> > /* keep track of device state */ >> >> >> > unsigned int is_added:1; >> >> >> > + unsigned int is_gone:1; >> >> >> > unsigned int is_busmaster:1; /* device is busmaster */ >> >> >> > unsigned int no_msi:1; /* device may not use msi */ >> >> >> > unsigned int block_cfg_access:1; /* config space access is blocked */ >> >> >> > Index: linux-pm/drivers/pci/remove.c >> >> >> > =================================================================== >> >> >> > --- linux-pm.orig/drivers/pci/remove.c >> >> >> > +++ linux-pm/drivers/pci/remove.c >> >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev >> >> >> > >> >> >> > static void pci_destroy_dev(struct pci_dev *dev) >> >> >> > { >> >> >> > + dev->is_gone = 1; >> >> >> > device_del(&dev->dev); >> >> >> > >> >> >> > down_write(&pci_bus_sem); >> >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct >> >> >> > */ >> >> >> > void pci_stop_and_remove_bus_device(struct pci_dev *dev) >> >> >> > { >> >> >> > - pci_stop_bus_device(dev); >> >> >> > - pci_remove_bus_device(dev); >> >> >> > + if (!dev->is_gone) { >> >> >> > + pci_stop_bus_device(dev); >> >> >> > + pci_remove_bus_device(dev); >> >> >> > + } >> >> >> > } >> >> >> > EXPORT_SYMBOL(pci_stop_and_remove_bus_device); >> >> >> > >> >> >> >> >> >> Yes, above change should address sys double remove problem. >> >> > >> >> > I've just realized that we don't need a new flag for that, though. >> >> > >> >> > It looks like we only need to check dev->dev.kobj.parent and return if that is >> >> > NULL, because that means pci_destroy_dev() has run for that device already >> >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like >> >> > it should do that?). >> >> > >> >> > Of course, that still is going to be racy if we don't hold >> >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code >> >> > path using it (or use another similar synchronization mechanism). >> >> >> >> Wonder if we can have safe way to check if device_del() is called already. >> > >> > Nope. >> > >> >> And those access_after_free should be addressed by driver core instead >> >> of pci code? >> > >> > Nope, it's up to the bus to handle this. It shouldn't be hard, you >> > shouldn't actually care about this, if you do, something is wrong. >> > >> > How is this PCI code so hard to get right? Look at USB for devices that >> > disappear from anywhere at anytime as an example for how to handle >> > this. PCI should be doing the same thing, no need for this "is_gone" >> > stuff. >> Greg, >> >> Don't agree USB is a good example to follow, do you never hit panic >> when you pull out USB device from anywhere at anytime without unmount >> or stop it via command ? > > You shouldn't. If you do, it's a bug, let us know and we will fix it. Of coz, next time hit, bore you with a calltrace. > >> that is not truth. the truth is none regards it as enterprise level >> interface to attach devices. > > Huh? USB 3.0 still not fast enough for enterprise level. > >> Is there a feature for an USB disk to tell the host you want to pull >> out it and should sync all the data in cache and unmount the files >> system then power it off ? > > Nope, neither is there one for when I yank out my PCI storage device > without telling the OS about it either. Everything better "just work", > with the exception of any lost data that might be in flight. To a desktop, you do have option to issue 'sync, umount' and pull out the device and it 'just work', To a server, someone wouldn't stand for any data lost in flight. USB need additional feature added for you to tell udev sync data etc without a console in hand. > >> What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ? > > I don't understand what that means at all. Don't be sleepy, man, you know USB is not powerful enough today, just as you said, someday, all the outdated thing will go away, just like those ISA, VESA, we don't care the low level data link layer anymore. but today, PCIe is still a little more complex/out than USB to handle. Thanks, Ethan > > We have USB network ethernet devices, I have a USB 3.0 one here that > works really well. Infiniband is merely a transport, with some "verbs" > on top of it, that has nothing to do with PCI other than you can have a > IB PCI controller in the system. And I have a USB graphics adapter here > that works just fine as well (people chain lots of them on one system.) > > So how does this apply to PCI at all? It's the same thing, you have to > be able to handle a PCI device going away at any point in time, with or > without telling the OS ahead of time that you are going to remove it. > > greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for platform/x86 and Konrad for Xen] On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote: [...] > > OK > > To be a bit more constructive, as the next step I'd try to use > pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said > above) without making the other changes made by my patch. Does that sound > reasonable? Well, no answer here, so as a followup, a series implementing that idea follows. I *hope* I found all of the places that need to be synchronized vs the bus rescan and device removal that can be triggered via sysfs, but I might overlook something. Also in some cases I wasn't quite sure how much stuff to put under the lock, because said stuff is not exactly straightforward. Enjoy! Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 10, 2014 at 03:20:44PM +0100, Rafael J. Wysocki wrote: > [Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for > platform/x86 and Konrad for Xen] > > On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote: > > [...] > > > > > OK > > > > To be a bit more constructive, as the next step I'd try to use > > pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said > > above) without making the other changes made by my patch. Does that sound > > reasonable? > > Well, no answer here, so as a followup, a series implementing that idea > follows. > > I *hope* I found all of the places that need to be synchronized vs the bus > rescan and device removal that can be triggered via sysfs, but I might overlook > something. Also in some cases I wasn't quite sure how much stuff to put under > the lock, because said stuff is not exactly straightforward. I applied this series to my pci/locking branch for v3.14. It should appear in -next tomorrow. Note that this touches some areas that are not strictly PCI, so speak up if I'm treading on your toes: arch/powerpc/kernel/eeh_driver.c | 19 ++++++++++++-- drivers/acpi/pci_root.c | 6 ++++ drivers/message/fusion/mptbase.c | 2 - drivers/pci/hotplug/acpiphp.h | 5 +++ drivers/pci/hotplug/acpiphp_core.c | 2 - drivers/pci/hotplug/acpiphp_glue.c | 43 +++++++++++++++++++++++++++++---- drivers/pci/hotplug/cpci_hotplug_pci.c | 14 +++++++++- drivers/pci/hotplug/cpqphp_pci.c | 8 +++++- drivers/pci/hotplug/ibmphp_core.c | 13 ++++++++- drivers/pci/hotplug/pciehp_pci.c | 17 +++++++++---- drivers/pci/hotplug/rpadlpar_core.c | 19 ++++++++++---- drivers/pci/hotplug/rpaphp_core.c | 4 +++ drivers/pci/hotplug/s390_pci_hpc.c | 4 ++- drivers/pci/hotplug/sgi_hotplug.c | 5 +++ drivers/pci/hotplug/shpchp_pci.c | 18 ++++++++++--- drivers/pci/pci-sysfs.c | 19 +++++--------- drivers/pci/probe.c | 18 +++++++++++++ drivers/pci/remove.c | 11 ++++++++ drivers/pci/xen-pcifront.c | 8 ++++++ drivers/pcmcia/cardbus.c | 7 +++++ drivers/platform/x86/asus-wmi.c | 2 + drivers/platform/x86/eeepc-laptop.c | 2 + drivers/scsi/mpt2sas/mpt2sas_base.c | 2 - drivers/scsi/mpt3sas/mpt3sas_base.c | 2 - include/linux/pci.h | 3 ++ -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-pm/include/linux/pci.h =================================================================== --- linux-pm.orig/include/linux/pci.h +++ linux-pm/include/linux/pci.h @@ -321,6 +321,7 @@ struct pci_dev { unsigned int multifunction:1;/* Part of multi-function device */ /* keep track of device state */ unsigned int is_added:1; + unsigned int is_gone:1; unsigned int is_busmaster:1; /* device is busmaster */ unsigned int no_msi:1; /* device may not use msi */ unsigned int block_cfg_access:1; /* config space access is blocked */ @@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_ int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap); unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge); unsigned int pci_rescan_bus(struct pci_bus *bus); +void lock_pci_remove_rescan(void); +void unlock_pci_remove_rescan(void); /* Vital product data routines */ ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf); Index: linux-pm/drivers/pci/pci-sysfs.c =================================================================== --- linux-pm.orig/drivers/pci/pci-sysfs.c +++ linux-pm/drivers/pci/pci-sysfs.c @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct static DEVICE_ATTR_RW(msi_bus); static DEFINE_MUTEX(pci_remove_rescan_mutex); + +void lock_pci_remove_rescan(void) +{ + mutex_lock(&pci_remove_rescan_mutex); +} + +void unlock_pci_remove_rescan(void) +{ + mutex_unlock(&pci_remove_rescan_mutex); +} + static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf, size_t count) { Index: linux-pm/drivers/pci/remove.c =================================================================== --- linux-pm.orig/drivers/pci/remove.c +++ linux-pm/drivers/pci/remove.c @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev static void pci_destroy_dev(struct pci_dev *dev) { + dev->is_gone = 1; device_del(&dev->dev); down_write(&pci_bus_sem); @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct */ void pci_stop_and_remove_bus_device(struct pci_dev *dev) { - pci_stop_bus_device(dev); - pci_remove_bus_device(dev); + if (!dev->is_gone) { + pci_stop_bus_device(dev); + pci_remove_bus_device(dev); + } } EXPORT_SYMBOL(pci_stop_and_remove_bus_device); Index: linux-pm/drivers/pci/hotplug/acpiphp.h =================================================================== --- linux-pm.orig/drivers/pci/hotplug/acpiphp.h +++ linux-pm/drivers/pci/hotplug/acpiphp.h @@ -71,6 +71,7 @@ struct acpiphp_bridge { struct acpiphp_context *context; int nr_slots; + bool is_going_away; /* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */ struct pci_bus *pci_bus; Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c =================================================================== --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c @@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph mutex_lock(&bridge_mutex); list_del(&bridge->list); mutex_unlock(&bridge_mutex); + + /* + * For non-root bridges this flag is protected by the PCI remove/rescan + * locking. For root bridges it is only operated under acpi_scan_lock + * anyway. + */ + bridge->is_going_away = true; } /** @@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc * * Iterate over all slots under this bridge and make sure that if a * card is present they are enabled, and if not they are disabled. + * + * For non-root bridges call under the PCI remove/rescan mutex. */ static void acpiphp_check_bridge(struct acpiphp_bridge *bridge) { struct acpiphp_slot *slot; + /* Bail out if the bridge is going away. */ + if (bridge->is_going_away) + return; + list_for_each_entry(slot, &bridge->slots, node) { struct pci_bus *bus = slot->bus; struct pci_dev *dev, *tmp; @@ -807,6 +820,8 @@ void acpiphp_check_host_bridge(acpi_hand } } +static int disable_and_eject_slot(struct acpiphp_slot *slot); + static void hotplug_event(acpi_handle handle, u32 type, void *data) { struct acpiphp_context *context = data; @@ -866,7 +881,7 @@ static void hotplug_event(acpi_handle ha case ACPI_NOTIFY_EJECT_REQUEST: /* request device eject */ pr_debug("%s: Device eject notify on %s\n", __func__, objname); - acpiphp_disable_and_eject_slot(func->slot); + disable_and_eject_slot(func->slot); break; } @@ -878,14 +893,19 @@ static void hotplug_event_work(void *dat { struct acpiphp_context *context = data; acpi_handle handle = context->handle; + struct acpiphp_bridge *bridge = context->func.parent; acpi_scan_lock_acquire(); + lock_pci_remove_rescan(); - hotplug_event(handle, type, context); + /* Bail out if the parent bridge is going away. */ + if (!bridge->is_going_away) + hotplug_event(handle, type, context); + unlock_pci_remove_rescan(); acpi_scan_lock_release(); acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL); - put_bridge(context->func.parent); + put_bridge(bridge); } /** @@ -1050,20 +1070,27 @@ void acpiphp_remove_slots(struct pci_bus */ int acpiphp_enable_slot(struct acpiphp_slot *slot) { - mutex_lock(&slot->crit_sect); - /* configure all functions */ - if (!(slot->flags & SLOT_ENABLED)) - enable_slot(slot); + struct acpiphp_func *func; + int ret = -ENODEV; - mutex_unlock(&slot->crit_sect); - return 0; + lock_pci_remove_rescan(); + + func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling); + if (!func->parent->is_going_away) { + mutex_lock(&slot->crit_sect); + /* configure all functions */ + if (!(slot->flags & SLOT_ENABLED)) + enable_slot(slot); + + mutex_unlock(&slot->crit_sect); + ret = 0; + } + + unlock_pci_remove_rescan(); + return ret; } -/** - * acpiphp_disable_and_eject_slot - power off and eject slot - * @slot: ACPI PHP slot - */ -int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) +static int disable_and_eject_slot(struct acpiphp_slot *slot) { struct acpiphp_func *func; int retval = 0; @@ -1087,6 +1114,25 @@ int acpiphp_disable_and_eject_slot(struc return retval; } +/** + * acpiphp_disable_and_eject_slot - power off and eject slot. + * @slot: ACPIPHP slot. + */ +int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) +{ + struct acpiphp_func *func; + int ret = -ENODEV; + + lock_pci_remove_rescan(); + + func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling); + if (!func->parent->is_going_away) + ret = disable_and_eject_slot(slot); + + unlock_pci_remove_rescan(); + return ret; +} + /* * slot enabled: 1 Index: linux-pm/drivers/acpi/pci_root.c =================================================================== --- linux-pm.orig/drivers/acpi/pci_root.c +++ linux-pm/drivers/acpi/pci_root.c @@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct { struct acpi_pci_root *root = acpi_driver_data(device); + lock_pci_remove_rescan(); + pci_stop_root_bus(root->bus); device_set_run_wake(root->bus->bridge, false); @@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct pci_remove_root_bus(root->bus); + unlock_pci_remove_rescan(); + kfree(root); }