Message ID | 1431253604-9214-2-git-send-email-haggaie@mellanox.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On Sun, May 10, 2015 at 01:26:32PM +0300, Haggai Eran wrote: > Currently the RDMA subsystem's device list and client list are protected by > a single mutex. This prevents adding user-facing APIs that iterate these > lists, since using them may cause a deadlock. The patch attempts to solve > this problem by adding an SRCU to protect the lists. Readers now don't need > the mutex, and are safe just by using srcu_read_lock/unlock. > > The ib_register_device, ib_register_client, and ib_unregister_client > functions are modified to only lock the device_mutex during their > respective list modification, and use the SRCU for iteration on the other > list. In ib_unregister_device, the client list iteration remains in the > mutex critical section as it is done in reverse order. > > This patch attempts to solve a similar need [1] that was seen in the RoCE > v2 patch series. > > [1] http://www.spinics.net/lists/linux-rdma/msg24733.html So at first blush this looked reasonable, but, no it is racy: ib_register_client: mutex_lock(&device_mutex); list_add_tail_rcu(&client->list, &client_list); mutex_unlock(&device_mutex); id = srcu_read_lock(&device_srcu); list_for_each_entry_rcu(device, &device_list, core_list) client->add(device); ib_register_device: mutex_lock(&device_mutex); list_add_tail(&device->core_list, &device_list); mutex_unlock(&device_mutex); id = srcu_read_lock(&device_srcu); list_for_each_entry_rcu(client, &client_list, list) client->add(device); So, if two threads call the two registers then the new client will have it's add called twice on the same device. There are similar problems with removal. I'm not sure RCU is the right way to approach this. The driver core has the same basic task to perform, maybe review it's locking arrangment between the device list and driver list. [Actually we probably should be using the driver core here, with IB clients as device drivers, but that is way beyond scope of this..] I had a bunch of other comments, but they are not relevant, considering the above. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/05/2015 21:18, Jason Gunthorpe wrote: > So at first blush this looked reasonable, but, no it is racy: > > ib_register_client: > mutex_lock(&device_mutex); > list_add_tail_rcu(&client->list, &client_list); > mutex_unlock(&device_mutex); > > id = srcu_read_lock(&device_srcu); > list_for_each_entry_rcu(device, &device_list, core_list) > client->add(device); > > ib_register_device: > mutex_lock(&device_mutex); > list_add_tail(&device->core_list, &device_list); > mutex_unlock(&device_mutex); > > id = srcu_read_lock(&device_srcu); > list_for_each_entry_rcu(client, &client_list, list) > client->add(device); > > So, if two threads call the two registers then the new client will > have it's add called twice on the same device. > > There are similar problems with removal. Right. Sorry I missed that. Our first draft just kept the addition and removal of elements to the device or client list under the mutex, thinking that only the new code in this patchset that does traversal of the lists would use the SRCU read lock. We then changed it thinking that it would be better to make some use of this SRCU in this patch as well. > > I'm not sure RCU is the right way to approach this. The driver core > has the same basic task to perform, maybe review it's locking > arrangment between the device list and driver list. > > [Actually we probably should be using the driver core here, with IB > clients as device drivers, but that is way beyond scope of this..] So, I'm not very familiar with that code, but it seems that the main difference is that in the core a single driver can be attached to a device. The device_add function calls bus_add_device to add it to the bus's list, and and later calls bus_probe_device to find a driver, without any lock between these calls. And bus_add_driver adds a new driver to the bus's driver list and then calls driver_attach without any lock between them. The only thing I see that makes sure a driver won't be attached twice to a device is the dev->driver field which is updated under the device lock during the probing process. I guess a similar thing we can do is to rely on the context we associate with a pair of a client and a device. If such a context exist, we don't need to call client->add again. What do you think? Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2015 at 09:07:51AM +0300, Haggai Eran wrote: > > I'm not sure RCU is the right way to approach this. The driver core > > has the same basic task to perform, maybe review it's locking > > arrangment between the device list and driver list. > > > > [Actually we probably should be using the driver core here, with IB > > clients as device drivers, but that is way beyond scope of this..] > > So, I'm not very familiar with that code, but it seems that the main > difference is that in the core a single driver can be attached to a > device. Roughly, a bus (IB would be a bus) has devices attached to it, and devices have drivers attached to them. Bus:Device is 1:N, Device:Drvier is 1:1. There a a couple of reasons to be interested in re-using the driver core, perhaps the best is that it has all the infrastructure to let us auto-load client modules... > I guess a similar thing we can do is to rely on the context we associate > with a pair of a client and a device. If such a context exist, we don't > need to call client->add again. What do you think? I didn't look closely, isn't this enough? device_register: mutex_lock(client_mutex); down_write(devices_rwsem); list_add(device_list,dev,..); up_write(devices_rwsem); /* Caller must prevent device_register/unregister concurrancy on the same dev */ foreach(client_list) .. client->add(dev,...) .. mutex_unlock(client_mutex) client_register: mutex_lock(client_mutex) list_add(client_list,new_client..) down_read(devices_rwsem); foreach(device_list) .. client->add(dev,new_client,..) .. up_read(devices_rwsem); mutex_unlock(client_mutex) [Note, I didn't check this carefully, just intuitively seems like a sane starting point] Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/05/2015 21:23, Jason Gunthorpe wrote: > On Tue, May 12, 2015 at 09:07:51AM +0300, Haggai Eran wrote: >>> I'm not sure RCU is the right way to approach this. The driver core >>> has the same basic task to perform, maybe review it's locking >>> arrangment between the device list and driver list. >>> >>> [Actually we probably should be using the driver core here, with IB >>> clients as device drivers, but that is way beyond scope of this..] >> >> So, I'm not very familiar with that code, but it seems that the main >> difference is that in the core a single driver can be attached to a >> device. > > Roughly, a bus (IB would be a bus) has devices attached to it, and > devices have drivers attached to them. Bus:Device is 1:N, > Device:Drvier is 1:1. > > There a a couple of reasons to be interested in re-using the driver > core, perhaps the best is that it has all the infrastructure to let us > auto-load client modules... So you want an ib_core bus and each client is a "device" or a driver on that bus? But this doesn't get you the relation between clients and ib devices. > >> I guess a similar thing we can do is to rely on the context we associate >> with a pair of a client and a device. If such a context exist, we don't >> need to call client->add again. What do you think? > > I didn't look closely, isn't this enough? > > device_register: > mutex_lock(client_mutex); > down_write(devices_rwsem); > list_add(device_list,dev,..); > up_write(devices_rwsem); > > /* Caller must prevent device_register/unregister concurrancy on the > same dev */ > > foreach(client_list) > .. client->add(dev,...) .. > > mutex_unlock(client_mutex) > > client_register: > mutex_lock(client_mutex) > list_add(client_list,new_client..) > down_read(devices_rwsem); > foreach(device_list) > .. client->add(dev,new_client,..) .. > up_read(devices_rwsem); > mutex_unlock(client_mutex) > > [Note, I didn't check this carefully, just intuitively seems like a > sane starting point] That could perhaps work for the RoCEv2 patch-set, as their deadlock relates to iterating the device list. This patch set however does an iteration on the client list (patch 3). Because a client remove could block on this iteration, you can still get a deadlock. I think I prefer keeping the single device_mutex and the SRCU, but keeping the device_mutex locked as it is today, covering all of the registration and unregistration code. Only the new code that reads the client list or the device list without modification to the other list will use the SRCU read lock. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 13, 2015 at 11:10:15AM +0300, Haggai Eran wrote: > >> I guess a similar thing we can do is to rely on the context we associate > >> with a pair of a client and a device. If such a context exist, we don't > >> need to call client->add again. What do you think? > > > > I didn't look closely, isn't this enough? > > > > device_register: > > mutex_lock(client_mutex); > > down_write(devices_rwsem); > > list_add(device_list,dev,..); > > up_write(devices_rwsem); > > > > /* Caller must prevent device_register/unregister concurrancy on the > > same dev */ > > > > foreach(client_list) > > .. client->add(dev,...) .. > > > > mutex_unlock(client_mutex) > > > > client_register: > > mutex_lock(client_mutex) > > list_add(client_list,new_client..) > > down_read(devices_rwsem); > > foreach(device_list) > > .. client->add(dev,new_client,..) .. > > up_read(devices_rwsem); > > mutex_unlock(client_mutex) > > > > [Note, I didn't check this carefully, just intuitively seems like a > > sane starting point] > > That could perhaps work for the RoCEv2 patch-set, as their deadlock > relates to iterating the device list. This patch set however does an > iteration on the client list (patch 3). Because a client remove could > block on this iteration, you can still get a deadlock. Really? Gross. Still, I think you got it right in your analysis, but we don't need RCU: device_register: mutex_lock(modify_mutex); down_write(lists_rwsem); list_add(device_list,dev,..); up_write(lists_rwsem); // implied by modify_mutex: down_read(lists_rwsem) foreach(client_list) .. client->add(dev,...) .. mutex_unlock(modify_mutex) client_register: mutex_lock(modify_mutex); // implied by modify_mutex: down_read(lists_rwsem) foreach(device_list) .. client->add(dev,new_client...) .. down_write(lists_rwsem); list_add(client_list,new_client..); up_write(lists_rwsem); mutex_unlock(modify_mutex) client_unregister: mutex_lock(modify_mutex); down_write(lists_rwsem); list_cut(client_list,..rm_client..); up_write(lists_rwsem); // implied by modify_mutex: down_read(lists_rwsem) foreach(device_list) .. client->remove(dev,rm_client...) .. mutex_unlock(modify_mutex) etc. Notice the ordering. > I think I prefer keeping the single device_mutex and the SRCU, but > keeping the device_mutex locked as it is today, covering all of the > registration and unregistration code. Only the new code that reads the > client list or the device list without modification to the other list > will use the SRCU read lock. In this case, I don't see a justification to use RCU, we'd need a high read load before optimizing the rwsem into RCU would make sense. I'm not sure you should ever use RCU until you've designed the locking using traditional means. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13/05/2015 18:57, Jason Gunthorpe wrote: > On Wed, May 13, 2015 at 11:10:15AM +0300, Haggai Eran wrote: > >>>> I guess a similar thing we can do is to rely on the context we associate >>>> with a pair of a client and a device. If such a context exist, we don't >>>> need to call client->add again. What do you think? >>> >>> I didn't look closely, isn't this enough? >>> >>> device_register: >>> mutex_lock(client_mutex); >>> down_write(devices_rwsem); >>> list_add(device_list,dev,..); >>> up_write(devices_rwsem); >>> >>> /* Caller must prevent device_register/unregister concurrancy on the >>> same dev */ >>> >>> foreach(client_list) >>> .. client->add(dev,...) .. >>> >>> mutex_unlock(client_mutex) >>> >>> client_register: >>> mutex_lock(client_mutex) >>> list_add(client_list,new_client..) >>> down_read(devices_rwsem); >>> foreach(device_list) >>> .. client->add(dev,new_client,..) .. >>> up_read(devices_rwsem); >>> mutex_unlock(client_mutex) >>> >>> [Note, I didn't check this carefully, just intuitively seems like a >>> sane starting point] >> >> That could perhaps work for the RoCEv2 patch-set, as their deadlock >> relates to iterating the device list. This patch set however does an >> iteration on the client list (patch 3). Because a client remove could >> block on this iteration, you can still get a deadlock. > > Really? Gross. > > Still, I think you got it right in your analysis, but we don't need RCU: > > device_register: > mutex_lock(modify_mutex); > down_write(lists_rwsem); > list_add(device_list,dev,..); > up_write(lists_rwsem); > > // implied by modify_mutex: down_read(lists_rwsem) > foreach(client_list) > .. client->add(dev,...) .. > mutex_unlock(modify_mutex) > > client_register: > mutex_lock(modify_mutex); > // implied by modify_mutex: down_read(lists_rwsem) > foreach(device_list) > .. client->add(dev,new_client...) .. > > down_write(lists_rwsem); > list_add(client_list,new_client..); > up_write(lists_rwsem); > > mutex_unlock(modify_mutex) > > client_unregister: > mutex_lock(modify_mutex); > down_write(lists_rwsem); > list_cut(client_list,..rm_client..); > up_write(lists_rwsem); > > // implied by modify_mutex: down_read(lists_rwsem) > foreach(device_list) > .. client->remove(dev,rm_client...) .. > > mutex_unlock(modify_mutex) > > etc. Notice the ordering. > Looks good. I'll use it in the next version of the patch. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b360350a0b20..7d90b2ca2eba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -58,12 +58,11 @@ EXPORT_SYMBOL_GPL(ib_wq); static LIST_HEAD(device_list); static LIST_HEAD(client_list); +/* device_srcu protects access to both device_list and client_list. */ +static struct srcu_struct device_srcu; + /* - * device_mutex protects access to both device_list and client_list. - * There's no real point to using multiple locks or something fancier - * like an rwsem: we always access both lists, and we're always - * modifying one list or the other list. In any case this is not a - * hot path so there's no point in trying to optimize. + * device_mutex protects writer access to both device_list and client_list. */ static DEFINE_MUTEX(device_mutex); @@ -276,6 +275,7 @@ int ib_register_device(struct ib_device *device, u8, struct kobject *)) { int ret; + int id; mutex_lock(&device_mutex); @@ -315,13 +315,19 @@ int ib_register_device(struct ib_device *device, device->reg_state = IB_DEV_REGISTERED; + mutex_unlock(&device_mutex); + + id = srcu_read_lock(&device_srcu); { struct ib_client *client; - list_for_each_entry(client, &client_list, list) + list_for_each_entry_rcu(client, &client_list, list) if (client->add && !add_client_context(device, client)) client->add(device); } + srcu_read_unlock(&device_srcu, id); + + return 0; out: mutex_unlock(&device_mutex); @@ -338,6 +344,7 @@ EXPORT_SYMBOL(ib_register_device); void ib_unregister_device(struct ib_device *device) { struct ib_client *client; + LIST_HEAD(contexts); struct ib_client_data *context, *tmp; unsigned long flags; @@ -347,21 +354,26 @@ void ib_unregister_device(struct ib_device *device) if (client->remove) client->remove(device); - list_del(&device->core_list); + list_del_rcu(&device->core_list); + + mutex_unlock(&device_mutex); + + synchronize_srcu(&device_srcu); kfree(device->gid_tbl_len); kfree(device->pkey_tbl_len); - mutex_unlock(&device_mutex); - ib_device_unregister_sysfs(device); spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) - kfree(context); + list_cut_position(&contexts, &device->client_data_list, + device->client_data_list.prev); spin_unlock_irqrestore(&device->client_data_lock, flags); device->reg_state = IB_DEV_UNREGISTERED; + + list_for_each_entry_safe(context, tmp, &contexts, list) + kfree(context); } EXPORT_SYMBOL(ib_unregister_device); @@ -381,15 +393,19 @@ EXPORT_SYMBOL(ib_unregister_device); int ib_register_client(struct ib_client *client) { struct ib_device *device; + int id; mutex_lock(&device_mutex); + list_add_tail_rcu(&client->list, &client_list); + mutex_unlock(&device_mutex); - list_add_tail(&client->list, &client_list); - list_for_each_entry(device, &device_list, core_list) + id = srcu_read_lock(&device_srcu); + + list_for_each_entry_rcu(device, &device_list, core_list) if (client->add && !add_client_context(device, client)) client->add(device); - mutex_unlock(&device_mutex); + srcu_read_unlock(&device_srcu, id); return 0; } @@ -407,11 +423,13 @@ void ib_unregister_client(struct ib_client *client) { struct ib_client_data *context, *tmp; struct ib_device *device; + LIST_HEAD(contexts); unsigned long flags; + int id; - mutex_lock(&device_mutex); + id = srcu_read_lock(&device_srcu); - list_for_each_entry(device, &device_list, core_list) { + list_for_each_entry_rcu(device, &device_list, core_list) { if (client->remove) client->remove(device); @@ -419,13 +437,21 @@ void ib_unregister_client(struct ib_client *client) list_for_each_entry_safe(context, tmp, &device->client_data_list, list) if (context->client == client) { list_del(&context->list); - kfree(context); + list_add(&context->list, &contexts); } spin_unlock_irqrestore(&device->client_data_lock, flags); } - list_del(&client->list); + srcu_read_unlock(&device_srcu, id); + + mutex_lock(&device_mutex); + list_del_rcu(&client->list); mutex_unlock(&device_mutex); + + synchronize_srcu(&device_srcu); + + list_for_each_entry_safe(context, tmp, &contexts, list) + kfree(context); } EXPORT_SYMBOL(ib_unregister_client); @@ -738,9 +764,15 @@ static int __init ib_core_init(void) { int ret; + ret = init_srcu_struct(&device_srcu); + if (ret) { + pr_warn("Couldn't initialize SRCU\n"); + return ret; + } + ib_wq = alloc_workqueue("infiniband", 0, 0); if (!ib_wq) - return -ENOMEM; + goto err_srcu; ret = ib_sysfs_setup(); if (ret) { @@ -770,6 +802,9 @@ err_sysfs: err: destroy_workqueue(ib_wq); +err_srcu: + cleanup_srcu_struct(&device_srcu); + return ret; } @@ -780,6 +815,8 @@ static void __exit ib_core_cleanup(void) ib_sysfs_cleanup(); /* Make sure that any pending umem accounting work is done. */ destroy_workqueue(ib_wq); + srcu_barrier(&device_srcu); + cleanup_srcu_struct(&device_srcu); } module_init(ib_core_init);
Currently the RDMA subsystem's device list and client list are protected by a single mutex. This prevents adding user-facing APIs that iterate these lists, since using them may cause a deadlock. The patch attempts to solve this problem by adding an SRCU to protect the lists. Readers now don't need the mutex, and are safe just by using srcu_read_lock/unlock. The ib_register_device, ib_register_client, and ib_unregister_client functions are modified to only lock the device_mutex during their respective list modification, and use the SRCU for iteration on the other list. In ib_unregister_device, the client list iteration remains in the mutex critical section as it is done in reverse order. This patch attempts to solve a similar need [1] that was seen in the RoCE v2 patch series. [1] http://www.spinics.net/lists/linux-rdma/msg24733.html Cc: Matan Barak <matanb@mellanox.com> Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> --- drivers/infiniband/core/device.c | 75 ++++++++++++++++++++++++++++++---------- 1 file changed, 56 insertions(+), 19 deletions(-)