Message ID | 20161031131958.mrrez5t7sow75p6v@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
DQo+IE9uIE9jdCAzMSwgMjAxNiwgYXQgMDk6MTksIFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3Ig PGJpZ2Vhc3lAbGludXRyb25peC5kZT4gd3JvdGU6DQo+IA0KPiBUaGUgcmF3X3dyaXRlX3NlcWNv dW50X2JlZ2luKCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGUoKSBidWdzIG1lDQo+IGJlY2F1 c2UgaXQgbWFwcyB0byBwcmVlbXB0X2Rpc2FibGUoKSBpbiAtUlQgd2hpY2ggSSBjYW4ndCBoYXZl IGF0IHRoaXMNCj4gcG9pbnQuIFNvIEkgdG9vayBhIGxvb2sgYXQgdGhlIGNvZGUuDQo+IEl0IHRo ZSBsb2NrZGVwIHBhcnQgd2FzIHJlbW92ZWQgaW4gY29tbWl0IGFiYmVjMmRhMTNmMCAoIk5GUzog VXNlDQo+IHJhd193cml0ZV9zZXFjb3VudF9iZWdpbi9lbmQgaW50IG5mczRfcmVjbGFpbV9vcGVu X3N0YXRlIikgYmVjYXVzZQ0KPiBsb2NrZGVwIGNvbXBsYWluZWQuIFRoZSB3aG9sZSBzZXFjb3Vu dCB0aGluZyB3YXMgaW50cm9kdWNlZCBpbiBjb21taXQNCj4gYzEzN2FmYWJlMzMwICgiTkZTdjQ6 IEFsbG93IHRoZSBzdGF0ZSBtYW5hZ2VyIHRvIG1hcmsgYW4gb3Blbl9vd25lciBhcw0KPiBiZWlu ZyByZWNvdmVyZWQiKS4NCj4gVHJvbmQgTXlrbGVidXN0IGNvbmZpcm1zIHRoYXQgdGhlcmUgaSBv bmx5IG9uZSB3cml0ZXIgdGhyZWFkIGF0DQo+IG5mczRfcmVjbGFpbV9vcGVuX3N0YXRlKCkNCj4g DQo+IFRoZSBsaXN0X2Zvcl9lYWNoX2VudHJ5KCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGU6 DQo+IEl0IHNlZW1zIHRoYXQgdGhpcyBsb2NrIHByb3RlY3RzIHRoZSAtPnNvX3N0YXRlcyBsaXN0 IGFtb25nIG90aGVyDQo+IGF0b21pY190ICYgZmxhZ3MgbWVtYmVycy4gU28gYXQgdGhlIGJlZ2lu IG9mIHRoZSBsb29wIHdlIGluYyAtPmNvdW50DQo+IGVuc3VyaW5nIHRoYXQgdGhpcyBmaWVsZCBp cyBub3QgcmVtb3ZlZCB3aGlsZSB3ZSB1c2UgaXQuIFNvIHdlIGRyb3AgdGhlDQo+IC0+c29fbG9j ayBsb2MgZHVyaW5nIHRoZSBsb29wIGl0IHNlZW1zLiBBbmQgYWZ0ZXIgbmZzNF9yZWNsYWltX2xv Y2tzKCkNCj4gaW52b2NhdGlvbiB3ZSBuZnM0X3B1dF9vcGVuX3N0YXRlKCkgYW5kIGdyYWIgdGhl IC0+c29fbG9jayBhZ2Fpbi4gU28gaWYNCj4gd2Ugd2VyZSB0aGUgbGFzdCB1c2VyIG9mIHRoaXMg c3RydWN0IGFuZCB3ZSByZW1vdmUgaXQsIHRoZW4gdGhlDQo+IGZvbGxvd2luZyBsaXN0X25leHRf ZW50cnkoKSBpbnZvY2F0aW9uIGlzIGEgdXNlLWFmdGVyLWZyZWUuIEV2ZW4gaWYgd2UNCj4gdXNl IGxpc3RfZm9yX2VhY2hfZW50cnlfc2FmZSgpIHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0aGF0IHRo ZSBmb2xsb3dpbmcNCj4gbWVtYmVyIGlzIHN0aWxsIHZhbGlkIGJlY2F1c2UgaXQgbWlnaHQgaGF2 ZSBiZWVuIHJlbW92ZWQgYnkgc29tZW9uZQ0KPiBpbnZva2luZyBuZnM0X3B1dF9vcGVuX3N0YXRl KCkgb24gaXQsIHJpZ2h0Pw0KPiBTbyB0aGVyZSBpcyB0aGlzLg0KPiANCj4gSG93ZXZlciB0byBh ZGRyZXNzIG15IGluaXRpYWwgcHJvYmxlbSBJIGhhdmUgaGVyZSBhIHBhdGNoIDopIFNvIGl0IHVz ZXMNCj4gYSByd19zZW1hcGhvcmUgd2hpY2ggZW5zdXJlcyB0aGF0IHRoZXJlIGlzIG9ubHkgb25l IHdyaXRlciBhdCBhIHRpbWUgb3INCj4gbXVsdGlwbGUgcmVhZGVyLiBTbyBpdCBzaG91bGQgYmUg YmFzaWNhbGx5IHdoYXQgaXMgaGFwcGVuaW5nIG5vdyBwbHVzIGENCj4gdGlueSB0aW55IHRpbnkg bG9jayBwbHVzIGxvY2tkZXAgY292ZXJhZ2UuIEkgdHJpZWQgdG8gdGhpcyBteXNlbGYgYnV0IEkN Cj4gZG9uJ3QgbWFuYWdlIHRvIGdldCBpbnRvIHRoaXMgY29kZSBwYXRoIGF0IGFsbCBzbyBJIG1p Z2h0IGJlIGRvaW5nDQo+IHNvbWV0aGluZyB3cm9uZy4NCj4gDQo+IENvdWxkIHlvdSBwbGVhc2Ug Y2hlY2sgaWYgdGhpcyBwYXRjaCBpcyB3b3JraW5nIGZvciB5b3UgYW5kIHdoZXRoZXIgbXkNCj4g bGlzdF9mb3JfZWFjaF9lbnRyeSgpIG9ic2VydmF0aW9uIGlzIGNvcnJlY3Qgb3Igbm90Pw0KPiAN Cj4gdjLigKZ2MzogcmVwbGFjZSB0aGUgc2VxbG9jayB3aXRoIGEgUlcgc2VtYXBob3JlLg0KPiAN Cg0KTkFDSy4gVGhhdCB3aWxsIGRlYWRsb2NrLiBUaGUgcmVhc29uIHdoeSB3ZSB1c2UgYSBzZXFs b2NrIHRoZXJlIGlzIHByZWNpc2VseSBiZWNhdXNlIHdlIGNhbm5vdCBhbGxvdyBvcmRpbmFyeSBS UEMgY2FsbHMgdG8gbG9jayBvdXQgdGhlIHJlY292ZXJ5IHRocmVhZC4gSWYgdGhlIHNlcnZlciBy ZWJvb3RzLCB0aGVuIG9yZGluYXJ5IFJQQyBjYWxscyB3aWxsIGZhaWwgdW50aWwgdGhlIHJlY292 ZXJ5IHRocmVhZCBoYXMgaGFkIGEgY2hhbmNlIHRvIHJlLWVzdGFibGlzaCB0aGUgc3RhdGUuDQoN Cg0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote: > > On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > The list_for_each_entry() in nfs4_reclaim_open_state: > > It seems that this lock protects the ->so_states list among other > > atomic_t & flags members. So at the begin of the loop we inc ->count > > ensuring that this field is not removed while we use it. So we drop the > > ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks() > > invocation we nfs4_put_open_state() and grab the ->so_lock again. So if > > we were the last user of this struct and we remove it, then the > > following list_next_entry() invocation is a use-after-free. Even if we > > use list_for_each_entry_safe() there is no guarantee that the following > > member is still valid because it might have been removed by someone > > invoking nfs4_put_open_state() on it, right? > > So there is this. > > > > However to address my initial problem I have here a patch :) So it uses > > a rw_semaphore which ensures that there is only one writer at a time or > > multiple reader. So it should be basically what is happening now plus a > > tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I > > don't manage to get into this code path at all so I might be doing > > something wrong. > > > > Could you please check if this patch is working for you and whether my > > list_for_each_entry() observation is correct or not? > > > > v2…v3: replace the seqlock with a RW semaphore. > > > > NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread. Hmmm. So this is getting invoked if I reboot the server? A restart of nfs-kernel-server is the same thing? Is the list_for_each_entry() observation I made correct? >If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state. This means that the ordinary RPC call won't return and fail but wait with the lock held until the recovery thread did its thing? Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Oct 31, 2016, at 11:56, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote: >>> On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: >>> The list_for_each_entry() in nfs4_reclaim_open_state: >>> It seems that this lock protects the ->so_states list among other >>> atomic_t & flags members. So at the begin of the loop we inc ->count >>> ensuring that this field is not removed while we use it. So we drop the >>> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks() >>> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if >>> we were the last user of this struct and we remove it, then the >>> following list_next_entry() invocation is a use-after-free. Even if we >>> use list_for_each_entry_safe() there is no guarantee that the following >>> member is still valid because it might have been removed by someone >>> invoking nfs4_put_open_state() on it, right? >>> So there is this. >>> >>> However to address my initial problem I have here a patch :) So it uses >>> a rw_semaphore which ensures that there is only one writer at a time or >>> multiple reader. So it should be basically what is happening now plus a >>> tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I >>> don't manage to get into this code path at all so I might be doing >>> something wrong. >>> >>> Could you please check if this patch is working for you and whether my >>> list_for_each_entry() observation is correct or not? >>> >>> v2…v3: replace the seqlock with a RW semaphore. >>> >> >> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread. > Hmmm. So this is getting invoked if I reboot the server? A restart of > nfs-kernel-server is the same thing? > Is the list_for_each_entry() observation I made correct? Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned. > >> If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state. > > This means that the ordinary RPC call won't return and fail but wait > with the lock held until the recovery thread did its thing? It uses the seqcount_t to figure out if a recovery occurred while an OPEN or CLOSE was being processed. If so, it schedules a new recovery of the stateids in question.
On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote: > > Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned. but this is tested at the top of the loop and by then you look at lists' ->next pointer which might be invalid. Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Nov 2, 2016, at 13:11, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote: >> >> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned. > > but this is tested at the top of the loop and by then you look at > lists' ->next pointer which might be invalid. > No. We ensure we restart the list scan if we release the spinlock. It’s safe…
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c index dff600ae0d74..55d5531aedbb 100644 --- a/fs/nfs/delegation.c +++ b/fs/nfs/delegation.c @@ -130,7 +130,6 @@ static int nfs_delegation_claim_opens(struct inode *inode, struct nfs_open_context *ctx; struct nfs4_state_owner *sp; struct nfs4_state *state; - unsigned int seq; int err; again: @@ -150,12 +149,11 @@ static int nfs_delegation_claim_opens(struct inode *inode, sp = state->owner; /* Block nfs4_proc_unlck */ mutex_lock(&sp->so_delegreturn_mutex); - seq = raw_seqcount_begin(&sp->so_reclaim_seqcount); + down_read(&sp->so_reclaim_rw); err = nfs4_open_delegation_recall(ctx, state, stateid, type); if (!err) err = nfs_delegation_claim_locks(ctx, state, stateid); - if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq)) - err = -EAGAIN; + up_read(&sp->so_reclaim_rw); mutex_unlock(&sp->so_delegreturn_mutex); put_nfs_open_context(ctx); if (err != 0) diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h index 9b3a82abab07..03d37826a183 100644 --- a/fs/nfs/nfs4_fs.h +++ b/fs/nfs/nfs4_fs.h @@ -111,7 +111,7 @@ struct nfs4_state_owner { unsigned long so_flags; struct list_head so_states; struct nfs_seqid_counter so_seqid; - seqcount_t so_reclaim_seqcount; + struct rw_semaphore so_reclaim_rw; struct mutex so_delegreturn_mutex; }; diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 7897826d7c51..ba6589d1fd36 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -2682,10 +2682,9 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata, struct nfs_server *server = sp->so_server; struct dentry *dentry; struct nfs4_state *state; - unsigned int seq; int ret; - seq = raw_seqcount_begin(&sp->so_reclaim_seqcount); + down_read(&sp->so_reclaim_rw); ret = _nfs4_proc_open(opendata); if (ret != 0) @@ -2721,12 +2720,10 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata, goto out; ctx->state = state; - if (d_inode(dentry) == state->inode) { + if (d_inode(dentry) == state->inode) nfs_inode_attach_open_context(ctx); - if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq)) - nfs4_schedule_stateid_recovery(server, state); - } out: + up_read(&sp->so_reclaim_rw); return ret; } diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 5f4281ec5f72..61c431fa2fe3 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server, nfs4_init_seqid_counter(&sp->so_seqid); atomic_set(&sp->so_count, 1); INIT_LIST_HEAD(&sp->so_lru); - seqcount_init(&sp->so_reclaim_seqcount); + init_rwsem(&sp->so_reclaim_rw); mutex_init(&sp->so_delegreturn_mutex); return sp; } @@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs * recovering after a network partition or a reboot from a * server that doesn't support a grace period. */ + down_write(&sp->so_reclaim_rw); spin_lock(&sp->so_lock); - raw_write_seqcount_begin(&sp->so_reclaim_seqcount); restart: list_for_each_entry(state, &sp->so_states, open_states) { if (!test_and_clear_bit(ops->state_flag_bit, &state->flags)) @@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs spin_lock(&sp->so_lock); goto restart; } - raw_write_seqcount_end(&sp->so_reclaim_seqcount); spin_unlock(&sp->so_lock); + up_write(&sp->so_reclaim_rw); return 0; out_err: nfs4_put_open_state(state); - spin_lock(&sp->so_lock); - raw_write_seqcount_end(&sp->so_reclaim_seqcount); - spin_unlock(&sp->so_lock); + up_write(&sp->so_reclaim_rw); return status; }
The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me because it maps to preempt_disable() in -RT which I can't have at this point. So I took a look at the code. It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because lockdep complained. The whole seqcount thing was introduced in commit c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as being recovered"). Trond Myklebust confirms that there i only one writer thread at nfs4_reclaim_open_state() The list_for_each_entry() in nfs4_reclaim_open_state: It seems that this lock protects the ->so_states list among other atomic_t & flags members. So at the begin of the loop we inc ->count ensuring that this field is not removed while we use it. So we drop the ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks() invocation we nfs4_put_open_state() and grab the ->so_lock again. So if we were the last user of this struct and we remove it, then the following list_next_entry() invocation is a use-after-free. Even if we use list_for_each_entry_safe() there is no guarantee that the following member is still valid because it might have been removed by someone invoking nfs4_put_open_state() on it, right? So there is this. However to address my initial problem I have here a patch :) So it uses a rw_semaphore which ensures that there is only one writer at a time or multiple reader. So it should be basically what is happening now plus a tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I don't manage to get into this code path at all so I might be doing something wrong. Could you please check if this patch is working for you and whether my list_for_each_entry() observation is correct or not? v2…v3: replace the seqlock with a RW semaphore. v1…v2: write_seqlock() disables preemption and some function need it (thread_run(), non-GFP_ATOMIC memory alloction()). We don't want preemption enabled because a preempted writer would stall the reader spinning. This is a duct tape mutex. Maybe the seqlock should go. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- fs/nfs/delegation.c | 6 ++---- fs/nfs/nfs4_fs.h | 2 +- fs/nfs/nfs4proc.c | 9 +++------ fs/nfs/nfs4state.c | 10 ++++------ 4 files changed, 10 insertions(+), 17 deletions(-)