Message ID | 20150701184408.GF17109@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote: > Mismatched reply could also be a possibility, but only if we end up with > sending more than one request with the same tag without waiting for response > for the first one. ... and I think I see what's going on. Tags are 16bit. Suppose the server stalls for some reason *and* we keep piling the requests up. New tags keep being grabbed by this: tag = P9_NOTAG; if (type != P9_TVERSION) { tag = p9_idpool_get(c->tagpool); if (tag < 0) return ERR_PTR(-ENOMEM); } tag is int here. Then we pass tag to req = p9_tag_alloc(c, tag, req_size); and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc() in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get() will happily return values greater than 65535 - it's using idr and it's used (with different pools) for 16bit tags and 32bit FIDs. Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is concerned, we'd just sent another request with tag 3. And on the client there are two threads waiting for responses on the same p9_req_t. Both happen to be TWRITE. Response to the first request arrives and we happen to let the second thread go at it first. Voila - the first request had been for page-sized write() and got successfully handled. The _second_ one had been short and is very surprised to see confirmation of 4Kb worth of data having been written. It should be easy to confirm - in p9_client_prepare_req() add if (WARN_ON_ONCE(tag != (u16)tag)) { p9_idpool_put(tag, c->tagpool); return ERR_PTR(-ENOMEM); } right after tag = p9_idpool_get(c->tagpool); if (tag < 0) return ERR_PTR(-ENOMEM); and see if it triggers. I'm not sure if failing with ENOMEM is the right response (another variant is to sleep there until the pile gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely not for the real work, but it will do for confirming that this is what we are hitting. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[9p and sunrpc folks added to Cc] On Thu, Jul 02, 2015 at 04:20:42AM +0100, Al Viro wrote: > On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote: > > Mismatched reply could also be a possibility, but only if we end up with > > sending more than one request with the same tag without waiting for response > > for the first one. > > ... and I think I see what's going on. Tags are 16bit. Suppose the > server stalls for some reason *and* we keep piling the requests up. > New tags keep being grabbed by this: > > tag = P9_NOTAG; > if (type != P9_TVERSION) { > tag = p9_idpool_get(c->tagpool); > if (tag < 0) > return ERR_PTR(-ENOMEM); > } > tag is int here. Then we pass tag to > req = p9_tag_alloc(c, tag, req_size); > and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc() > in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get() > will happily return values greater than 65535 - it's using idr and it's > used (with different pools) for 16bit tags and 32bit FIDs. > > Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from > p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is > concerned, we'd just sent another request with tag 3. And on the client > there are two threads waiting for responses on the same p9_req_t. Both > happen to be TWRITE. Response to the first request arrives and we happen > to let the second thread go at it first. Voila - the first request had > been for page-sized write() and got successfully handled. The _second_ one > had been short and is very surprised to see confirmation of 4Kb worth of > data having been written. > > It should be easy to confirm - in p9_client_prepare_req() add > if (WARN_ON_ONCE(tag != (u16)tag)) { > p9_idpool_put(tag, c->tagpool); > return ERR_PTR(-ENOMEM); > } > right after > tag = p9_idpool_get(c->tagpool); > if (tag < 0) > return ERR_PTR(-ENOMEM); > > and see if it triggers. I'm not sure if failing with ENOMEM is the > right response (another variant is to sleep there until the pile > gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely > not for the real work, but it will do for confirming that this is what > we are hitting. FWIW, we probably would be better off with throttling rather than ENOMEM in such situations. I'm not familiar with sunrpc enough to be sure how to do that right way (note that RPC equivalent of 9P tags is 32bit, so the throttling there is based on memory shortage rather than running out of XID space), but the interesting issues should be similar - potential deadlocks in near-OOM situations. Suggestions? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[repeating, since my previous email didn't reach mailing lists] 2015-07-02 7:10 GMT+03:00 Al Viro <viro@zeniv.linux.org.uk>: >> It should be easy to confirm - in p9_client_prepare_req() add >> if (WARN_ON_ONCE(tag != (u16)tag)) { >> p9_idpool_put(tag, c->tagpool); >> return ERR_PTR(-ENOMEM); >> } >> right after >> tag = p9_idpool_get(c->tagpool); >> if (tag < 0) >> return ERR_PTR(-ENOMEM); >> >> and see if it triggers. I'm not sure if failing with ENOMEM is the >> right response (another variant is to sleep there until the pile >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely >> not for the real work, but it will do for confirming that this is what >> we are hitting. > Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 10:19:07AM +0300, Andrey Ryabinin wrote: > On 07/02/2015 07:10 AM, Al Viro wrote: > >> > >> It should be easy to confirm - in p9_client_prepare_req() add > >> if (WARN_ON_ONCE(tag != (u16)tag)) { > >> p9_idpool_put(tag, c->tagpool); > >> return ERR_PTR(-ENOMEM); > >> } > >> right after > >> tag = p9_idpool_get(c->tagpool); > >> if (tag < 0) > >> return ERR_PTR(-ENOMEM); > >> > >> and see if it triggers. I'm not sure if failing with ENOMEM is the > >> right response (another variant is to sleep there until the pile > >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely > >> not for the real work, but it will do for confirming that this is what > >> we are hitting. > > > > Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger. While the one in p9_client_write() (on rsize < count) did? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote: > >> and see if it triggers. I'm not sure if failing with ENOMEM is the > >> right response (another variant is to sleep there until the pile > >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely > >> not for the real work, but it will do for confirming that this is what > >> we are hitting. > > > > Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger. Summary for those who'd missed the beginning of the thread: what we are seeing is p9_client_write() issing TWRITE and getting RWRITE in reply (tags match, packets look plausible) with count in RWRITE way more than that in TWRITE. IOW, we are telling the server to write e.g. 93 bytes and are getting told that yes, the write had been successful - all 4096 bytes of it. qemu virtio-9p for server; from my reading of qemu side of things, it can't be sending reply with count greater than that in request. The bug I suspected to be the cause of that is in tag allocation in net/9p/client.c - we could end up wrapping around 2^16 with enough pending requests and that would have triggered that kind of mess. However, Andrey doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger. BTW, was that on the run where debugging printk in p9_client_write() *did* trigger? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/02/2015 10:59 AM, Al Viro wrote: > On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote: > >>>> and see if it triggers. I'm not sure if failing with ENOMEM is the >>>> right response (another variant is to sleep there until the pile >>>> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely >>>> not for the real work, but it will do for confirming that this is what >>>> we are hitting. >>> >> >> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger. > > Summary for those who'd missed the beginning of the thread: what we are > seeing is p9_client_write() issing TWRITE and getting RWRITE in reply > (tags match, packets look plausible) with count in RWRITE way more than > that in TWRITE. > > IOW, we are telling the server to write e.g. 93 bytes and are getting told > that yes, the write had been successful - all 4096 bytes of it. > > qemu virtio-9p for server; from my reading of qemu side of things, it can't > be sending reply with count greater than that in request. Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing this under kvmtool as well. It just takes a bit longer to reproduce this in kvmtool. > The bug I suspected to be the cause of that is in tag allocation in > net/9p/client.c - we could end up wrapping around 2^16 with enough pending > requests and that would have triggered that kind of mess. However, Andrey > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger. > BTW, was that on the run where debugging printk in p9_client_write() *did* > trigger? Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger, but debug printk in p9_client_write() *did* trigger. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote: > Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing > this under kvmtool as well. It just takes a bit longer to reproduce > this in kvmtool. > > > The bug I suspected to be the cause of that is in tag allocation in > > net/9p/client.c - we could end up wrapping around 2^16 with enough pending > > requests and that would have triggered that kind of mess. However, Andrey > > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger. > > BTW, was that on the run where debugging printk in p9_client_write() *did* > > trigger? > > Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger, > but debug printk in p9_client_write() *did* trigger. Bloody wonderful... Could you check if v9fs_write() in qemu hw/9pfs/virtio-9p.c ever gets to offset = 7; err = pdu_marshal(pdu, offset, "d", total); with total > count on your testcase? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 09:25:30AM +0100, Al Viro wrote: > On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote: > > Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing > > this under kvmtool as well. It just takes a bit longer to reproduce > > this in kvmtool. > > > > > The bug I suspected to be the cause of that is in tag allocation in > > > net/9p/client.c - we could end up wrapping around 2^16 with enough pending > > > requests and that would have triggered that kind of mess. However, Andrey > > > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger. > > > BTW, was that on the run where debugging printk in p9_client_write() *did* > > > trigger? > > > > Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger, > > but debug printk in p9_client_write() *did* trigger. > > Bloody wonderful... Could you check if v9fs_write() in qemu > hw/9pfs/virtio-9p.c ever gets to > offset = 7; > err = pdu_marshal(pdu, offset, "d", total); > with total > count on your testcase? Another thing that might be worth checking: in p9_tag_alloc() (net/9p/client.c) before req->status = REQ_STATUS_ALLOC; check that req->status == REQ_STATUS_IDLE and yell if it isn't. BTW, the loop in there ( /* check again since original check was outside of lock */ while (tag >= c->max_tag) { ) looks fishy. If we get more than P9_ROW_MAXTAG allocations at once, we'll have trouble, but I doubt that this is what we are hitting. In any case, adding WARN_ON(c->req[row]); right after row = (tag / P9_ROW_MAXTAG); wouldn't hurt. I would be very surprised if that one triggered, though. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2 Jul 2015 04:20:42 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote: > > Mismatched reply could also be a possibility, but only if we end up with > > sending more than one request with the same tag without waiting for response > > for the first one. > > ... and I think I see what's going on. Tags are 16bit. Suppose the > server stalls for some reason *and* we keep piling the requests up. > New tags keep being grabbed by this: > > tag = P9_NOTAG; > if (type != P9_TVERSION) { > tag = p9_idpool_get(c->tagpool); > if (tag < 0) > return ERR_PTR(-ENOMEM); > } > tag is int here. Then we pass tag to > req = p9_tag_alloc(c, tag, req_size); > and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc() > in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get() > will happily return values greater than 65535 - it's using idr and it's > used (with different pools) for 16bit tags and 32bit FIDs. > > Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from > p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is > concerned, we'd just sent another request with tag 3. And on the client > there are two threads waiting for responses on the same p9_req_t. Both > happen to be TWRITE. Response to the first request arrives and we happen > to let the second thread go at it first. Voila - the first request had > been for page-sized write() and got successfully handled. The _second_ one > had been short and is very surprised to see confirmation of 4Kb worth of > data having been written. > > It should be easy to confirm - in p9_client_prepare_req() add > if (WARN_ON_ONCE(tag != (u16)tag)) { > p9_idpool_put(tag, c->tagpool); > return ERR_PTR(-ENOMEM); > } > right after > tag = p9_idpool_get(c->tagpool); > if (tag < 0) > return ERR_PTR(-ENOMEM); > > and see if it triggers. I'm not sure if failing with ENOMEM is the > right response (another variant is to sleep there until the pile > gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely > not for the real work, but it will do for confirming that this is what > we are hitting. ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead. That should ensure that it's only allocating values from within the given range.
On Thu, 2 Jul 2015 08:00:26 -0400 Jeff Layton <jlayton@poochiereds.net> wrote: > On Thu, 2 Jul 2015 04:20:42 +0100 > Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote: > > > Mismatched reply could also be a possibility, but only if we end up with > > > sending more than one request with the same tag without waiting for response > > > for the first one. > > > > ... and I think I see what's going on. Tags are 16bit. Suppose the > > server stalls for some reason *and* we keep piling the requests up. > > New tags keep being grabbed by this: > > > > tag = P9_NOTAG; > > if (type != P9_TVERSION) { > > tag = p9_idpool_get(c->tagpool); > > if (tag < 0) > > return ERR_PTR(-ENOMEM); > > } > > tag is int here. Then we pass tag to > > req = p9_tag_alloc(c, tag, req_size); > > and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc() > > in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get() > > will happily return values greater than 65535 - it's using idr and it's > > used (with different pools) for 16bit tags and 32bit FIDs. > > > > Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from > > p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is > > concerned, we'd just sent another request with tag 3. And on the client > > there are two threads waiting for responses on the same p9_req_t. Both > > happen to be TWRITE. Response to the first request arrives and we happen > > to let the second thread go at it first. Voila - the first request had > > been for page-sized write() and got successfully handled. The _second_ one > > had been short and is very surprised to see confirmation of 4Kb worth of > > data having been written. > > > > It should be easy to confirm - in p9_client_prepare_req() add > > if (WARN_ON_ONCE(tag != (u16)tag)) { > > p9_idpool_put(tag, c->tagpool); > > return ERR_PTR(-ENOMEM); > > } > > right after > > tag = p9_idpool_get(c->tagpool); > > if (tag < 0) > > return ERR_PTR(-ENOMEM); > > > > and see if it triggers. I'm not sure if failing with ENOMEM is the > > right response (another variant is to sleep there until the pile > > gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely > > not for the real work, but it will do for confirming that this is what > > we are hitting. > > ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead. > That should ensure that it's only allocating values from within the > given range. > Erm...and why is it passing in '0' to idr_alloc for the end value if it can't deal with more than 16 bits? That seems like a plain old bug... The other stuff you've noted should also be fixed of course, but the IDR usage here could use a little rework.
On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote: > Erm...and why is it passing in '0' to idr_alloc for the end value if it > can't deal with more than 16 bits? That seems like a plain old bug... Because they are using the same function (with different pool, obviously) for FID allocation, and those are 32bit... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2 Jul 2015 17:45:35 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote: > > > Erm...and why is it passing in '0' to idr_alloc for the end value if it > > can't deal with more than 16 bits? That seems like a plain old bug... > > Because they are using the same function (with different pool, obviously) > for FID allocation, and those are 32bit... Ahh, right... So p9_idpool_create should take an argument for the "end" value, and then store that in a new field in p9_idpool. Then they can pass that in as the "end" parm in idr_alloc. Or, they could give up using the same function there and use a different one for tags and FIDs. In any case...allowing this thing to allocate tag values that can collide seems fundamentally wrong. Using idr_alloc_cyclic might also not hurt either, particularly given that these tag values are supposed to function something like an XID and you probably don't want to be reusing them too quickly.
Jeff Layton wrote on Thu, Jul 02, 2015: > So p9_idpool_create should take an argument for the "end" value, and > then store that in a new field in p9_idpool. Then they can pass that in > as the "end" parm in idr_alloc. Or, they could give up using the same > function there and use a different one for tags and FIDs. > > In any case...allowing this thing to allocate tag values that can > collide seems fundamentally wrong. Using idr_alloc_cyclic might also > not hurt either, particularly given that these tag values are supposed > to function something like an XID and you probably don't want to be > reusing them too quickly. Using cache=none here so behavious is likely different with cache, but basically you can't get more than one tag per user thread accessing the 9P mount... And in RDMA there's a credit so I can't get past whatever sq option was given (defaults to 32) -- tbh even with other transports I doubt it's going to get much higher. Still definitely needs fixing, but I think the issue is somewhere else... If Andrey could share the workload he uses I can try with other servers, would be nice if we can rule a qemu bug out completely :)
On Thu, Jul 02, 2015 at 01:01:39PM -0400, Jeff Layton wrote: > So p9_idpool_create should take an argument for the "end" value, and > then store that in a new field in p9_idpool. Then they can pass that in > as the "end" parm in idr_alloc. Or, they could give up using the same > function there and use a different one for tags and FIDs. > > In any case...allowing this thing to allocate tag values that can > collide seems fundamentally wrong. Using idr_alloc_cyclic might also > not hurt either, particularly given that these tag values are supposed > to function something like an XID and you probably don't want to be > reusing them too quickly. All they are used for is matching response to request. Basically, you can have up to 65535 pending requests. Reusing it right after getting the response is fine. Keep in mind that it's not supposed to be used on top of something like UDP - *all* retransmits, etc., are responsibility of transport. It's connection-oriented, reliable and order-preserving, with a shitload of state tied to connection, starting with FIDs - unlike FHANDLE, FID meaning depends upon connection history. Tags are even more transient. Basically, the rules are * request bears a 16bit tag. * server can process pending requests in any order (with one exception) and it must put the same tag into responses. * client can ask to abort a pending request by issuing Tflush[old_tag]; * server must handle Tflush immediately. It must drop any pending request matching old_tag and send Rflush confirming that. No response to any request matching old_tag sent before Tflush should be issued after issuing Rflush. * if client has not issued Tflush, it must not reuse the tag until getting a response bearing that tag. * if client *has* issued Tflush, it must not reuse the tag until getting Rflush, even if it does get response to the request it has tried to abort. BTW, failure to send Tflush means that we should leave the tag in use, period. As it is, p9_client_rpc()/p9_client_zc_rpc() ignore such situations - failure from p9_client_flush() is simply not noticed. I seriously doubt that this is what we are hitting here, but it's a bug all the same. We also must _not_ let p9_client_cb() do anything unless req is non-NULL and req->status is REQ_STATUS_SENT - stray tags from server shouldn't be acted upon. As it is, the whole thing is trivial to oops - just have server send _any_ R-message with something like 0xfff0 for tag. End of story, p9_tag_lookup() returns NULL, p9_client_cb() oopses. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote: > Using cache=none here so behavious is likely different with cache, but > basically you can't get more than one tag per user thread accessing the > 9P mount... Yes, and...? You can get a lot more than one user thread... Andrey is using trinity(1) on client, and that's *definitely* not single-threaded - the whole point is stressing the damn thing. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2 Jul 2015 19:56:29 +0200 Dominique Martinet <dominique.martinet@cea.fr> wrote: > Jeff Layton wrote on Thu, Jul 02, 2015: > > So p9_idpool_create should take an argument for the "end" value, and > > then store that in a new field in p9_idpool. Then they can pass that in > > as the "end" parm in idr_alloc. Or, they could give up using the same > > function there and use a different one for tags and FIDs. > > > > In any case...allowing this thing to allocate tag values that can > > collide seems fundamentally wrong. Using idr_alloc_cyclic might also > > not hurt either, particularly given that these tag values are supposed > > to function something like an XID and you probably don't want to be > > reusing them too quickly. > > Using cache=none here so behavious is likely different with cache, but > basically you can't get more than one tag per user thread accessing the > 9P mount... > And in RDMA there's a credit so I can't get past whatever sq option was > given (defaults to 32) -- tbh even with other transports I doubt it's > going to get much higher. > > Still definitely needs fixing, but I think the issue is somewhere > else... If Andrey could share the workload he uses I can try with other > servers, would be nice if we can rule a qemu bug out completely :) > Fair enough... If you're in there and decide to fix this up, then consider moving this over to IDA instead of IDR. The pointers stored are not terribly interesting (always the same as the p9_idpool), so by doing that you'll save quite a bit of memory as well.
On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > All they are used for is matching response to request. Basically, you > can have up to 65535 pending requests. Reusing it right after getting > the response is fine. Reusing a tag right after getting the completion may be fine in theory, but it still sounds like a bad idea. Sure, it's used to match the command with the reply, but using those kinds of things for matching re-sends and to index into various "current data structures" is also very common (not having looked at p9 I don't know how much it does), and basically reusing tags "soon" tends to make those kidns of things fragile. Which can easily turn a "this _should_ work" into "it doesn't _actually_ work" just because it ends up making things like race conditions and re-ordering of replies trigger worse behavior. For example, things like "done with previous command X" and "now starting new command X" - if the tag is the same and those *independent* messages get re-ordered, the tag just failed in what it was supposed to do. So circular allocators are likely a good idea even if there are other layers that should handle retransmits etc. So it does sound like it would be better to use a circular tag allocator rather than a "lowest tag first" allocator. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2015-07-02 20:56 GMT+03:00 Dominique Martinet <dominique.martinet@cea.fr>: > > Still definitely needs fixing, but I think the issue is somewhere > else... If Andrey could share the workload he uses I can try with other > servers, would be nice if we can rule a qemu bug out completely :) > I simply run trinity from 9p rootfs: ./trinity -qqq --dangerous -xinit_module -C100 qemu guest, virtio transport. BTW, I've discovered that all this bogus writes comes from trinity logger (log.c) which just opens one file per thread (trinity-childX.log) and writes to it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 02, 2015 at 12:16:14PM -0700, Linus Torvalds wrote: > On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > All they are used for is matching response to request. Basically, you > > can have up to 65535 pending requests. Reusing it right after getting > > the response is fine. > > Reusing a tag right after getting the completion may be fine in > theory, but it still sounds like a bad idea. Sure, it's used to match > the command with the reply, but using those kinds of things for > matching re-sends and to index into various "current data structures" > is also very common (not having looked at p9 I don't know how much it > does), and basically reusing tags "soon" tends to make those kidns of > things fragile. _All_ retransmits are done in transport layer there. It's not NFS - it really expects reliable ordered connection for transport. No retransmits, no duplicates, etc. I'm not dead against circular allocation, but I would really like to figure out what's going on first. I still wonder if we are seeing wraparound (should've posted a diff instead of verbal description - mea culpa). If we are not, it smells like response to request having arrived while the tag had been not in use from the client POV, _or_ buggered barriers of some kind. Maybe buggered ordering of replies somewhere, but that's only if Tflush had been involved (as in -> Twrite tag = 3 -> Tflush tag = 42 old_tag = 3 <- Rwrite tag = 3 <- Rflush tag = 42 mark tag 3 free to be reused reuse tag 3 ... somehow get to seeing Rwrite only now But I don't see where such ordering violation could've happened at the moment. The way it's supposed to work is that the sequence -> Twhatever tag = N -> Tflush old_tag = N must either end up with no response to the former arriving at all, or arriving before the response to the latter. Transport itself does preserve ordering (TCP certainly would, but virtio queue also does, AFAICS) and we really need to have p9_client_cb() called in order of arrival. Hmm... This is a stab in the dark, but... we have vring_interrupt() calling req_done(), which does while (1) { spin_lock_irqsave(&chan->lock, flags); rc = virtqueue_get_buf(chan->vq, &len); if (rc == NULL) { spin_unlock_irqrestore(&chan->lock, flags); break; } chan->ring_bufs_avail = 1; spin_unlock_irqrestore(&chan->lock, flags); /* Wakeup if anyone waiting for VirtIO ring space. */ wake_up(chan->vc_wq); p9_debug(P9_DEBUG_TRANS, ": rc %p\n", rc); p9_debug(P9_DEBUG_TRANS, ": lookup tag %d\n", rc->tag); req = p9_tag_lookup(chan->client, rc->tag); p9_client_cb(chan->client, req, REQ_STATUS_RCVD); } What's to prevent *another* vring_interrupt() (called from some kind of IRQ handler) hitting on another CPU and competing with this one for the queue? While we are at it, both p9_tag_lookup() and p9_client_cb() should be find with being called under spin_lock_irqsave, so why not hold it outside of the loop? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro wrote on Thu, Jul 02, 2015: > On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote: > > Using cache=none here so behavious is likely different with cache, but > > basically you can't get more than one tag per user thread accessing the > > 9P mount... > > Yes, and...? You can get a lot more than one user thread... Andrey is > using trinity(1) on client, and that's *definitely* not single-threaded - > the whole point is stressing the damn thing. I have run trinity quite a bit and it doesn't fork bomb as far as I can recall, with him running it with -C100 we're not quite at 2^16 yet? I do agree it's a problem, just don't think it's the one we're hitting -- I'll try again on a recent kernel to see if anything changed with rdma/tcp as well, but I'm starting to doubt I'll get any luck with anything other than virtio; which doesn't really help since it's not the same order of latencies. FWIW I don't *think* trinity can issue TFlush either without user interaction, that's a really special call. It can only happen in rpc() or zc_rpc() if it's interrupted by ERESTARTSYS which I understand as ^C? (I'll look into making the pools use IDA unless someone else steps up, sure. Thanks Jeff)
diff --git a/net/9p/client.c b/net/9p/client.c index 6f4c4c8..ca3b342 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -1647,6 +1647,7 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err) if (*err) { trace_9p_protocol_dump(clnt, req->rc); p9_free_req(clnt, req); + break; } p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);