diff mbox

forgetting to cancel request in interrupted zero-copy 9P RPC (was Re: [git pull] vfs part 2)

Message ID 20150703150000.GS17109@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro July 3, 2015, 3 p.m. UTC
On Fri, Jul 03, 2015 at 10:42:10AM +0100, Al Viro wrote:

> AFAICS, we get occasional stray responses from somewhere.  And no, it doesn't
> seem to be related to flushes or to dropping chan->lock in req_done() (this
> run had been with chan->lock taken on the outside of the loop).
> 
> What I really don't understand is WTF is it playing with p9_tag_lookup() -
> it's stashing req->tc via virtqueue_add_sgs() opaque data argument, fetches
> it back in req_done(), then picks ->tag from it and uses p9_tag_lookup() to
> find req.  Why not simply pass req instead?  I had been wrong about that
> p9_tag_lookup() being able to return NULL, but why bother with it at all?


Got it.  What happens is that on zero-copy path a signal hitting in the
end of p9_virtio_zc_request() is treated as "it hadn't been sent, got
an error, fuck off and mark the tag ready for reuse".  No TFLUSH issued,
etc.  As the result, when reply finally *does* arrive (we had actually
sent the request), it plays hell on the entire thing - tag might very
well have been reused by then and an unrelated request sent with the
same tag.  Depending on the timing, results can get rather ugly.

There are still other bogosities found in this thread, and at the very
least we need to cope with genuine corrupted response from server, but
the patch below fixes the problem with stray responses here and stops the
"what do you mean, you'd written 4K?  I've only sent 30 bytes!" problems
here.  10 minutes of trinity running without triggering it, while without
that patch it triggers in 2-3 minutes.

Could you verify that the patch below deals with your setup as well?
If it does, I'm going to put it into tonight's pull request, after I get
some sleep...  Right now I'm about to crawl in direction of bed - 25 hours
of uptime is a bit too much... ;-/

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrey Ryabinin July 3, 2015, 7:56 p.m. UTC | #1
2015-07-03 18:00 GMT+03:00 Al Viro <viro@zeniv.linux.org.uk>:
> Could you verify that the patch below deals with your setup as well?
> If it does, I'm going to put it into tonight's pull request, after I get
> some sleep...  Right now I'm about to crawl in direction of bed - 25 hours
> of uptime is a bit too much... ;-/
>

Works for me.
Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..8c4941d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -843,7 +843,8 @@  static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
 	if (err < 0) {
 		if (err == -EIO)
 			c->status = Disconnected;
-		goto reterr;
+		if (err != -ERESTARTSYS)
+			goto reterr;
 	}
 	if (req->status == REQ_STATUS_ERROR) {
 		p9_debug(P9_DEBUG_ERROR, "req_status error %d\n", req->t_err);
@@ -1647,7 +1648,10 @@  p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
 		if (*err) {
 			trace_9p_protocol_dump(clnt, req->rc);
 			p9_free_req(clnt, req);
+			break;
 		}
+		if (rsize < count)
+			pr_err("mismatched reply [tag = %d]\n", req->tc->tag);
 
 		p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);