From patchwork Wed Aug  3 03:49:26 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alan Curry <rlwinm@sdf.org>
X-Patchwork-Id: 9260733
X-Patchwork-Delegate: kvalo@adurom.com
Return-Path: <linux-wireless-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	62EC960754 for <patchwork-linux-wireless@patchwork.kernel.org>;
	Wed,  3 Aug 2016 03:50:46 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 54470284FB
	for <patchwork-linux-wireless@patchwork.kernel.org>;
	Wed,  3 Aug 2016 03:50:46 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4857128542; Wed,  3 Aug 2016 03:50:46 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E960A284FB
	for <patchwork-linux-wireless@patchwork.kernel.org>;
	Wed,  3 Aug 2016 03:50:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754701AbcHCDuW (ORCPT
	<rfc822;patchwork-linux-wireless@patchwork.kernel.org>);
	Tue, 2 Aug 2016 23:50:22 -0400
Received: from mx.sdf.org ([192.94.73.20]:51434 "EHLO sdf.lonestar.org"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751654AbcHCDuV (ORCPT <rfc822; linux-wireless@vger.kernel.org>);
	Tue, 2 Aug 2016 23:50:21 -0400
Received: from sdf.org (IDENT:rlwinm@otaku.freeshell.org [192.94.73.9])
	by sdf.lonestar.org (8.15.2/8.14.5) with ESMTPS id u733nRLH007550
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256 bits)
	verified NO); Wed, 3 Aug 2016 03:49:27 GMT
Received: (from rlwinm@localhost)
	by sdf.org (8.15.2/8.12.8/Submit) id u733nRPn000595;
	Wed, 3 Aug 2016 03:49:27 GMT
From: Alan Curry <rlwinm@sdf.org>
Message-Id: <201608030349.u733nRPn000595@sdf.org>
Subject: Re: PROBLEM: network data corruption (bisected to e5a4b0bb803b)
In-Reply-To: <20160728012253.GT2356@ZenIV.linux.org.uk>
To: Al Viro <viro@ZenIV.linux.org.uk>
Date: Wed, 3 Aug 2016 03:49:26 +0000 (UTC)
CC: alexmcwhirter@triadic.us, David Miller <davem@davemloft.net>,
	rlwinm@sdf.org, chunkeey@googlemail.com,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
X-Mailer: ELM [version 2.4ME+ PL93 (25)]
MIME-Version: 1.0
Sender: linux-wireless-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-wireless.vger.kernel.org>
X-Mailing-List: linux-wireless@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Al Viro wrote:
> 
> Which just might mean that we have *three* issues here -
> 	(1) buggered __copy_to_user_inatomic() (and friends) on some sparcs
> 	(2) your ssl-only corruption
> 	(3) Alan's x86_64 corruption on plain TCP read - no ssl *or* sparc
> anywhere, and no multi-segment recvmsg().  Which would strongly argue in
> favour of some kind of copy_page_to_iter() breakage triggered when handling
> a fragmented skb, as in (1).  Except that I don't see anything similar in
> x86_64 uaccess primitives...
> 

I think I've solved (3) at least...

Using the twin weapons of printk and stubbornness, I have built a working
theory of the bug. I haven't traced it all the way through, so my explanation
may be partly wrong. I do have a patch that eliminates the symptom in all my
tests though. Here's what happens:

A corrupted packet somehow arrives in skb_copy_and_csum_datagram_msg().
During downloads at reasonably high speed, about 0.1% of my incoming
packets are bad. Probably because the access point is that suspicious
Comcast thing.

skb_copy_and_csum_datagram_msg() does this:

		if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter,
					       chunk, &csum))
			goto fault;
		if (csum_fold(csum))
			goto csum_error;

skb_copy_and_csum_datagram() copies the bad data, computes the checksum,
and *advances the iterator*. The checksum is bad, so it goes to
csum_error, which returns without indicating success to userspace, but the
bad data is in the userspace buffer, and since the iterator has advanced,
the proper data doesn't get written to the proper place when it arrives in a
retransmission. The same iterator is still used because we're still in the
same syscall (I guess - this is one of the parts I didn't check out).

My ugly patch fixes this in the most obvious way: make a local copy of
msg->msg_iter before the call to skb_copy_and_csum_datagram(), and copy it
back if the checksum is bad, just before "goto csum_error;". (I wonder if the
other failure exits from this function might need to do the same thing.)

You can probably reproduce this problem if you deliberately inject some
bad TCP checksums into a stream. Just make sure the receiving machine is
in a blocking read() on the socket when the bad packet arrives. You may
need to resend the offending packet afterward with the checksum corrected.

diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..574d4bf 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -730,6 +730,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 {
 	__wsum csum;
 	int chunk = skb->len - hlen;
+	struct iov_iter save_iter;
 
 	if (!chunk)
 		return 0;
@@ -741,11 +742,14 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 			goto fault;
 	} else {
 		csum = csum_partial(skb->data, hlen, skb->csum);
+		memcpy(&save_iter, &msg->msg_iter, sizeof save_iter);
 		if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter,
 					       chunk, &csum))
 			goto fault;
-		if (csum_fold(csum))
+		if (csum_fold(csum)) {
+			memcpy(&msg->msg_iter, &save_iter, sizeof save_iter);
 			goto csum_error;
+		}
 		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE))
 			netdev_rx_csum_fault(skb->dev);
 	}