From patchwork Wed Aug 3 03:49:26 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Curry X-Patchwork-Id: 9260733 X-Patchwork-Delegate: kvalo@adurom.com Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 62EC960754 for ; Wed, 3 Aug 2016 03:50:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 54470284FB for ; Wed, 3 Aug 2016 03:50:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4857128542; Wed, 3 Aug 2016 03:50:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E960A284FB for ; Wed, 3 Aug 2016 03:50:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754701AbcHCDuW (ORCPT ); Tue, 2 Aug 2016 23:50:22 -0400 Received: from mx.sdf.org ([192.94.73.20]:51434 "EHLO sdf.lonestar.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751654AbcHCDuV (ORCPT ); Tue, 2 Aug 2016 23:50:21 -0400 Received: from sdf.org (IDENT:rlwinm@otaku.freeshell.org [192.94.73.9]) by sdf.lonestar.org (8.15.2/8.14.5) with ESMTPS id u733nRLH007550 (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256 bits) verified NO); Wed, 3 Aug 2016 03:49:27 GMT Received: (from rlwinm@localhost) by sdf.org (8.15.2/8.12.8/Submit) id u733nRPn000595; Wed, 3 Aug 2016 03:49:27 GMT From: Alan Curry Message-Id: <201608030349.u733nRPn000595@sdf.org> Subject: Re: PROBLEM: network data corruption (bisected to e5a4b0bb803b) In-Reply-To: <20160728012253.GT2356@ZenIV.linux.org.uk> To: Al Viro Date: Wed, 3 Aug 2016 03:49:26 +0000 (UTC) CC: alexmcwhirter@triadic.us, David Miller , rlwinm@sdf.org, chunkeey@googlemail.com, linux-wireless@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org X-Mailer: ELM [version 2.4ME+ PL93 (25)] MIME-Version: 1.0 Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Al Viro wrote: > > Which just might mean that we have *three* issues here - > (1) buggered __copy_to_user_inatomic() (and friends) on some sparcs > (2) your ssl-only corruption > (3) Alan's x86_64 corruption on plain TCP read - no ssl *or* sparc > anywhere, and no multi-segment recvmsg(). Which would strongly argue in > favour of some kind of copy_page_to_iter() breakage triggered when handling > a fragmented skb, as in (1). Except that I don't see anything similar in > x86_64 uaccess primitives... > I think I've solved (3) at least... Using the twin weapons of printk and stubbornness, I have built a working theory of the bug. I haven't traced it all the way through, so my explanation may be partly wrong. I do have a patch that eliminates the symptom in all my tests though. Here's what happens: A corrupted packet somehow arrives in skb_copy_and_csum_datagram_msg(). During downloads at reasonably high speed, about 0.1% of my incoming packets are bad. Probably because the access point is that suspicious Comcast thing. skb_copy_and_csum_datagram_msg() does this: if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter, chunk, &csum)) goto fault; if (csum_fold(csum)) goto csum_error; skb_copy_and_csum_datagram() copies the bad data, computes the checksum, and *advances the iterator*. The checksum is bad, so it goes to csum_error, which returns without indicating success to userspace, but the bad data is in the userspace buffer, and since the iterator has advanced, the proper data doesn't get written to the proper place when it arrives in a retransmission. The same iterator is still used because we're still in the same syscall (I guess - this is one of the parts I didn't check out). My ugly patch fixes this in the most obvious way: make a local copy of msg->msg_iter before the call to skb_copy_and_csum_datagram(), and copy it back if the checksum is bad, just before "goto csum_error;". (I wonder if the other failure exits from this function might need to do the same thing.) You can probably reproduce this problem if you deliberately inject some bad TCP checksums into a stream. Just make sure the receiving machine is in a blocking read() on the socket when the bad packet arrives. You may need to resend the offending packet afterward with the checksum corrected. diff --git a/net/core/datagram.c b/net/core/datagram.c index b7de71f..574d4bf 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -730,6 +730,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, { __wsum csum; int chunk = skb->len - hlen; + struct iov_iter save_iter; if (!chunk) return 0; @@ -741,11 +742,14 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, goto fault; } else { csum = csum_partial(skb->data, hlen, skb->csum); + memcpy(&save_iter, &msg->msg_iter, sizeof save_iter); if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter, chunk, &csum)) goto fault; - if (csum_fold(csum)) + if (csum_fold(csum)) { + memcpy(&msg->msg_iter, &save_iter, sizeof save_iter); goto csum_error; + } if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE)) netdev_rx_csum_fault(skb->dev); }