From patchwork Thu Oct 24 20:50:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11210735 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 30B5713B1 for ; Thu, 24 Oct 2019 20:50:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E3AA121A4C for ; Thu, 24 Oct 2019 20:50:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gsr0P+AK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E3AA121A4C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0BFF06B0271; Thu, 24 Oct 2019 16:50:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 070A96B0272; Thu, 24 Oct 2019 16:50:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA0C36B0273; Thu, 24 Oct 2019 16:50:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0120.hostedemail.com [216.40.44.120]) by kanga.kvack.org (Postfix) with ESMTP id C3EF56B0271 for ; Thu, 24 Oct 2019 16:50:31 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 58A4481E1 for ; Thu, 24 Oct 2019 20:50:31 +0000 (UTC) X-FDA: 76079871462.09.pin10_219b3e5f06e1c X-Spam-Summary: 2,0,0,965609ddad9288c1,d41d8cd98f00b204,htejun@gmail.com,:davem@davemloft.net:netdev@vger.kernel.org:kernel-team@fb.com:linux-kernel@vger.kernel.org:josef@toxicpanda.com:eric.dumazet@gmail.com:jakub.kicinski@netronome.com:hannes@cmpxchg.org::mgorman@suse.de:akpm@linux-foundation.org,RULES_HIT:2:41:355:379:800:960:966:973:988:989:1260:1277:1312:1313:1314:1345:1359:1437:1516:1518:1519:1535:1593:1594:1595:1596:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4250:4321:4385:5007:6119:6261:6653:7514:7875:7903:8603:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12986:13146:13161:13229:13230:13439:13869:13895:14096:14097:14394:21080:21433:21444:21451:21627:21795:21939:30005:30034:30041:30051:30054:30090,0,RBL:209.85.160.195:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCac he:0,MSF X-HE-Tag: pin10_219b3e5f06e1c X-Filterd-Recvd-Size: 9571 Received: from mail-qt1-f195.google.com (mail-qt1-f195.google.com [209.85.160.195]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Oct 2019 20:50:30 +0000 (UTC) Received: by mail-qt1-f195.google.com with SMTP id e14so2071qto.1 for ; Thu, 24 Oct 2019 13:50:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=yiBNxTVjLKb3LJZRA9pl6gZMexfgF8lPwoDnT1vLA8g=; b=gsr0P+AKeUbPo0SZQ9m052jHu47E2IhhB9AygnZGTUHegpz2XnMlB01F85FPA2Nwa5 LWC82oUttGymrxLgbvRbTmx6Lif9caZD0ONWLxvRdVkduwCjXuklWzQt3DV570BFKlul AWfKbhVypbK8bwkoLhjYuHymSHpc55VloVsh3UD+w59cFHXCdjElxalxE988Nior67QK tJ+pNxrh+YN9304oSTkvO7PXCQ1YUWMJIDIBciwHYh+K2FrEpaVfWfnVNApkh/aXtoox zQmziuwlBZabXRggTZgS4xn8qrN5d4wJn/W93jovdn49VE3mGAxX/pw7rTninM3fXuUB Mm5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=yiBNxTVjLKb3LJZRA9pl6gZMexfgF8lPwoDnT1vLA8g=; b=UyHEwfkjG9250YgPlQz6RiuG2gzg/qvkL7dIYaXSXkRWNWYEHhWyPGITxY7Ia4U+MU obzpO6GZAiSDNMTfU5U2W/ECvzVCi30DuRgNIqiO/6dIDvY0/EYSrDcDhz35IOYmdxxb 2YffqyfDhW8s41JVMfIQVpDItY8hlEhlILIUcvBdLXGH5N7lfRgv7PRPJlxk3h90f/s2 t5hC/FKqcgX6pJUsdX7/VJ6XtP1VpYTZRPRhoM8Htf1a8Kq5Frrw+Av4cqj1vxz+8290 sRlrmYEF8LPZjgdhNKHVaQcMQxWn0E+fC1BHxrdHnvRqHxtjy3mdhxJd/vTtFmnhAT14 mvYw== X-Gm-Message-State: APjAAAVRvJLmdhWraDZ9yUoVanQznTV67E4Pd7tJ+z0qoiK52YSwkOlq uuXYyI6NI0dzVEQ0qVe+9IITX4CJ X-Google-Smtp-Source: APXvYqya/7Yblx6OaWRLVv0bIdGPj4jN7xb3wMBDjV6dVc0GXOVyNjYufuwD5ghJwYUJHrLEDHRtGQ== X-Received: by 2002:ac8:6b12:: with SMTP id w18mr6480132qts.173.1571950230017; Thu, 24 Oct 2019 13:50:30 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::3:b2e]) by smtp.gmail.com with ESMTPSA id m186sm13129881qkd.119.2019.10.24.13.50.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 24 Oct 2019 13:50:29 -0700 (PDT) Date: Thu, 24 Oct 2019 13:50:27 -0700 From: Tejun Heo To: "David S. Miller" Cc: netdev@vger.kernel.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, Josef Bacik , Eric Dumazet , Jakub Kicinski , Johannes Weiner , linux-mm@kvack.org, Mel Gorman , Andrew Morton Subject: [PATCH v2] net: fix sk_page_frag() recursion from memory reclaim Message-ID: <20191024205027.GF3622521@devbig004.ftw2.facebook.com> References: <20191019170141.GQ18794@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20191019170141.GQ18794@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: sk_page_frag() optimizes skb_frag allocations by using per-task skb_frag cache when it knows it's the only user. The condition is determined by seeing whether the socket allocation mask allows blocking - if the allocation may block, it obviously owns the task's context and ergo exclusively owns current->task_frag. Unfortunately, this misses recursion through memory reclaim path. Please take a look at the following backtrace. [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10 ... tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 sock_xmit.isra.24+0xa1/0x170 [nbd] nbd_send_cmd+0x1d2/0x690 [nbd] nbd_queue_rq+0x1b5/0x3b0 [nbd] __blk_mq_try_issue_directly+0x108/0x1b0 blk_mq_request_issue_directly+0xbd/0xe0 blk_mq_try_issue_list_directly+0x41/0xb0 blk_mq_sched_insert_requests+0xa2/0xe0 blk_mq_flush_plug_list+0x205/0x2a0 blk_flush_plug_list+0xc3/0xf0 [1] blk_finish_plug+0x21/0x2e _xfs_buf_ioapply+0x313/0x460 __xfs_buf_submit+0x67/0x220 xfs_buf_read_map+0x113/0x1a0 xfs_trans_read_buf_map+0xbf/0x330 xfs_btree_read_buf_block.constprop.42+0x95/0xd0 xfs_btree_lookup_get_block+0x95/0x170 xfs_btree_lookup+0xcc/0x470 xfs_bmap_del_extent_real+0x254/0x9a0 __xfs_bunmapi+0x45c/0xab0 xfs_bunmapi+0x15/0x30 xfs_itruncate_extents_flags+0xca/0x250 xfs_free_eofblocks+0x181/0x1e0 xfs_fs_destroy_inode+0xa8/0x1b0 destroy_inode+0x38/0x70 dispose_list+0x35/0x50 prune_icache_sb+0x52/0x70 super_cache_scan+0x120/0x1a0 do_shrink_slab+0x120/0x290 shrink_slab+0x216/0x2b0 shrink_node+0x1b6/0x4a0 do_try_to_free_pages+0xc6/0x370 try_to_free_mem_cgroup_pages+0xe3/0x1e0 try_charge+0x29e/0x790 mem_cgroup_charge_skmem+0x6a/0x100 __sk_mem_raise_allocated+0x18e/0x390 __sk_mem_schedule+0x2a/0x40 [0] tcp_sendmsg_locked+0x8eb/0xe10 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 ___sys_sendmsg+0x26d/0x2b0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In [0], tcp_send_msg_locked() was using current->page_frag when it called sk_wmem_schedule(). It already calculated how many bytes can be fit into current->page_frag. Due to memory pressure, sk_wmem_schedule() called into memory reclaim path which called into xfs and then IO issue path. Because the filesystem in question is backed by nbd, the control goes back into the tcp layer - back into tcp_sendmsg_locked(). nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes sense - it's in the process of freeing memory and wants to be able to, e.g., drop clean pages to make forward progress. However, this confused sk_page_frag() called from [2]. Because it only tests whether the allocation allows blocking which it does, it now thinks current->page_frag can be used again although it already was being used in [0]. After [2] used current->page_frag, the offset would be increased by the used amount. When the control returns to [0], current->page_frag's offset is increased and the previously calculated number of bytes now may overrun the end of allocated memory leading to silent memory corruptions. Fix it by adding gfpflags_normal_context() which tests sleepable && !reclaim and use it to determine whether to use current->task_frag. v2: Eric didn't like gfp flags being tested twice. Introduce a new helper gfpflags_normal_context() and combine the two tests. Signed-off-by: Tejun Heo Cc: Josef Bacik Cc: Eric Dumazet Cc: stable@vger.kernel.org --- include/linux/gfp.h | 23 +++++++++++++++++++++++ include/net/sock.h | 11 ++++++++--- 2 files changed, 31 insertions(+), 3 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index fb07b503dc45..61f2f6ff9467 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -325,6 +325,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) return !!(gfp_flags & __GFP_DIRECT_RECLAIM); } +/** + * gfpflags_normal_context - is gfp_flags a normal sleepable context? + * @gfp_flags: gfp_flags to test + * + * Test whether @gfp_flags indicates that the allocation is from the + * %current context and allowed to sleep. + * + * An allocation being allowed to block doesn't mean it owns the %current + * context. When direct reclaim path tries to allocate memory, the + * allocation context is nested inside whatever %current was doing at the + * time of the original allocation. The nested allocation may be allowed + * to block but modifying anything %current owns can corrupt the outer + * context's expectations. + * + * %true result from this function indicates that the allocation context + * can sleep and use anything that's associated with %current. + */ +static inline bool gfpflags_normal_context(const gfp_t gfp_flags) +{ + return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) == + __GFP_DIRECT_RECLAIM; +} + #ifdef CONFIG_HIGHMEM #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM #else diff --git a/include/net/sock.h b/include/net/sock.h index f69b58bff7e5..c31a9ed86d5a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2242,12 +2242,17 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, * sk_page_frag - return an appropriate page_frag * @sk: socket * - * If socket allocation mode allows current thread to sleep, it means its - * safe to use the per task page_frag instead of the per socket one. + * Use the per task page_frag instead of the per socket one for + * optimization when we know that we're in the normal context and owns + * everything that's associated with %current. + * + * gfpflags_allow_blocking() isn't enough here as direct reclaim may nest + * inside other socket operations and end up recursing into sk_page_frag() + * while it's already in use. */ static inline struct page_frag *sk_page_frag(struct sock *sk) { - if (gfpflags_allow_blocking(sk->sk_allocation)) + if (gfpflags_normal_context(sk->sk_allocation)) return ¤t->task_frag; return &sk->sk_frag;