From patchwork Mon Mar 23 05:07:47 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 6069871 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id BD70F9F350 for ; Mon, 23 Mar 2015 05:10:04 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id A369D2022D for ; Mon, 23 Mar 2015 05:10:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 796D720107 for ; Mon, 23 Mar 2015 05:10:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753354AbbCWFI2 (ORCPT ); Mon, 23 Mar 2015 01:08:28 -0400 Received: from mail-qg0-f43.google.com ([209.85.192.43]:35263 "EHLO mail-qg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752855AbbCWFIY (ORCPT ); Mon, 23 Mar 2015 01:08:24 -0400 Received: by qgf74 with SMTP id 74so11568094qgf.2; Sun, 22 Mar 2015 22:08:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=6g92CWgvjqls6lFUIjEw006cz7fUIMtQxMJ3IMCRcWc=; b=C2592wS9Wn9ArXRvhyJUo7JQJ++9hVmb7TX8yDaYL8f6YMNRFD0AkkrZfPDB44pVm3 CJIxVNiAmc5uVgclq8tVCDbRsBlTVr8qkLl+PKqvHm254Ae12L1fe3IijC7mBqXEPtIH uFWw6H23GHwt18UNMegSPrSCoF4WdjZlNpPC5BD6NEsyr2TDGTH1AZI0a/HgNRHjJ/6R 4I9soAJthdnMjHQ0es8uUpuIoOnmvX1Mxzib9gfpF60KBB/M8Zo29laQLMRCExQDrstw J6VRWFggSSj2pLpV7yMiNBVn3586hQXxgKqDCqMOUclzJAlBlsXrgS3cNPq/4hEwa4Dg RzQA== X-Received: by 10.55.26.209 with SMTP id l78mr144118524qkh.60.1427087303480; Sun, 22 Mar 2015 22:08:23 -0700 (PDT) Received: from htj.duckdns.org.lan (207-38-238-8.c3-0.wsd-ubr1.qens-wsd.ny.cable.rcn.com. [207.38.238.8]) by mx.google.com with ESMTPSA id f77sm8494303qka.9.2015.03.22.22.08.21 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 22 Mar 2015 22:08:22 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo , Vladimir Davydov Subject: [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim Date: Mon, 23 Mar 2015 01:07:47 -0400 Message-Id: <1427087267-16592-19-git-send-email-tj@kernel.org> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1427087267-16592-1-git-send-email-tj@kernel.org> References: <1427087267-16592-1-git-send-email-tj@kernel.org> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID,T_RP_MATCHES_RCVD,UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Because writeback wasn't cgroup aware before, the usual dirty throttling mechanism in balance_dirty_pages() didn't work for processes under memcg limit. The writeback path didn't know how much memory is available or how fast the dirty pages are being written out for a given memcg and balance_dirty_pages() didn't have any measure of IO back pressure for the memcg. To work around the issue, memcg implemented an ad-hoc dirty throttling mechanism in the direct reclaim path by stalling on pages under writeback which are encountered during direct reclaim scan. This is rather ugly and crude - none of the configurability, fairness, or bandwidth-proportional distribution of the normal path. The previous patches implemented proper memcg aware dirty throttling and the ad-hoc mechanism is no longer necessary. Remove it. Note: I removed the parts which seemed obvious and it behaves fine while testing but my understanding of this code path is rudimentary and it's quite possible that I got something wrong. Please let me know if I got some wrong or more global_reclaim() sites should be updated. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Cc: Vladimir Davydov --- mm/vmscan.c | 109 ++++++++++++++++++------------------------------------------ 1 file changed, 33 insertions(+), 76 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9f8d3c0..d084c95 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,53 +929,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, nr_congested++; /* - * If a page at the tail of the LRU is under writeback, there - * are three cases to consider. - * - * 1) If reclaim is encountering an excessive number of pages - * under writeback and this page is both under writeback and - * PageReclaim then it indicates that pages are being queued - * for IO but are being recycled through the LRU before the - * IO can complete. Waiting on the page itself risks an - * indefinite stall if it is impossible to writeback the - * page due to IO error or disconnected storage so instead - * note that the LRU is being scanned too quickly and the - * caller can stall after page list has been processed. - * - * 2) Global reclaim encounters a page, memcg encounters a - * page that is not marked for immediate reclaim or - * the caller does not have __GFP_IO. In this case mark - * the page for immediate reclaim and continue scanning. - * - * __GFP_IO is checked because a loop driver thread might - * enter reclaim, and deadlock if it waits on a page for - * which it is needed to do the write (loop masks off - * __GFP_IO|__GFP_FS for this reason); but more thought - * would probably show more reasons. - * - * Don't require __GFP_FS, since we're not going into the - * FS, just waiting on its writeback completion. Worryingly, - * ext4 gfs2 and xfs allocate pages with - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing - * may_enter_fs here is liable to OOM on them. - * - * 3) memcg encounters a page that is not already marked - * PageReclaim. memcg does not have any dirty pages - * throttling so we could easily OOM just because too many - * pages are in writeback and there is nothing else to - * reclaim. Wait for the writeback to complete. + * A page at the tail of the LRU is under writeback. If + * reclaim is encountering an excessive number of pages + * under writeback and this page is both under writeback + * and PageReclaim then it indicates that pages are being + * queued for IO but are being recycled through the LRU + * before the IO can complete. Waiting on the page itself + * risks an indefinite stall if it is impossible to + * writeback the page due to IO error or disconnected + * storage so instead note that the LRU is being scanned + * too quickly and the caller can stall after page list has + * been processed. */ if (PageWriteback(page)) { - /* Case 1 above */ if (current_is_kswapd() && PageReclaim(page) && test_bit(ZONE_WRITEBACK, &zone->flags)) { nr_immediate++; - goto keep_locked; - - /* Case 2 above */ - } else if (global_reclaim(sc) || - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { + } else { /* * This is slightly racy - end_page_writeback() * might have just cleared PageReclaim, then @@ -989,13 +960,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ SetPageReclaim(page); nr_writeback++; - - goto keep_locked; - - /* Case 3 above */ - } else { - wait_on_page_writeback(page); } + goto keep_locked; } if (!force_reclaim) @@ -1423,9 +1389,6 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; - if (!global_reclaim(sc)) - return 0; - if (file) { inactive = zone_page_state(zone, NR_INACTIVE_FILE); isolated = zone_page_state(zone, NR_ISOLATED_FILE); @@ -1615,35 +1578,29 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, set_bit(ZONE_WRITEBACK, &zone->flags); /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim + * Tag a zone as congested if all the dirty pages scanned were + * backed by a congested BDI and wait_iff_congested will stall. */ - if (global_reclaim(sc)) { - /* - * Tag a zone as congested if all the dirty pages scanned were - * backed by a congested BDI and wait_iff_congested will stall. - */ - if (nr_dirty && nr_dirty == nr_congested) - set_bit(ZONE_CONGESTED, &zone->flags); + if (nr_dirty && nr_dirty == nr_congested) + set_bit(ZONE_CONGESTED, &zone->flags); - /* - * If dirty pages are scanned that are not queued for IO, it - * implies that flushers are not keeping up. In this case, flag - * the zone ZONE_DIRTY and kswapd will start writing pages from - * reclaim context. - */ - if (nr_unqueued_dirty == nr_taken) - set_bit(ZONE_DIRTY, &zone->flags); + /* + * If dirty pages are scanned that are not queued for IO, it + * implies that flushers are not keeping up. In this case, flag the + * zone ZONE_DIRTY and kswapd will start writing pages from reclaim + * context. + */ + if (nr_unqueued_dirty == nr_taken) + set_bit(ZONE_DIRTY, &zone->flags); - /* - * If kswapd scans pages marked marked for immediate - * reclaim and under writeback (nr_immediate), it implies - * that pages are cycling through the LRU faster than - * they are written so also forcibly stall. - */ - if (nr_immediate && current_may_throttle()) - congestion_wait(BLK_RW_ASYNC, HZ/10); - } + /* + * If kswapd scans pages marked marked for immediate reclaim and + * under writeback (nr_immediate), it implies that pages are + * cycling through the LRU faster than they are written so also + * forcibly stall. + */ + if (nr_immediate && current_may_throttle()) + congestion_wait(BLK_RW_ASYNC, HZ/10); /* * Stall direct reclaim for IO completions if underlying BDIs or zone