From patchwork Fri Nov 25 14:06:33 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eli V X-Patchwork-Id: 9447633 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 34CE26071C for ; Fri, 25 Nov 2016 14:06:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 21B7E27D5D for ; Fri, 25 Nov 2016 14:06:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 145B527FAE; Fri, 25 Nov 2016 14:06:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 42BF727D5D for ; Fri, 25 Nov 2016 14:06:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754790AbcKYOGi (ORCPT ); Fri, 25 Nov 2016 09:06:38 -0500 Received: from mail-vk0-f66.google.com ([209.85.213.66]:32917 "EHLO mail-vk0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754566AbcKYOGf (ORCPT ); Fri, 25 Nov 2016 09:06:35 -0500 Received: by mail-vk0-f66.google.com with SMTP id l126so1838576vkh.0 for ; Fri, 25 Nov 2016 06:06:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=dwbB4uVy05qYAOH84rgWe5wj92ecQgy9gwupEUhia/E=; b=FSlfcWzQbgPYFH/fBTwOsHahsCT6YTNUUI3J6PvhUWvBfuvB35oiohFqBxb+Y7QQIL y7PIZZBj4cX9q1WAzpleePyvfpUAJiEdGBSrd+qfH833ZRRb6Js0LFe3UAoqpXh9OdsK VUrIrgfh3A03FHonMWr5NbvL88ALb2grdEq4Pqs2gnpriu2WPF4oxfBWrZ28j7q2mCSJ +AiUnwdeCE0lI2Ug0ve3EgDJ9vTb5XhYbXObZMQe9jO3PKzg0JpJ8KsIj/m2BbTUJFjU ECg/n4pyVcAszYaYgE1HT3da4S34W/Kuixj6NNtrdBy0Qah4ewksRtDpKTTjDOBCcOeh vG8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=dwbB4uVy05qYAOH84rgWe5wj92ecQgy9gwupEUhia/E=; b=R1PTeHsYQos0xr3AXS4HinadmUeBCXPjNoqGJ9nk3VGk6UFF6A8apK7johv63DcW6Y BmRhsWP/eVeBzjFRKLmBezpmgO7oBLiZSY9img1g/DH40d0XuKfTbXFBhLzYCJ29l/kC 1l6VcAqS8+cuBoyp/mk7/Zb7VCN89gRdgrSXG7rVea4y6/nxpB4HrBT0T/Db7BxwLdSv OVfjybnpp48aIohEy2ePwQjpJSFeDP413oweDs1ZMRqDwcwlWV59fxzoIR/LtpYN6Xs6 A6baK4HDOTQ5iHB2fxhdapy6uNFwARm+fyRg9lOzhn8RL8jPkMFJMXxK0XiYAo6M/kbf yq1g== X-Gm-Message-State: AKaTC03aqzXVb+nfPHiRxpFVvePZ1KyBzgej0ol6JuCPzEMfjK3bno5ryCkISB+TCCnODQgqmlkqlhRE+MY+yQ== X-Received: by 10.31.183.144 with SMTP id h138mr2425187vkf.48.1480082794041; Fri, 25 Nov 2016 06:06:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.159.48.155 with HTTP; Fri, 25 Nov 2016 06:06:33 -0800 (PST) In-Reply-To: <74787aa3-4140-2bae-ce59-5b67db33d811@suse.cz> References: <20161103115353.de87ff35756a4ca8b21d2c57@linux-foundation.org> <4c85dfa5-9dbe-ea3c-7816-1ab321931e1c@suse.cz> <74787aa3-4140-2bae-ce59-5b67db33d811@suse.cz> From: E V Date: Fri, 25 Nov 2016 09:06:33 -0500 Message-ID: Subject: Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache To: Vlastimil Babka Cc: Tetsuo Handa , Andrew Morton , bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org, Michal Hocko , linux-btrfs , "Kirill A. Shutemov" , Joonsoo Kim Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP So I rebooted with 4.9rc6 with the patch inspired by the thread "[PATCH] btrfs: limit the number of asynchronous delalloc pages to reasonable value", but at 512K pages, ie: diff -u2 fs/btrfs/inode.c ../linux-4.9-rc6/fs/btrfs/ System still OOM'd after a few hours of rsync copying & deleting files, but it didn't panic this time which was nice ;-) I then set: echo 500 >> /proc/sys/vm/watermark_scale_factor echo 3 >> /proc/sys/vm/dirty_background_ratio and system has been running rsync fine for most of a day. system memory load is noticably different in sar -r after changing the vm params, rsync during OOM: 12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 03:25:05 AM 158616 32836640 99.52 72376 30853268 2942644 3.62 26048876 4977872 4936488 03:30:36 AM 157700 32837556 99.52 72468 30944468 2940028 3.62 26070084 4957328 4957432 03:35:02 AM 1802144 31193112 94.54 72560 29266432 2944352 3.62 26184324 3182048 187784 03:40:32 AM 157272 32837984 99.52 72648 30934432 3007244 3.70 26102636 4930744 4927832 03:45:05 AM 158288 32836968 99.52 72896 30980504 412108 0.51 26089920 4959668 4977556 running rsync after tuning VM params: 08:35:01 AM 1903352 31091904 94.23 232772 26603624 2680952 3.30 24133864 5019748 1229964 08:40:01 AM 2878552 30116704 91.28 232800 25641520 2697356 3.32 24158248 4039372 2864656 08:45:01 AM 3482616 29512640 89.45 232656 25043068 2696144 3.32 24087376 3526164 1897192 08:50:01 AM 3590672 29404584 89.12 232856 24962856 2704196 3.33 24078188 3451400 666760 08:55:01 AM 2064900 30930356 93.74 234800 26480996 2730384 3.36 24009244 5044012 50028 On Tue, Nov 22, 2016 at 9:48 AM, Vlastimil Babka wrote: > On 11/22/2016 02:58 PM, E V wrote: >> System OOM'd several times last night with 4.8.10, I attached the >> page_owner output from a morning cat ~8 hours after OOM's to the >> bugzilla case, split and compressed to fit under the 5M attachment >> limit. Let me know if you need anything else. > > Looks like for some reason, the stack saving produces garbage stacks > that only repeat save_stack_trace and save_stack functions :/ > > But judging from gfp flags and page flags, most pages seem to be > allocated with: > > mask 0x2400840(GFP_NOFS|__GFP_NOFAIL) > > and page flags: > > 0x20000000000006c(referenced|uptodate|lru|active) > or > 0x20000000000016c(referenced|uptodate|lru|active|owner_priv_1) > or > 0x20000000000086c(referenced|uptodate|lru|active|private) > > While GFP_HIGHUSER_MOVABLE (which I would expect on lru) are less frequent. > > Example: >> grep GFP_NOFS page_owner_after_af | wc -l > 973596 >> grep GFP_HIGHUSER_MOVABLE page_owner_after_af | wc -l > 158879 >> grep GFP_NOFAIL page_owner_after_af | wc -l > 971442 > > grepping for btrfs shows that at least some stacks for NOFS/NOFAIL pages > imply it: > clear_state_bit+0x135/0x1c0 [btrfs] > or > add_delayed_tree_ref+0xbf/0x170 [btrfs] > or > __btrfs_map_block+0x6a8/0x1200 [btrfs] > or > btrfs_buffer_uptodate+0x48/0x70 [btrfs] > or > btrfs_set_path_blocking+0x34/0x60 [btrfs] > > and some more variants. > > So looks like the pages contain btrfs metadata, are on file lru and from > previous checks of /proc/kpagecount we know that they most likely have > page_count() == 0 but are not freed. Could btrfs guys provide some > insight here? > >> On Fri, Nov 18, 2016 at 10:02 AM, E V wrote: >>> Yes, the short window between the stalls and the panic makes it >>> difficult to manually check much. I could setup a cron every 5 minutes >>> or so if you want. Also, I see the OOM's in 4.8, but it has yet to >>> panic on me. Where as 4.9rc has panic'd both times I've booted it, so >>> depending on what you want to look at it might be easier to >>> investigate on 4.8. Let me know, I can turn on a couple of the DEBUG >>> config's and build a new 4.8.8. Never looked into a netconsole or >>> serial console. I think just getting the system to use a higher res >>> console would be an improvement, but the OOM's seemed to be the root >>> cause of the panic so I haven't spent any time looking into that as of >>> yet, >>> >>> Thanks, >>> -Eli >>> >>> On Fri, Nov 18, 2016 at 6:54 AM, Tetsuo Handa >>> wrote: >>>> On 2016/11/18 6:49, Vlastimil Babka wrote: >>>>> On 11/16/2016 02:39 PM, E V wrote: >>>>>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of >>>>>> the stack trace, and the 38 call traces in a 2 minute window shortly >>>>>> before, to the bugzilla case for those not on it's e-mail list: >>>>>> >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=186671 >>>>> >>>>> The panic screenshot has only the last part, but the end marker says >>>>> it's OOM with no killable processes. The DEBUG_VM config thus didn't >>>>> trigger anything, and still there's tons of pagecache, mostly clean, >>>>> that's not being reclaimed. >>>>> >>>>> Could you now try this? >>>>> - enable CONFIG_PAGE_OWNER >>>>> - boot with kernel option: page_owner=on >>>>> - after the first oom, "cat /sys/kernel/debug/page_owner > file" >>>>> - provide the file (compressed, it will be quite large) >>>> >>>> Excuse me for a noise, but do we really need to do >>>> "cat /sys/kernel/debug/page_owner > file" after the first OOM killer >>>> invocation? I worry that it might be too difficult to do. >>>> Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file" >>>> hourly and compare tendency between the latest one and previous one? >>>> >>>> This system has swap, and /var/log/messages before panic >>>> reports that swapin was stalling at memory allocation. >>>> >>>> ---------------------------------------- >>>> [130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE) >>>> [130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: G W I 4.9.0-rc5 #2 >>>> [130346.262662] 0000000000000000 ffffffff8129ba69 ffffffff8170e4c8 ffffc90003ccb8d8 >>>> [130346.262714] ffffffff8113449f 024200ca1ca11b40 ffffffff8170e4c8 ffffc90003ccb880 >>>> [130346.262765] ffffffff00000010 ffffc90003ccb8e8 ffffc90003ccb898 ffff88041f226e80 >>>> [130346.262817] Call Trace: >>>> [130346.262843] [] ? dump_stack+0x46/0x5d >>>> [130346.262872] [] ? warn_alloc+0x11f/0x140 >>>> [130346.262899] [] ? __alloc_pages_slowpath+0x84b/0xa80 >>>> [130346.262929] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 >>>> [130346.262960] [] ? alloc_pages_vma+0xbe/0x260 >>>> [130346.262989] [] ? pagecache_get_page+0x22/0x280 >>>> [130346.263019] [] ? __read_swap_cache_async+0x118/0x1a0 >>>> [130346.263048] [] ? read_swap_cache_async+0xf/0x30 >>>> [130346.263077] [] ? swapin_readahead+0x16e/0x1c0 >>>> [130346.263106] [] ? radix_tree_lookup_slot+0xe/0x20 >>>> [130346.263135] [] ? find_get_entry+0x14/0x130 >>>> [130346.263162] [] ? pagecache_get_page+0x22/0x280 >>>> [130346.263193] [] ? do_swap_page+0x44f/0x5f0 >>>> [130346.263220] [] ? __radix_tree_lookup+0x62/0xc0 >>>> [130346.263249] [] ? handle_mm_fault+0x66a/0xf00 >>>> [130346.263277] [] ? find_get_entry+0x14/0x130 >>>> [130346.263305] [] ? __do_page_fault+0x1c5/0x490 >>>> [130346.263336] [] ? page_fault+0x22/0x30 >>>> [130346.263364] [] ? copy_user_generic_string+0x2c/0x40 >>>> [130346.263395] [] ? set_fd_set+0x1d/0x30 >>>> [130346.263422] [] ? core_sys_select+0x1a5/0x260 >>>> [130346.263450] [] ? getname_flags+0x6a/0x1e0 >>>> [130346.263479] [] ? cp_new_stat+0x115/0x130 >>>> [130346.263509] [] ? ktime_get_ts64+0x3f/0xf0 >>>> [130346.263537] [] ? SyS_select+0xa5/0xe0 >>>> [130346.263564] [] ? entry_SYSCALL_64_fastpath+0x13/0x94 >>>> ---------------------------------------- >>>> >>>> Under such situation, trying to login and execute /bin/cat could take minutes. >>>> Also, writing to btrfs and ext4 seems to be stalling. The btrfs one is a >>>> situation where WQ_MEM_RECLAIM kernel workqueue is unable to make progress. >>>> >>>> ---------------------------------------- >>>> [130420.008231] kworker/u34:21: page allocation stalls for 35028ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL) >>>> [130420.008287] CPU: 5 PID: 24286 Comm: kworker/u34:21 Tainted: G W I 4.9.0-rc5 #2 >>>> [130420.008401] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] >>>> [130420.008432] 0000000000000000 ffffffff8129ba69 ffffffff8170e4c8 ffffc900087836a0 >>>> [130420.008483] ffffffff8113449f 024008401e3f1b40 ffffffff8170e4c8 ffffc90008783648 >>>> [130420.008534] ffffffff00000010 ffffc900087836b0 ffffc90008783660 ffff88041ecc4340 >>>> [130420.008586] Call Trace: >>>> [130420.008611] [] ? dump_stack+0x46/0x5d >>>> [130420.008640] [] ? warn_alloc+0x11f/0x140 >>>> [130420.008667] [] ? __alloc_pages_slowpath+0x84b/0xa80 >>>> [130420.008707] [] ? search_bitmap+0xc2/0x140 [btrfs] >>>> [130420.008736] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 >>>> [130420.008766] [] ? alloc_pages_current+0x8a/0x110 >>>> [130420.008796] [] ? pagecache_get_page+0xec/0x280 >>>> [130420.008836] [] ? alloc_extent_buffer+0x108/0x430 [btrfs] >>>> [130420.008875] [] ? btrfs_alloc_tree_block+0x118/0x4d0 [btrfs] >>>> [130420.008927] [] ? __btrfs_cow_block+0x148/0x5d0 [btrfs] >>>> [130420.008964] [] ? btrfs_cow_block+0x114/0x1d0 [btrfs] >>>> [130420.009001] [] ? btrfs_search_slot+0x206/0xa40 [btrfs] >>>> [130420.009039] [] ? lookup_inline_extent_backref+0xd9/0x620 [btrfs] >>>> [130420.009095] [] ? set_extent_bit+0x24/0x30 [btrfs] >>>> [130420.009124] [] ? kmem_cache_alloc+0x17f/0x1b0 >>>> [130420.009161] [] ? __btrfs_free_extent.isra.69+0xef/0xd10 [btrfs] >>>> [130420.009215] [] ? btrfs_merge_delayed_refs+0x56/0x6f0 [btrfs] >>>> [130420.009269] [] ? __btrfs_run_delayed_refs+0x745/0x1320 [btrfs] >>>> [130420.009314] [] ? ttwu_do_wakeup+0xf/0xe0 >>>> [130420.009351] [] ? btrfs_run_delayed_refs+0x90/0x2b0 [btrfs] >>>> [130420.009404] [] ? delayed_ref_async_start+0x84/0xa0 [btrfs] >>>> [130420.009459] [] ? normal_work_helper+0xc3/0x2f0 [btrfs] >>>> [130420.009490] [] ? process_one_work+0x14b/0x400 >>>> [130420.009518] [] ? worker_thread+0x5d/0x470 >>>> [130420.009546] [] ? rescuer_thread+0x310/0x310 >>>> [130420.009573] [] ? do_group_exit+0x34/0xb0 >>>> [130420.009601] [] ? kthread+0xcb/0xf0 >>>> [130420.009627] [] ? kthread_park+0x50/0x50 >>>> [130420.009655] [] ? ret_from_fork+0x22/0x30 >>>> ---------------------------------------- >>>> >>>> ---------------------------------------- >>>> [130438.436025] jbd2/dm-0-8: page allocation stalls for 10492ms, order:0, mode:0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE) >>>> [130438.436095] CPU: 2 PID: 1838 Comm: jbd2/dm-0-8 Tainted: G W I 4.9.0-rc5 #2 >>>> [130438.436184] 0000000000000000 ffffffff8129ba69 ffffffff8170e4c8 ffffc90003e13728 >>>> [130438.436237] ffffffff8113449f 0242084800000200 ffffffff8170e4c8 ffffc90003e136d0 >>>> [130438.436289] 0000000100000010 ffffc90003e13738 ffffc90003e136e8 0000000000000001 >>>> [130438.436340] Call Trace: >>>> [130438.436368] [] ? dump_stack+0x46/0x5d >>>> [130438.436399] [] ? warn_alloc+0x11f/0x140 >>>> [130438.436426] [] ? __alloc_pages_slowpath+0x84b/0xa80 >>>> [130438.436455] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 >>>> [130438.436488] [] ? alloc_pages_current+0x8a/0x110 >>>> [130438.436518] [] ? pagecache_get_page+0xec/0x280 >>>> [130438.436549] [] ? __getblk_gfp+0xf1/0x320 >>>> [130438.436593] [] ? ext4_get_branch+0xa4/0x130 [ext4] >>>> [130438.436628] [] ? ext4_ind_map_blocks+0xcb/0xb10 [ext4] >>>> [130438.436658] [] ? update_curr+0x7e/0x100 >>>> [130438.436688] [] ? ktime_get+0x31/0xa0 >>>> [130438.436716] [] ? mempool_alloc+0x59/0x170 >>>> [130438.436743] [] ? update_curr+0x7e/0x100 >>>> [130438.436775] [] ? ext4_map_blocks+0x3c3/0x630 [ext4] >>>> [130438.436808] [] ? _ext4_get_block+0x64/0xc0 [ext4] >>>> [130438.436838] [] ? generic_block_bmap+0x37/0x50 >>>> [130438.436870] [] ? ext4_bmap+0x37/0xd0 [ext4] >>>> [130438.436901] [] ? jbd2_journal_bmap+0x21/0x70 [jbd2] >>>> [130438.436932] [] ? jbd2_journal_get_descriptor_buffer+0x1e/0xc0 [jbd2] >>>> [130438.436979] [] ? jbd2_journal_write_revoke_records+0x198/0x2b0 [jbd2] >>>> [130438.437026] [] ? jbd2_journal_commit_transaction+0x5d6/0x19f0 [jbd2] >>>> [130438.437071] [] ? update_curr+0x7e/0x100 >>>> [130438.437099] [] ? dequeue_task_fair+0x5dc/0x1120 >>>> [130438.437127] [] ? pick_next_task_fair+0x12c/0x420 >>>> [130438.437157] [] ? kjournald2+0xc8/0x250 [jbd2] >>>> [130438.437187] [] ? wake_up_atomic_t+0x30/0x30 >>>> [130438.437216] [] ? commit_timeout+0x10/0x10 [jbd2] >>>> [130438.437247] [] ? kthread+0xcb/0xf0 >>>> [130438.437273] [] ? kthread_park+0x50/0x50 >>>> [130438.437304] [] ? ret_from_fork+0x22/0x30 >>>> ---------------------------------------- >>>> >>>> Under such situation, saving /sys/kernel/debug/page_owner to a file might >>>> be impossible. And, once the stalling started, it took less than 5 minutes >>>> before the kernel panics due to "Out of memory and no killable process". >>>> This could happen when E V is offline. >>>> >>>> Since rsyslogd is likely be killed by the OOM killer for situations like >>>> this, E V might want to try serial console or netconsole for saving kernel >>>> messages reliably. >>>> >>>> I don't know what we will find by analyzing /sys/kernel/debug/page_owner , >>>> but if something is wrong, can't we try whether >>>> "echo 3 > /proc/sys/vm/drop_caches" before the stalling starts helps. >>>> >>>> I guess that this problem became visible by OOM detection rework which >>>> went to Linux 4.7. I don't know what "free_pcp:0kB local_pcp:0kB" means >>>> (get_page_from_freelist() for any order is failng?), but in general I think >>>> this /var/log/messages showed that free_pcp: and local_pcp: remains small. >>>> > --- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- fs/btrfs/inode.c 2016-11-13 13:32:32.000000000 -0500 +++ ../linux-4.9-rc6/fs/btrfs/inode.c 2016-11-23 08:31:02.145669550 -0500 @@ -1159,5 +1159,5 @@ unsigned long nr_pages; u64 cur_end; - int limit = 10 * SZ_1M; + int limit = SZ_512K; clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, EXTENT_LOCKED,