Message ID | 1577264666-246071-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | per lruvec lru_lock for memcg | expand |
On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > This patchset move lru_lock into lruvec, give a lru_lock for each of > lruvec, thus bring a lru_lock for each of memcg per node. I see that there has been plenty of feedback on previous versions, but no acked/reviewed tags as yet. I think I'll take a pass for now, see what the audience feedback looks like ;)
在 2020/1/1 上午7:05, Andrew Morton 写道: > On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > >> This patchset move lru_lock into lruvec, give a lru_lock for each of >> lruvec, thus bring a lru_lock for each of memcg per node. > > I see that there has been plenty of feedback on previous versions, but > no acked/reviewed tags as yet. > > I think I'll take a pass for now, see what the audience feedback looks > like ;) > Thanks a lot! Andrew. Please drop the 10th patch since it's for debug only and cost performance drop. Best regards & Happy new year! :) Alex
在 2020/1/2 下午6:21, Alex Shi 写道: > > > 在 2020/1/1 上午7:05, Andrew Morton 写道: >> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >>> This patchset move lru_lock into lruvec, give a lru_lock for each of >>> lruvec, thus bring a lru_lock for each of memcg per node. >> >> I see that there has been plenty of feedback on previous versions, but >> no acked/reviewed tags as yet. >> >> I think I'll take a pass for now, see what the audience feedback looks >> like ;) >> > Hi Johannes, Any comments of this version? :) Thanks Alex > > Thanks a lot! Andrew. > > Please drop the 10th patch since it's for debug only and cost performance drop. > > Best regards & Happy new year! :) > Alex >
On Fri, 10 Jan 2020, Alex Shi wrote: > 在 2020/1/2 下午6:21, Alex Shi 写道: > > 在 2020/1/1 上午7:05, Andrew Morton 写道: > >> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > >> > >>> This patchset move lru_lock into lruvec, give a lru_lock for each of > >>> lruvec, thus bring a lru_lock for each of memcg per node. > >> > >> I see that there has been plenty of feedback on previous versions, but > >> no acked/reviewed tags as yet. > >> > >> I think I'll take a pass for now, see what the audience feedback looks > >> like ;) > >> > > > > Hi Johannes, > > Any comments of this version? :) I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all - perhaps because my particular interest tends towards tmpfs and swap, and swap always made trouble for lruvec lock - one of the reasons why our patches were more complicated than you thought necessary. Booted a smallish kernel in mem=700M with 1.5G of swap, with intention of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs (losetup was the last command started but I doubt it played much part): mount -t tmpfs -o size=470M tmpfs /tst cp /dev/zero /tst losetup /dev/loop0 /tst/zero and kernel crashed on the VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); kernel BUG at mm/memcontrol.c:1268! lock_page_lruvec_irqsave relock_page_lruvec_irqsave pagevec_lru_move_fn __pagevec_lru_add lru_add_drain_cpu lru_add_drain swap_cluster_readahead shmem_swapin shmem_swapin_page shmem_getpage_gfp shmem_getpage shmem_write_begin generic_perform_write __generic_file_write_iter generic_file_write_iter new_sync_write __vfs_write vfs_write ksys_write __x86_sys_write do_syscall_64 Hugh
在 2020/1/13 下午4:48, Hugh Dickins 写道: > On Fri, 10 Jan 2020, Alex Shi wrote: >> 在 2020/1/2 下午6:21, Alex Shi 写道: >>> 在 2020/1/1 上午7:05, Andrew Morton 写道: >>>> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: >>>> >>>>> This patchset move lru_lock into lruvec, give a lru_lock for each of >>>>> lruvec, thus bring a lru_lock for each of memcg per node. >>>> >>>> I see that there has been plenty of feedback on previous versions, but >>>> no acked/reviewed tags as yet. >>>> >>>> I think I'll take a pass for now, see what the audience feedback looks >>>> like ;) >>>> >>> >> >> Hi Johannes, >> >> Any comments of this version? :) > > I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all - > perhaps because my particular interest tends towards tmpfs and swap, > and swap always made trouble for lruvec lock - one of the reasons why > our patches were more complicated than you thought necessary. > > Booted a smallish kernel in mem=700M with 1.5G of swap, with intention > of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs > (losetup was the last command started but I doubt it played much part): > > mount -t tmpfs -o size=470M tmpfs /tst > cp /dev/zero /tst > losetup /dev/loop0 /tst/zero Hi Hugh, Many thanks for the testing! I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following. my qemu vmm like this: [root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst [root@debug010000002015 ~]# cp /dev/zero /tst cp: error writing ‘/tst/zero’: No space left on device cp: failed to extend ‘/tst/zero’: No space left on device [root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero [root@debug010000002015 ~]# cat /proc/cmdline earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient? Thanks a lot! Alex static void commit_charge(struct page *page, struct mem_cgroup *memcg, bool lrucare) { - int isolated; + struct lruvec *lruvec = NULL; VM_BUG_ON_PAGE(page->mem_cgroup, page); @@ -2612,8 +2617,16 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg, * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page * may already be on some other mem_cgroup's LRU. Take care of it. */ - if (lrucare) - lock_page_lru(page, &isolated); + if (lrucare) { + lruvec = lock_page_lruvec_irq(page); + if (likely(PageLRU(page))) { + ClearPageLRU(page); + del_page_from_lru_list(page, lruvec, page_lru(page)); + } else { + unlock_page_lruvec_irq(lruvec); + lruvec = NULL; + } + } /* * Nobody should be changing or seriously looking at @@ -2631,8 +2644,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg, */ page->mem_cgroup = memcg; - if (lrucare) - unlock_page_lru(page, isolated); + if (lrucare && lruvec) { + unlock_page_lruvec_irq(lruvec); + lruvec = lock_page_lruvec_irq(page); + + VM_BUG_ON_PAGE(PageLRU(page), page); + SetPageLRU(page); + add_page_to_lru_list(page, lruvec, page_lru(page)); + unlock_page_lruvec_irq(lruvec); + } } > > and kernel crashed on the > > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); > kernel BUG at mm/memcontrol.c:1268! > lock_page_lruvec_irqsave > relock_page_lruvec_irqsave > pagevec_lru_move_fn > __pagevec_lru_add > lru_add_drain_cpu > lru_add_drain > swap_cluster_readahead > shmem_swapin > shmem_swapin_page > shmem_getpage_gfp > shmem_getpage > shmem_write_begin > generic_perform_write > __generic_file_write_iter > generic_file_write_iter > new_sync_write > __vfs_write > vfs_write > ksys_write > __x86_sys_write > do_syscall_64 > > Hugh >
On Mon, 13 Jan 2020, Alex Shi wrote: > 在 2020/1/13 下午4:48, Hugh Dickins 写道: > > > > I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all - > > perhaps because my particular interest tends towards tmpfs and swap, > > and swap always made trouble for lruvec lock - one of the reasons why > > our patches were more complicated than you thought necessary. > > > > Booted a smallish kernel in mem=700M with 1.5G of swap, with intention > > of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs > > (losetup was the last command started but I doubt it played much part): > > > > mount -t tmpfs -o size=470M tmpfs /tst > > cp /dev/zero /tst > > losetup /dev/loop0 /tst/zero > > Hi Hugh, > > Many thanks for the testing! > > I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following. > my qemu vmm like this: > > [root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst > [root@debug010000002015 ~]# cp /dev/zero /tst > cp: error writing ‘/tst/zero’: No space left on device > cp: failed to extend ‘/tst/zero’: No space left on device > [root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero > [root@debug010000002015 ~]# cat /proc/cmdline > earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on > > my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient? I tried with the mods you had appended, from [PATCH v7 02/10] discussion with Konstantion: no, still crashes in a similar way. Does your github tree have other changes too? I see it says "Latest commit e05d0dd 22 days ago", which doesn't seem to fit. Afraid I don't have time to test many variations. It looks like, in my case, systemd was usually jumping in and doing something with shmem (perhaps via memfd) that read back from swap and triggered the crash without any further intervention from me. So please try booting with mem=700M and 1.5G swap, mount -t tmpfs -o size=470M tmpfs /tst cp /dev/zero /tst; cp /tst/zero /dev/null That's enough to crash it for me, without getting into any losetup or systemd complications. But you might have to adjust the numbers to be sure of writing out and reading back from swap. It's swap to SSD in my case, don't think that matters. I happen to run with swappiness 100 (precisely to help generate swap problems), but swappiness 60 is good enough to get these crashes. Hugh
> I tried with the mods you had appended, from [PATCH v7 02/10] > discussion with Konstantion: no, still crashes in a similar way.> > Does your github tree have other changes too? I see it says "Latest > commit e05d0dd 22 days ago", which doesn't seem to fit. Afraid I > don't have time to test many variations. Thanks a lot for testing! the github version is same as your tested. The github branches page has a bug, it don't show correct update time. https://github.com/alexshi/linux/branches while detailed page does. https://github.com/alexshi/linux/tree/lru-next > > It looks like, in my case, systemd was usually jumping in and doing > something with shmem (perhaps via memfd) that read back from swap > and triggered the crash without any further intervention from me. > > So please try booting with mem=700M and 1.5G swap, > mount -t tmpfs -o size=470M tmpfs /tst > cp /dev/zero /tst; cp /tst/zero /dev/null > > That's enough to crash it for me, without getting into any losetup or > systemd complications. But you might have to adjust the numbers to be > sure of writing out and reading back from swap. > > It's swap to SSD in my case, don't think that matters. I happen to > run with swappiness 100 (precisely to help generate swap problems), > but swappiness 60 is good enough to get these crashes. > I did use 700M memory and 1.5G swapfile in my qemu, but with a swapfile not a disk. qemu-system-x86_64 -smp 4 -enable-kvm -cpu SandyBridge \ -m 700M -kernel /home/kuiliang.as/linux/qemulru/arch/x86/boot/bzImage \ -append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on " \ -hda /home/kuiliang.as/rootimages/CentOS-7-x86_64-Azure-1703.qcow2 \ -hdb /home/kuiliang.as/rootimages/hdb.qcow2 \ --nographic \ Anyway, although I didn't reproduced the bug. but I found a bug in my debug function: VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug for debug function, not real issue. The 9th patch should be replaced by the following new patch. Many thanks for testing! Alex From ac6d3e2bcfba5727d5c03f9655bb0c7443f655eb Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Mon, 23 Dec 2019 13:33:54 +0800 Subject: [PATCH v8 8/9] mm/lru: add debug checking for page memcg moving This debug patch could give some clues if there are sth out of consideration. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/memcontrol.h | 5 +++++ mm/compaction.c | 2 ++ mm/memcontrol.c | 13 +++++++++++++ 3 files changed, 20 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 09e861df48e8..ece88bb11d0f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -421,6 +421,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *); struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*); void unlock_page_lruvec_irq(struct lruvec *); void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long); +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); @@ -1183,6 +1184,10 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/mm/compaction.c b/mm/compaction.c index 8c0a2da217d8..151242817bf4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -971,6 +971,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); locked_lruvec = lruvec; + lruvec_memcg_debug(lruvec, page); + /* Try get exclusive access under lock */ if (!skip_updated) { skip_updated = true; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 00fef8ddbd08..a473da8d2275 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1238,6 +1238,17 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd return lruvec; } +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ + if (mem_cgroup_disabled()) + return; + + if (!page->mem_cgroup) + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page); + else + VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); +} + struct lruvec *lock_page_lruvec_irq(struct page *page) { struct lruvec *lruvec; @@ -1247,6 +1258,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page) lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); spin_lock_irq(&lruvec->lru_lock); + lruvec_memcg_debug(lruvec, page); return lruvec; } @@ -1259,6 +1271,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); spin_lock_irqsave(&lruvec->lru_lock, *flags); + lruvec_memcg_debug(lruvec, page); return lruvec; }
在 2020/1/14 下午5:14, Alex Shi 写道: > Anyway, although I didn't reproduced the bug. but I found a bug in my > debug function: > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page); > > if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug > for debug function, not real issue. The 9th patch should be replaced by > the following new patch. If !page->mem_cgroup, means the page is on root_mem_cgroup, so lurvec's memcg is root_mem_cgroup, not NULL. that trigger the issue. Hi Johannes, So I have a question about the lock_page_memcg in this scenario, Should we lock the page to root_mem_cgroup? or there is no needs as no tasks move to a leaf memcg from root? Thanks Alex