[029/158] mm/swap.c: piggyback lru_add_drain_all() calls

Message ID	20191201015040.dGbXkKv8r%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=vaHI=ZX=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9AE902086A Date: Sat, 30 Nov 2019 17:50:40 -0800 From: akpm@linux-foundation.org To: akpm@linux-foundation.org, khlebnikov@yandex-team.ru, linux-mm@kvack.org, mhocko@kernel.org, mm-commits@vger.kernel.org, torvalds@linux-foundation.org, willy@infradead.org Subject: [patch 029/158] mm/swap.c: piggyback lru_add_drain_all() calls Message-ID: <20191201015040.dGbXkKv8r%akpm@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/158] scripts/spelling.txt: add more spellings to spelling.txt \| expand [001/158] scripts/spelling.txt: add more spellings to spelling.txt [002/158] ocfs2: fix passing zero to 'PTR_ERR' warning [003/158] fs/buffer.c: fix use true/false for bool type [004/158] fs/buffer.c: include internal.h for missing declarations [005/158] mm, slab: make kmalloc_info[] contain all types of names [006/158] mm, slab: remove unused kmalloc_size() [007/158] mm, slab_common: use enum kmalloc_cache_type to iterate over kmalloc caches [008/158] mm: slub: print the offset of fault addresses [009/158] mm/slub.c: update comments [010/158] mm/slub.c: clean up validate_slab() [011/158] mm/filemap.c: remove redundant cache invalidation after async direct-io write [012/158] fs/direct-io.c: keep dio_warn_stale_pagecache() when CONFIG_BLOCK=n [013/158] mm/filemap.c: warn if stale pagecache is left after direct write [014/158] mm/gup.c: allow CMA migration to propagate errors back to caller [015/158] mm/gup.c: fix comments of __get_user_pages() and get_user_pages_remote() [016/158] mm, swap: disallow swapon() on zoned block devices [017/158] mm/swap.c: trivial mark_page_accessed() cleanup [018/158] mm, memcg: clean up reclaim iter array [019/158] mm: memcontrol: remove dead code from memory_max_write() [020/158] mm: memcontrol: try harder to set a new memory.high [021/158] include/linux/memcontrol.h: fix comments based on per-node memcg [022/158] mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node() [023/158] Documentation/admin-guide/cgroup-v2.rst: document why inactive_X + active_X may not equal… [024/158] mm: drop mmap_sem before calling balance_dirty_pages() in write fault [025/158] shmem: pin the file in shmem_fault() if mmap_sem is dropped [026/158] mm: emit tracepoint when RSS changes [027/158] rss_stat: add support to detect RSS updates of external mm [028/158] mm/mmap.c: remove a never-triggered warning in __vma_adjust() [029/158] mm/swap.c: piggyback lru_add_drain_all() calls [030/158] mm/mmap.c: prev could be retrieved from vma->vm_prev [031/158] mm/mmap.c: __vma_unlink_prev() is not necessary now [032/158] mm/mmap.c: extract __vma_unlink_list() as counterpart for __vma_link_list() [033/158] mm/mmap.c: rb_parent is not necessary in __vma_link_list() [034/158] mm/rmap.c: don't reuse anon_vma if we just want a copy [035/158] mm/rmap.c: reuse mergeable anon_vma as parent when fork [036/158] mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area [037/158] ARC: mm: remove __ARCH_USE_5LEVEL_HACK [038/158] asm-generic/tlb: stub out pud_free_tlb() if nopud ... [039/158] asm-generic/tlb: stub out p4d_free_tlb() if nop4d ... [040/158] asm-generic/tlb: stub out pmd_free_tlb() if nopmd [041/158] asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDED [042/158] mm/rmap.c: fix outdated comment in page_get_anon_vma() [043/158] mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap() [044/158] mm: move the backup x_devmap() functions to asm-generic/pgtable.h [045/158] mm/memory.c: fix a huge pud insertion race during faulting [046/158] mm: add generic p?d_leaf() macros [047/158] arc: mm: add p?d_leaf() definitions [048/158] arm: mm: add p?d_leaf() definitions [049/158] arm64: mm: add p?d_leaf() definitions [050/158] mips: mm: add p?d_leaf() definitions [051/158] powerpc: mm: add p?d_leaf() definitions [052/158] riscv: mm: add p?d_leaf() definitions [053/158] s390: mm: add p?d_leaf() definitions [054/158] sparc: mm: add p?d_leaf() definitions [055/158] x86: mm: add p?d_leaf() definitions [056/158] mm: pagewalk: add p4d_entry() and pgd_entry() [057/158] mm: pagewalk: allow walking without vma [058/158] mm: pagewalk: add test_p?d callbacks [059/158] mm: pagewalk: add 'depth' parameter to pte_hole [060/158] x86: mm: point to struct seq_file from struct pg_state [061/158] x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct [062/158] x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct [063/158] x86: mm: convert ptdump_walk_pgd_level_core() to take an mm_struct [064/158] mm: add generic ptdump [065/158] x86: mm: convert dump_pagetables to use walk_page_range [066/158] arm64: mm: convert mm/dump.c to use walk_page_range() [067/158] arm64: mm: display non-present entries in ptdump [068/158] mm: ptdump: reduce level numbers by 1 in note_page() [069/158] mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappings [070/158] memfd: add test for COW on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappings [071/158] mm/memory-failure.c clean up around tk pre-allocation [072/158] mm, soft-offline: convert parameter to pfn [073/158] mm/memory-failure.c: use page_shift() in add_to_kill() [074/158] mm/hotplug: reorder memblock_[free\|remove]() calls in try_remove_memory() [075/158] mm/memory_hotplug.c: add a bounds check to __add_pages() [076/158] mm/memory_hotplug: export generic_online_page() [077/158] hv_balloon: use generic_online_page() [078/158] mm/memory_hotplug: remove __online_page_free() and __online_page_increment_counters() [079/158] mm/page_alloc.c: don't set pages PageReserved() when offlining [080/158] mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE [081/158] include/linux/memory_hotplug.h: move definitions of {set,clear}_zone_contiguous [082/158] drivers/base/memory.c: drop the mem_sysfs_mutex [083/158] mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes [084/158] mm/sparse: consistently do not zero memmap [085/158] mm/sparse.c: mark populate_section_memmap as __meminit [086/158] mm/sparse.c: do not waste pre allocated memmap space [087/158] mm/vmalloc.c: remove unnecessary highmem_mask from parameter of gfpflags_allow_blocking() [088/158] mm/vmalloc: remove preempt_disable/enable when doing preloading [089/158] mm/vmalloc: respect passed gfp_mask when doing preloading [090/158] mm/vmalloc: add more comments to the adjust_va_to_fit_type() [091/158] selftests: vm: add fragment CONFIG_TEST_VMALLOC [092/158] mm/vmalloc: rework vmap_area_lock [093/158] kasan: support backing vmalloc space with real shadow memory [094/158] kasan: add test for vmalloc [095/158] fork: support VMAP_STACK with KASAN_VMALLOC [096/158] x86/kasan: support KASAN_VMALLOC [097/158] mm/page_alloc: add alloc_contig_pages() [098/158] mm, pcp: share common code between memory hotplug and percpu sysctl handler [099/158] mm, pcpu: make zone pcp updates and reset internal to the mm [100/158] include/linux/mmzone.h: fix comment for ISOLATE_UNMAPPED macro [101/158] mm/page_alloc.c: print reserved_highatomic info [102/158] mm/vmscan: remove unused lru_pages argument [103/158] mm/vmscan.c: remove unused scan_control parameter from pageout() [104/158] mm: vmscan: simplify lruvec_lru_size() [105/158] mm: clean up and clarify lruvec lookup procedure [106/158] mm: vmscan: move inactive_list_is_low() swap check to the caller [107/158] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() [108/158] mm: vmscan: replace shrink_node() loop with a retry jump [109/158] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() [110/158] mm: vmscan: split shrink_node() into node part and memcgs part [111/158] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs [112/158] mm: vmscan: move file exhaustion detection to the node level [113/158] mm: vmscan: detect file thrashing at the reclaim root [114/158] mm: vmscan: enforce inactive:active ratio at the reclaim root [115/158] mm/vmscan.c: fix typo in comment [116/158] kernel: sysctl: make drop_caches write-only [117/158] mm/z3fold.c: add inter-page compaction [118/158] mm/mempolicy.c: check range first in queue_pages_test_walk [119/158] mm/mempolicy.c: fix checking unmapped holes for mbind [120/158] mm/memblock.c: cleanup doc [121/158] mm/memblock: correct doc for function [122/158] mm: support memblock alloc on the exact node for sparse_buffer_init() [123/158] hugetlbfs: hugetlb_fault_mutex_hash() cleanup [124/158] mm/hugetlbfs: fix error handling when setting up mounts [125/158] powerpc/mm: remove pmd_huge/pud_huge stubs and include hugetlb.h [126/158] hugetlbfs: convert macros to static inline, fix sparse warning [127/158] hugetlbfs: add O_TMPFILE support [128/158] hugetlbfs: take read_lock on i_mmap for PMD sharing [129/158] hugetlb: region_chg provides only cache entry [130/158] hugetlb: remove duplicated code [131/158] hugetlb: remove unused hstate in hugetlb_fault_mutex_hash() [132/158] mm/hugetlb: avoid looping to the same hugepage if !pages and !vmas [133/158] mm/huge_memory.c: split_huge_pages_fops should be defined with DEFINE_DEBUGFS_ATTRIBUTE [134/158] mm/migrate.c: handle freed page at the first place [135/158] mm, thp: do not queue fully unmapped pages for deferred split [136/158] mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() [137/158] mm/cma.c: switch to bitmap_zalloc() for cma bitmap allocation [138/158] mm/cma_debug.c: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops [139/158] autonuma: fix watermark checking in migrate_balanced_pgdat() [140/158] autonuma: reduce cache footprint when scanning page tables [141/158] mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops [142/158] mm/mmap.c: make vma_merge() comment more easy to understand [143/158] mm/madvise.c: replace with page_size() in madvise_inject_error() [144/158] mm/madvise.c: use PAGE_ALIGN[ED] for range checking [145/158] userfaultfd: use vma_pagesize for all huge page size calculation [146/158] userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb() [147/158] userfaultfd: wrap the common dst_vma check into an inlined function [148/158] fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register() [149/158] userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK [150/158] mm/shmem.c: make array 'values' static const, makes object smaller [151/158] mm: shmem: use proper gfp flags for shmem_writepage() [152/158] mm/shmem.c: cast the type of unmap_start to u64 [153/158] mm: fix struct member name in function comments [154/158] mm: fix typos in comments when calling __SetPageUptodate() [155/158] mm/memory_hotplug.c: remove __online_page_set_limits() [156/158] mm/Kconfig: fix indentation [157/158] mm/Kconfig: fix trivial help text punctuation [158/158] mm/page_io.c: annotate refault stalls from swap_readpage

Message ID

20191201015040.dGbXkKv8r%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9AE902086A
Date: Sat, 30 Nov 2019 17:50:40 -0800
From: akpm@linux-foundation.org
To: akpm@linux-foundation.org, khlebnikov@yandex-team.ru,
 linux-mm@kvack.org, mhocko@kernel.org, mm-commits@vger.kernel.org,
 torvalds@linux-foundation.org, willy@infradead.org
Subject: [patch 029/158] mm/swap.c: piggyback lru_add_drain_all()
 calls
Message-ID: <20191201015040.dGbXkKv8r%akpm@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/158] scripts/spelling.txt: add more spellings to spelling.txt | expand

Commit Message

Andrew Morton Dec. 1, 2019, 1:50 a.m. UTC

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Subject: mm/swap.c: piggyback lru_add_drain_all() calls

This is a very slow operation.  Right now POSIX_FADV_DONTNEED is the top
user because it has to freeze page references when removing it from the
cache.  invalidate_bdev() calls it for the same reason.  Both are
triggered from userspace, so it's easy to generate a storm.

mlock/mlockall no longer calls lru_add_drain_all - I've seen here
serious slowdown on older kernels.

There are some less obvious paths in memory migration/CMA/offlining which
shouldn't call frequently.

The worst case requires a non-trivial workload because lru_add_drain_all()
skips cpus where vectors are empty.  Something must constantly generate a
flow of pages for each cpu.  Also cpus must be busy to make scheduling
per-cpu works slower.  And the machine must be big enough (64+ cpus in our
case).

In our case that was a massive series of mlock calls in map-reduce while
other tasks write logs (and generates flows of new pages in per-cpu
vectors).  Mlock calls were serialized by mutex and accumulated latency up
to 10 seconds or more.

The kernel does not call lru_add_drain_all on mlock paths since 4.15, but
the same scenario could be triggered by fadvise(POSIX_FADV_DONTNEED) or
any other remaining user.

There is no reason to do the drain again if somebody else already drained
all the per-cpu vectors while we waited for the lock.

Piggyback on a drain starting and finishing while we wait for the lock:
all pages pending at the time of our entry were drained from the vectors.

Callers like POSIX_FADV_DONTNEED retry their operations once after
draining per-cpu vectors when pages have unexpected references.

Link: http://lkml.kernel.org/r/157019456205.3142.3369423180908482020.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/mm/swap.c~mm-swap-piggyback-lru_add_drain_all-calls
+++ a/mm/swap.c
@@ -713,9 +713,10 @@  static void lru_add_drain_per_cpu(struct
  */
 void lru_add_drain_all(void)
 {
+	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu;
+	int cpu, seq;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -724,7 +725,19 @@  void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
+	seq = raw_read_seqcount_latch(&seqcount);
+
 	mutex_lock(&lock);
+
+	/*
+	 * Piggyback on drain started and finished while we waited for lock:
+	 * all pages pended at the time of our enter were drained from vectors.
+	 */
+	if (__read_seqcount_retry(&seqcount, seq))
+		goto done;
+
+	raw_write_seqcount_latch(&seqcount);
+
 	cpumask_clear(&has_work);
 
 	for_each_online_cpu(cpu) {
@@ -745,6 +758,7 @@  void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+done:
 	mutex_unlock(&lock);
 }
 #else

[029/158] mm/swap.c: piggyback lru_add_drain_all() calls

Commit Message

Patch