Message ID | 1631518709-42881-1-git-send-email-feng.tang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/page_alloc: detect allocation forbidden by cpuset and bail out early | expand |
On Mon 13-09-21 15:38:29, Feng Tang wrote: > There was report that starting an Ubuntu in docker while using cpuset > to bind it to movable nodes (a node only has movable zone, like a node > for hotplug or a Persistent Memory node in normal usage) will fail > due to memory allocation failure, and then OOM is involved and many > other innocent processes got killed. It can be reproduced with command: > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > > runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 > CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) > Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 > Call Trace: > dump_stack+0x6b/0x88 > dump_header+0x4a/0x1e2 > oom_kill_process.cold+0xb/0x10 > out_of_memory.part.0+0xaf/0x230 > out_of_memory+0x3d/0x80 > __alloc_pages_slowpath.constprop.0+0x954/0xa20 > __alloc_pages_nodemask+0x2d3/0x300 > pipe_write+0x322/0x590 > new_sync_write+0x196/0x1b0 > vfs_write+0x1c3/0x1f0 > ksys_write+0xa7/0xe0 > do_syscall_64+0x52/0xd0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Mem-Info: > active_anon:392832 inactive_anon:182 isolated_anon:0 > active_file:68130 inactive_file:151527 isolated_file:0 > unevictable:2701 dirty:0 writeback:7 > slab_reclaimable:51418 slab_unreclaimable:116300 > mapped:45825 shmem:735 pagetables:2540 bounce:0 > free:159849484 free_pcp:73 free_cma:0 > Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no > Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 0 > Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB > > oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 > Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 > oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB > oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > The reason is, in the case, the target cpuset nodes only have movable > zone, while the creation of an OS in docker sometimes needs to allocate > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > the cpuset limit forbids the allocation, then out-of-memory killing is > involved even when normal nodes and movable nodes both have many free > memory. > > The OOM killer cannot help to resolve the situation as there is no > usable memory for the request in the cpuset scope. The only reasonable > measure to take is to fail the allocation right away and have the caller > to deal with it. > > So add a check for cases like this in the slowpath of allocation, and > bail out early returning NULL for the allocation. > > As page allocation is one of the hottest path in kernel, this check > will hurt all users with sane cpuset configuration, add a static branch > check and detect the abnormal config in cpuset memory binding setup so > that the extra check in page allocation is not paid by everyone. > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > it inside OOM code, adding cpuset check, refining comments] > > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > Changelog: > > v2: > * add a static branch detection in cpuset code to reduce > the overhead in allocation hotpath (Michal Hocko) > > v1 (since RFC): > * move the handling from oom code to page allocation > path (Michal/David) > > include/linux/cpuset.h | 15 +++++++++++++++ > include/linux/mmzone.h | 12 ++++++++++++ > kernel/cgroup/cpuset.c | 14 ++++++++++++++ > mm/page_alloc.c | 13 +++++++++++++ > 4 files changed, 54 insertions(+) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d2b9c41..403ccf9 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -34,6 +34,8 @@ > */ > extern struct static_key_false cpusets_pre_enable_key; > extern struct static_key_false cpusets_enabled_key; > +extern struct static_key_false cpusets_insane_config_key; > + > static inline bool cpusets_enabled(void) > { > return static_branch_unlikely(&cpusets_enabled_key); > @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) > static_branch_dec_cpuslocked(&cpusets_pre_enable_key); > } > > +/* > + * This will get enabled whenever a cpuset configuration is considered > + * unsupportable in general. E.g. movable only node which cannot satisfy > + * any non movable allocations (see update_nodemask). Page allocator > + * needs to make additional checks for those configurations and this > + * check is meant to guard those checks without any overhead for sane > + * configurations. > + */ > +static inline bool cpusets_insane_config(void) > +{ > + return static_branch_unlikely(&cpusets_insane_config_key); > +} > + > extern int cpuset_init(void); > extern void cpuset_init_smp(void); > extern void cpuset_force_rebuild(void); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6a1d79d..b69b871 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1220,6 +1220,18 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, > #define for_each_zone_zonelist(zone, z, zlist, highidx) \ > for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) > > +/* Whether the 'nodes' are all movable nodes */ > +static inline bool movable_only_nodes(nodemask_t *nodes) > +{ > + struct zonelist *zonelist; > + struct zoneref *z; > + > + zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK]; This will work but it just begs a question why you haven't chosen a node from the given nodemask. So I believe it would be easier to read if you did zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK] > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > + return (!z->zone) ? true : false; > +} > + > + > #ifdef CONFIG_SPARSEMEM > #include <asm/sparsemem.h> > #endif > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index df1ccf4..03eb40c 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -69,6 +69,13 @@ > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > +/* > + * There could be abnormal cpuset configurations for cpu or memory > + * node binding, add this key to provide a quick low-cost judgement > + * of the situation. > + */ > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > + > /* See "Frequency meter" comments, below. */ > > struct fmeter { > @@ -1868,6 +1875,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > if (retval < 0) > goto done; > > + if (movable_only_nodes(&trialcs->mems_allowed)) { You can skip the check if cpusets_insane_config(). The question is whether you want to report all potential users or only the first one. > + static_branch_enable(&cpusets_insane_config_key); > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " > + "Cpuset allocations might fail even with a lot of memory available.\n", > + nodemask_pr_args(&trialcs->mems_allowed)); > + } > + > spin_lock_irq(&callback_lock); > cs->mems_allowed = trialcs->mems_allowed; > spin_unlock_irq(&callback_lock); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b37435c..a7e0854 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (!ac->preferred_zoneref->zone) > goto nopage; > > + /* > + * Check for insane configurations where the cpuset doesn't contain > + * any suitable zone to satisfy the request - e.g. non-movable > + * GFP_HIGHUSER allocations from MOVABLE nodes only. > + */ > + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { > + struct zoneref *z = first_zones_zonelist(ac->zonelist, > + ac->highest_zoneidx, > + &cpuset_current_mems_allowed); > + if (!z->zone) > + goto nopage; > + } > + > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); The rest looks sensible to me.
Hi Feng, Thank you for the patch! Yet something to improve: [auto build test ERROR on hnaz-linux-mm/master] url: https://github.com/0day-ci/linux/commits/Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016 base: https://github.com/hnaz/linux-mm master config: arc-randconfig-r043-20210913 (attached as .config) compiler: arc-elf-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/276fb2292fa199777b3e9a394c8737e4c618cd23 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016 git checkout 276fb2292fa199777b3e9a394c8737e4c618cd23 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=arc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/page_alloc.c:3810:15: warning: no previous prototype for 'should_fail_alloc_page' [-Wmissing-prototypes] 3810 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) | ^~~~~~~~~~~~~~~~~~~~~~ mm/page_alloc.c: In function '__alloc_pages_slowpath': >> mm/page_alloc.c:4922:13: error: implicit declaration of function 'cpusets_insane_config' [-Werror=implicit-function-declaration] 4922 | if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { | ^~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/cpusets_insane_config +4922 mm/page_alloc.c 4868 4869 static inline struct page * 4870 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, 4871 struct alloc_context *ac) 4872 { 4873 bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; 4874 const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; 4875 struct page *page = NULL; 4876 unsigned int alloc_flags; 4877 unsigned long did_some_progress; 4878 enum compact_priority compact_priority; 4879 enum compact_result compact_result; 4880 int compaction_retries; 4881 int no_progress_loops; 4882 unsigned int cpuset_mems_cookie; 4883 int reserve_flags; 4884 4885 /* 4886 * We also sanity check to catch abuse of atomic reserves being used by 4887 * callers that are not in atomic context. 4888 */ 4889 if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == 4890 (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) 4891 gfp_mask &= ~__GFP_ATOMIC; 4892 4893 retry_cpuset: 4894 compaction_retries = 0; 4895 no_progress_loops = 0; 4896 compact_priority = DEF_COMPACT_PRIORITY; 4897 cpuset_mems_cookie = read_mems_allowed_begin(); 4898 4899 /* 4900 * The fast path uses conservative alloc_flags to succeed only until 4901 * kswapd needs to be woken up, and to avoid the cost of setting up 4902 * alloc_flags precisely. So we do that now. 4903 */ 4904 alloc_flags = gfp_to_alloc_flags(gfp_mask); 4905 4906 /* 4907 * We need to recalculate the starting point for the zonelist iterator 4908 * because we might have used different nodemask in the fast path, or 4909 * there was a cpuset modification and we are retrying - otherwise we 4910 * could end up iterating over non-eligible zones endlessly. 4911 */ 4912 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 4913 ac->highest_zoneidx, ac->nodemask); 4914 if (!ac->preferred_zoneref->zone) 4915 goto nopage; 4916 4917 /* 4918 * Check for insane configurations where the cpuset doesn't contain 4919 * any suitable zone to satisfy the request - e.g. non-movable 4920 * GFP_HIGHUSER allocations from MOVABLE nodes only. 4921 */ > 4922 if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { 4923 struct zoneref *z = first_zones_zonelist(ac->zonelist, 4924 ac->highest_zoneidx, 4925 &cpuset_current_mems_allowed); 4926 if (!z->zone) 4927 goto nopage; 4928 } 4929 4930 if (alloc_flags & ALLOC_KSWAPD) 4931 wake_all_kswapds(order, gfp_mask, ac); 4932 4933 /* 4934 * The adjusted alloc_flags might result in immediate success, so try 4935 * that first 4936 */ 4937 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); 4938 if (page) 4939 goto got_pg; 4940 4941 /* 4942 * For costly allocations, try direct compaction first, as it's likely 4943 * that we have enough base pages and don't need to reclaim. For non- 4944 * movable high-order allocations, do that as well, as compaction will 4945 * try prevent permanent fragmentation by migrating from blocks of the 4946 * same migratetype. 4947 * Don't try this for allocations that are allowed to ignore 4948 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen. 4949 */ 4950 if (can_direct_reclaim && 4951 (costly_order || 4952 (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) 4953 && !gfp_pfmemalloc_allowed(gfp_mask)) { 4954 page = __alloc_pages_direct_compact(gfp_mask, order, 4955 alloc_flags, ac, 4956 INIT_COMPACT_PRIORITY, 4957 &compact_result); 4958 if (page) 4959 goto got_pg; 4960 4961 /* 4962 * Checks for costly allocations with __GFP_NORETRY, which 4963 * includes some THP page fault allocations 4964 */ 4965 if (costly_order && (gfp_mask & __GFP_NORETRY)) { 4966 /* 4967 * If allocating entire pageblock(s) and compaction 4968 * failed because all zones are below low watermarks 4969 * or is prohibited because it recently failed at this 4970 * order, fail immediately unless the allocator has 4971 * requested compaction and reclaim retry. 4972 * 4973 * Reclaim is 4974 * - potentially very expensive because zones are far 4975 * below their low watermarks or this is part of very 4976 * bursty high order allocations, 4977 * - not guaranteed to help because isolate_freepages() 4978 * may not iterate over freed pages as part of its 4979 * linear scan, and 4980 * - unlikely to make entire pageblocks free on its 4981 * own. 4982 */ 4983 if (compact_result == COMPACT_SKIPPED || 4984 compact_result == COMPACT_DEFERRED) 4985 goto nopage; 4986 4987 /* 4988 * Looks like reclaim/compaction is worth trying, but 4989 * sync compaction could be very expensive, so keep 4990 * using async compaction. 4991 */ 4992 compact_priority = INIT_COMPACT_PRIORITY; 4993 } 4994 } 4995 4996 retry: 4997 /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ 4998 if (alloc_flags & ALLOC_KSWAPD) 4999 wake_all_kswapds(order, gfp_mask, ac); 5000 5001 reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); 5002 if (reserve_flags) 5003 alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags); 5004 5005 /* 5006 * Reset the nodemask and zonelist iterators if memory policies can be 5007 * ignored. These allocations are high priority and system rather than 5008 * user oriented. 5009 */ 5010 if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { 5011 ac->nodemask = NULL; 5012 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 5013 ac->highest_zoneidx, ac->nodemask); 5014 } 5015 5016 /* Attempt with potentially adjusted zonelist and alloc_flags */ 5017 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); 5018 if (page) 5019 goto got_pg; 5020 5021 /* Caller is not willing to reclaim, we can't balance anything */ 5022 if (!can_direct_reclaim) 5023 goto nopage; 5024 5025 /* Avoid recursion of direct reclaim */ 5026 if (current->flags & PF_MEMALLOC) 5027 goto nopage; 5028 5029 /* Try direct reclaim and then allocating */ 5030 page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, 5031 &did_some_progress); 5032 if (page) 5033 goto got_pg; 5034 5035 /* Try direct compaction and then allocating */ 5036 page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, 5037 compact_priority, &compact_result); 5038 if (page) 5039 goto got_pg; 5040 5041 /* Do not loop if specifically requested */ 5042 if (gfp_mask & __GFP_NORETRY) 5043 goto nopage; 5044 5045 /* 5046 * Do not retry costly high order allocations unless they are 5047 * __GFP_RETRY_MAYFAIL 5048 */ 5049 if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) 5050 goto nopage; 5051 5052 if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, 5053 did_some_progress > 0, &no_progress_loops)) 5054 goto retry; 5055 5056 /* 5057 * It doesn't make any sense to retry for the compaction if the order-0 5058 * reclaim is not able to make any progress because the current 5059 * implementation of the compaction depends on the sufficient amount 5060 * of free memory (see __compaction_suitable) 5061 */ 5062 if (did_some_progress > 0 && 5063 should_compact_retry(ac, order, alloc_flags, 5064 compact_result, &compact_priority, 5065 &compaction_retries)) 5066 goto retry; 5067 5068 5069 /* Deal with possible cpuset update races before we start OOM killing */ 5070 if (check_retry_cpuset(cpuset_mems_cookie, ac)) 5071 goto retry_cpuset; 5072 5073 /* Reclaim has failed us, start killing things */ 5074 page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); 5075 if (page) 5076 goto got_pg; 5077 5078 /* Avoid allocations with no watermarks from looping endlessly */ 5079 if (tsk_is_oom_victim(current) && 5080 (alloc_flags & ALLOC_OOM || 5081 (gfp_mask & __GFP_NOMEMALLOC))) 5082 goto nopage; 5083 5084 /* Retry as long as the OOM killer is making progress */ 5085 if (did_some_progress) { 5086 no_progress_loops = 0; 5087 goto retry; 5088 } 5089 5090 nopage: 5091 /* Deal with possible cpuset update races before we fail */ 5092 if (check_retry_cpuset(cpuset_mems_cookie, ac)) 5093 goto retry_cpuset; 5094 5095 /* 5096 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure 5097 * we always retry 5098 */ 5099 if (gfp_mask & __GFP_NOFAIL) { 5100 /* 5101 * All existing users of the __GFP_NOFAIL are blockable, so warn 5102 * of any new users that actually require GFP_NOWAIT 5103 */ 5104 if (WARN_ON_ONCE(!can_direct_reclaim)) 5105 goto fail; 5106 5107 /* 5108 * PF_MEMALLOC request from this context is rather bizarre 5109 * because we cannot reclaim anything and only can loop waiting 5110 * for somebody to do a work for us 5111 */ 5112 WARN_ON_ONCE(current->flags & PF_MEMALLOC); 5113 5114 /* 5115 * non failing costly orders are a hard requirement which we 5116 * are not prepared for much so let's warn about these users 5117 * so that we can identify them and convert them to something 5118 * else. 5119 */ 5120 WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); 5121 5122 /* 5123 * Help non-failing allocations by giving them access to memory 5124 * reserves but do not use ALLOC_NO_WATERMARKS because this 5125 * could deplete whole memory reserves which would just make 5126 * the situation worse 5127 */ 5128 page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); 5129 if (page) 5130 goto got_pg; 5131 5132 cond_resched(); 5133 goto retry; 5134 } 5135 fail: 5136 warn_alloc(gfp_mask, ac->nodemask, 5137 "page allocation failure: order:%u", order); 5138 got_pg: 5139 return page; 5140 } 5141 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Feng, Thank you for the patch! Yet something to improve: [auto build test ERROR on hnaz-linux-mm/master] url: https://github.com/0day-ci/linux/commits/Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016 base: https://github.com/hnaz/linux-mm master config: riscv-randconfig-r042-20210913 (attached as .config) compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install riscv cross compiling tool for clang build # apt-get install binutils-riscv64-linux-gnu # https://github.com/0day-ci/linux/commit/276fb2292fa199777b3e9a394c8737e4c618cd23 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Feng-Tang/mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early/20210913-154016 git checkout 276fb2292fa199777b3e9a394c8737e4c618cd23 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from mm/page_alloc.c:20: In file included from include/linux/highmem.h:10: In file included from include/linux/hardirq.h:11: In file included from ./arch/riscv/include/generated/asm/hardirq.h:1: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/riscv/include/asm/io.h:136: include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __raw_readb(PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:36:51: note: expanded from macro '__le16_to_cpu' #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) ^ In file included from mm/page_alloc.c:20: In file included from include/linux/highmem.h:10: In file included from include/linux/hardirq.h:11: In file included from ./arch/riscv/include/generated/asm/hardirq.h:1: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/riscv/include/asm/io.h:136: include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:34:51: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) ^ In file included from mm/page_alloc.c:20: In file included from include/linux/highmem.h:10: In file included from include/linux/hardirq.h:11: In file included from ./arch/riscv/include/generated/asm/hardirq.h:1: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/riscv/include/asm/io.h:136: include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writeb(value, PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:1024:55: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] return (port > MMIO_UPPER_LIMIT) ? NULL : PCI_IOBASE + port; ~~~~~~~~~~ ^ mm/page_alloc.c:3810:15: warning: no previous prototype for function 'should_fail_alloc_page' [-Wmissing-prototypes] noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) ^ mm/page_alloc.c:3810:10: note: declare 'static' if the function is not intended to be used outside of this translation unit noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) ^ static >> mm/page_alloc.c:4922:6: error: implicit declaration of function 'cpusets_insane_config' [-Werror,-Wimplicit-function-declaration] if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { ^ 8 warnings and 1 error generated. vim +/cpusets_insane_config +4922 mm/page_alloc.c 4868 4869 static inline struct page * 4870 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, 4871 struct alloc_context *ac) 4872 { 4873 bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; 4874 const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; 4875 struct page *page = NULL; 4876 unsigned int alloc_flags; 4877 unsigned long did_some_progress; 4878 enum compact_priority compact_priority; 4879 enum compact_result compact_result; 4880 int compaction_retries; 4881 int no_progress_loops; 4882 unsigned int cpuset_mems_cookie; 4883 int reserve_flags; 4884 4885 /* 4886 * We also sanity check to catch abuse of atomic reserves being used by 4887 * callers that are not in atomic context. 4888 */ 4889 if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == 4890 (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) 4891 gfp_mask &= ~__GFP_ATOMIC; 4892 4893 retry_cpuset: 4894 compaction_retries = 0; 4895 no_progress_loops = 0; 4896 compact_priority = DEF_COMPACT_PRIORITY; 4897 cpuset_mems_cookie = read_mems_allowed_begin(); 4898 4899 /* 4900 * The fast path uses conservative alloc_flags to succeed only until 4901 * kswapd needs to be woken up, and to avoid the cost of setting up 4902 * alloc_flags precisely. So we do that now. 4903 */ 4904 alloc_flags = gfp_to_alloc_flags(gfp_mask); 4905 4906 /* 4907 * We need to recalculate the starting point for the zonelist iterator 4908 * because we might have used different nodemask in the fast path, or 4909 * there was a cpuset modification and we are retrying - otherwise we 4910 * could end up iterating over non-eligible zones endlessly. 4911 */ 4912 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 4913 ac->highest_zoneidx, ac->nodemask); 4914 if (!ac->preferred_zoneref->zone) 4915 goto nopage; 4916 4917 /* 4918 * Check for insane configurations where the cpuset doesn't contain 4919 * any suitable zone to satisfy the request - e.g. non-movable 4920 * GFP_HIGHUSER allocations from MOVABLE nodes only. 4921 */ > 4922 if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { 4923 struct zoneref *z = first_zones_zonelist(ac->zonelist, 4924 ac->highest_zoneidx, 4925 &cpuset_current_mems_allowed); 4926 if (!z->zone) 4927 goto nopage; 4928 } 4929 4930 if (alloc_flags & ALLOC_KSWAPD) 4931 wake_all_kswapds(order, gfp_mask, ac); 4932 4933 /* 4934 * The adjusted alloc_flags might result in immediate success, so try 4935 * that first 4936 */ 4937 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); 4938 if (page) 4939 goto got_pg; 4940 4941 /* 4942 * For costly allocations, try direct compaction first, as it's likely 4943 * that we have enough base pages and don't need to reclaim. For non- 4944 * movable high-order allocations, do that as well, as compaction will 4945 * try prevent permanent fragmentation by migrating from blocks of the 4946 * same migratetype. 4947 * Don't try this for allocations that are allowed to ignore 4948 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen. 4949 */ 4950 if (can_direct_reclaim && 4951 (costly_order || 4952 (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) 4953 && !gfp_pfmemalloc_allowed(gfp_mask)) { 4954 page = __alloc_pages_direct_compact(gfp_mask, order, 4955 alloc_flags, ac, 4956 INIT_COMPACT_PRIORITY, 4957 &compact_result); 4958 if (page) 4959 goto got_pg; 4960 4961 /* 4962 * Checks for costly allocations with __GFP_NORETRY, which 4963 * includes some THP page fault allocations 4964 */ 4965 if (costly_order && (gfp_mask & __GFP_NORETRY)) { 4966 /* 4967 * If allocating entire pageblock(s) and compaction 4968 * failed because all zones are below low watermarks 4969 * or is prohibited because it recently failed at this 4970 * order, fail immediately unless the allocator has 4971 * requested compaction and reclaim retry. 4972 * 4973 * Reclaim is 4974 * - potentially very expensive because zones are far 4975 * below their low watermarks or this is part of very 4976 * bursty high order allocations, 4977 * - not guaranteed to help because isolate_freepages() 4978 * may not iterate over freed pages as part of its 4979 * linear scan, and 4980 * - unlikely to make entire pageblocks free on its 4981 * own. 4982 */ 4983 if (compact_result == COMPACT_SKIPPED || 4984 compact_result == COMPACT_DEFERRED) 4985 goto nopage; 4986 4987 /* 4988 * Looks like reclaim/compaction is worth trying, but 4989 * sync compaction could be very expensive, so keep 4990 * using async compaction. 4991 */ 4992 compact_priority = INIT_COMPACT_PRIORITY; 4993 } 4994 } 4995 4996 retry: 4997 /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ 4998 if (alloc_flags & ALLOC_KSWAPD) 4999 wake_all_kswapds(order, gfp_mask, ac); 5000 5001 reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); 5002 if (reserve_flags) 5003 alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags); 5004 5005 /* 5006 * Reset the nodemask and zonelist iterators if memory policies can be 5007 * ignored. These allocations are high priority and system rather than 5008 * user oriented. 5009 */ 5010 if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { 5011 ac->nodemask = NULL; 5012 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 5013 ac->highest_zoneidx, ac->nodemask); 5014 } 5015 5016 /* Attempt with potentially adjusted zonelist and alloc_flags */ 5017 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); 5018 if (page) 5019 goto got_pg; 5020 5021 /* Caller is not willing to reclaim, we can't balance anything */ 5022 if (!can_direct_reclaim) 5023 goto nopage; 5024 5025 /* Avoid recursion of direct reclaim */ 5026 if (current->flags & PF_MEMALLOC) 5027 goto nopage; 5028 5029 /* Try direct reclaim and then allocating */ 5030 page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, 5031 &did_some_progress); 5032 if (page) 5033 goto got_pg; 5034 5035 /* Try direct compaction and then allocating */ 5036 page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, 5037 compact_priority, &compact_result); 5038 if (page) 5039 goto got_pg; 5040 5041 /* Do not loop if specifically requested */ 5042 if (gfp_mask & __GFP_NORETRY) 5043 goto nopage; 5044 5045 /* 5046 * Do not retry costly high order allocations unless they are 5047 * __GFP_RETRY_MAYFAIL 5048 */ 5049 if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) 5050 goto nopage; 5051 5052 if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, 5053 did_some_progress > 0, &no_progress_loops)) 5054 goto retry; 5055 5056 /* 5057 * It doesn't make any sense to retry for the compaction if the order-0 5058 * reclaim is not able to make any progress because the current 5059 * implementation of the compaction depends on the sufficient amount 5060 * of free memory (see __compaction_suitable) 5061 */ 5062 if (did_some_progress > 0 && 5063 should_compact_retry(ac, order, alloc_flags, 5064 compact_result, &compact_priority, 5065 &compaction_retries)) 5066 goto retry; 5067 5068 5069 /* Deal with possible cpuset update races before we start OOM killing */ 5070 if (check_retry_cpuset(cpuset_mems_cookie, ac)) 5071 goto retry_cpuset; 5072 5073 /* Reclaim has failed us, start killing things */ 5074 page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); 5075 if (page) 5076 goto got_pg; 5077 5078 /* Avoid allocations with no watermarks from looping endlessly */ 5079 if (tsk_is_oom_victim(current) && 5080 (alloc_flags & ALLOC_OOM || 5081 (gfp_mask & __GFP_NOMEMALLOC))) 5082 goto nopage; 5083 5084 /* Retry as long as the OOM killer is making progress */ 5085 if (did_some_progress) { 5086 no_progress_loops = 0; 5087 goto retry; 5088 } 5089 5090 nopage: 5091 /* Deal with possible cpuset update races before we fail */ 5092 if (check_retry_cpuset(cpuset_mems_cookie, ac)) 5093 goto retry_cpuset; 5094 5095 /* 5096 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure 5097 * we always retry 5098 */ 5099 if (gfp_mask & __GFP_NOFAIL) { 5100 /* 5101 * All existing users of the __GFP_NOFAIL are blockable, so warn 5102 * of any new users that actually require GFP_NOWAIT 5103 */ 5104 if (WARN_ON_ONCE(!can_direct_reclaim)) 5105 goto fail; 5106 5107 /* 5108 * PF_MEMALLOC request from this context is rather bizarre 5109 * because we cannot reclaim anything and only can loop waiting 5110 * for somebody to do a work for us 5111 */ 5112 WARN_ON_ONCE(current->flags & PF_MEMALLOC); 5113 5114 /* 5115 * non failing costly orders are a hard requirement which we 5116 * are not prepared for much so let's warn about these users 5117 * so that we can identify them and convert them to something 5118 * else. 5119 */ 5120 WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); 5121 5122 /* 5123 * Help non-failing allocations by giving them access to memory 5124 * reserves but do not use ALLOC_NO_WATERMARKS because this 5125 * could deplete whole memory reserves which would just make 5126 * the situation worse 5127 */ 5128 page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); 5129 if (page) 5130 goto got_pg; 5131 5132 cond_resched(); 5133 goto retry; 5134 } 5135 fail: 5136 warn_alloc(gfp_mask, ac->nodemask, 5137 "page allocation failure: order:%u", order); 5138 got_pg: 5139 return page; 5140 } 5141 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote: [...] > > +/* > > + * This will get enabled whenever a cpuset configuration is considered > > + * unsupportable in general. E.g. movable only node which cannot satisfy > > + * any non movable allocations (see update_nodemask). Page allocator > > + * needs to make additional checks for those configurations and this > > + * check is meant to guard those checks without any overhead for sane > > + * configurations. > > + */ > > +static inline bool cpusets_insane_config(void) > > +{ > > + return static_branch_unlikely(&cpusets_insane_config_key); > > +} > > + > > extern int cpuset_init(void); > > extern void cpuset_init_smp(void); > > extern void cpuset_force_rebuild(void); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 6a1d79d..b69b871 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1220,6 +1220,18 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, > > #define for_each_zone_zonelist(zone, z, zlist, highidx) \ > > for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) > > > > +/* Whether the 'nodes' are all movable nodes */ > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > +{ > > + struct zonelist *zonelist; > > + struct zoneref *z; > > + > > + zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK]; > > This will work but it just begs a question why you haven't chosen a node > from the given nodemask. So I believe it would be easier to read if you > did > zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK] This was also my first try to get the 'zonelist', but from the update_nodemask(), the nodemask could be NULL. /* * An empty mems_allowed is ok iff there are no tasks in the cpuset. * Since nodelist_parse() fails on an empty mask, we special case * that parsing. The validate_change() call ensures that cpusets * with tasks have memory. */ if (!*buf) { nodes_clear(trialcs->mems_allowed); > > + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); > > + return (!z->zone) ? true : false; > > +} > > + > > + > > #ifdef CONFIG_SPARSEMEM > > #include <asm/sparsemem.h> > > #endif > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index df1ccf4..03eb40c 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -69,6 +69,13 @@ > > DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); > > DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); > > > > +/* > > + * There could be abnormal cpuset configurations for cpu or memory > > + * node binding, add this key to provide a quick low-cost judgement > > + * of the situation. > > + */ > > +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); > > + > > /* See "Frequency meter" comments, below. */ > > > > struct fmeter { > > @@ -1868,6 +1875,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, > > if (retval < 0) > > goto done; > > > > + if (movable_only_nodes(&trialcs->mems_allowed)) { > > You can skip the check if cpusets_insane_config(). The question is > whether you want to report all potential users or only the first one. Yes, I missed that, will add this check > > + static_branch_enable(&cpusets_insane_config_key); > > + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " > > + "Cpuset allocations might fail even with a lot of memory available.\n", > > + nodemask_pr_args(&trialcs->mems_allowed)); > > + } > > + > > spin_lock_irq(&callback_lock); > > cs->mems_allowed = trialcs->mems_allowed; > > spin_unlock_irq(&callback_lock); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index b37435c..a7e0854 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > if (!ac->preferred_zoneref->zone) > > goto nopage; > > > > + /* > > + * Check for insane configurations where the cpuset doesn't contain > > + * any suitable zone to satisfy the request - e.g. non-movable > > + * GFP_HIGHUSER allocations from MOVABLE nodes only. > > + */ > > + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { > > + struct zoneref *z = first_zones_zonelist(ac->zonelist, > > + ac->highest_zoneidx, > > + &cpuset_current_mems_allowed); > > + if (!z->zone) > > + goto nopage; > > + } > > + > > if (alloc_flags & ALLOC_KSWAPD) > > wake_all_kswapds(order, gfp_mask, ac); > > The rest looks sensible to me. Thanks for the review! - Feng > -- > Michal Hocko > SUSE Labs
On Mon 13-09-21 19:34:23, Feng Tang wrote: > On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote: [...] > > > +/* Whether the 'nodes' are all movable nodes */ > > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > > +{ > > > + struct zonelist *zonelist; > > > + struct zoneref *z; > > > + > > > + zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK]; > > > > This will work but it just begs a question why you haven't chosen a node > > from the given nodemask. So I believe it would be easier to read if you > > did > > zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK] > > This was also my first try to get the 'zonelist', but from the > update_nodemask(), the nodemask could be NULL. I guess you meant to say s@NULL@empty@ While this complicates things a bit it is nothing really hard to work around. You simply check for nodes_empty() and return false because such a nodemask cannot by definition be movable only.
On Mon, Sep 13, 2021 at 01:45:51PM +0200, Michal Hocko wrote: > On Mon 13-09-21 19:34:23, Feng Tang wrote: > > On Mon, Sep 13, 2021 at 11:15:54AM +0200, Michal Hocko wrote: > [...] > > > > +/* Whether the 'nodes' are all movable nodes */ > > > > +static inline bool movable_only_nodes(nodemask_t *nodes) > > > > +{ > > > > + struct zonelist *zonelist; > > > > + struct zoneref *z; > > > > + > > > > + zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK]; > > > > > > This will work but it just begs a question why you haven't chosen a node > > > from the given nodemask. So I believe it would be easier to read if you > > > did > > > zonelist = NODE_DATA(first_node(nodes))->node_zonelists[ZONELIST_FALLBACK] > > > > This was also my first try to get the 'zonelist', but from the > > update_nodemask(), the nodemask could be NULL. > > I guess you meant to say s@NULL@empty@ > While this complicates things a bit it is nothing really hard to work > around. You simply check for nodes_empty() and return false because such > a nodemask cannot by definition be movable only. Yes, a nodes_empty() check can solve it, thanks, - Feng > -- > Michal Hocko > SUSE Labs
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d2b9c41..403ccf9 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -34,6 +34,8 @@ */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; +extern struct static_key_false cpusets_insane_config_key; + static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } +/* + * This will get enabled whenever a cpuset configuration is considered + * unsupportable in general. E.g. movable only node which cannot satisfy + * any non movable allocations (see update_nodemask). Page allocator + * needs to make additional checks for those configurations and this + * check is meant to guard those checks without any overhead for sane + * configurations. + */ +static inline bool cpusets_insane_config(void) +{ + return static_branch_unlikely(&cpusets_insane_config_key); +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..b69b871 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1220,6 +1220,18 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, #define for_each_zone_zonelist(zone, z, zlist, highidx) \ for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) +/* Whether the 'nodes' are all movable nodes */ +static inline bool movable_only_nodes(nodemask_t *nodes) +{ + struct zonelist *zonelist; + struct zoneref *z; + + zonelist = &(first_online_pgdat())->node_zonelists[ZONELIST_FALLBACK]; + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); + return (!z->zone) ? true : false; +} + + #ifdef CONFIG_SPARSEMEM #include <asm/sparsemem.h> #endif diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index df1ccf4..03eb40c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -69,6 +69,13 @@ DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); +/* + * There could be abnormal cpuset configurations for cpu or memory + * node binding, add this key to provide a quick low-cost judgement + * of the situation. + */ +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1868,6 +1875,13 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; + if (movable_only_nodes(&trialcs->mems_allowed)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " + "Cpuset allocations might fail even with a lot of memory available.\n", + nodemask_pr_args(&trialcs->mems_allowed)); + } + spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c..a7e0854 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; + /* + * Check for insane configurations where the cpuset doesn't contain + * any suitable zone to satisfy the request - e.g. non-movable + * GFP_HIGHUSER allocations from MOVABLE nodes only. + */ + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { + struct zoneref *z = first_zones_zonelist(ac->zonelist, + ac->highest_zoneidx, + &cpuset_current_mems_allowed); + if (!z->zone) + goto nopage; + } + if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac);
There was report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is, in the case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handle it inside OOM code, adding cpuset check, refining comments] Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- Changelog: v2: * add a static branch detection in cpuset code to reduce the overhead in allocation hotpath (Michal Hocko) v1 (since RFC): * move the handling from oom code to page allocation path (Michal/David) include/linux/cpuset.h | 15 +++++++++++++++ include/linux/mmzone.h | 12 ++++++++++++ kernel/cgroup/cpuset.c | 14 ++++++++++++++ mm/page_alloc.c | 13 +++++++++++++ 4 files changed, 54 insertions(+)