Message ID | 20240116071302.2282230-1-wangkefeng.wang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: memory: move mem_cgroup_charge() into alloc_anon_folio() | expand |
On 16/01/2024 07:13, Kefeng Wang wrote: > In order to allocate as much as possible of large folio, move > the mem charge into alloc_anon_folio() and try the next order > if mem_cgroup_charge() fails, also we change the GFP_KERNEL > to gfp to be consistent with PMD THP. I agree that changing gfp gives you consistency. But it's not entirely clear to me why THP should use one set of flags for this case, and since pages another. Why does this difference exist? Otherwise, LGTM! > > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > mm/memory.c | 14 +++++++------- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 5e88d5379127..2e31a407e6f9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > folio = vma_alloc_folio(gfp, order, vma, addr, true); > if (folio) { > + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { > + folio_put(folio); > + goto next; > + } > + folio_throttle_swaprate(folio, gfp); > clear_huge_page(&folio->page, vmf->address, 1 << order); > return folio; > } > +next: > order = next_order(&orders, order); > } > > fallback: > #endif > - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address); > + return folio_prealloc(vma->vm_mm, vma, vmf->address, true); > } > > /* > @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > nr_pages = folio_nr_pages(folio); > addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); > > - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > - goto oom_free_page; > - folio_throttle_swaprate(folio, GFP_KERNEL); > - > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > * preceding stores to the page contents become visible before > @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > release: > folio_put(folio); > goto unlock; > -oom_free_page: > - folio_put(folio); > oom: > return VM_FAULT_OOM; > }
On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote: > On 16/01/2024 07:13, Kefeng Wang wrote: > > In order to allocate as much as possible of large folio, move > > the mem charge into alloc_anon_folio() and try the next order > > if mem_cgroup_charge() fails, also we change the GFP_KERNEL > > to gfp to be consistent with PMD THP. > > I agree that changing gfp gives you consistency. But it's not entirely clear to > me why THP should use one set of flags for this case, and since pages another. > Why does this difference exist? I think it needs to be spelled out much better in the changelog. Here's my attempt at explaining why we might want this change. mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way. In addition to checking gfpflags_allow_blocking(), it pays attention to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within this memcg do not exceed their quotas. Using the same GFP flags ensures that we handle large anonymous folios correctly, including falling back to smaller orders when there is plenty of memory available in the system but this memcg is close to its limits. ... I remain not-an-expert in memcg and anonymous memory and welcome improvements to that text.
On 16/01/2024 14:51, Matthew Wilcox wrote: > On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote: >> On 16/01/2024 07:13, Kefeng Wang wrote: >>> In order to allocate as much as possible of large folio, move >>> the mem charge into alloc_anon_folio() and try the next order >>> if mem_cgroup_charge() fails, also we change the GFP_KERNEL >>> to gfp to be consistent with PMD THP. >> >> I agree that changing gfp gives you consistency. But it's not entirely clear to >> me why THP should use one set of flags for this case, and since pages another. >> Why does this difference exist? > > I think it needs to be spelled out much better in the changelog. Here's > my attempt at explaining why we might want this change. > > mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way. > In addition to checking gfpflags_allow_blocking(), it pays attention to > __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within > this memcg do not exceed their quotas. Using the same GFP flags ensures > that we handle large anonymous folios correctly, including falling back > to smaller orders when there is plenty of memory available in the system > but this memcg is close to its limits. Thanks for the explanation. Please add to the commit log. Essentially you are saying that previously, all mTHP allocations would cause reclaim from the memcg if the allocation caused the quota to be used up. But with this change, it might now avoid that reclaim and just OOM, if the flags are as such? So then we retry with the next lowest available size. Makes sense! > > ... I remain not-an-expert in memcg and anonymous memory and welcome > improvements to that text. Me too...
Hi Kefeng, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio() config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20240117/202401170535.2TfJ7u74-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/202401170535.2TfJ7u74-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401170535.2TfJ7u74-lkp@intel.com/ All error/warnings (new ones prefixed by >>): mm/memory.c: In function 'alloc_anon_folio': >> mm/memory.c:4223:31: error: 'vma' undeclared (first use in this function); did you mean 'vmf'? 4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true); | ^~~ | vmf mm/memory.c:4223:31: note: each undeclared identifier is reported only once for each function it appears in >> mm/memory.c:4224:1: warning: control reaches end of non-void function [-Wreturn-type] 4224 | } | ^ vim +4223 mm/memory.c 4153 4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf) 4155 { 4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 4157 struct vm_area_struct *vma = vmf->vma; 4158 unsigned long orders; 4159 struct folio *folio; 4160 unsigned long addr; 4161 pte_t *pte; 4162 gfp_t gfp; 4163 int order; 4164 4165 /* 4166 * If uffd is active for the vma we need per-page fault fidelity to 4167 * maintain the uffd semantics. 4168 */ 4169 if (unlikely(userfaultfd_armed(vma))) 4170 goto fallback; 4171 4172 /* 4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled 4174 * for this vma. Then filter out the orders that can't be allocated over 4175 * the faulting address and still be fully contained in the vma. 4176 */ 4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, 4178 BIT(PMD_ORDER) - 1); 4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders); 4180 4181 if (!orders) 4182 goto fallback; 4183 4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); 4185 if (!pte) 4186 return ERR_PTR(-EAGAIN); 4187 4188 /* 4189 * Find the highest order where the aligned range is completely 4190 * pte_none(). Note that all remaining orders will be completely 4191 * pte_none(). 4192 */ 4193 order = highest_order(orders); 4194 while (orders) { 4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); 4196 if (pte_range_none(pte + pte_index(addr), 1 << order)) 4197 break; 4198 order = next_order(&orders, order); 4199 } 4200 4201 pte_unmap(pte); 4202 4203 /* Try allocating the highest of the remaining orders. */ 4204 gfp = vma_thp_gfp_mask(vma); 4205 while (orders) { 4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); 4207 folio = vma_alloc_folio(gfp, order, vma, addr, true); 4208 if (folio) { 4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { 4210 folio_put(folio); 4211 goto next; 4212 } 4213 folio_throttle_swaprate(folio, gfp); 4214 clear_huge_page(&folio->page, vmf->address, 1 << order); 4215 return folio; 4216 } 4217 next: 4218 order = next_order(&orders, order); 4219 } 4220 4221 fallback: 4222 #endif > 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true); > 4224 } 4225
Hi Kefeng, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio() config: i386-buildonly-randconfig-002-20240116 (https://download.01.org/0day-ci/archive/20240117/202401170708.XjVa9Trj-lkp@intel.com/config) compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/202401170708.XjVa9Trj-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401170708.XjVa9Trj-lkp@intel.com/ All errors (new ones prefixed by >>): >> mm/memory.c:4223:24: error: use of undeclared identifier 'vma' 4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true); | ^ mm/memory.c:4223:36: error: use of undeclared identifier 'vma' 4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true); | ^ 2 errors generated. vim +/vma +4223 mm/memory.c 4153 4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf) 4155 { 4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 4157 struct vm_area_struct *vma = vmf->vma; 4158 unsigned long orders; 4159 struct folio *folio; 4160 unsigned long addr; 4161 pte_t *pte; 4162 gfp_t gfp; 4163 int order; 4164 4165 /* 4166 * If uffd is active for the vma we need per-page fault fidelity to 4167 * maintain the uffd semantics. 4168 */ 4169 if (unlikely(userfaultfd_armed(vma))) 4170 goto fallback; 4171 4172 /* 4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled 4174 * for this vma. Then filter out the orders that can't be allocated over 4175 * the faulting address and still be fully contained in the vma. 4176 */ 4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, 4178 BIT(PMD_ORDER) - 1); 4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders); 4180 4181 if (!orders) 4182 goto fallback; 4183 4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); 4185 if (!pte) 4186 return ERR_PTR(-EAGAIN); 4187 4188 /* 4189 * Find the highest order where the aligned range is completely 4190 * pte_none(). Note that all remaining orders will be completely 4191 * pte_none(). 4192 */ 4193 order = highest_order(orders); 4194 while (orders) { 4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); 4196 if (pte_range_none(pte + pte_index(addr), 1 << order)) 4197 break; 4198 order = next_order(&orders, order); 4199 } 4200 4201 pte_unmap(pte); 4202 4203 /* Try allocating the highest of the remaining orders. */ 4204 gfp = vma_thp_gfp_mask(vma); 4205 while (orders) { 4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); 4207 folio = vma_alloc_folio(gfp, order, vma, addr, true); 4208 if (folio) { 4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { 4210 folio_put(folio); 4211 goto next; 4212 } 4213 folio_throttle_swaprate(folio, gfp); 4214 clear_huge_page(&folio->page, vmf->address, 1 << order); 4215 return folio; 4216 } 4217 next: 4218 order = next_order(&orders, order); 4219 } 4220 4221 fallback: 4222 #endif > 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true); 4224 } 4225
On 2024/1/17 5:26, kernel test robot wrote: > Hi Kefeng, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on akpm-mm/mm-everything] > > url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com > patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio() > config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20240117/202401170535.2TfJ7u74-lkp@intel.com/config) > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/202401170535.2TfJ7u74-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202401170535.2TfJ7u74-lkp@intel.com/ > > All error/warnings (new ones prefixed by >>): > thanks, will fix built on !CONFIG_TRANSPARENT_HUGEPAGE > mm/memory.c: In function 'alloc_anon_folio': >>> mm/memory.c:4223:31: error: 'vma' undeclared (first use in this function); did you mean 'vmf'? > 4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true); > | ^~~ > | vmf > mm/memory.c:4223:31: note: each undeclared identifier is reported only once for each function it appears in >>> mm/memory.c:4224:1: warning: control reaches end of non-void function [-Wreturn-type] > 4224 | } > | ^ > > > vim +4223 mm/memory.c > > 4153 > 4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf) > 4155 { > 4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE > 4157 struct vm_area_struct *vma = vmf->vma; > 4158 unsigned long orders; > 4159 struct folio *folio; > 4160 unsigned long addr; > 4161 pte_t *pte; > 4162 gfp_t gfp; > 4163 int order; > 4164 > 4165 /* > 4166 * If uffd is active for the vma we need per-page fault fidelity to > 4167 * maintain the uffd semantics. > 4168 */ > 4169 if (unlikely(userfaultfd_armed(vma))) > 4170 goto fallback; > 4171 > 4172 /* > 4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled > 4174 * for this vma. Then filter out the orders that can't be allocated over > 4175 * the faulting address and still be fully contained in the vma. > 4176 */ > 4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, > 4178 BIT(PMD_ORDER) - 1); > 4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders); > 4180 > 4181 if (!orders) > 4182 goto fallback; > 4183 > 4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); > 4185 if (!pte) > 4186 return ERR_PTR(-EAGAIN); > 4187 > 4188 /* > 4189 * Find the highest order where the aligned range is completely > 4190 * pte_none(). Note that all remaining orders will be completely > 4191 * pte_none(). > 4192 */ > 4193 order = highest_order(orders); > 4194 while (orders) { > 4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > 4196 if (pte_range_none(pte + pte_index(addr), 1 << order)) > 4197 break; > 4198 order = next_order(&orders, order); > 4199 } > 4200 > 4201 pte_unmap(pte); > 4202 > 4203 /* Try allocating the highest of the remaining orders. */ > 4204 gfp = vma_thp_gfp_mask(vma); > 4205 while (orders) { > 4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > 4207 folio = vma_alloc_folio(gfp, order, vma, addr, true); > 4208 if (folio) { > 4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { > 4210 folio_put(folio); > 4211 goto next; > 4212 } > 4213 folio_throttle_swaprate(folio, gfp); > 4214 clear_huge_page(&folio->page, vmf->address, 1 << order); > 4215 return folio; > 4216 } > 4217 next: > 4218 order = next_order(&orders, order); > 4219 } > 4220 > 4221 fallback: > 4222 #endif >> 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true); >> 4224 } > 4225 >
On 2024/1/16 23:07, Ryan Roberts wrote: > On 16/01/2024 14:51, Matthew Wilcox wrote: >> On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote: >>> On 16/01/2024 07:13, Kefeng Wang wrote: >>>> In order to allocate as much as possible of large folio, move >>>> the mem charge into alloc_anon_folio() and try the next order >>>> if mem_cgroup_charge() fails, also we change the GFP_KERNEL >>>> to gfp to be consistent with PMD THP. >>> >>> I agree that changing gfp gives you consistency. But it's not entirely clear to >>> me why THP should use one set of flags for this case, and since pages another. >>> Why does this difference exist? >> >> I think it needs to be spelled out much better in the changelog. Here's >> my attempt at explaining why we might want this change. >> >> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way. >> In addition to checking gfpflags_allow_blocking(), it pays attention to >> __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within >> this memcg do not exceed their quotas. Using the same GFP flags ensures >> that we handle large anonymous folios correctly, including falling back >> to smaller orders when there is plenty of memory available in the system >> but this memcg is close to its limits. > > Thanks for the explanation. Please add to the commit log. Thanks, it is much better, will update, a similar change in THP, see commit 3b3636924dfe "mm, memcg: sync allocation and memcg charge gfp flags for THP". > > Essentially you are saying that previously, all mTHP allocations would cause > reclaim from the memcg if the allocation caused the quota to be used up. But > with this change, it might now avoid that reclaim and just OOM, if the flags are > as such? So then we retry with the next lowest available size. Makes sense! > With correct GFP, we could get less reclaim and faster fallabck to next order, that's what I want too. > >> >> ... I remain not-an-expert in memcg and anonymous memory and welcome >> improvements to that text. > > Me too...
diff --git a/mm/memory.c b/mm/memory.c index 5e88d5379127..2e31a407e6f9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { + folio_put(folio); + goto next; + } + folio_throttle_swaprate(folio, gfp); clear_huge_page(&folio->page, vmf->address, 1 << order); return folio; } +next: order = next_order(&orders, order); } fallback: #endif - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address); + return folio_prealloc(vma->vm_mm, vma, vmf->address, true); } /* @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) nr_pages = folio_nr_pages(folio); addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) - goto oom_free_page; - folio_throttle_swaprate(folio, GFP_KERNEL); - /* * The memory barrier inside __folio_mark_uptodate makes sure that * preceding stores to the page contents become visible before @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) release: folio_put(folio); goto unlock; -oom_free_page: - folio_put(folio); oom: return VM_FAULT_OOM; }
In order to allocate as much as possible of large folio, move the mem charge into alloc_anon_folio() and try the next order if mem_cgroup_charge() fails, also we change the GFP_KERNEL to gfp to be consistent with PMD THP. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> --- mm/memory.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)