diff mbox series

MM: discard __GFP_ATOMIC

Message ID 163712397076.13692.4727608274002939094@noble.neil.brown.name (mailing list archive)
State New
Headers show
Series MM: discard __GFP_ATOMIC | expand

Commit Message

NeilBrown Nov. 17, 2021, 4:39 a.m. UTC
__GFP_ATOMIC serves little purpose.
It's main effect is to set ALLOC_HARDER which adds a few little boosts to
increase the chance of an allocation succeeding, one of which is to
lower the water-mark at which it will succeed.

It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark.  It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.

__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this.  We already get a might_sleep() warning
if __GFP_DIRECT_RECLAIM is set.

__GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
probable that testing ALLOC_HARDER is a better fit here.

__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep.  This should test __GFP_DIRECT_RECLAIM instead.

This patch:
 - removes __GFP_ATOMIC
 - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
   (as well as ALLOC_HIGH).
 - makes other adjustments as suggested by the above.

The net result is not change to GFP_ATOMIC allocations.  Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges.  This affects:
  xen, dm, md, ntfs3
  the vermillion frame buffer
  hibernation
  ksm
  swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.

Signed-off-by: NeilBrown <neilb@suse.de>
---

This patch came out of my attempts to review and document some of the
__GFP flags.  I couldn't find that __GFP_ATOMIC meant anything really
useful, so I thought that discarding it would be best.

Obviously GFP_ATOMIC is useful - just not __GFP_ATOMIC (and the fact
that both names exist could be seen as a warning sign).

Thanks,
NeilBrown


 Documentation/vm/balance.rst         |  2 +-
 drivers/iommu/tegra-smmu.c           |  4 ++--
 include/linux/gfp.h                  | 12 ++++--------
 include/trace/events/mmflags.h       |  1 -
 lib/test_printf.c                    |  8 ++++----
 mm/internal.h                        |  3 +--
 mm/memcontrol.c                      |  2 +-
 mm/page_alloc.c                      | 16 ++++------------
 tools/perf/builtin-kmem.c            |  1 -
 tools/testing/radix-tree/linux/gfp.h |  3 +--
 10 files changed, 18 insertions(+), 34 deletions(-)

Comments

Matthew Wilcox (Oracle) Nov. 17, 2021, 1:18 p.m. UTC | #1
On Wed, Nov 17, 2021 at 03:39:30PM +1100, NeilBrown wrote:
> +++ b/drivers/iommu/tegra-smmu.c
> @@ -676,12 +676,12 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
>  	 * allocate page in a sleeping context if GFP flags permit. Hence
>  	 * spinlock needs to be unlocked and re-locked after allocation.
>  	 */
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfp & __GFP_DIRECT_RECLAIM)
>  		spin_unlock_irqrestore(&as->lock, *flags);
>  
>  	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
>  
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfp & __GFP_DIRECT_RECLAIM)
>  		spin_lock_irqsave(&as->lock, *flags);
>  
>  	/*

Surely this should be gfpflags_allow_blocking() instead of poking about
in the innards of gfp flags?

This patch seems like a good simplification to me.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Michal Hocko Nov. 18, 2021, 9:22 a.m. UTC | #2
[Cc Mel]

On Wed 17-11-21 15:39:30, Neil Brown wrote:
> 
> __GFP_ATOMIC serves little purpose.
> It's main effect is to set ALLOC_HARDER which adds a few little boosts to
> increase the chance of an allocation succeeding, one of which is to
> lower the water-mark at which it will succeed.
> 
> It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> adjusts this watermark.  It is probable that other users of __GFP_HIGH
> should benefit from the other little bonuses that __GFP_ATOMIC gets.
> 
> __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
> There is little point to this.  We already get a might_sleep() warning
> if __GFP_DIRECT_RECLAIM is set.
> 
> __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> probable that testing ALLOC_HARDER is a better fit here.
> 
> __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> 
> This patch:
>  - removes __GFP_ATOMIC
>  - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
>    (as well as ALLOC_HIGH).
>  - makes other adjustments as suggested by the above.
> 
> The net result is not change to GFP_ATOMIC allocations.  Other
> allocations that use __GFP_HIGH will benefit from a few different extra
> privileges.  This affects:
>   xen, dm, md, ntfs3
>   the vermillion frame buffer
>   hibernation
>   ksm
>   swap
> all of which likely produce more benefit than cost if these selected
> allocation are more likely to succeed quickly.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
> This patch came out of my attempts to review and document some of the
> __GFP flags.  I couldn't find that __GFP_ATOMIC meant anything really
> useful, so I thought that discarding it would be best.
> 
> Obviously GFP_ATOMIC is useful - just not __GFP_ATOMIC (and the fact
> that both names exist could be seen as a warning sign).
> 
> Thanks,
> NeilBrown
> 
> 
>  Documentation/vm/balance.rst         |  2 +-
>  drivers/iommu/tegra-smmu.c           |  4 ++--
>  include/linux/gfp.h                  | 12 ++++--------
>  include/trace/events/mmflags.h       |  1 -
>  lib/test_printf.c                    |  8 ++++----
>  mm/internal.h                        |  3 +--
>  mm/memcontrol.c                      |  2 +-
>  mm/page_alloc.c                      | 16 ++++------------
>  tools/perf/builtin-kmem.c            |  1 -
>  tools/testing/radix-tree/linux/gfp.h |  3 +--
>  10 files changed, 18 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/vm/balance.rst b/Documentation/vm/balance.rst
> index 6a1fadf3e173..e38e9d83c1c7 100644
> --- a/Documentation/vm/balance.rst
> +++ b/Documentation/vm/balance.rst
> @@ -6,7 +6,7 @@ Memory Balancing
>  
>  Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
>  
> -Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
>  well as for non __GFP_IO allocations.
>  
>  The first reason why a caller may avoid reclaim is that the caller can not
> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
> index e900e3c46903..c5fa8b8673b6 100644
> --- a/drivers/iommu/tegra-smmu.c
> +++ b/drivers/iommu/tegra-smmu.c
> @@ -676,12 +676,12 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
>  	 * allocate page in a sleeping context if GFP flags permit. Hence
>  	 * spinlock needs to be unlocked and re-locked after allocation.
>  	 */
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfp & __GFP_DIRECT_RECLAIM)
>  		spin_unlock_irqrestore(&as->lock, *flags);
>  
>  	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
>  
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfp & __GFP_DIRECT_RECLAIM)
>  		spin_lock_irqsave(&as->lock, *flags);
>  
>  	/*
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b976c4177299..676c813bd93f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,7 +39,7 @@ struct vm_area_struct;
>  #define ___GFP_IO		0x40u
>  #define ___GFP_FS		0x80u
>  #define ___GFP_ZERO		0x100u
> -#define ___GFP_ATOMIC		0x200u
> +/* 0x200u unused */
>  #define ___GFP_DIRECT_RECLAIM	0x400u
>  #define ___GFP_KSWAPD_RECLAIM	0x800u
>  #define ___GFP_WRITE		0x1000u
> @@ -116,11 +116,8 @@ struct vm_area_struct;
>   *
>   * %__GFP_HIGH indicates that the caller is high-priority and that granting
>   * the request is necessary before the system can make forward progress.
> - * For example, creating an IO context to clean pages.
> - *
> - * %__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
> - * high priority. Users are typically interrupt handlers. This may be
> - * used in conjunction with %__GFP_HIGH
> + * For example creating an IO context to clean pages and requests
> + * from atomic context.
>   *
>   * %__GFP_MEMALLOC allows access to all memory. This should only be used when
>   * the caller guarantees the allocation will allow more memory to be freed
> @@ -135,7 +132,6 @@ struct vm_area_struct;
>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>   */
> -#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
>  #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
>  #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
>  #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
> @@ -320,7 +316,7 @@ struct vm_area_struct;
>   * version does not attempt reclaim/compaction at all and is by default used
>   * in page fault path, while the non-light is used by khugepaged.
>   */
> -#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
>  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
>  #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 116ed4d5d0f8..30f492256b8c 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -29,7 +29,6 @@
>  	{(unsigned long)__GFP_HIGHMEM,		"__GFP_HIGHMEM"},	\
>  	{(unsigned long)GFP_DMA32,		"GFP_DMA32"},		\
>  	{(unsigned long)__GFP_HIGH,		"__GFP_HIGH"},		\
> -	{(unsigned long)__GFP_ATOMIC,		"__GFP_ATOMIC"},	\
>  	{(unsigned long)__GFP_IO,		"__GFP_IO"},		\
>  	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
>  	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
> diff --git a/lib/test_printf.c b/lib/test_printf.c
> index 07309c45f327..8010de49b6c5 100644
> --- a/lib/test_printf.c
> +++ b/lib/test_printf.c
> @@ -673,17 +673,17 @@ flags(void)
>  	gfp = GFP_ATOMIC|__GFP_DMA;
>  	test("GFP_ATOMIC|GFP_DMA", "%pGg", &gfp);
>  
> -	gfp = __GFP_ATOMIC;
> -	test("__GFP_ATOMIC", "%pGg", &gfp);
> +	gfp = __GFP_HIGH;
> +	test("__GFP_HIGH", "%pGg", &gfp);
>  
>  	/* Any flags not translated by the table should remain numeric */
>  	gfp = ~__GFP_BITS_MASK;
>  	snprintf(cmp_buffer, BUF_SIZE, "%#lx", (unsigned long) gfp);
>  	test(cmp_buffer, "%pGg", &gfp);
>  
> -	snprintf(cmp_buffer, BUF_SIZE, "__GFP_ATOMIC|%#lx",
> +	snprintf(cmp_buffer, BUF_SIZE, "__GFP_HIGH|%#lx",
>  							(unsigned long) gfp);
> -	gfp |= __GFP_ATOMIC;
> +	gfp |= __GFP_HIGH;
>  	test(cmp_buffer, "%pGg", &gfp);
>  
>  	kfree(cmp_buffer);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3b79a5c9427a..4564067065f9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -20,8 +20,7 @@
>   */
>  #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
>  			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
> -			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
> -			__GFP_ATOMIC)
> +			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
>  
>  /* The GFP flags allowed during early boot */
>  #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 781605e92015..1ff55347e4b9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2621,7 +2621,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * put the burden of reclaim on regular allocation requests
>  	 * and let these go through as privileged allocations.
>  	 */
> -	if (gfp_mask & __GFP_ATOMIC)
> +	if (gfp_mask & __GFP_HIGH)
>  		goto force;
>  
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..4fff90ca145b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3932,12 +3932,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  					free_pages))
>  		return true;
>  	/*
> -	 * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
> +	 * Ignore watermark boosting for GFP_HIGH order-0 allocations
>  	 * when checking the min watermark. The min watermark is the
>  	 * point where boosting is ignored so that kswapd is woken up
>  	 * when below the low watermark.
>  	 */
> -	if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
> +	if (unlikely(!order && (alloc_flags & ALLOC_HARDER) && z->watermark_boost
>  		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
>  		mark = z->_watermark[WMARK_MIN];
>  		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
> @@ -4661,12 +4661,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
> +	 * set both ALLOC_HARDER (unless __GFP_NOMEMALLOC) and ALLOC_HIGH.
>  	 */
>  	alloc_flags |= (__force int)
>  		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
>  
> -	if (gfp_mask & __GFP_ATOMIC) {
> +	if (gfp_mask & __GFP_HIGH) {
>  		/*
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
>  		 * if it can't schedule.
> @@ -4859,14 +4859,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> -	/*
> -	 * We also sanity check to catch abuse of atomic reserves being used by
> -	 * callers that are not in atomic context.
> -	 */
> -	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> -				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> -		gfp_mask &= ~__GFP_ATOMIC;
> -
>  retry_cpuset:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
> diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
> index da03a341c63c..17e6eeb6e196 100644
> --- a/tools/perf/builtin-kmem.c
> +++ b/tools/perf/builtin-kmem.c
> @@ -640,7 +640,6 @@ static const struct {
>  	{ "__GFP_HIGHMEM",		"HM" },
>  	{ "GFP_DMA32",			"D32" },
>  	{ "__GFP_HIGH",			"H" },
> -	{ "__GFP_ATOMIC",		"_A" },
>  	{ "__GFP_IO",			"I" },
>  	{ "__GFP_FS",			"F" },
>  	{ "__GFP_NOWARN",		"NWR" },
> diff --git a/tools/testing/radix-tree/linux/gfp.h b/tools/testing/radix-tree/linux/gfp.h
> index 32159c08a52e..0a0741104dfe 100644
> --- a/tools/testing/radix-tree/linux/gfp.h
> +++ b/tools/testing/radix-tree/linux/gfp.h
> @@ -12,7 +12,6 @@
>  #define __GFP_FS		0x80u
>  #define __GFP_NOWARN		0x200u
>  #define __GFP_ZERO		0x8000u
> -#define __GFP_ATOMIC		0x80000u
>  #define __GFP_ACCOUNT		0x100000u
>  #define __GFP_DIRECT_RECLAIM	0x400000u
>  #define __GFP_KSWAPD_RECLAIM	0x2000000u
> @@ -20,7 +19,7 @@
>  #define __GFP_RECLAIM	(__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
>  
>  #define GFP_ZONEMASK	0x0fu
> -#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
>  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
>  
> -- 
> 2.33.1
Mel Gorman Nov. 18, 2021, 1:27 p.m. UTC | #3
On Thu, Nov 18, 2021 at 10:22:36AM +0100, Michal Hocko wrote:
> [Cc Mel]
> 

I think this patch should be ok. There are few direct users of __GFP_HIGH
and some of them are borderline silly (e.g. mm/shmem.c specifying
__GFP_HIGH|__GFP_NOMEMALLOC) while others just look questionable (
drivers/md/raid10.c seems to assume __GFP_HIGH guarantees allocation
success). Xen appears to be the worst abuser of __GFP_HIGH.
NeilBrown Nov. 18, 2021, 11:02 p.m. UTC | #4
On Fri, 19 Nov 2021, Mel Gorman wrote:
> On Thu, Nov 18, 2021 at 10:22:36AM +0100, Michal Hocko wrote:
> > [Cc Mel]
> > 
> 
> I think this patch should be ok. There are few direct users of __GFP_HIGH
> and some of them are borderline silly (e.g. mm/shmem.c specifying
> __GFP_HIGH|__GFP_NOMEMALLOC) while others just look questionable (
> drivers/md/raid10.c seems to assume __GFP_HIGH guarantees allocation
> success). Xen appears to be the worst abuser of __GFP_HIGH.

That __GFP_HIGH in raid10.c is passed to mempool_alloc(), so there is no
assumption that __GFP_HIGH will provide guarantees - the mempool does
that.
The comment - which I wrote 4 years ago and don't recall at all -
suggest it was purely about performance - get error handling out of the
way quickly.  I doubt I could justify it if challenged...

Thanks,
NeilBrown


> 
> -- 
> Mel Gorman
> SUSE Labs
> 
>
NeilBrown Nov. 18, 2021, 11:14 p.m. UTC | #5
On Thu, 18 Nov 2021, Matthew Wilcox wrote:
> On Wed, Nov 17, 2021 at 03:39:30PM +1100, NeilBrown wrote:
> > +++ b/drivers/iommu/tegra-smmu.c
> > @@ -676,12 +676,12 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
> >  	 * allocate page in a sleeping context if GFP flags permit. Hence
> >  	 * spinlock needs to be unlocked and re-locked after allocation.
> >  	 */
> > -	if (!(gfp & __GFP_ATOMIC))
> > +	if (gfp & __GFP_DIRECT_RECLAIM)
> >  		spin_unlock_irqrestore(&as->lock, *flags);
> >  
> >  	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
> >  
> > -	if (!(gfp & __GFP_ATOMIC))
> > +	if (gfp & __GFP_DIRECT_RECLAIM)
> >  		spin_lock_irqsave(&as->lock, *flags);
> >  
> >  	/*
> 
> Surely this should be gfpflags_allow_blocking() instead of poking about
> in the innards of gfp flags?

Possibly.  Didn't know about gfpflags_allow_blocking().  From a quick
grep in the kernel, a whole lot of other people don't know about it
either, though clearly some do.

Maybe we should reaname "__GFP_DIRECT_RECLAIM" to
"__GFP_ALLOW_BLOCKING", because that is what most users seems to care
about.

If not, then we probably want a gfpflags_without_block() function that
removes that flag, as lots of code wants to do that - and using the flag
for one, and an inline for the other is not consistent.

My leaning would be to __GFP_ALLOW_BLOCKING

NeilBrown


> 
> This patch seems like a good simplification to me.
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Thanks,
NeilBrown
Matthew Wilcox (Oracle) Nov. 19, 2021, 2:10 p.m. UTC | #6
On Fri, Nov 19, 2021 at 10:14:38AM +1100, NeilBrown wrote:
> On Thu, 18 Nov 2021, Matthew Wilcox wrote:
> > Surely this should be gfpflags_allow_blocking() instead of poking about
> > in the innards of gfp flags?
> 
> Possibly.  Didn't know about gfpflags_allow_blocking().  From a quick
> grep in the kernel, a whole lot of other people don't know about it
> either, though clearly some do.
> 
> Maybe we should reaname "__GFP_DIRECT_RECLAIM" to
> "__GFP_ALLOW_BLOCKING", because that is what most users seems to care
> about.

I tend towards the school of thought that the __GFP flags should make
sense to the implementation and users should use either GFP_ or functions.
When we see users adding or subtracting __GFP flags, that's a problem.

> If not, then we probably want a gfpflags_without_block() function that
> removes that flag, as lots of code wants to do that - and using the flag
> for one, and an inline for the other is not consistent.

It's not a _lot_ of code ...

block/bio.c
drivers/net/ethernet/mellanox/mlx4/icm.c
drivers/vhost/net.c
fs/btrfs/disk-io.c
fs/btrfs/volumes.c
fs/erofs/zdata.c
fs/fscache/cookie.c
fs/fscache/page.c
kernel/cgroup/cgroup.c
net/core/skbuff.c
net/core/sock.c
net/netlink/af_netlink.c

(excluding mm -- if mm wants to meddle with GFP flags, that's fine)

I think a lot of these are probably misguided, eg the filesystems should
probably be using GFP_NOFS to prevent recursion.
NeilBrown Nov. 20, 2021, 10:51 a.m. UTC | #7
On Sat, 20 Nov 2021, Matthew Wilcox wrote:
> On Fri, Nov 19, 2021 at 10:14:38AM +1100, NeilBrown wrote:
> > On Thu, 18 Nov 2021, Matthew Wilcox wrote:
> > > Surely this should be gfpflags_allow_blocking() instead of poking about
> > > in the innards of gfp flags?
> > 
> > Possibly.  Didn't know about gfpflags_allow_blocking().  From a quick
> > grep in the kernel, a whole lot of other people don't know about it
> > either, though clearly some do.
> > 
> > Maybe we should reaname "__GFP_DIRECT_RECLAIM" to
> > "__GFP_ALLOW_BLOCKING", because that is what most users seems to care
> > about.
> 
> I tend towards the school of thought that the __GFP flags should make
> sense to the implementation and users should use either GFP_ or functions.
> When we see users adding or subtracting __GFP flags, that's a problem.

Except __GFP_NOWARN of course, or __GFP_ZERO, or __GFP_NOFAIL.
What about __GFP_HIGHMEM?  __GFP_DMA?  __GFP_HIGH?

They all seem to be quite meaningful to the caller - explicitly
specifying properties of the memory or properties of the service.
(But maybe you would prefer __GFP_HIGH be spelled "__GFP_LOW_WATERMARK"
so it would make more sense to the implementation).

__GFP_DIRECTRECLAIM seems to me to be more the exception than the rule -
specifying internal implementation details.

Actually ... I take it back about __GFP_NOWARN.  That probably shouldn't
exist at all.  Warnings should be based on how stressed the mm system is,
not on whether the caller wants thinks failure is manageable.

NeilBrown
Michal Hocko Nov. 22, 2021, 4:43 p.m. UTC | #8
On Wed 17-11-21 15:39:30, Neil Brown wrote:
> 
> __GFP_ATOMIC serves little purpose.
> It's main effect is to set ALLOC_HARDER which adds a few little boosts to
> increase the chance of an allocation succeeding, one of which is to
> lower the water-mark at which it will succeed.
> 
> It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> adjusts this watermark.  It is probable that other users of __GFP_HIGH
> should benefit from the other little bonuses that __GFP_ATOMIC gets.

While I like to see __GFP_ATOMIC going away I am not really sure about
this particular part. We have 3 ways to get to memory reserves. One of
thme is directly controlable by __GFP_HIGH and two are internal to the
allocator to handle different situations - ALLOC_OOM is to help the oom
victim to make a fwd progress and ALLOC_HARDER should be for contexts
which cannot rely on the memory reclaim to continue.

What is the point of having ALLOC_HIGH and ALLOC_HARDER if you just
add both of them for __GFP_HIGH? I think you should be instead really
get back to pre d0164adc89f6b and allow ALLOC_HARDER for requests which
have neither of the reclaim allowed. That would require tweaking
GFP_ATOMIC as well I suspect and drop __GFP_KSWAPD_RECLAIM. Or do
something else.
 
> __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
> There is little point to this.  We already get a might_sleep() warning
> if __GFP_DIRECT_RECLAIM is set.

I believe the point of the warning was to stop any abuse of an
additional memory reserves for context which can reclaim and to spare
those to interrupt handlers - which usually use GFP_ATOMIC. A lack of
any reports suggests this hasn't happened and the warning can be
dropped. Would be worth a patch on its own with this explanation.
 
> __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> probable that testing ALLOC_HARDER is a better fit here.

This has been introduced by f80b08fc44536 but I have to say that I
haven't understood why this couldn't check for __GFP_DIRECT_RECLAIM
or one ALLOC_$FOO boosters rather than __GFP_ATOMIC. Again something for
a separate patch.
 
> __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> sleep.  This should test __GFP_DIRECT_RECLAIM instead.

Willy has already proposed a better alternative.

Thanks!
Michal Hocko Nov. 22, 2021, 4:54 p.m. UTC | #9
On Sat 20-11-21 21:51:20, Neil Brown wrote:
> On Sat, 20 Nov 2021, Matthew Wilcox wrote:
> > On Fri, Nov 19, 2021 at 10:14:38AM +1100, NeilBrown wrote:
> > > On Thu, 18 Nov 2021, Matthew Wilcox wrote:
> > > > Surely this should be gfpflags_allow_blocking() instead of poking about
> > > > in the innards of gfp flags?
> > > 
> > > Possibly.  Didn't know about gfpflags_allow_blocking().  From a quick
> > > grep in the kernel, a whole lot of other people don't know about it
> > > either, though clearly some do.
> > > 
> > > Maybe we should reaname "__GFP_DIRECT_RECLAIM" to
> > > "__GFP_ALLOW_BLOCKING", because that is what most users seems to care
> > > about.
> > 
> > I tend towards the school of thought that the __GFP flags should make
> > sense to the implementation and users should use either GFP_ or functions.
> > When we see users adding or subtracting __GFP flags, that's a problem.
> 
> Except __GFP_NOWARN of course, or __GFP_ZERO, or __GFP_NOFAIL.
> What about __GFP_HIGHMEM?  __GFP_DMA?  __GFP_HIGH?
> 
> They all seem to be quite meaningful to the caller - explicitly
> specifying properties of the memory or properties of the service.
> (But maybe you would prefer __GFP_HIGH be spelled "__GFP_LOW_WATERMARK"
> so it would make more sense to the implementation).
> 
> __GFP_DIRECTRECLAIM seems to me to be more the exception than the rule -
> specifying internal implementation details.

I do not think it is viable to fix up gfp flags to be consistent :/
Both __GFP_DIRECT_RECLAIM and __GFP_KSWAPD_RECLAIM are way too lowlevel
but historically we've had requests to inhibit kswapd for a particular
requests because that has led to problems - fun reading caf491916b1c1.
__GFP_ALLOW_BLOCKING would make a lot of sense but I am not sure it
would be a good match to __GFP_KSWAPD_RECLAIM.

> Actually ... I take it back about __GFP_NOWARN.  That probably shouldn't
> exist at all.  Warnings should be based on how stressed the mm system is,
> not on whether the caller wants thinks failure is manageable.

Unless we change the way when allocation warnings are triggered then we
really need this. There are many opportunistic allocations with a
fallback behavior which do not want to swamp kernel logs with failures
that are of no use. Think of a THP allocation that really want to be
just very quick and falls back to normal base pages otherwise. Deducing
context which is just fine to not report failures is quite tricky and it
can get wrong easily. Callers should know whether warning can be of any
use in many cases.
NeilBrown Nov. 23, 2021, 4:15 a.m. UTC | #10
On Tue, 23 Nov 2021, Michal Hocko wrote:
> On Sat 20-11-21 21:51:20, Neil Brown wrote:
> > On Sat, 20 Nov 2021, Matthew Wilcox wrote:
> > > On Fri, Nov 19, 2021 at 10:14:38AM +1100, NeilBrown wrote:
> > > > On Thu, 18 Nov 2021, Matthew Wilcox wrote:
> > > > > Surely this should be gfpflags_allow_blocking() instead of poking about
> > > > > in the innards of gfp flags?
> > > > 
> > > > Possibly.  Didn't know about gfpflags_allow_blocking().  From a quick
> > > > grep in the kernel, a whole lot of other people don't know about it
> > > > either, though clearly some do.
> > > > 
> > > > Maybe we should reaname "__GFP_DIRECT_RECLAIM" to
> > > > "__GFP_ALLOW_BLOCKING", because that is what most users seems to care
> > > > about.
> > > 
> > > I tend towards the school of thought that the __GFP flags should make
> > > sense to the implementation and users should use either GFP_ or functions.
> > > When we see users adding or subtracting __GFP flags, that's a problem.
> > 
> > Except __GFP_NOWARN of course, or __GFP_ZERO, or __GFP_NOFAIL.
> > What about __GFP_HIGHMEM?  __GFP_DMA?  __GFP_HIGH?
> > 
> > They all seem to be quite meaningful to the caller - explicitly
> > specifying properties of the memory or properties of the service.
> > (But maybe you would prefer __GFP_HIGH be spelled "__GFP_LOW_WATERMARK"
> > so it would make more sense to the implementation).
> > 
> > __GFP_DIRECTRECLAIM seems to me to be more the exception than the rule -
> > specifying internal implementation details.
> 
> I do not think it is viable to fix up gfp flags to be consistent :/

You may be right :-)  Of course if we don't try - you'll definitely be right.

> Both __GFP_DIRECT_RECLAIM and __GFP_KSWAPD_RECLAIM are way too lowlevel
> but historically we've had requests to inhibit kswapd for a particular
> requests because that has led to problems - fun reading caf491916b1c1.

Unfortunately that commit doesn't provide any reasoning, just an
assertion.
The best reasoning I could find was in caf491916b1c1 which was the initial
revert.  There the primary reasoning was "there is a bug that we don't
have time for a proper fix before the next release, so let's just use
this quick fix".
...  and maybe "the quick fix" was "the right fix", but I cannot tell from
the commit logs :-(

> __GFP_ALLOW_BLOCKING would make a lot of sense but I am not sure it
> would be a good match to __GFP_KSWAPD_RECLAIM.

So? __GFP_ALLOW_BLOCKING makes it clear what is, or is not, acceptable
to the caller.  How much reclaim, or other activity, alloc_page()
engages in is largely irrelevant to the caller as lock as it doesn't
block if asked not to (and doesn't enter an FS if asked not to, etc).

> 
> > Actually ... I take it back about __GFP_NOWARN.  That probably shouldn't
> > exist at all.  Warnings should be based on how stressed the mm system is,
> > not on whether the caller wants thinks failure is manageable.
> 
> Unless we change the way when allocation warnings are triggered then we
> really need this. There are many opportunistic allocations with a
> fallback behavior which do not want to swamp kernel logs with failures
> that are of no use. Think of a THP allocation that really want to be
> just very quick and falls back to normal base pages otherwise. Deducing
> context which is just fine to not report failures is quite tricky and it
> can get wrong easily. Callers should know whether warning can be of any
> use in many cases.

"Unless" being the key work.
It makes sense to warn when a __GFP_HIGH or __GFP_MEMALLOC allocation
fails,  because they are clearly important.
It makes sense to warning if direct reclaim and retrying were enabled,
as then alloc_page() has tried really hard, but failed anyway.  Thought
maybe if COSTLY_ORDER is exceeded, then the warning is unlikely to be
interesting.
But does it ever make sense to warn if either of
__GFP_RETRY_MAYFAIL __GFP_NORETRY are present?

If we always suppressed warning when those flags were present, then many
(most?) uses for __GFP_NOWARN can be discarded.

I can see that some of the __GFP flags are designed to each perform a
single well-defined function and internally to mm/ that makes sense.
But exposing those flags to all users appears to be a recipe for
trouble.  Hiding them all behind "__" doesn't stop people from using and
misusing them.  Others are externally meaningful.  Making them visually
similar to the ones we want to hide isn't helping anyone.

When Willy wrote:
  > When we see users adding or subtracting __GFP flags, that's a problem.
the "problem" is not so much in the fact that they *do* but in the fact
that they *can*.

I would be greatly in favour of GFP flags which made sense to callers,
and a mapping to ALLOC_ flags in mm/ which makes sense to allocators.

I doubt anything outside mm/ cares about whether KSWAPD is woken or not.
IT probably should be for small-order allocations, and not so much for
large-order allocations.  but mm/khugepaged.c might make other decisions.

Thanks,
NeilBrown
NeilBrown Nov. 23, 2021, 4:33 a.m. UTC | #11
On Tue, 23 Nov 2021, Michal Hocko wrote:
> On Wed 17-11-21 15:39:30, Neil Brown wrote:
> > 
> > __GFP_ATOMIC serves little purpose.
> > It's main effect is to set ALLOC_HARDER which adds a few little boosts to
> > increase the chance of an allocation succeeding, one of which is to
> > lower the water-mark at which it will succeed.
> > 
> > It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> > adjusts this watermark.  It is probable that other users of __GFP_HIGH
> > should benefit from the other little bonuses that __GFP_ATOMIC gets.
> 
> While I like to see __GFP_ATOMIC going away I am not really sure about
> this particular part. We have 3 ways to get to memory reserves. One of
> thme is directly controlable by __GFP_HIGH and two are internal to the
> allocator to handle different situations - ALLOC_OOM is to help the oom
> victim to make a fwd progress and ALLOC_HARDER should be for contexts
> which cannot rely on the memory reclaim to continue.

ALLOC_OOM certainly makes sense.

> 
> What is the point of having ALLOC_HIGH and ALLOC_HARDER if you just
> add both of them for __GFP_HIGH? I think you should be instead really
> get back to pre d0164adc89f6b and allow ALLOC_HARDER for requests which
> have neither of the reclaim allowed. That would require tweaking
> GFP_ATOMIC as well I suspect and drop __GFP_KSWAPD_RECLAIM. Or do
> something else.

NONONONO.  Tying ALLOC_HARDER to "no reclaim" is a mistake.  From the
caller's perspective they are very different statements, which might
sometimes go together (and GFP_ATOMIC is exactly where they go
together).

"no reclaim" is a question "am I willing to pay the price of performing
reclaim", and there might be various different reasons for choosing not
to.

"ALLOC_HARDER" is a question of "can I justify imposing on other threads
by taking memory that they might want".  Again there may be different
reasons, but they will not always align with the first set.

With my patch there is still a difference between ALLOC_HIGH and
ALLOC_HARDER, but not much.
__GFP_HIGH combined with __GFP_NOMEMALLOC - which could be seen as "high
priority, but not too high" delivers ALLOC_HIGH without ALLOC_HARDER.
It may not be a useful distinction, but it seems to preserve most of
what I didn't want to change.

>  
> > __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
> > There is little point to this.  We already get a might_sleep() warning
> > if __GFP_DIRECT_RECLAIM is set.
> 
> I believe the point of the warning was to stop any abuse of an
> additional memory reserves for context which can reclaim and to spare
> those to interrupt handlers - which usually use GFP_ATOMIC. A lack of
> any reports suggests this hasn't happened and the warning can be
> dropped. Would be worth a patch on its own with this explanation.
>  
> > __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> > probable that testing ALLOC_HARDER is a better fit here.
> 
> This has been introduced by f80b08fc44536 but I have to say that I
> haven't understood why this couldn't check for __GFP_DIRECT_RECLAIM
> or one ALLOC_$FOO boosters rather than __GFP_ATOMIC. Again something for
> a separate patch.

Yes - the commit description contrasts "atomic" with "sleepable"
allocations.  Given that, __GFP_DIRECT_RECLAIM would make make sense.

>  
> > __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> > sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> 
> Willy has already proposed a better alternative.

I'm happy to resend the patch with this change (despite it seeming like
a bandaid on a blowhole).

Thanks for the thorough review!!

NeilBrown

> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 
>
Michal Hocko Nov. 23, 2021, 1:41 p.m. UTC | #12
On Tue 23-11-21 15:33:19, Neil Brown wrote:
[...]
> "ALLOC_HARDER" is a question of "can I justify imposing on other threads
> by taking memory that they might want".  Again there may be different
> reasons, but they will not always align with the first set.
> 
> With my patch there is still a difference between ALLOC_HIGH and
> ALLOC_HARDER, but not much.
> __GFP_HIGH combined with __GFP_NOMEMALLOC - which could be seen as "high
> priority, but not too high" delivers ALLOC_HIGH without ALLOC_HARDER.
> It may not be a useful distinction, but it seems to preserve most of
> what I didn't want to change.

I am not sure this is really a helpful distinction. I would even say that
an explicit use of __GFP_NOMEMALLOC | __GFP_HIGH is actively confusing
as that would mean that you do not allow access to reserves while you
want to dip into them anyway.

Anyway, I still think that ALLOC_HARDER should stay under control of the
allocator as a heuristic rather being imprinted into gfp flags
directly. Having two levels of memory reserves access is just too
complicated for users and I wouldn't be surprised if most callers would
just consider their usecase important enough to justify as much reserves
as possible.

Allocation from an interrupt context sounds like a good usecase for
ALLOC_HARDER. I am not sure about rt_task one but that one can be
reasoned about as well. All/most __GFP_HIGH allocations just look like
an overuse and conflation of the two modes. Both these were the primary
usecase for ALLOC_HARDER historically we just tried to find a way how to
express the former by gfp flags.
Michal Hocko Nov. 23, 2021, 2:27 p.m. UTC | #13
On Tue 23-11-21 15:15:20, Neil Brown wrote:
> On Tue, 23 Nov 2021, Michal Hocko wrote:
[...]
> > Both __GFP_DIRECT_RECLAIM and __GFP_KSWAPD_RECLAIM are way too lowlevel
> > but historically we've had requests to inhibit kswapd for a particular
> > requests because that has led to problems - fun reading caf491916b1c1.
> 
> Unfortunately that commit doesn't provide any reasoning, just an
> assertion.
> The best reasoning I could find was in caf491916b1c1 which was the initial
> revert.  There the primary reasoning was "there is a bug that we don't
> have time for a proper fix before the next release, so let's just use
> this quick fix".
> ...  and maybe "the quick fix" was "the right fix", but I cannot tell from
> the commit logs :-(

Yeah, that was not entirely fair from me but I just found it a nice
example of how fun our process around gpf has been historically.
A more fair would be to point you at 32dba98e085f ("thp: _GFP_NO_KSWAPD")
which has introduced for THP use. Mostly as a workaround to existing
reclaim problems because THPs have been enabled by default for everybody
and that had backfired. Rik has tried to remove the flag c654345924f7
("mm: remove __GFP_NO_KSWAPD") because most problems had been fixed - he
believed. But that has turned out to be not the case 82b212f40059
("Revert "mm: remove __GFP_NO_KSWAPD"") and swap storms triggered by THP
peak loads were still observed.

THP still seem to remain to be the biggest user of the flag (read only
to care to not have the flag. Maybe another round of the check whether
we need it...

> > __GFP_ALLOW_BLOCKING would make a lot of sense but I am not sure it
> > would be a good match to __GFP_KSWAPD_RECLAIM.
> 
> So? __GFP_ALLOW_BLOCKING makes it clear what is, or is not, acceptable
> to the caller.  How much reclaim, or other activity, alloc_page()
> engages in is largely irrelevant to the caller as lock as it doesn't
> block if asked not to (and doesn't enter an FS if asked not to, etc).

Hmm, maybe you are right.
 
> > > Actually ... I take it back about __GFP_NOWARN.  That probably shouldn't
> > > exist at all.  Warnings should be based on how stressed the mm system is,
> > > not on whether the caller wants thinks failure is manageable.
> > 
> > Unless we change the way when allocation warnings are triggered then we
> > really need this. There are many opportunistic allocations with a
> > fallback behavior which do not want to swamp kernel logs with failures
> > that are of no use. Think of a THP allocation that really want to be
> > just very quick and falls back to normal base pages otherwise. Deducing
> > context which is just fine to not report failures is quite tricky and it
> > can get wrong easily. Callers should know whether warning can be of any
> > use in many cases.
> 
> "Unless" being the key work.
> It makes sense to warn when a __GFP_HIGH or __GFP_MEMALLOC allocation
> fails,  because they are clearly important.
>
> It makes sense to warning if direct reclaim and retrying were enabled,
> as then alloc_page() has tried really hard, but failed anyway.  Thought
> maybe if COSTLY_ORDER is exceeded, then the warning is unlikely to be
> interesting.

For "normal" small allocations we usually get an OOM report if the
memory is depleted. That will provide quite a lot of potentially useful
context to debug memory usage. Non reclaiming allocations can be just
opportunistic that choose to not reclaim with an other approach as a
fallback but there are others that really cannot reclaim because they
are in an atomic context. I do not see an easy way to tell one from the
other. Simirarly for higher order allocations it can be useful to see
whether the memory is depletely or just fragmented.

> But does it ever make sense to warn if either of
> __GFP_RETRY_MAYFAIL __GFP_NORETRY are present?

> If we always suppressed warning when those flags were present, then many
> (most?) uses for __GFP_NOWARN can be discarded.

Yes __GFP_NORETRY is mostly (maybe always) used with __GFP_NOWARN.
Coccinelle would be a good way to check. I do remember MAYFAIL is used
for page migration to allocate target memory. It is often useful to see
that the migration is failing because of lack of memory.
 
> I can see that some of the __GFP flags are designed to each perform a
> single well-defined function and internally to mm/ that makes sense.
> But exposing those flags to all users appears to be a recipe for
> trouble.  Hiding them all behind "__" doesn't stop people from using and
> misusing them.  Others are externally meaningful.  Making them visually
> similar to the ones we want to hide isn't helping anyone.

I do agree here.
Andrew Morton April 30, 2022, 6:30 p.m. UTC | #14
Folks,

Discussion seemed to meander then peter out.  What's the thinking on
this patch now?  

Thanks.



From: "NeilBrown" <neilb@suse.de>
Subject: mm: discard __GFP_ATOMIC

__GFP_ATOMIC serves little purpose.  Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.

It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark.  It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.

__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. 
There is little point to this.  We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.

__GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
probable that testing ALLOC_HARDER is a better fit here.

__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep.  This should test __GFP_DIRECT_RECLAIM instead.

This patch:
 - removes __GFP_ATOMIC
 - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
   (as well as ALLOC_HIGH).
 - makes other adjustments as suggested by the above.

The net result is not change to GFP_ATOMIC allocations.  Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges.  This affects:
  xen, dm, md, ntfs3
  the vermillion frame buffer
  hibernation
  ksm
  swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.

Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/balance.rst   |    2 +-
 drivers/iommu/tegra-smmu.c     |    4 ++--
 include/linux/gfp.h            |   12 ++++--------
 include/trace/events/mmflags.h |    1 -
 lib/test_printf.c              |    8 ++++----
 mm/internal.h                  |    2 +-
 mm/page_alloc.c                |   16 ++++------------
 tools/include/linux/gfp.h      |    3 +--
 tools/perf/builtin-kmem.c      |    1 -
 9 files changed, 17 insertions(+), 32 deletions(-)

--- a/Documentation/vm/balance.rst~mm-discard-__gfp_atomic
+++ a/Documentation/vm/balance.rst
@@ -6,7 +6,7 @@ Memory Balancing
 
 Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
 
-Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
 well as for non __GFP_IO allocations.
 
 The first reason why a caller may avoid reclaim is that the caller can not
--- a/drivers/iommu/tegra-smmu.c~mm-discard-__gfp_atomic
+++ a/drivers/iommu/tegra-smmu.c
@@ -671,12 +671,12 @@ static struct page *as_get_pde_page(stru
 	 * allocate page in a sleeping context if GFP flags permit. Hence
 	 * spinlock needs to be unlocked and re-locked after allocation.
 	 */
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfp & __GFP_DIRECT_RECLAIM)
 		spin_unlock_irqrestore(&as->lock, *flags);
 
 	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
 
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfp & __GFP_DIRECT_RECLAIM)
 		spin_lock_irqsave(&as->lock, *flags);
 
 	/*
--- a/include/linux/gfp.h~mm-discard-__gfp_atomic
+++ a/include/linux/gfp.h
@@ -39,7 +39,7 @@ struct vm_area_struct;
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
 #define ___GFP_ZERO		0x100u
-#define ___GFP_ATOMIC		0x200u
+/* 0x200u unused */
 #define ___GFP_DIRECT_RECLAIM	0x400u
 #define ___GFP_KSWAPD_RECLAIM	0x800u
 #define ___GFP_WRITE		0x1000u
@@ -124,11 +124,8 @@ struct vm_area_struct;
  *
  * %__GFP_HIGH indicates that the caller is high-priority and that granting
  * the request is necessary before the system can make forward progress.
- * For example, creating an IO context to clean pages.
- *
- * %__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
- * high priority. Users are typically interrupt handlers. This may be
- * used in conjunction with %__GFP_HIGH
+ * For example creating an IO context to clean pages and requests
+ * from atomic context.
  *
  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
  * the caller guarantees the allocation will allow more memory to be freed
@@ -143,7 +140,6 @@ struct vm_area_struct;
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
  */
-#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
 #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
@@ -337,7 +333,7 @@ struct vm_area_struct;
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
  */
-#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
--- a/include/trace/events/mmflags.h~mm-discard-__gfp_atomic
+++ a/include/trace/events/mmflags.h
@@ -29,7 +29,6 @@
 	{(unsigned long)__GFP_HIGHMEM,		"__GFP_HIGHMEM"},	\
 	{(unsigned long)GFP_DMA32,		"GFP_DMA32"},		\
 	{(unsigned long)__GFP_HIGH,		"__GFP_HIGH"},		\
-	{(unsigned long)__GFP_ATOMIC,		"__GFP_ATOMIC"},	\
 	{(unsigned long)__GFP_IO,		"__GFP_IO"},		\
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
--- a/lib/test_printf.c~mm-discard-__gfp_atomic
+++ a/lib/test_printf.c
@@ -673,17 +673,17 @@ flags(void)
 	gfp = GFP_ATOMIC|__GFP_DMA;
 	test("GFP_ATOMIC|GFP_DMA", "%pGg", &gfp);
 
-	gfp = __GFP_ATOMIC;
-	test("__GFP_ATOMIC", "%pGg", &gfp);
+	gfp = __GFP_HIGH;
+	test("__GFP_HIGH", "%pGg", &gfp);
 
 	/* Any flags not translated by the table should remain numeric */
 	gfp = ~__GFP_BITS_MASK;
 	snprintf(cmp_buffer, BUF_SIZE, "%#lx", (unsigned long) gfp);
 	test(cmp_buffer, "%pGg", &gfp);
 
-	snprintf(cmp_buffer, BUF_SIZE, "__GFP_ATOMIC|%#lx",
+	snprintf(cmp_buffer, BUF_SIZE, "__GFP_HIGH|%#lx",
 							(unsigned long) gfp);
-	gfp |= __GFP_ATOMIC;
+	gfp |= __GFP_HIGH;
 	test(cmp_buffer, "%pGg", &gfp);
 
 	kfree(cmp_buffer);
--- a/mm/internal.h~mm-discard-__gfp_atomic
+++ a/mm/internal.h
@@ -24,7 +24,7 @@ struct folio_batch;
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
-			__GFP_ATOMIC|__GFP_NOLOCKDEP)
+			__GFP_NOLOCKDEP)
 
 /* The GFP flags allowed during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
--- a/mm/page_alloc.c~mm-discard-__gfp_atomic
+++ a/mm/page_alloc.c
@@ -3918,12 +3918,12 @@ static inline bool zone_watermark_fast(s
 					free_pages))
 		return true;
 	/*
-	 * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
+	 * Ignore watermark boosting for GFP_HIGH order-0 allocations
 	 * when checking the min watermark. The min watermark is the
 	 * point where boosting is ignored so that kswapd is woken up
 	 * when below the low watermark.
 	 */
-	if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
+	if (unlikely(!order && (alloc_flags & ALLOC_HARDER) && z->watermark_boost
 		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
 		mark = z->_watermark[WMARK_MIN];
 		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
@@ -4657,12 +4657,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (unless __GFP_NOMEMALLOC) and ALLOC_HIGH.
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
 
-	if (gfp_mask & __GFP_ATOMIC) {
+	if (gfp_mask & __GFP_HIGH) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
@@ -4855,14 +4855,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
 
-	/*
-	 * We also sanity check to catch abuse of atomic reserves being used by
-	 * callers that are not in atomic context.
-	 */
-	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
-				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
-		gfp_mask &= ~__GFP_ATOMIC;
-
 retry_cpuset:
 	compaction_retries = 0;
 	no_progress_loops = 0;
--- a/tools/include/linux/gfp.h~mm-discard-__gfp_atomic
+++ a/tools/include/linux/gfp.h
@@ -12,7 +12,6 @@
 #define __GFP_FS		0x80u
 #define __GFP_NOWARN		0x200u
 #define __GFP_ZERO		0x8000u
-#define __GFP_ATOMIC		0x80000u
 #define __GFP_ACCOUNT		0x100000u
 #define __GFP_DIRECT_RECLAIM	0x400000u
 #define __GFP_KSWAPD_RECLAIM	0x2000000u
@@ -20,7 +19,7 @@
 #define __GFP_RECLAIM	(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM)
 
 #define GFP_ZONEMASK	0x0fu
-#define GFP_ATOMIC	(__GFP_HIGH | __GFP_ATOMIC | __GFP_KSWAPD_RECLAIM)
+#define GFP_ATOMIC	(__GFP_HIGH | __GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
 
--- a/tools/perf/builtin-kmem.c~mm-discard-__gfp_atomic
+++ a/tools/perf/builtin-kmem.c
@@ -640,7 +640,6 @@ static const struct {
 	{ "__GFP_HIGHMEM",		"HM" },
 	{ "GFP_DMA32",			"D32" },
 	{ "__GFP_HIGH",			"H" },
-	{ "__GFP_ATOMIC",		"_A" },
 	{ "__GFP_IO",			"I" },
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_NOWARN",		"NWR" },
Michal Hocko May 1, 2022, 3:45 p.m. UTC | #15
On Sat 30-04-22 11:30:28, Andrew Morton wrote:
> Folks,
> 
> Discussion seemed to meander then peter out.  What's the thinking on
> this patch now?  

I like the simplification. This requires a deeper review. I cannot
promise anything before LSFMM ends and who knows what my inbox will look
like afterwards. In general I think we have just too many levels of
memory reserves access and most people have no clue how to use them
properly.
Michal Hocko Sept. 6, 2022, 7:35 a.m. UTC | #16
On Sat 30-04-22 11:30:28, Andrew Morton wrote:

Sorry, this got lost in my inbox. Thanks Andrew for poking me.

> From: "NeilBrown" <neilb@suse.de>
> Subject: mm: discard __GFP_ATOMIC
> 
> __GFP_ATOMIC serves little purpose.  Its main effect is to set
> ALLOC_HARDER which adds a few little boosts to increase the chance of an
> allocation succeeding, one of which is to lower the water-mark at which it
> will succeed.
> 
> It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> adjusts this watermark.  It is probable that other users of __GFP_HIGH
> should benefit from the other little bonuses that __GFP_ATOMIC gets.
> 
> __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. 
> There is little point to this.  We already get a might_sleep() warning if
> __GFP_DIRECT_RECLAIM is set.
> 
> __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> probable that testing ALLOC_HARDER is a better fit here.
> 
> __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> 
> This patch:
>  - removes __GFP_ATOMIC
>  - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
>    (as well as ALLOC_HIGH).
>  - makes other adjustments as suggested by the above.
> 
> The net result is not change to GFP_ATOMIC allocations.  Other
> allocations that use __GFP_HIGH will benefit from a few different extra
> privileges.  This affects:
>   xen, dm, md, ntfs3
>   the vermillion frame buffer
>   hibernation
>   ksm
>   swap
> all of which likely produce more benefit than cost if these selected
> allocation are more likely to succeed quickly.

This is a good summary of the current usage and existing issues. It also
shows that the naming is tricky and allows people to make wrong calls
(tegra-smmu.c). I also thing that it is wrong to couple memory reserves
access to the reclaim constrains/expectations of the caller.

> Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
> Signed-off-by: NeilBrown <neilb@suse.de>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Thierry Reding <thierry.reding@gmail.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Yes, I am all for dropping the gfp flag. One thing that is not really
entirely clear to me, though, is whether we still need 3 levels of
memory reserves access. Can we just drop ALLOC_HARDER? With this patch
applied it serves RT tasks and conflates it with __GFP_HIGH users
essentially. So why do we need that additional level of reserves?
Mel Gorman Sept. 7, 2022, 9:47 a.m. UTC | #17
On Tue, Sep 06, 2022 at 09:35:41AM +0200, Michal Hocko wrote:
> > From: "NeilBrown" <neilb@suse.de>
> > Subject: mm: discard __GFP_ATOMIC
> > 
> > __GFP_ATOMIC serves little purpose.  Its main effect is to set
> > ALLOC_HARDER which adds a few little boosts to increase the chance of an
> > allocation succeeding, one of which is to lower the water-mark at which it
> > will succeed.
> > 
> > It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> > adjusts this watermark.  It is probable that other users of __GFP_HIGH
> > should benefit from the other little bonuses that __GFP_ATOMIC gets.
> > 
> > __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. 
> > There is little point to this.  We already get a might_sleep() warning if
> > __GFP_DIRECT_RECLAIM is set.
> > 
> > __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> > probable that testing ALLOC_HARDER is a better fit here.
> > 
> > __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> > sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> > 
> > This patch:
> >  - removes __GFP_ATOMIC
> >  - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
> >    (as well as ALLOC_HIGH).
> >  - makes other adjustments as suggested by the above.
> > 
> > The net result is not change to GFP_ATOMIC allocations.  Other
> > allocations that use __GFP_HIGH will benefit from a few different extra
> > privileges.  This affects:
> >   xen, dm, md, ntfs3
> >   the vermillion frame buffer
> >   hibernation
> >   ksm
> >   swap
> > all of which likely produce more benefit than cost if these selected
> > allocation are more likely to succeed quickly.
> 
> This is a good summary of the current usage and existing issues. It also
> shows that the naming is tricky and allows people to make wrong calls
> (tegra-smmu.c). I also thing that it is wrong to couple memory reserves
> access to the reclaim constrains/expectations of the caller.
> 

I think it's worth trying to get rid of __GFP_ATOMIC although this patch
needs to be rebased. Without rebasing it, I suspect there is a corner case
for reserving high order atomic blocks. A high-order atomic allocation
might get confused with a __GFP_HIGH high-order allocation that can sleep.
It would not be completely irrational to have such a caller if it was in a
path that can tolerate a stall but stalling might have visible consequences.
I'm also worried that the patch might allow __GFP_HIGH to ignore cpusets
which is probably not intended by direct users like ksm.

> > Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Thierry Reding <thierry.reding@gmail.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> Yes, I am all for dropping the gfp flag. One thing that is not really
> entirely clear to me, though, is whether we still need 3 levels of
> memory reserves access. Can we just drop ALLOC_HARDER? With this patch
> applied it serves RT tasks and conflates it with __GFP_HIGH users
> essentially. So why do we need that additional level of reserves?

I think this would fall under the "naming is hard". If __GFP_ATOMIC was
removed, the ALLOC_ flags might need renaming to detect differences in
high priority allocations (RT + GFP_ATOMIC), critical allocations (OOM)
and ones that can access special reserves (GFP_ATOMIC high-order).
Andrew Morton Oct. 17, 2022, 2:38 a.m. UTC | #18
On Wed, 7 Sep 2022 10:47:24 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Tue, Sep 06, 2022 at 09:35:41AM +0200, Michal Hocko wrote:
> > > From: "NeilBrown" <neilb@suse.de>
> > > Subject: mm: discard __GFP_ATOMIC
> > > 
> > > __GFP_ATOMIC serves little purpose.  Its main effect is to set
> > > ALLOC_HARDER which adds a few little boosts to increase the chance of an
> > > allocation succeeding, one of which is to lower the water-mark at which it
> > > will succeed.
> > > 
> > > It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> > > adjusts this watermark.  It is probable that other users of __GFP_HIGH
> > > should benefit from the other little bonuses that __GFP_ATOMIC gets.
> > > 
> > > __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. 
> > > There is little point to this.  We already get a might_sleep() warning if
> > > __GFP_DIRECT_RECLAIM is set.
> > > 
> > > __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> > > probable that testing ALLOC_HARDER is a better fit here.
> > > 
> > > __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> > > sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> > > 
> > > This patch:
> > >  - removes __GFP_ATOMIC
> > >  - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
> > >    (as well as ALLOC_HIGH).
> > >  - makes other adjustments as suggested by the above.
> > > 
> > > The net result is not change to GFP_ATOMIC allocations.  Other
> > > allocations that use __GFP_HIGH will benefit from a few different extra
> > > privileges.  This affects:
> > >   xen, dm, md, ntfs3
> > >   the vermillion frame buffer
> > >   hibernation
> > >   ksm
> > >   swap
> > > all of which likely produce more benefit than cost if these selected
> > > allocation are more likely to succeed quickly.
> > 
> > This is a good summary of the current usage and existing issues. It also
> > shows that the naming is tricky and allows people to make wrong calls
> > (tegra-smmu.c). I also thing that it is wrong to couple memory reserves
> > access to the reclaim constrains/expectations of the caller.
> > 
> 
> I think it's worth trying to get rid of __GFP_ATOMIC although this patch
> needs to be rebased. Without rebasing it, I suspect there is a corner case
> for reserving high order atomic blocks. A high-order atomic allocation
> might get confused with a __GFP_HIGH high-order allocation that can sleep.
> It would not be completely irrational to have such a caller if it was in a
> path that can tolerate a stall but stalling might have visible consequences.
> I'm also worried that the patch might allow __GFP_HIGH to ignore cpusets
> which is probably not intended by direct users like ksm.

Unclear what you mean by "rebased".  You're saying the patch might have
issues - doesn't that mean it needs to be "fixed"?

Anyway, I've been maintaining this change for nearly a year - if
nothing happens soon I guess I'll drop it so it doesn't get in people's
way.
Mel Gorman Oct. 18, 2022, 12:11 p.m. UTC | #19
On Sun, Oct 16, 2022 at 07:38:27PM -0700, Andrew Morton wrote:
> On Wed, 7 Sep 2022 10:47:24 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Tue, Sep 06, 2022 at 09:35:41AM +0200, Michal Hocko wrote:
> > > > From: "NeilBrown" <neilb@suse.de>
> > > > Subject: mm: discard __GFP_ATOMIC
> > > > 
> > > > __GFP_ATOMIC serves little purpose.  Its main effect is to set
> > > > ALLOC_HARDER which adds a few little boosts to increase the chance of an
> > > > allocation succeeding, one of which is to lower the water-mark at which it
> > > > will succeed.
> > > > 
> > > > It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> > > > adjusts this watermark.  It is probable that other users of __GFP_HIGH
> > > > should benefit from the other little bonuses that __GFP_ATOMIC gets.
> > > > 
> > > > __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. 
> > > > There is little point to this.  We already get a might_sleep() warning if
> > > > __GFP_DIRECT_RECLAIM is set.
> > > > 
> > > > __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> > > > probable that testing ALLOC_HARDER is a better fit here.
> > > > 
> > > > __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> > > > sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> > > > 
> > > > This patch:
> > > >  - removes __GFP_ATOMIC
> > > >  - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
> > > >    (as well as ALLOC_HIGH).
> > > >  - makes other adjustments as suggested by the above.
> > > > 
> > > > The net result is not change to GFP_ATOMIC allocations.  Other
> > > > allocations that use __GFP_HIGH will benefit from a few different extra
> > > > privileges.  This affects:
> > > >   xen, dm, md, ntfs3
> > > >   the vermillion frame buffer
> > > >   hibernation
> > > >   ksm
> > > >   swap
> > > > all of which likely produce more benefit than cost if these selected
> > > > allocation are more likely to succeed quickly.
> > > 
> > > This is a good summary of the current usage and existing issues. It also
> > > shows that the naming is tricky and allows people to make wrong calls
> > > (tegra-smmu.c). I also thing that it is wrong to couple memory reserves
> > > access to the reclaim constrains/expectations of the caller.
> > > 
> > 
> > I think it's worth trying to get rid of __GFP_ATOMIC although this patch
> > needs to be rebased. Without rebasing it, I suspect there is a corner case
> > for reserving high order atomic blocks. A high-order atomic allocation
> > might get confused with a __GFP_HIGH high-order allocation that can sleep.
> > It would not be completely irrational to have such a caller if it was in a
> > path that can tolerate a stall but stalling might have visible consequences.
> > I'm also worried that the patch might allow __GFP_HIGH to ignore cpusets
> > which is probably not intended by direct users like ksm.
> 
> Unclear what you mean by "rebased". 

I was looking at the original patch, not what was in mm-unstable.

> You're saying the patch might have
> issues - doesn't that mean it needs to be "fixed"?
> 

Yes, I think it needs some follow-on work.

> Anyway, I've been maintaining this change for nearly a year - if
> nothing happens soon I guess I'll drop it so it doesn't get in people's
> way.

I think it's easier to highlight my concerns by renaming ALLOC_HARDER
to ALLOC_MIN_RESERVE because that's effectively what it means.  It then
becomes a bit more obvious where the problems might be.

They're all fixable but do people agree that these problems are real or
my imagination?

---
 mm/internal.h   |  2 +-
 mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4b44ced87fff..8c7fd034b277 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -743,7 +743,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_HARDER		 0x10 /* try to alloc harder */
+#define ALLOC_MIN_RESERVE	 0x10 /* allow allocations below min watermark */
 #define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
 #define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a9a844dc197..9ff4cd4c5d11 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3708,8 +3708,16 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		 * due to non-CMA allocation context. HIGHATOMIC area is
 		 * reserved for high-order atomic allocation, so order-0
 		 * request should skip it.
+		 *
+		 * BZZT: __GFP_HIGH sets ALLOC_MIN_RESERVE and while __GFP_HIGH
+		 * 	 is set for GFP_ATOMIC, it does not mean that all
+		 * 	 __GFP_HIGH allocations also do not specify
+		 * 	 _GFP_DIRECT_RECLAIM. While there do not appear to
+		 * 	 be any in-tree users that allocation high-order pages
+		 * 	 using __GFP_HIGH && !__GFP_DIRECT_RECLAIM, it's a
+		 * 	 potential issue.
 		 */
-		if (order > 0 && alloc_flags & ALLOC_HARDER)
+		if (order > 0 && alloc_flags & ALLOC_MIN_RESERVE)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
 			page = __rmqueue(zone, order, migratetype, alloc_flags);
@@ -3943,11 +3951,11 @@ ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
 static inline long __zone_watermark_unusable_free(struct zone *z,
 				unsigned int order, unsigned int alloc_flags)
 {
-	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+	const bool alloc_harder = (alloc_flags & (ALLOC_MIN_RESERVE|ALLOC_OOM));
 	long unusable_free = (1 << order) - 1;
 
 	/*
-	 * If the caller does not have rights to ALLOC_HARDER then subtract
+	 * If the caller does not have rights to ALLOC_MIN_RESERVE then subtract
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
@@ -3975,17 +3983,21 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+	const bool alloc_harder = (alloc_flags & (ALLOC_MIN_RESERVE|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
+	/*
+	 * BZZT: ALLOC_HIGH is now effectively an alias of ALLOC_MIN_RESERVE
+	 *       so we end up double dipping below.
+	 */
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 
 	if (unlikely(alloc_harder)) {
 		/*
-		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * OOM victims can try even harder than normal ALLOC_MIN_RESERVE
 		 * users on the grounds that it's definitely going to be in
 		 * the exit path shortly and free memory. Any allocation it
 		 * makes during the free path will be small and short-lived.
@@ -4027,6 +4039,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 		}
 #endif
+		/* BZZT: ALLOC_MIN_RESERVE does not imply access to highatomic reserves */
 		if (alloc_harder && !free_area_empty(area, MIGRATE_HIGHATOMIC))
 			return true;
 	}
@@ -4069,12 +4082,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 					free_pages))
 		return true;
 	/*
-	 * Ignore watermark boosting for GFP_HIGH order-0 allocations
+	 * Ignore watermark boosting for __GFP_HIGH order-0 allocations
 	 * when checking the min watermark. The min watermark is the
 	 * point where boosting is ignored so that kswapd is woken up
 	 * when below the low watermark.
 	 */
-	if (unlikely(!order && (alloc_flags & ALLOC_HARDER) && z->watermark_boost
+	if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost
 		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
 		mark = z->_watermark[WMARK_MIN];
 		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
@@ -4289,8 +4302,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			/*
 			 * If this is a high-order atomic allocation then check
 			 * if the pageblock should be reserved for the future
+			 *
+			 * BZZT: __GFP_HIGH does not imply high-order atomic
 			 */
-			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+			if (unlikely(order && (alloc_flags & ALLOC_MIN_RESERVE)))
 				reserve_highatomic_pageblock(page, zone, order);
 
 			return page;
@@ -4834,6 +4849,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
 	 * set both ALLOC_HARDER (unless __GFP_NOMEMALLOC) and ALLOC_HIGH.
+	 *
+	 * BZZT: Minor conflation issue. __GFP_HIGH == ALLOC_HIGH but they are
+	 * 	 now very similar to each other. Probably should remove
+	 * 	 ALLOC_HIGH and modify __GFP_HIGH == ALLOC_MIN_RESERVE
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
@@ -4844,14 +4863,24 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 * if it can't schedule.
 		 */
 		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			alloc_flags |= ALLOC_HARDER;
+			alloc_flags |= ALLOC_MIN_RESERVE;
 		/*
 		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
 		 * comment for __cpuset_node_allowed().
+		 *
+		 * BZZT: Not necessarily "atomic", should have checked
+		 * 	 __GFP_DIRECT_RECLAIM? This is a matter of
+		 * 	 definition as __GFP_HIGH is documented as being
+		 * 	 necessary for the system to make progress.
+		 * 	 Either enforce cpusets or update __GFP_HIGH
+		 * 	 documentation stating that CPUSETS may be
+		 * 	 broken. Based on the current __GFP_HIGH
+		 * 	 direct users, documenting that cpusets can
+		 * 	 be broken appears to be the safest option.
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && in_task())
-		alloc_flags |= ALLOC_HARDER;
+		alloc_flags |= ALLOC_MIN_RESERVE;
 
 	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
 
@@ -5277,7 +5306,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * could deplete whole memory reserves which would just make
 		 * the situation worse
 		 */
-		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);
 		if (page)
 			goto got_pg;
diff mbox series

Patch

diff --git a/Documentation/vm/balance.rst b/Documentation/vm/balance.rst
index 6a1fadf3e173..e38e9d83c1c7 100644
--- a/Documentation/vm/balance.rst
+++ b/Documentation/vm/balance.rst
@@ -6,7 +6,7 @@  Memory Balancing
 
 Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
 
-Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
 well as for non __GFP_IO allocations.
 
 The first reason why a caller may avoid reclaim is that the caller can not
diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index e900e3c46903..c5fa8b8673b6 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -676,12 +676,12 @@  static struct page *as_get_pde_page(struct tegra_smmu_as *as,
 	 * allocate page in a sleeping context if GFP flags permit. Hence
 	 * spinlock needs to be unlocked and re-locked after allocation.
 	 */
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfp & __GFP_DIRECT_RECLAIM)
 		spin_unlock_irqrestore(&as->lock, *flags);
 
 	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
 
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfp & __GFP_DIRECT_RECLAIM)
 		spin_lock_irqsave(&as->lock, *flags);
 
 	/*
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..676c813bd93f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,7 +39,7 @@  struct vm_area_struct;
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
 #define ___GFP_ZERO		0x100u
-#define ___GFP_ATOMIC		0x200u
+/* 0x200u unused */
 #define ___GFP_DIRECT_RECLAIM	0x400u
 #define ___GFP_KSWAPD_RECLAIM	0x800u
 #define ___GFP_WRITE		0x1000u
@@ -116,11 +116,8 @@  struct vm_area_struct;
  *
  * %__GFP_HIGH indicates that the caller is high-priority and that granting
  * the request is necessary before the system can make forward progress.
- * For example, creating an IO context to clean pages.
- *
- * %__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
- * high priority. Users are typically interrupt handlers. This may be
- * used in conjunction with %__GFP_HIGH
+ * For example creating an IO context to clean pages and requests
+ * from atomic context.
  *
  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
  * the caller guarantees the allocation will allow more memory to be freed
@@ -135,7 +132,6 @@  struct vm_area_struct;
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
  */
-#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
 #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
@@ -320,7 +316,7 @@  struct vm_area_struct;
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
  */
-#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..30f492256b8c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -29,7 +29,6 @@ 
 	{(unsigned long)__GFP_HIGHMEM,		"__GFP_HIGHMEM"},	\
 	{(unsigned long)GFP_DMA32,		"GFP_DMA32"},		\
 	{(unsigned long)__GFP_HIGH,		"__GFP_HIGH"},		\
-	{(unsigned long)__GFP_ATOMIC,		"__GFP_ATOMIC"},	\
 	{(unsigned long)__GFP_IO,		"__GFP_IO"},		\
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
diff --git a/lib/test_printf.c b/lib/test_printf.c
index 07309c45f327..8010de49b6c5 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -673,17 +673,17 @@  flags(void)
 	gfp = GFP_ATOMIC|__GFP_DMA;
 	test("GFP_ATOMIC|GFP_DMA", "%pGg", &gfp);
 
-	gfp = __GFP_ATOMIC;
-	test("__GFP_ATOMIC", "%pGg", &gfp);
+	gfp = __GFP_HIGH;
+	test("__GFP_HIGH", "%pGg", &gfp);
 
 	/* Any flags not translated by the table should remain numeric */
 	gfp = ~__GFP_BITS_MASK;
 	snprintf(cmp_buffer, BUF_SIZE, "%#lx", (unsigned long) gfp);
 	test(cmp_buffer, "%pGg", &gfp);
 
-	snprintf(cmp_buffer, BUF_SIZE, "__GFP_ATOMIC|%#lx",
+	snprintf(cmp_buffer, BUF_SIZE, "__GFP_HIGH|%#lx",
 							(unsigned long) gfp);
-	gfp |= __GFP_ATOMIC;
+	gfp |= __GFP_HIGH;
 	test(cmp_buffer, "%pGg", &gfp);
 
 	kfree(cmp_buffer);
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..4564067065f9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -20,8 +20,7 @@ 
  */
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
-			__GFP_ATOMIC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* The GFP flags allowed during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 781605e92015..1ff55347e4b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2621,7 +2621,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * put the burden of reclaim on regular allocation requests
 	 * and let these go through as privileged allocations.
 	 */
-	if (gfp_mask & __GFP_ATOMIC)
+	if (gfp_mask & __GFP_HIGH)
 		goto force;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..4fff90ca145b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3932,12 +3932,12 @@  static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 					free_pages))
 		return true;
 	/*
-	 * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
+	 * Ignore watermark boosting for GFP_HIGH order-0 allocations
 	 * when checking the min watermark. The min watermark is the
 	 * point where boosting is ignored so that kswapd is woken up
 	 * when below the low watermark.
 	 */
-	if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
+	if (unlikely(!order && (alloc_flags & ALLOC_HARDER) && z->watermark_boost
 		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
 		mark = z->_watermark[WMARK_MIN];
 		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
@@ -4661,12 +4661,12 @@  gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (unless __GFP_NOMEMALLOC) and ALLOC_HIGH.
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
 
-	if (gfp_mask & __GFP_ATOMIC) {
+	if (gfp_mask & __GFP_HIGH) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
@@ -4859,14 +4859,6 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
 
-	/*
-	 * We also sanity check to catch abuse of atomic reserves being used by
-	 * callers that are not in atomic context.
-	 */
-	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
-				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
-		gfp_mask &= ~__GFP_ATOMIC;
-
 retry_cpuset:
 	compaction_retries = 0;
 	no_progress_loops = 0;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index da03a341c63c..17e6eeb6e196 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -640,7 +640,6 @@  static const struct {
 	{ "__GFP_HIGHMEM",		"HM" },
 	{ "GFP_DMA32",			"D32" },
 	{ "__GFP_HIGH",			"H" },
-	{ "__GFP_ATOMIC",		"_A" },
 	{ "__GFP_IO",			"I" },
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_NOWARN",		"NWR" },
diff --git a/tools/testing/radix-tree/linux/gfp.h b/tools/testing/radix-tree/linux/gfp.h
index 32159c08a52e..0a0741104dfe 100644
--- a/tools/testing/radix-tree/linux/gfp.h
+++ b/tools/testing/radix-tree/linux/gfp.h
@@ -12,7 +12,6 @@ 
 #define __GFP_FS		0x80u
 #define __GFP_NOWARN		0x200u
 #define __GFP_ZERO		0x8000u
-#define __GFP_ATOMIC		0x80000u
 #define __GFP_ACCOUNT		0x100000u
 #define __GFP_DIRECT_RECLAIM	0x400000u
 #define __GFP_KSWAPD_RECLAIM	0x2000000u
@@ -20,7 +19,7 @@ 
 #define __GFP_RECLAIM	(__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
 
 #define GFP_ZONEMASK	0x0fu
-#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)