mm: io_uring: allow oom-killer from io_uring_setup

Message ID	20220125051736.2981459-1-shakeelb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Mon, 24 Jan 2022 21:17:36 -0800 Message-Id: <20220125051736.2981459-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [PATCH] mm: io_uring: allow oom-killer from io_uring_setup From: Shakeel Butt <shakeelb@google.com> To: Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: io_uring: allow oom-killer from io_uring_setup \| expand mm: io_uring: allow oom-killer from io_uring_setup

Shakeel Butt Jan. 25, 2022, 5:17 a.m. UTC

On an overcommitted system which is running multiple workloads of
varying priorities, it is preferred to trigger an oom-killer to kill a
low priority workload than to let the high priority workload receiving
ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
ENOMEMs instead of oom-kills because io_uring_setup callchain is using
__GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
allow the oom-killer to kill a lower priority job.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 fs/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

David Rientjes Jan. 25, 2022, 6:35 p.m. UTC | #1

On Mon, 24 Jan 2022, Shakeel Butt wrote:

> On an overcommitted system which is running multiple workloads of
> varying priorities, it is preferred to trigger an oom-killer to kill a
> low priority workload than to let the high priority workload receiving
> ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
> ENOMEMs instead of oom-kills because io_uring_setup callchain is using
> __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
> allow the oom-killer to kill a lower priority job.
> 

What is the size of the allocations that io_mem_alloc() is doing?

If get_order(size) > PAGE_ALLOC_COSTLY_ORDER, then this will fail even 
without the __GFP_NORETRY.  To make the guarantee that workloads are not 
receiving ENOMEM, it seems like we'd need to guarantee that allocations 
going through io_mem_alloc() are sufficiently small.

(And if we're really serious about it, then even something like a 
BUILD_BUG_ON().)

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  fs/io_uring.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e54c4127422e..d9eeb202363c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8928,10 +8928,9 @@ static void io_mem_free(void *ptr)
>  
>  static void *io_mem_alloc(size_t size)
>  {
> -	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> -				__GFP_NORETRY | __GFP_ACCOUNT;
> +	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
>  
> -	return (void *) __get_free_pages(gfp_flags, get_order(size));
> +	return (void *) __get_free_pages(gfp, get_order(size));
>  }
>  
>  static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
> -- 
> 2.35.0.rc0.227.g00780c9af4-goog
> 
> 
>

Shakeel Butt Jan. 25, 2022, 10:57 p.m. UTC | #2

On Tue, Jan 25, 2022 at 10:35 AM David Rientjes <rientjes@google.com> wrote:
>
> On Mon, 24 Jan 2022, Shakeel Butt wrote:
>
> > On an overcommitted system which is running multiple workloads of
> > varying priorities, it is preferred to trigger an oom-killer to kill a
> > low priority workload than to let the high priority workload receiving
> > ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
> > ENOMEMs instead of oom-kills because io_uring_setup callchain is using
> > __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
> > allow the oom-killer to kill a lower priority job.
> >
>
> What is the size of the allocations that io_mem_alloc() is doing?
>
> If get_order(size) > PAGE_ALLOC_COSTLY_ORDER, then this will fail even
> without the __GFP_NORETRY.  To make the guarantee that workloads are not
> receiving ENOMEM, it seems like we'd need to guarantee that allocations
> going through io_mem_alloc() are sufficiently small.
>
> (And if we're really serious about it, then even something like a
> BUILD_BUG_ON().)
>

The test case provided to me for which the user was seeing ENOMEMs was
io_uring_setup() with 64 entries (nothing else).

If I understand rings_size() calculations correctly then the 0 order
allocation was requested in io_mem_alloc().

For order > PAGE_ALLOC_COSTLY_ORDER, maybe we can use
__GFP_RETRY_MAYFAIL. It will at least do more aggressive reclaim
though I think that is a separate discussion. For this issue, we are
seeing ENOMEMs even for order 0 allocations.

David Rientjes Jan. 26, 2022, 1:42 a.m. UTC | #3

On Tue, 25 Jan 2022, Shakeel Butt wrote:

> > > On an overcommitted system which is running multiple workloads of
> > > varying priorities, it is preferred to trigger an oom-killer to kill a
> > > low priority workload than to let the high priority workload receiving
> > > ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
> > > ENOMEMs instead of oom-kills because io_uring_setup callchain is using
> > > __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
> > > allow the oom-killer to kill a lower priority job.
> > >
> >
> > What is the size of the allocations that io_mem_alloc() is doing?
> >
> > If get_order(size) > PAGE_ALLOC_COSTLY_ORDER, then this will fail even
> > without the __GFP_NORETRY.  To make the guarantee that workloads are not
> > receiving ENOMEM, it seems like we'd need to guarantee that allocations
> > going through io_mem_alloc() are sufficiently small.
> >
> > (And if we're really serious about it, then even something like a
> > BUILD_BUG_ON().)
> >
> 
> The test case provided to me for which the user was seeing ENOMEMs was
> io_uring_setup() with 64 entries (nothing else).
> 
> If I understand rings_size() calculations correctly then the 0 order
> allocation was requested in io_mem_alloc().
> 
> For order > PAGE_ALLOC_COSTLY_ORDER, maybe we can use
> __GFP_RETRY_MAYFAIL. It will at least do more aggressive reclaim
> though I think that is a separate discussion. For this issue, we are
> seeing ENOMEMs even for order 0 allocations.
> 

Ah, gotcha, thanks for the background.  IIUC, io_uring_setup() can be done 
with anything with CAP_SYS_NICE so my only concern would be whether this 
could be used maliciously on a system not using memcg, but in that case we 
can already fork many small processes that consume all memory and oom kill 
everything else on the system already.

Shakeel Butt Feb. 5, 2022, 6:32 a.m. UTC | #4

On Mon, Jan 24, 2022 at 9:17 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On an overcommitted system which is running multiple workloads of
> varying priorities, it is preferred to trigger an oom-killer to kill a
> low priority workload than to let the high priority workload receiving
> ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
> ENOMEMs instead of oom-kills because io_uring_setup callchain is using
> __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
> allow the oom-killer to kill a lower priority job.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Jens, any comments or concerns on this patch?

> ---
>  fs/io_uring.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e54c4127422e..d9eeb202363c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8928,10 +8928,9 @@ static void io_mem_free(void *ptr)
>
>  static void *io_mem_alloc(size_t size)
>  {
> -       gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> -                               __GFP_NORETRY | __GFP_ACCOUNT;
> +       gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
>
> -       return (void *) __get_free_pages(gfp_flags, get_order(size));
> +       return (void *) __get_free_pages(gfp, get_order(size));
>  }
>
>  static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
> --
> 2.35.0.rc0.227.g00780c9af4-goog
>

Jens Axboe Feb. 7, 2022, 3:44 p.m. UTC | #5

On Mon, 24 Jan 2022 21:17:36 -0800, Shakeel Butt wrote:
> On an overcommitted system which is running multiple workloads of
> varying priorities, it is preferred to trigger an oom-killer to kill a
> low priority workload than to let the high priority workload receiving
> ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
> ENOMEMs instead of oom-kills because io_uring_setup callchain is using
> __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
> allow the oom-killer to kill a lower priority job.
> 
> [...]

Applied, thanks!

[1/1] mm: io_uring: allow oom-killer from io_uring_setup
      commit: 0a3f1e0beacf6cc8ae5f846b0641c1df476e83d6

Best regards,

mm: io_uring: allow oom-killer from io_uring_setup

Commit Message

Comments

Patch