diff mbox series

io_uring: avoid page allocation warnings

Message ID 20190430132405.8268-1-mark.rutland@arm.com (mailing list archive)
State New, archived
Headers show
Series io_uring: avoid page allocation warnings | expand

Commit Message

Mark Rutland April 30, 2019, 1:24 p.m. UTC
In io_sqe_buffer_register() we allocate a number of arrays based on the
iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
we can still attempt to allocate arrays exceeding MAX_ORDER.

On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
bytes, which is greater than 4MiB. This results in SLUB warning that
we're trying to allocate greater than MAX_ORDER, and failing the
allocation.

Avoid this by passing __GFP_NOWARN when allocating arrays for the
user-provided iov_len. We'll gracefully handle the failed allocation,
returning -ENOMEM to userspace.

We should probably consider lowering the limit below SZ_1G, or reworking
the array allocations.

Full splat from before this patch:

WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x110/0x190 lib/dump_stack.c:113
 panic+0x384/0x68c kernel/panic.c:214
 __warn+0x2bc/0x2c0 kernel/panic.c:571
 report_bug+0x228/0x2d8 lib/bug.c:186
 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
 brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
 el1_dbg+0x18/0x8c
 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
 alloc_pages include/linux/gfp.h:509 [inline]
 kmalloc_order+0x20/0x50 mm/slab_common.c:1231
 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
 kmalloc_large include/linux/slab.h:480 [inline]
 __kmalloc+0x3dc/0x4f0 mm/slub.c:3791
 kmalloc_array include/linux/slab.h:670 [inline]
 io_sqe_buffer_register fs/io_uring.c:2472 [inline]
 __io_uring_register fs/io_uring.c:2962 [inline]
 __do_sys_io_uring_register fs/io_uring.c:3008 [inline]
 __se_sys_io_uring_register fs/io_uring.c:2990 [inline]
 __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
 el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
SMP: stopping secondary CPUs
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
CPU features: 0x002,23000438
Memory Limit: none
Rebooting in 1 seconds..

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/io_uring.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Matthew Wilcox April 30, 2019, 2:18 p.m. UTC | #1
On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> In io_sqe_buffer_register() we allocate a number of arrays based on the
> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> we can still attempt to allocate arrays exceeding MAX_ORDER.
> 
> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> bytes, which is greater than 4MiB. This results in SLUB warning that
> we're trying to allocate greater than MAX_ORDER, and failing the
> allocation.
> 
> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> user-provided iov_len. We'll gracefully handle the failed allocation,
> returning -ENOMEM to userspace.
> 
> We should probably consider lowering the limit below SZ_1G, or reworking
> the array allocations.

I'd suggest that kvmalloc is probably our friend here ... we don't really
want to return -ENOMEM to userspace for this case, I don't think.
Mark Rutland April 30, 2019, 2:59 p.m. UTC | #2
On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> > In io_sqe_buffer_register() we allocate a number of arrays based on the
> > iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> > we can still attempt to allocate arrays exceeding MAX_ORDER.
> > 
> > On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> > iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> > allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> > bytes, which is greater than 4MiB. This results in SLUB warning that
> > we're trying to allocate greater than MAX_ORDER, and failing the
> > allocation.
> > 
> > Avoid this by passing __GFP_NOWARN when allocating arrays for the
> > user-provided iov_len. We'll gracefully handle the failed allocation,
> > returning -ENOMEM to userspace.
> > 
> > We should probably consider lowering the limit below SZ_1G, or reworking
> > the array allocations.
> 
> I'd suggest that kvmalloc is probably our friend here ... we don't really
> want to return -ENOMEM to userspace for this case, I don't think.

Sure. I'll go verify that the uring code doesn't assume this memory is
physically contiguous.

I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
GFP_KERNEL.

Thanks,
Mark.
Jens Axboe April 30, 2019, 4:21 p.m. UTC | #3
On 4/30/19 8:59 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>
>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>> allocation.
>>>
>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>> returning -ENOMEM to userspace.
>>>
>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>> the array allocations.
>>
>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>> want to return -ENOMEM to userspace for this case, I don't think.
> 
> Sure. I'll go verify that the uring code doesn't assume this memory is
> physically contiguous.
> 
> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> GFP_KERNEL.

kvmalloc() is fine, the io_uring code doesn't care about the layout of
the memory, it just uses it as an index.
Mark Rutland April 30, 2019, 5:03 p.m. UTC | #4
On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
> On 4/30/19 8:59 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> >> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> >>> In io_sqe_buffer_register() we allocate a number of arrays based on the
> >>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> >>> we can still attempt to allocate arrays exceeding MAX_ORDER.
> >>>
> >>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> >>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> >>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> >>> bytes, which is greater than 4MiB. This results in SLUB warning that
> >>> we're trying to allocate greater than MAX_ORDER, and failing the
> >>> allocation.
> >>>
> >>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> >>> user-provided iov_len. We'll gracefully handle the failed allocation,
> >>> returning -ENOMEM to userspace.
> >>>
> >>> We should probably consider lowering the limit below SZ_1G, or reworking
> >>> the array allocations.
> >>
> >> I'd suggest that kvmalloc is probably our friend here ... we don't really
> >> want to return -ENOMEM to userspace for this case, I don't think.
> > 
> > Sure. I'll go verify that the uring code doesn't assume this memory is
> > physically contiguous.
> > 
> > I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> > GFP_KERNEL.
> 
> kvmalloc() is fine, the io_uring code doesn't care about the layout of
> the memory, it just uses it as an index.

I've just had a go at that, but when using kvmalloc() with or without
GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
syzkaller prog below:

----
Syzkaller reproducer:
# {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
r0 = io_uring_setup(0x378, &(0x7f00000000c0))
sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
----

... I'm a bit worried that opens up a trivial DoS.

Thoughts?

Thanks,
Mark.
Jens Axboe April 30, 2019, 6:11 p.m. UTC | #5
On 4/30/19 11:03 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
>> On 4/30/19 8:59 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>>>
>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>>>> allocation.
>>>>>
>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>>>> returning -ENOMEM to userspace.
>>>>>
>>>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>>>> the array allocations.
>>>>
>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>>>> want to return -ENOMEM to userspace for this case, I don't think.
>>>
>>> Sure. I'll go verify that the uring code doesn't assume this memory is
>>> physically contiguous.
>>>
>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
>>> GFP_KERNEL.
>>
>> kvmalloc() is fine, the io_uring code doesn't care about the layout of
>> the memory, it just uses it as an index.
> 
> I've just had a go at that, but when using kvmalloc() with or without
> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> syzkaller prog below:
> 
> ----
> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> ----
> 
> ... I'm a bit worried that opens up a trivial DoS.
> 
> Thoughts?

Can you post the patch you used?
Mark Rutland May 1, 2019, 10:30 a.m. UTC | #6
On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
> On 4/30/19 11:03 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
> >> On 4/30/19 8:59 AM, Mark Rutland wrote:
> >>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> >>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> >>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
> >>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
> >>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
> >>>>>
> >>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
> >>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
> >>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
> >>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
> >>>>> we're trying to allocate greater than MAX_ORDER, and failing the
> >>>>> allocation.
> >>>>>
> >>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
> >>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
> >>>>> returning -ENOMEM to userspace.
> >>>>>
> >>>>> We should probably consider lowering the limit below SZ_1G, or reworking
> >>>>> the array allocations.
> >>>>
> >>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
> >>>> want to return -ENOMEM to userspace for this case, I don't think.
> >>>
> >>> Sure. I'll go verify that the uring code doesn't assume this memory is
> >>> physically contiguous.
> >>>
> >>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
> >>> GFP_KERNEL.
> >>
> >> kvmalloc() is fine, the io_uring code doesn't care about the layout of
> >> the memory, it just uses it as an index.
> > 
> > I've just had a go at that, but when using kvmalloc() with or without
> > GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> > syzkaller prog below:
> > 
> > ----
> > Syzkaller reproducer:
> > # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> > r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> > sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> > io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> > ----
> > 
> > ... I'm a bit worried that opens up a trivial DoS.
> > 
> > Thoughts?
> 
> Can you post the patch you used?

Diff below.

Mark.

---->8----
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 578b5494f9e5..c6dad90fab14 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2364,7 +2364,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
-		kfree(imu->bvec);
+		kvfree(imu->bvec);
 		imu->nr_bvecs = 0;
 	}
 
@@ -2456,9 +2456,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		if (!pages || nr_pages > got_pages) {
 			kfree(vmas);
 			kfree(pages);
-			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
 						GFP_KERNEL);
-			vmas = kmalloc_array(nr_pages,
+			vmas = kvmalloc_array(nr_pages,
 					sizeof(struct vm_area_struct *),
 					GFP_KERNEL);
 			if (!pages || !vmas) {
@@ -2470,7 +2470,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			got_pages = nr_pages;
 		}
 
-		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
 						GFP_KERNEL);
 		ret = -ENOMEM;
 		if (!imu->bvec) {
@@ -2531,12 +2531,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ctx->nr_user_bufs++;
 	}
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	return 0;
 err:
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	io_sqe_buffer_unregister(ctx);
 	return ret;
 }
Jens Axboe May 1, 2019, 12:41 p.m. UTC | #7
On 5/1/19 4:30 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
>> On 4/30/19 11:03 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
>>>> On 4/30/19 8:59 AM, Mark Rutland wrote:
>>>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
>>>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
>>>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the
>>>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
>>>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.
>>>>>>>
>>>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
>>>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
>>>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
>>>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that
>>>>>>> we're trying to allocate greater than MAX_ORDER, and failing the
>>>>>>> allocation.
>>>>>>>
>>>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the
>>>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,
>>>>>>> returning -ENOMEM to userspace.
>>>>>>>
>>>>>>> We should probably consider lowering the limit below SZ_1G, or reworking
>>>>>>> the array allocations.
>>>>>>
>>>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really
>>>>>> want to return -ENOMEM to userspace for this case, I don't think.
>>>>>
>>>>> Sure. I'll go verify that the uring code doesn't assume this memory is
>>>>> physically contiguous.
>>>>>
>>>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
>>>>> GFP_KERNEL.
>>>>
>>>> kvmalloc() is fine, the io_uring code doesn't care about the layout of
>>>> the memory, it just uses it as an index.
>>>
>>> I've just had a go at that, but when using kvmalloc() with or without
>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
>>> syzkaller prog below:
>>>
>>> ----
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
>>> ----
>>>
>>> ... I'm a bit worried that opens up a trivial DoS.
>>>
>>> Thoughts?
>>
>> Can you post the patch you used?
> 
> Diff below.

And the reproducer, that was never posted. Patch looks fine to me. Note
that buffer registration is under the protection of RLIMIT_MEMLOCK.
That's usually very limited for non-root, as root you can of course
consume as much as you want and OOM the system.
Mark Rutland May 1, 2019, 3:09 p.m. UTC | #8
On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:
> On 5/1/19 4:30 AM, Mark Rutland wrote:
> > On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
> >> On 4/30/19 11:03 AM, Mark Rutland wrote:
> >>> I've just had a go at that, but when using kvmalloc() with or without
> >>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
> >>> syzkaller prog below:
> >>>
> >>> ----
> >>> Syzkaller reproducer:
> >>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
> >>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
> >>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
> >>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
> >>> ----
> >>>
> >>> ... I'm a bit worried that opens up a trivial DoS.
> >>>
> >>> Thoughts?
> >>
> >> Can you post the patch you used?
> > 
> > Diff below.
> 
> And the reproducer, that was never posted.

It was; the "Syzakller reproducer" above is the reproducer I used with
syz-repro.

I've manually minimized that to C below. AFAICT, that hits a leak, which
is what's triggering the OOM after the program is run a number of times
with the previously posted kvmalloc patch.

Per /proc/meminfo, that memory isn't accounted anywhere.

> Patch looks fine to me. Note
> that buffer registration is under the protection of RLIMIT_MEMLOCK.
> That's usually very limited for non-root, as root you can of course
> consume as much as you want and OOM the system.

Sure.

As above, it looks like there's a leak, regardless.

Thanks,
Mark.

---->8----
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/uio.h>

// NOTE: arm64 syscall numbers
#ifndef __NR_io_uring_register
#define __NR_io_uring_register 427
#endif
#ifndef __NR_io_uring_setup
#define __NR_io_uring_setup 425
#endif

#define __IORING_REGISTER_BUFFERS         0
	
struct __io_sqring_offsets {
	uint32_t head;
	uint32_t tail;
	uint32_t ring_mask;
	uint32_t ring_entries;
	uint32_t flags;
	uint32_t dropped;
	uint32_t array;
	uint32_t resv1;
	uint64_t resv2;
};

struct __io_uring_params {
	uint32_t sq_entries;
	uint32_t cq_entries;
	uint32_t flags;
	uint32_t sq_thread_cpu;
	uint32_t sq_thread_idle;
	uint32_t resv[5];
	struct __io_sqring_offsets sq_off;
	struct __io_sqring_offsets cq_off;

};

static struct __io_uring_params params; 

static struct iovec iov = {
	.iov_base = (void *)0x10,
	.iov_len  = 1024 * 1024 * 1024,
};

int main(void)
{
	int fd;

	fd = syscall(__NR_io_uring_setup, 0x1, &params);
	syscall(__NR_io_uring_register, fd, __IORING_REGISTER_BUFFERS, &iov, 1);

	return 0;
}
Jens Axboe May 1, 2019, 3:29 p.m. UTC | #9
On 5/1/19 9:09 AM, Mark Rutland wrote:
> On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:
>> On 5/1/19 4:30 AM, Mark Rutland wrote:
>>> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
>>>> On 4/30/19 11:03 AM, Mark Rutland wrote:
>>>>> I've just had a go at that, but when using kvmalloc() with or without
>>>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
>>>>> syzkaller prog below:
>>>>>
>>>>> ----
>>>>> Syzkaller reproducer:
>>>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
>>>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))
>>>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
>>>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
>>>>> ----
>>>>>
>>>>> ... I'm a bit worried that opens up a trivial DoS.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Can you post the patch you used?
>>>
>>> Diff below.
>>
>> And the reproducer, that was never posted.
> 
> It was; the "Syzakller reproducer" above is the reproducer I used with
> syz-repro.
> 
> I've manually minimized that to C below. AFAICT, that hits a leak, which
> is what's triggering the OOM after the program is run a number of times
> with the previously posted kvmalloc patch.
> 
> Per /proc/meminfo, that memory isn't accounted anywhere.
> 
>> Patch looks fine to me. Note
>> that buffer registration is under the protection of RLIMIT_MEMLOCK.
>> That's usually very limited for non-root, as root you can of course
>> consume as much as you want and OOM the system.
> 
> Sure.
> 
> As above, it looks like there's a leak, regardless.

The leak is that we're not releasing imu->bvec in case of error. I fixed
a missing kfree -> kvfree as well in your patch, with this rolled up
version it works for me.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 18cecb6a0151..3e817d40fb96 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2443,7 +2443,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
-		kfree(imu->bvec);
+		kvfree(imu->bvec);
 		imu->nr_bvecs = 0;
 	}
 
@@ -2533,11 +2533,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ret = 0;
 		if (!pages || nr_pages > got_pages) {
-			kfree(vmas);
-			kfree(pages);
-			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+			kvfree(vmas);
+			kvfree(pages);
+			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
 						GFP_KERNEL);
-			vmas = kmalloc_array(nr_pages,
+			vmas = kvmalloc_array(nr_pages,
 					sizeof(struct vm_area_struct *),
 					GFP_KERNEL);
 			if (!pages || !vmas) {
@@ -2549,7 +2549,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			got_pages = nr_pages;
 		}
 
-		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
 						GFP_KERNEL);
 		ret = -ENOMEM;
 		if (!imu->bvec) {
@@ -2588,6 +2588,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 			if (ctx->account_mem)
 				io_unaccount_mem(ctx->user, nr_pages);
+			kvfree(imu->bvec);
 			goto err;
 		}
 
@@ -2610,12 +2611,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ctx->nr_user_bufs++;
 	}
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	return 0;
 err:
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	io_sqe_buffer_unregister(ctx);
 	return ret;
 }
Mark Rutland May 1, 2019, 3:55 p.m. UTC | #10
On Wed, May 01, 2019 at 09:29:25AM -0600, Jens Axboe wrote:
> On 5/1/19 9:09 AM, Mark Rutland wrote:
> > I've manually minimized that to C below. AFAICT, that hits a leak, which
> > is what's triggering the OOM after the program is run a number of times
> > with the previously posted kvmalloc patch.
> > 
> > Per /proc/meminfo, that memory isn't accounted anywhere.
> > 
> >> Patch looks fine to me. Note
> >> that buffer registration is under the protection of RLIMIT_MEMLOCK.
> >> That's usually very limited for non-root, as root you can of course
> >> consume as much as you want and OOM the system.
> > 
> > Sure.
> > 
> > As above, it looks like there's a leak, regardless.
> 
> The leak is that we're not releasing imu->bvec in case of error. I fixed
> a missing kfree -> kvfree as well in your patch, with this rolled up
> version it works for me.

That works for me too.

I'll fold that into v2, and send that out momentarily.

Thanks,
Mark.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 48fa6e86bfd6..25fc8cb56fc5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2453,10 +2453,10 @@  static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			kfree(vmas);
 			kfree(pages);
 			pages = kmalloc_array(nr_pages, sizeof(struct page *),
-						GFP_KERNEL);
+						GFP_KERNEL | __GFP_NOWARN);
 			vmas = kmalloc_array(nr_pages,
 					sizeof(struct vm_area_struct *),
-					GFP_KERNEL);
+					GFP_KERNEL | __GFP_NOWARN);
 			if (!pages || !vmas) {
 				ret = -ENOMEM;
 				if (ctx->account_mem)
@@ -2467,7 +2467,7 @@  static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		}
 
 		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
-						GFP_KERNEL);
+						GFP_KERNEL | __GFP_NOWARN);
 		ret = -ENOMEM;
 		if (!imu->bvec) {
 			if (ctx->account_mem)