Message ID | 20190430132405.8268-1-mark.rutland@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | io_uring: avoid page allocation warnings | expand |
On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: > In io_sqe_buffer_register() we allocate a number of arrays based on the > iov_len from the user-provided iov. While we limit iov_len to SZ_1G, > we can still attempt to allocate arrays exceeding MAX_ORDER. > > On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and > iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to > allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 > bytes, which is greater than 4MiB. This results in SLUB warning that > we're trying to allocate greater than MAX_ORDER, and failing the > allocation. > > Avoid this by passing __GFP_NOWARN when allocating arrays for the > user-provided iov_len. We'll gracefully handle the failed allocation, > returning -ENOMEM to userspace. > > We should probably consider lowering the limit below SZ_1G, or reworking > the array allocations. I'd suggest that kvmalloc is probably our friend here ... we don't really want to return -ENOMEM to userspace for this case, I don't think.
On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: > On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: > > In io_sqe_buffer_register() we allocate a number of arrays based on the > > iov_len from the user-provided iov. While we limit iov_len to SZ_1G, > > we can still attempt to allocate arrays exceeding MAX_ORDER. > > > > On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and > > iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to > > allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 > > bytes, which is greater than 4MiB. This results in SLUB warning that > > we're trying to allocate greater than MAX_ORDER, and failing the > > allocation. > > > > Avoid this by passing __GFP_NOWARN when allocating arrays for the > > user-provided iov_len. We'll gracefully handle the failed allocation, > > returning -ENOMEM to userspace. > > > > We should probably consider lowering the limit below SZ_1G, or reworking > > the array allocations. > > I'd suggest that kvmalloc is probably our friend here ... we don't really > want to return -ENOMEM to userspace for this case, I don't think. Sure. I'll go verify that the uring code doesn't assume this memory is physically contiguous. I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain GFP_KERNEL. Thanks, Mark.
On 4/30/19 8:59 AM, Mark Rutland wrote: > On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: >> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: >>> In io_sqe_buffer_register() we allocate a number of arrays based on the >>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G, >>> we can still attempt to allocate arrays exceeding MAX_ORDER. >>> >>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and >>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to >>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 >>> bytes, which is greater than 4MiB. This results in SLUB warning that >>> we're trying to allocate greater than MAX_ORDER, and failing the >>> allocation. >>> >>> Avoid this by passing __GFP_NOWARN when allocating arrays for the >>> user-provided iov_len. We'll gracefully handle the failed allocation, >>> returning -ENOMEM to userspace. >>> >>> We should probably consider lowering the limit below SZ_1G, or reworking >>> the array allocations. >> >> I'd suggest that kvmalloc is probably our friend here ... we don't really >> want to return -ENOMEM to userspace for this case, I don't think. > > Sure. I'll go verify that the uring code doesn't assume this memory is > physically contiguous. > > I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain > GFP_KERNEL. kvmalloc() is fine, the io_uring code doesn't care about the layout of the memory, it just uses it as an index.
On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote: > On 4/30/19 8:59 AM, Mark Rutland wrote: > > On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: > >> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: > >>> In io_sqe_buffer_register() we allocate a number of arrays based on the > >>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G, > >>> we can still attempt to allocate arrays exceeding MAX_ORDER. > >>> > >>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and > >>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to > >>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 > >>> bytes, which is greater than 4MiB. This results in SLUB warning that > >>> we're trying to allocate greater than MAX_ORDER, and failing the > >>> allocation. > >>> > >>> Avoid this by passing __GFP_NOWARN when allocating arrays for the > >>> user-provided iov_len. We'll gracefully handle the failed allocation, > >>> returning -ENOMEM to userspace. > >>> > >>> We should probably consider lowering the limit below SZ_1G, or reworking > >>> the array allocations. > >> > >> I'd suggest that kvmalloc is probably our friend here ... we don't really > >> want to return -ENOMEM to userspace for this case, I don't think. > > > > Sure. I'll go verify that the uring code doesn't assume this memory is > > physically contiguous. > > > > I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain > > GFP_KERNEL. > > kvmalloc() is fine, the io_uring code doesn't care about the layout of > the memory, it just uses it as an index. I've just had a go at that, but when using kvmalloc() with or without GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the syzkaller prog below: ---- Syzkaller reproducer: # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} r0 = io_uring_setup(0x378, &(0x7f00000000c0)) sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) ---- ... I'm a bit worried that opens up a trivial DoS. Thoughts? Thanks, Mark.
On 4/30/19 11:03 AM, Mark Rutland wrote: > On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote: >> On 4/30/19 8:59 AM, Mark Rutland wrote: >>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: >>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: >>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the >>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G, >>>>> we can still attempt to allocate arrays exceeding MAX_ORDER. >>>>> >>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and >>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to >>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 >>>>> bytes, which is greater than 4MiB. This results in SLUB warning that >>>>> we're trying to allocate greater than MAX_ORDER, and failing the >>>>> allocation. >>>>> >>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the >>>>> user-provided iov_len. We'll gracefully handle the failed allocation, >>>>> returning -ENOMEM to userspace. >>>>> >>>>> We should probably consider lowering the limit below SZ_1G, or reworking >>>>> the array allocations. >>>> >>>> I'd suggest that kvmalloc is probably our friend here ... we don't really >>>> want to return -ENOMEM to userspace for this case, I don't think. >>> >>> Sure. I'll go verify that the uring code doesn't assume this memory is >>> physically contiguous. >>> >>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain >>> GFP_KERNEL. >> >> kvmalloc() is fine, the io_uring code doesn't care about the layout of >> the memory, it just uses it as an index. > > I've just had a go at that, but when using kvmalloc() with or without > GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the > syzkaller prog below: > > ---- > Syzkaller reproducer: > # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} > r0 = io_uring_setup(0x378, &(0x7f00000000c0)) > sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) > io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) > ---- > > ... I'm a bit worried that opens up a trivial DoS. > > Thoughts? Can you post the patch you used?
On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote: > On 4/30/19 11:03 AM, Mark Rutland wrote: > > On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote: > >> On 4/30/19 8:59 AM, Mark Rutland wrote: > >>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: > >>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: > >>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the > >>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G, > >>>>> we can still attempt to allocate arrays exceeding MAX_ORDER. > >>>>> > >>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and > >>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to > >>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 > >>>>> bytes, which is greater than 4MiB. This results in SLUB warning that > >>>>> we're trying to allocate greater than MAX_ORDER, and failing the > >>>>> allocation. > >>>>> > >>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the > >>>>> user-provided iov_len. We'll gracefully handle the failed allocation, > >>>>> returning -ENOMEM to userspace. > >>>>> > >>>>> We should probably consider lowering the limit below SZ_1G, or reworking > >>>>> the array allocations. > >>>> > >>>> I'd suggest that kvmalloc is probably our friend here ... we don't really > >>>> want to return -ENOMEM to userspace for this case, I don't think. > >>> > >>> Sure. I'll go verify that the uring code doesn't assume this memory is > >>> physically contiguous. > >>> > >>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain > >>> GFP_KERNEL. > >> > >> kvmalloc() is fine, the io_uring code doesn't care about the layout of > >> the memory, it just uses it as an index. > > > > I've just had a go at that, but when using kvmalloc() with or without > > GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the > > syzkaller prog below: > > > > ---- > > Syzkaller reproducer: > > # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} > > r0 = io_uring_setup(0x378, &(0x7f00000000c0)) > > sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) > > io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) > > ---- > > > > ... I'm a bit worried that opens up a trivial DoS. > > > > Thoughts? > > Can you post the patch you used? Diff below. Mark. ---->8---- diff --git a/fs/io_uring.c b/fs/io_uring.c index 578b5494f9e5..c6dad90fab14 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2364,7 +2364,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) if (ctx->account_mem) io_unaccount_mem(ctx->user, imu->nr_bvecs); - kfree(imu->bvec); + kvfree(imu->bvec); imu->nr_bvecs = 0; } @@ -2456,9 +2456,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, if (!pages || nr_pages > got_pages) { kfree(vmas); kfree(pages); - pages = kmalloc_array(nr_pages, sizeof(struct page *), + pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); - vmas = kmalloc_array(nr_pages, + vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *), GFP_KERNEL); if (!pages || !vmas) { @@ -2470,7 +2470,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, got_pages = nr_pages; } - imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec), GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { @@ -2531,12 +2531,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ctx->nr_user_bufs++; } - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); return 0; err: - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); io_sqe_buffer_unregister(ctx); return ret; }
On 5/1/19 4:30 AM, Mark Rutland wrote: > On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote: >> On 4/30/19 11:03 AM, Mark Rutland wrote: >>> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote: >>>> On 4/30/19 8:59 AM, Mark Rutland wrote: >>>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote: >>>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote: >>>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the >>>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G, >>>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER. >>>>>>> >>>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and >>>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to >>>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 >>>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that >>>>>>> we're trying to allocate greater than MAX_ORDER, and failing the >>>>>>> allocation. >>>>>>> >>>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the >>>>>>> user-provided iov_len. We'll gracefully handle the failed allocation, >>>>>>> returning -ENOMEM to userspace. >>>>>>> >>>>>>> We should probably consider lowering the limit below SZ_1G, or reworking >>>>>>> the array allocations. >>>>>> >>>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really >>>>>> want to return -ENOMEM to userspace for this case, I don't think. >>>>> >>>>> Sure. I'll go verify that the uring code doesn't assume this memory is >>>>> physically contiguous. >>>>> >>>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain >>>>> GFP_KERNEL. >>>> >>>> kvmalloc() is fine, the io_uring code doesn't care about the layout of >>>> the memory, it just uses it as an index. >>> >>> I've just had a go at that, but when using kvmalloc() with or without >>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the >>> syzkaller prog below: >>> >>> ---- >>> Syzkaller reproducer: >>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} >>> r0 = io_uring_setup(0x378, &(0x7f00000000c0)) >>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) >>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) >>> ---- >>> >>> ... I'm a bit worried that opens up a trivial DoS. >>> >>> Thoughts? >> >> Can you post the patch you used? > > Diff below. And the reproducer, that was never posted. Patch looks fine to me. Note that buffer registration is under the protection of RLIMIT_MEMLOCK. That's usually very limited for non-root, as root you can of course consume as much as you want and OOM the system.
On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote: > On 5/1/19 4:30 AM, Mark Rutland wrote: > > On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote: > >> On 4/30/19 11:03 AM, Mark Rutland wrote: > >>> I've just had a go at that, but when using kvmalloc() with or without > >>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the > >>> syzkaller prog below: > >>> > >>> ---- > >>> Syzkaller reproducer: > >>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} > >>> r0 = io_uring_setup(0x378, &(0x7f00000000c0)) > >>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) > >>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) > >>> ---- > >>> > >>> ... I'm a bit worried that opens up a trivial DoS. > >>> > >>> Thoughts? > >> > >> Can you post the patch you used? > > > > Diff below. > > And the reproducer, that was never posted. It was; the "Syzakller reproducer" above is the reproducer I used with syz-repro. I've manually minimized that to C below. AFAICT, that hits a leak, which is what's triggering the OOM after the program is run a number of times with the previously posted kvmalloc patch. Per /proc/meminfo, that memory isn't accounted anywhere. > Patch looks fine to me. Note > that buffer registration is under the protection of RLIMIT_MEMLOCK. > That's usually very limited for non-root, as root you can of course > consume as much as you want and OOM the system. Sure. As above, it looks like there's a leak, regardless. Thanks, Mark. ---->8---- #include <stdint.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/types.h> #include <linux/uio.h> // NOTE: arm64 syscall numbers #ifndef __NR_io_uring_register #define __NR_io_uring_register 427 #endif #ifndef __NR_io_uring_setup #define __NR_io_uring_setup 425 #endif #define __IORING_REGISTER_BUFFERS 0 struct __io_sqring_offsets { uint32_t head; uint32_t tail; uint32_t ring_mask; uint32_t ring_entries; uint32_t flags; uint32_t dropped; uint32_t array; uint32_t resv1; uint64_t resv2; }; struct __io_uring_params { uint32_t sq_entries; uint32_t cq_entries; uint32_t flags; uint32_t sq_thread_cpu; uint32_t sq_thread_idle; uint32_t resv[5]; struct __io_sqring_offsets sq_off; struct __io_sqring_offsets cq_off; }; static struct __io_uring_params params; static struct iovec iov = { .iov_base = (void *)0x10, .iov_len = 1024 * 1024 * 1024, }; int main(void) { int fd; fd = syscall(__NR_io_uring_setup, 0x1, ¶ms); syscall(__NR_io_uring_register, fd, __IORING_REGISTER_BUFFERS, &iov, 1); return 0; }
On 5/1/19 9:09 AM, Mark Rutland wrote: > On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote: >> On 5/1/19 4:30 AM, Mark Rutland wrote: >>> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote: >>>> On 4/30/19 11:03 AM, Mark Rutland wrote: >>>>> I've just had a go at that, but when using kvmalloc() with or without >>>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the >>>>> syzkaller prog below: >>>>> >>>>> ---- >>>>> Syzkaller reproducer: >>>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false} >>>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0)) >>>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800) >>>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1) >>>>> ---- >>>>> >>>>> ... I'm a bit worried that opens up a trivial DoS. >>>>> >>>>> Thoughts? >>>> >>>> Can you post the patch you used? >>> >>> Diff below. >> >> And the reproducer, that was never posted. > > It was; the "Syzakller reproducer" above is the reproducer I used with > syz-repro. > > I've manually minimized that to C below. AFAICT, that hits a leak, which > is what's triggering the OOM after the program is run a number of times > with the previously posted kvmalloc patch. > > Per /proc/meminfo, that memory isn't accounted anywhere. > >> Patch looks fine to me. Note >> that buffer registration is under the protection of RLIMIT_MEMLOCK. >> That's usually very limited for non-root, as root you can of course >> consume as much as you want and OOM the system. > > Sure. > > As above, it looks like there's a leak, regardless. The leak is that we're not releasing imu->bvec in case of error. I fixed a missing kfree -> kvfree as well in your patch, with this rolled up version it works for me. diff --git a/fs/io_uring.c b/fs/io_uring.c index 18cecb6a0151..3e817d40fb96 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2443,7 +2443,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) if (ctx->account_mem) io_unaccount_mem(ctx->user, imu->nr_bvecs); - kfree(imu->bvec); + kvfree(imu->bvec); imu->nr_bvecs = 0; } @@ -2533,11 +2533,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ret = 0; if (!pages || nr_pages > got_pages) { - kfree(vmas); - kfree(pages); - pages = kmalloc_array(nr_pages, sizeof(struct page *), + kvfree(vmas); + kvfree(pages); + pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); - vmas = kmalloc_array(nr_pages, + vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *), GFP_KERNEL); if (!pages || !vmas) { @@ -2549,7 +2549,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, got_pages = nr_pages; } - imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec), GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { @@ -2588,6 +2588,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, } if (ctx->account_mem) io_unaccount_mem(ctx->user, nr_pages); + kvfree(imu->bvec); goto err; } @@ -2610,12 +2611,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ctx->nr_user_bufs++; } - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); return 0; err: - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); io_sqe_buffer_unregister(ctx); return ret; }
On Wed, May 01, 2019 at 09:29:25AM -0600, Jens Axboe wrote: > On 5/1/19 9:09 AM, Mark Rutland wrote: > > I've manually minimized that to C below. AFAICT, that hits a leak, which > > is what's triggering the OOM after the program is run a number of times > > with the previously posted kvmalloc patch. > > > > Per /proc/meminfo, that memory isn't accounted anywhere. > > > >> Patch looks fine to me. Note > >> that buffer registration is under the protection of RLIMIT_MEMLOCK. > >> That's usually very limited for non-root, as root you can of course > >> consume as much as you want and OOM the system. > > > > Sure. > > > > As above, it looks like there's a leak, regardless. > > The leak is that we're not releasing imu->bvec in case of error. I fixed > a missing kfree -> kvfree as well in your patch, with this rolled up > version it works for me. That works for me too. I'll fold that into v2, and send that out momentarily. Thanks, Mark.
diff --git a/fs/io_uring.c b/fs/io_uring.c index 48fa6e86bfd6..25fc8cb56fc5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2453,10 +2453,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, kfree(vmas); kfree(pages); pages = kmalloc_array(nr_pages, sizeof(struct page *), - GFP_KERNEL); + GFP_KERNEL | __GFP_NOWARN); vmas = kmalloc_array(nr_pages, sizeof(struct vm_area_struct *), - GFP_KERNEL); + GFP_KERNEL | __GFP_NOWARN); if (!pages || !vmas) { ret = -ENOMEM; if (ctx->account_mem) @@ -2467,7 +2467,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, } imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), - GFP_KERNEL); + GFP_KERNEL | __GFP_NOWARN); ret = -ENOMEM; if (!imu->bvec) { if (ctx->account_mem)
In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER. On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation. Avoid this by passing __GFP_NOWARN when allocating arrays for the user-provided iov_len. We'll gracefully handle the failed allocation, returning -ENOMEM to userspace. We should probably consider lowering the limit below SZ_1G, or reworking the array allocations. Full splat from before this patch: WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)