From patchwork Fri Jan 18 16:12:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770817 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 81D7A13B5 for ; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 709712F432 for ; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 647CE2F5C8; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AC4052F432 for ; Fri, 18 Jan 2019 16:13:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728105AbfARQNI (ORCPT ); Fri, 18 Jan 2019 11:13:08 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:40577 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728100AbfARQNH (ORCPT ); Fri, 18 Jan 2019 11:13:07 -0500 Received: by mail-pg1-f194.google.com with SMTP id z10so6267660pgp.7 for ; Fri, 18 Jan 2019 08:13:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=C+lUBfdzYfDre8IGi2k1Kyr9iKAx4mtf+OfNWW4p/HQ=; b=Hnji6THUwl94uwkZlcok8NKuRxRGTSVMcYB9z0nn3p6Vl8ufirPBm53lMlMi1OmCy8 maz/DERFCx0l1fZdavOapuIJudr6cl0RSLWi1jVAsthWUsjmMxJorlKDAWvF/6smzOET Ttx09MpdMhOxaRczzsCWJjPlUDUZt3ubVTu1EjR+ZUnf/2aKaFkBPCq9VOTtV8eyVfCa NyJo+yVAS5iJ4YJ/jutFwLfSMqcaj2RQsb4UDWnskZyxB4E3PVmAttDhiyuc78wNCt8s 7ZwePAJXUNhdXIx5AJas/0g5MwicmUY/6VQ33sXxJY1sSSqU9MduxF8iYFWZOwKNBxA+ x+Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=C+lUBfdzYfDre8IGi2k1Kyr9iKAx4mtf+OfNWW4p/HQ=; b=BAIY9Q0dvQtyBQkYmJRYoDZVQz1x6G6ohfIGuHjF2eBHt5KhuZlfhuaNcWc2U2IzTw d+cm9lLnq+VRfZASWsLc11XErMPFqt2BmhWviSQCEEYmoGleUh1nJJcsLrcRnXBb9XDs L5ApHDWd0pbcxaVvqw9BGSaQuQQvwctBChKZt6YnrH6/n7VSuXVrHFvHec5KARJ2Zw7i cTytWzIdCnNxMzBo8/mcbtD/DC+AsnlLDbtZLh/3L/9IUSCfyVN3TdJcxcazExSvuPgL F6b+4fDaWvXUzI/5MBRHCJye/pwSpe2gnAb+NkYYbAi52bzki9lrM8kxUaeGCPjHcQFB CgHg== X-Gm-Message-State: AJcUukf9K+OmToor2C+d0dNv62Ew4Ol7O94aoZuAMf0IkJ+j9wak5aC+ RkgkI+BYGTDdSVEQl2CXRdwUXqo/KbPNUg== X-Google-Smtp-Source: ALg8bN5FXh2vWl00DcIid8vj40xgr+Pexh/Wmnw66yppehiI8QsSiWSRv8kcMtp+UC4MdZvuii3iLQ== X-Received: by 2002:a63:a401:: with SMTP id c1mr18345189pgf.403.1547827985636; Fri, 18 Jan 2019 08:13:05 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:04 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 14/17] io_uring: add submission polling Date: Fri, 18 Jan 2019 09:12:22 -0700 Message-Id: <20190118161225.4545-15-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe --- fs/io_uring.c | 220 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 +- 2 files changed, 223 insertions(+), 7 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6aaa0bf3648c..cdd9873edfe3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -87,8 +88,10 @@ struct io_ring_ctx { /* IO offload */ struct workqueue_struct *sqo_wq; + struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; struct files_struct *sqo_files; + wait_queue_head_t sqo_wait; struct { /* CQ ring */ @@ -264,6 +267,9 @@ static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); + if ((ctx->flags & IORING_SETUP_SQPOLL) && + waitqueue_active(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); } static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, @@ -1102,6 +1108,169 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) return false; } +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = io_submit_sqe(ctx, &sqes[i], statep); + if (!ret) { + submitted++; + continue; + } + + io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; + + old_files = current->files; + current->files = ctx->sqo_files; + + old_fs = get_fs(); + set_fs(USER_DS); + + timeout = inflight = 0; + while (!kthread_should_stop()) { + bool all_fixed, mm_fault = false; + int i; + + if (inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * App should not use IORING_ENTER_GETEVENTS + * with thread polling, but if it does, then + * ensure we are mutually exclusive. + */ + if (mutex_trylock(&ctx->uring_lock)) { + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } + } else { + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + + if (!io_get_sqring(ctx, &sqes[0])) { + /* + * We're polling, let us spin for a second without + * work before going to sleep. + */ + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&ctx->sqo_wait, &wait, + TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + + if (!io_get_sqring(ctx, &sqes[0])) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&ctx->sqo_wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + continue; + } + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + + i = 0; + all_fixed = true; + do { + if (sqes[i].sqe->opcode != IORING_OP_READ_FIXED && + sqes[i].sqe->opcode != IORING_OP_WRITE_FIXED) + all_fixed = false; + + i++; + if (i == ARRAY_SIZE(sqes)) + break; + } while (io_get_sqring(ctx, &sqes[i])); + + io_commit_sqring(ctx); + + /* Unless all new commands are FIXED regions, grab mm */ + if (!all_fixed && !cur_mm) { + mm_fault = !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + cur_mm = ctx->sqo_mm; + } + } + + inflight += io_submit_sqes(ctx, sqes, i, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; @@ -1175,9 +1344,14 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + if (ctx->flags & IORING_SETUP_SQPOLL) { + wake_up(&ctx->sqo_wait); + ret = to_submit; + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1250,10 +1424,12 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } -static int io_sq_offload_start(struct io_ring_ctx *ctx) +static int io_sq_offload_start(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret; + init_waitqueue_head(&ctx->sqo_wait); ctx->sqo_mm = current->mm; /* @@ -1266,6 +1442,27 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) if (!ctx->sqo_files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (p->flags & IORING_SETUP_SQ_AFF) { + ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, + ctx, p->sq_thread_cpu, + "io_uring-sq"); + } else { + ctx->sqo_thread = kthread_create(io_sq_thread, ctx, + "io_uring-sq"); + } + if (IS_ERR(ctx->sqo_thread)) { + ret = PTR_ERR(ctx->sqo_thread); + ctx->sqo_thread = NULL; + goto err; + } + wake_up_process(ctx->sqo_thread); + } else if (p->flags & IORING_SETUP_SQ_AFF) { + /* Can't have SQ_AFF without SQPOLL */ + ret = -EINVAL; + goto err; + } + /* Do QD, or 2 * CPUS, whatever is smallest */ ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); @@ -1276,6 +1473,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) return 0; err: + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_files) ctx->sqo_files = NULL; ctx->sqo_mm = NULL; @@ -1284,6 +1486,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) static void io_sq_offload_stop(struct io_ring_ctx *ctx) { + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_wq) { destroy_workqueue(ctx->sqo_wq); ctx->sqo_wq = NULL; @@ -1772,7 +1979,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; - ret = io_sq_offload_start(ctx); + ret = io_sq_offload_start(ctx, p); if (ret) goto err; @@ -1807,7 +2014,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | + IORING_SETUP_SQ_AFF)) return -EINVAL; ret = io_uring_create(entries, &p, compat); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 8323320077ec..37c7402be9ca 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -44,6 +44,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_SQPOLL (1 << 1) /* SQ poll thread */ +#define IORING_SETUP_SQ_AFF (1 << 2) /* sq_thread_cpu is valid */ #define IORING_OP_NOP 0 #define IORING_OP_READV 1 @@ -87,6 +89,11 @@ struct io_sqring_offsets { __u32 resv[3]; }; +/* + * sq_ring->flags + */ +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; @@ -109,7 +116,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; };