Message ID | 150a0b4905f1d7274b4c2c7f5e3f4d8df5dda1d7.1452549431.git.bcrl@kvack.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > Another blocking operation used by applications that want aio > functionality is that of opening files that are not resident in memory. > Using the thread based aio helper, add support for IOCB_CMD_OPENAT. So I think this is ridiculously ugly. AIO is a horrible ad-hoc design, with the main excuse being "other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it". But AIO was always really really ugly. Now you introduce the notion of doing almost arbitrary system calls asynchronously in threads, but then you use that ass-backwards nasty interface to do so. Why? If you want to do arbitrary asynchronous system calls, just *do* it. But do _that_, not "let's extend this horrible interface in arbitrary random ways one special system call at a time". In other words, why is the interface not simply: "do arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread". That's something that a lot of people might use. In fact, if they can avoid the nasty AIO interface, maybe they'll even use it for things like read() and write(). So I really think it would be a nice thing to allow some kind of arbitrary "queue up asynchronous system call" model. But I do not think the AIO model should be the model used for that, even if I think there might be some shared infrastructure. So I would seriously suggest: - how about we add a true "asynchronous system call" interface - make it be a list of system calls with a futex completion for each list entry, so that you can easily wait for the end result that way. - maybe (and this is where it gets really iffy) you could even pass in the result of one system call to the next, so that you can do things like fd = openat(..) ret = read(fd, ..) asynchronously and then just wait for the read() to complete. and let us *not* tie this to the aio interface. In fact, if we do it well, we can go the other way, and try to implement the nasty AIO interface on top of the generic "just do things asynchronously". And I actually think many of your kernel thread parts are good for a generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic all makes sense, although I do suspect you could just make it unconditional. The cost of a few atomics shouldn't be excessive when we're talking "use a thread to do op X". What do you think? Do you think it might be possible to aim for a generic "do system call asynchronously" model instead? I'm adding Ingo the to cc, because I think Ingo had a "run this list of system calls" patch at one point - in order to avoid system call overhead. I don't think that was very interesting (because system call overhead is seldom all that noticeable for any interesting system calls), but with the "let's do the list asynchronously" addition it might be much more intriguing. Ingo, do I remember correctly that it was you? I might be confused about who wrote that patch, and I can't find it now. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Another blocking operation used by applications that want aio > > functionality is that of opening files that are not resident in memory. > > Using the thread based aio helper, add support for IOCB_CMD_OPENAT. > > So I think this is ridiculously ugly. > > AIO is a horrible ad-hoc design, with the main excuse being "other, > less gifted people, made that design, and we are implementing it for > compatibility because database people - who seldom have any shred of > taste - actually use it". > > But AIO was always really really ugly. > > Now you introduce the notion of doing almost arbitrary system calls > asynchronously in threads, but then you use that ass-backwards nasty > interface to do so. > > Why? Understood, but there are some reasons behind this. The core aio submit mechanism is modeled after the lio_listio() call in posix. While the cost of performing syscalls has decreased substantially over the last 10 years, the cost of context switches has not. Some AIO operations really want to do part of the work in the context of the original submitter for the work. That was/is a critical piece of the async readahead functionality in this series -- without being able to do a quick return to the caller when all the cached data is allready resident in the kernel, there is a significant performance degradation in my tests. For other operations which are going to do blocking i/o anyways, the cost of the context switch often becomes noise. The async readahead also fills a fills a hole in the proposed extensions to preadv()/pwritev() -- they need some way to trigger and know when a readahead operation has completed. One needs a completion queue of some sort to figure out which operation has completed in a reasonable efficient manner. The futex doesn't really have the ability to do this. Thread dispatching is another problem the applications I work on encounter, and AIO helps in this particular area because a thread that is running hot can simply check the AIO event ring buffer for new events in its main event loop. Userspace fundamentally *cannot* do a good job of dispatching work to threads. The code I've see other developers come up with ends up doing things like epoll() in one thread followed by dispatching the receieved events to different threads. This ends up making multiple expensive syscalls (since locking and cross CPU bouncing is required) when the kernel could just direct things to the right thread in the first place. There are a lot of requirements bringing additional complexity that start to surface once you look at how some of these applications are actually written. > If you want to do arbitrary asynchronous system calls, just *do* it. > But do _that_, not "let's extend this horrible interface in arbitrary > random ways one special system call at a time". > > In other words, why is the interface not simply: "do arbitrary system > call X with arguments A, B, C, D asynchronously using a kernel > thread". We've had a few proposals to do this, none of which have really managed to tackle all the problems that arose. If we go down this path, we will end up needing a table of what syscalls can actually be performed asynchronously, and flags indicating what bits of context those syscalls require. This does end up looking a bit like how AIO does things depending on how hard you squint. I'm not opposed to reworking how AIO dispatches things. If we're willing to relax some constraints (like the hard enforced limits on the number of AIOs in flight), things can be substantially simplified. Again, worries about things like memory usage today are vastly different than they were back in the early '00s, so the decisions that make sense now will certainly change the design. Cancellation is also a concern. Cancellation is not something that can be sacrificed. Without some mechanism to cancel operations that are in flight, there is no way for a process to cleanly exit. This patch series nicely proves that signals work very well for cancellation, and fit in with a lot of the code we already have. This implies we would need to treat threads doing async operations differently from normal threads. What happens with the pid namespace? > That's something that a lot of people might use. In fact, if they can > avoid the nasty AIO interface, maybe they'll even use it for things > like read() and write(). > > So I really think it would be a nice thing to allow some kind of > arbitrary "queue up asynchronous system call" model. > > But I do not think the AIO model should be the model used for that, > even if I think there might be some shared infrastructure. > > So I would seriously suggest: > > - how about we add a true "asynchronous system call" interface > > - make it be a list of system calls with a futex completion for each > list entry, so that you can easily wait for the end result that way. > > - maybe (and this is where it gets really iffy) you could even pass > in the result of one system call to the next, so that you can do > things like > > fd = openat(..) > ret = read(fd, ..) > > asynchronously and then just wait for the read() to complete. > > and let us *not* tie this to the aio interface. > > In fact, if we do it well, we can go the other way, and try to > implement the nasty AIO interface on top of the generic "just do > things asynchronously". > > And I actually think many of your kernel thread parts are good for a > generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic > all makes sense, although I do suspect you could just make it > unconditional. The cost of a few atomics shouldn't be excessive when > we're talking "use a thread to do op X". > > What do you think? Do you think it might be possible to aim for a > generic "do system call asynchronously" model instead? Maybe it's not too bad to do -- the syscall() primitive is reasonably well defined and is supported across architectures, but we're going to need new wrappers for *every* syscall supported. Odds are the work will have to be done incrementally to weed out which syscalls are safe and which are not, but there is certainly no reason we can't reuse syscall numbers and the same argument layout. Chaining things becomes messy. There are some cases where that works, but at least on the applications I've worked on, there tends to be a fair amount of logic that needs to be run before you can figure out what and where the next operation is. The canonical example I can think of is the case where one is retreiving data from disk. The first operation is a read into some table to find out where data is located, the next operation is a search (binary search in the case I'm thinking of) in the data that was just read to figure out which record actually contains the data the app cares about, followed by a read to actually fetch the data the user actually requires. And it gets more complicated: different disk i/os need to be issued with different priorities (something that was not included in what I just posted today, but is work I plan to propose for merging in the future). In some cases the priority is known beforehand, but in other cases it needs to be adjusted dynamically depending on information fetched (users don't like it if huge i/os completely starve their smaller i/os for significant amounts of time). > I'm adding Ingo the to cc, because I think Ingo had a "run this list > of system calls" patch at one point - in order to avoid system call > overhead. I don't think that was very interesting (because system call > overhead is seldom all that noticeable for any interesting system > calls), but with the "let's do the list asynchronously" addition it > might be much more intriguing. Ingo, do I remember correctly that it > was you? I might be confused about who wrote that patch, and I can't > find it now. I'd certainly be interested in hearing more ideas concerning requirements. Sorry for the giant wall of text... Nothing is simple! =-) -ben > Linus
On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Another blocking operation used by applications that want aio > > functionality is that of opening files that are not resident in memory. > > Using the thread based aio helper, add support for IOCB_CMD_OPENAT. > > So I think this is ridiculously ugly. > > AIO is a horrible ad-hoc design, with the main excuse being "other, > less gifted people, made that design, and we are implementing it for > compatibility because database people - who seldom have any shred of > taste - actually use it". > > But AIO was always really really ugly. > > Now you introduce the notion of doing almost arbitrary system calls > asynchronously in threads, but then you use that ass-backwards nasty > interface to do so. [ ... ] > I'm adding Ingo the to cc, because I think Ingo had a "run this list > of system calls" patch at one point - in order to avoid system call > overhead. I don't think that was very interesting (because system call > overhead is seldom all that noticeable for any interesting system > calls), but with the "let's do the list asynchronously" addition it > might be much more intriguing. Ingo, do I remember correctly that it > was you? I might be confused about who wrote that patch, and I can't > find it now. Zach Brown and Ingo traded a bunch of ideas. There were chicklets and syslets? A little search, it looks like acall was a slightly different iteration, but the patches didn't make it off oss.oracle.com: https://lwn.net/Articles/316806/ -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > What do you think? Do you think it might be possible to aim for a generic "do > system call asynchronously" model instead? > > I'm adding Ingo the to cc, because I think Ingo had a "run this list of system > calls" patch at one point - in order to avoid system call overhead. I don't > think that was very interesting (because system call overhead is seldom all that > noticeable for any interesting system calls), but with the "let's do the list > asynchronously" addition it might be much more intriguing. Ingo, do I remember > correctly that it was you? I might be confused about who wrote that patch, and I > can't find it now. Yeah, it was the whole 'syslets' and 'threadlets' stuff - I had both implemented and prototyped into a 'list directory entries asynchronously' testcase. Threadlets was pretty close to what you are suggesting now. Here's a very good (as usual!) writeup from LWN: https://lwn.net/Articles/223899/ Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/aio.c b/fs/aio.c index 4384df4..346786b 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -40,6 +40,8 @@ #include <linux/ramfs.h> #include <linux/percpu-refcount.h> #include <linux/mount.h> +#include <linux/fdtable.h> +#include <linux/fs_struct.h> #include <asm/kmap_types.h> #include <asm/uaccess.h> @@ -204,6 +206,9 @@ struct aio_kiocb { unsigned long ki_rlimit_fsize; aio_thread_work_fn_t ki_work_fn; struct work_struct ki_work; + struct fs_struct *ki_fs; + struct files_struct *ki_files; + const struct cred *ki_cred; #endif }; @@ -227,6 +232,7 @@ static const struct address_space_operations aio_ctx_aops; static void aio_complete(struct kiocb *kiocb, long res, long res2); ssize_t aio_fsync(struct kiocb *iocb, int datasync); long aio_poll(struct aio_kiocb *iocb); +long aio_openat(struct aio_kiocb *req); static __always_inline bool aio_may_use_threads(void) { @@ -1496,6 +1502,9 @@ static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb) static void aio_thread_fn(struct work_struct *work) { struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work); + struct files_struct *old_files = current->files; + const struct cred *old_cred = current_cred(); + struct fs_struct *old_fs = current->fs; kiocb_cancel_fn *old_cancel; long ret; @@ -1503,6 +1512,13 @@ static void aio_thread_fn(struct work_struct *work) current->kiocb = &iocb->common; /* For io_send_sig(). */ WARN_ON(atomic_read(¤t->signal->sigcnt) != 1); + if (iocb->ki_fs) + current->fs = iocb->ki_fs; + if (iocb->ki_files) + current->files = iocb->ki_files; + if (iocb->ki_cred) + current->cred = iocb->ki_cred; + /* Check for early stage cancellation and switch to late stage * cancellation if it has not already occurred. */ @@ -1519,6 +1535,19 @@ static void aio_thread_fn(struct work_struct *work) ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)) ret = -EINTR; + if (iocb->ki_cred) { + current->cred = old_cred; + put_cred(iocb->ki_cred); + } + if (iocb->ki_files) { + current->files = old_files; + put_files_struct(iocb->ki_files); + } + if (iocb->ki_fs) { + exit_fs(current); + current->fs = old_fs; + } + /* Completion serializes cancellation by taking ctx_lock, so * aio_complete() will not return until after force_sig() in * aio_thread_queue_iocb_cancel(). This should ensure that @@ -1530,6 +1559,9 @@ static void aio_thread_fn(struct work_struct *work) } #define AIO_THREAD_NEED_TASK 0x0001 /* Need aio_kiocb->ki_submit_task */ +#define AIO_THREAD_NEED_FS 0x0002 /* Need aio_kiocb->ki_fs */ +#define AIO_THREAD_NEED_FILES 0x0004 /* Need aio_kiocb->ki_files */ +#define AIO_THREAD_NEED_CRED 0x0008 /* Need aio_kiocb->ki_cred */ /* aio_thread_queue_iocb * Queues an aio_kiocb for dispatch to a worker thread. Prepares the @@ -1547,6 +1579,20 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb, iocb->ki_submit_task = current; get_task_struct(iocb->ki_submit_task); } + if (flags & AIO_THREAD_NEED_FS) { + struct fs_struct *fs = current->fs; + + iocb->ki_fs = fs; + spin_lock(&fs->lock); + fs->users++; + spin_unlock(&fs->lock); + } + if (flags & AIO_THREAD_NEED_FILES) { + iocb->ki_files = current->files; + atomic_inc(&iocb->ki_files->count); + } + if (flags & AIO_THREAD_NEED_CRED) + iocb->ki_cred = get_current_cred(); /* Cancellation needs to be always available for operations performed * using helper threads. Prior to the iocb being assigned to a worker @@ -1716,22 +1762,54 @@ long aio_poll(struct aio_kiocb *req) { return aio_thread_queue_iocb(req, aio_thread_op_poll, 0); } + +static long aio_thread_op_openat(struct aio_kiocb *req) +{ + u64 buf, offset; + long ret; + u32 fd; + + use_mm(req->ki_ctx->mm); + if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes))) + ret = -EFAULT; + else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf))) + ret = -EFAULT; + else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset))) + ret = -EFAULT; + else { + ret = do_sys_open((s32)fd, + (const char __user *)(long)buf, + (int)offset, + (unsigned short)(offset >> 32)); + } + unuse_mm(req->ki_ctx->mm); + return ret; +} + +long aio_openat(struct aio_kiocb *req) +{ + return aio_thread_queue_iocb(req, aio_thread_op_openat, + AIO_THREAD_NEED_TASK | + AIO_THREAD_NEED_FILES | + AIO_THREAD_NEED_CRED); +} #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */ /* * aio_run_iocb: * Performs the initial checks and io submission. */ -static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode, - char __user *buf, size_t len, bool compat) +static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb, + bool compat) { struct file *file = req->common.ki_filp; ssize_t ret = -EINVAL; + char __user *buf; int rw; fmode_t mode; rw_iter_op *iter_op; - switch (opcode) { + switch (user_iocb->aio_lio_opcode) { case IOCB_CMD_PREAD: case IOCB_CMD_PREADV: mode = FMODE_READ; @@ -1768,12 +1846,17 @@ rw_common: if (!iter_op) return -EINVAL; - if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV) - ret = aio_setup_vectored_rw(rw, buf, len, + buf = (char __user *)(unsigned long)user_iocb->aio_buf; + if (user_iocb->aio_lio_opcode == IOCB_CMD_PREADV || + user_iocb->aio_lio_opcode == IOCB_CMD_PWRITEV) + ret = aio_setup_vectored_rw(rw, buf, + user_iocb->aio_nbytes, &req->ki_iovec, compat, &req->ki_iter); else { - ret = import_single_range(rw, buf, len, req->ki_iovec, + ret = import_single_range(rw, buf, + user_iocb->aio_nbytes, + req->ki_iovec, &req->ki_iter); } if (!ret) @@ -1810,6 +1893,11 @@ rw_common: ret = aio_poll(req); break; + case IOCB_CMD_OPENAT: + if (aio_may_use_threads()) + ret = aio_openat(req); + break; + default: pr_debug("EINVAL: no operation provided\n"); return -EINVAL; @@ -1856,14 +1944,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, if (unlikely(!req)) return -EAGAIN; - req->common.ki_filp = fget(iocb->aio_fildes); - if (unlikely(!req->common.ki_filp)) { - ret = -EBADF; - goto out_put_req; + if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT) + req->common.ki_filp = NULL; + else { + req->common.ki_filp = fget(iocb->aio_fildes); + if (unlikely(!req->common.ki_filp)) { + ret = -EBADF; + goto out_put_req; + } } req->common.ki_pos = iocb->aio_offset; req->common.ki_complete = aio_complete; - req->common.ki_flags = iocb_flags(req->common.ki_filp); + if (req->common.ki_filp) + req->common.ki_flags = iocb_flags(req->common.ki_filp); if (iocb->aio_flags & IOCB_FLAG_RESFD) { /* @@ -1891,10 +1984,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->ki_user_iocb = user_iocb; req->ki_user_data = iocb->aio_data; - ret = aio_run_iocb(req, iocb->aio_lio_opcode, - (char __user *)(unsigned long)iocb->aio_buf, - iocb->aio_nbytes, - compat); + ret = aio_run_iocb(req, iocb, compat); if (ret) goto out_put_req; diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 7639fb1..0e16988 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -44,6 +44,8 @@ enum { IOCB_CMD_NOOP = 6, IOCB_CMD_PREADV = 7, IOCB_CMD_PWRITEV = 8, + + IOCB_CMD_OPENAT = 9, }; /*