Message ID | 80934665e0dd2360e2583522c7c7569e5a92be0e.1452549431.git.bcrl@kvack.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote: > Enable a fully asynchronous fsync and fdatasync operations in aio using > the aio thread queuing mechanism. > > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Insufficient. Needs the range to be passed through and call vfs_fsync_range(), as I implemented here: https://lkml.org/lkml/2015/10/28/878 Cheers, Dave.
On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote: > > Insufficient. Needs the range to be passed through and call > vfs_fsync_range(), as I implemented here: And I think that's insufficient *also*. What you actually want is "sync_file_range()", with the full set of arguments. Yes, really. Sometimes you want to start the writeback, sometimes you want to wait for it. Sometimes you want both. For example, if you are doing your own manual write-behind logic, it is not sufficient for "wait for data". What you want is "start IO on new data" followed by "wait for old data to have been written out". I think this only strengthens my "stop with the idiotic special-case-AIO magic already" argument. If we want something more generic than the usual aio, then we should go all in. Not "let's make more limited special cases". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 12, 2016 at 12:11:28PM +1100, Dave Chinner wrote: > On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote: > > Enable a fully asynchronous fsync and fdatasync operations in aio using > > the aio thread queuing mechanism. > > > > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com> > > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> > > Insufficient. Needs the range to be passed through and call > vfs_fsync_range(), as I implemented here: Noted. > https://lkml.org/lkml/2015/10/28/878 Please at least Cc the aio list in the future on aio patches, as I do not have the time to read linux-kernel these days unless prodded to do so... -ben > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Jan 11, 2016 at 05:20:42PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Insufficient. Needs the range to be passed through and call > > vfs_fsync_range(), as I implemented here: > > And I think that's insufficient *also*. > > What you actually want is "sync_file_range()", with the full set of arguments. That's a different interface. the aio fsync interface has been exposed to userspace for years, we just haven't implemented it in the kernel. That's a major difference to everything else being proposed in this patch set, especially this one. FYI sync_file_range() is definitely not a fsync/fdatasync replacement as it does not guarantee data durability in any way. i.e. you can call sync_file_range, have it wait for data to be written, return to userspace, then lose power and lose the data that sync_file_range said it wrote. That's because sync_file_range() does not: a) write the metadata needed to reference the data to disk; and b) flush volatile storage caches after data and metadata is written. Hence sync_file_range is useless to applications that need to guarantee data durability. Not to mention that most AIO applications use direct IO, and so have no use for fine grained control over page cache writeback semantics. They only require a) and b) above, so implementing the AIO fsync primitive is exactly what they want. > Yes, really. Sometimes you want to start the writeback, sometimes you > want to wait for it. Sometimes you want both. Without durability guarantees such application level optimisations are pretty much worthless. > I think this only strengthens my "stop with the idiotic > special-case-AIO magic already" argument. If we want something more > generic than the usual aio, then we should go all in. Not "let's make > more limited special cases". No, I don't think this specific case does, because the AIO fsync interface already exists.... Cheers, Dave.
On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote: > > That's a different interface. So is openat. So is readahead. My point is that this idiotic "let's expose special cases" must end. It's broken. It inevitably only exposes a subset of what different people would want. Making "aio_read()" and friends a special interface had historical reasons for it. But expanding willy-nilly on that model does not. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 11, 2016 at 06:38:15PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > That's a different interface. > > So is openat. So is readahead. > > My point is that this idiotic "let's expose special cases" must end. > It's broken. It inevitably only exposes a subset of what different > people would want. > > Making "aio_read()" and friends a special interface had historical > reasons for it. But expanding willy-nilly on that model does not. Yes, I heard you the first time, but you haven't acknowledged that the aio fsync interface is indeed different because it already exists. What's the problem with implementing an AIO call that we've advertised as supported for many years now that people are asking us to implement it? As for a generic async syscall interface, why not just add IOCB_CMD_SYSCALL that encodes the syscall number and parameters into the iovec structure and let the existing aio subsystem handle demultiplexing it and handing them off to threads/workqueues/etc? That was we get contexts, events, signals, completions, cancelations, etc from the existing infrastructure, and there's really only a dispatch/collection layer that needs to be added? If we then provide the userspace interface via the libaio library to call the async syscalls with an AIO context handle, then there's little more that needs to be done to support just about everything as an async syscall... Cheers, Dave.
On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > Yes, I heard you the first time, but you haven't acknowledged that > the aio fsync interface is indeed different because it already > exists. What's the problem with implementing an AIO call that we've > advertised as supported for many years now that people are asking us > to implement it? Oh, I don't disagree with that. I think it should be exposed, my point was that that too was not enough. I don't see why you argue. You said "that's not enough". And I jjust said that your expansion wasn't sufficient either, and that I think we should strive to expand things even more. And preferably not in some ad-hoc manner. Expand it to *everything* we can do. > As for a generic async syscall interface, why not just add > IOCB_CMD_SYSCALL that encodes the syscall number and parameters > into the iovec structure and let the existing aio subsystem handle > demultiplexing it and handing them off to threads/workqueues/etc? That would likely be the simplest approach, yes. There's a few arguments against it, though: - doing the indirect system call thing does end up being architecture-specific, so now you do need the AIO code to call into some arch wrapper. Not a huge deal, since the arch wrapper will be pretty simple (and we can have a default one that just returns ENOSYS, so that we don't have to synchronize all architectures) - the aio interface really is horrible crap. Really really. For example, the whole "send signal as a completion model" is so f*cking broken that I really don't want to extend the aio interface too much. I think it's unfixable. So I really think we'd be *much* better off with a new interface entirely - preferably one that allows the old aio interfaces to fall out fairly naturally. Ben mentioned lio_listio() as a reason for why he wanted to extend the AIO interface, but I think it works the other way around: yes, we should look at lio_listio(), but we should look at it mainly as a way to ask ourselves: "can we implement a new aynchronous system call submission model that would also make it possible to implement lio_listio() as a user space wrapper around it". For example, if we had an actual _good_ way to queue up things, you could probably make that "struct sigevent" completion for lio_listio() just be another asynchronous system call at the end of the list - a system call that sends the completion signal. And the aiocb_list[] itself? Maybe those could just be done as normal (individual) aio calls (so that you end up having the aiocb that you can wait on with aio_suspend() etc). But then people who do *not* want the crazy aiocb, and do *not* want some SIGIO or whatever, could just fire off asynchronous system calls without that cruddy interface. So my argument is really that I think it would be better to at least look into maybe creating something less crapulent, and striving to make it easy to make the old legacy interfaces be just wrappers around a more capable model. And hey, it may be that in the end nobody cares enough, and the right thing (or at least the prudent thing) to do is to just pile the crap on deeper and higher, and just add a single IOCB_CMD_SYSCALL indirection entry. So I'm not dismissing that as a solution - I just don't think it's a particularly clean one. It does have the advantage of likely being a fairly simple hack. But it smells like a hack. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So my argument is really that I think it would be better to at least > look into maybe creating something less crapulent, and striving to > make it easy to make the old legacy interfaces be just wrappers around > a more capable model. Hmm. Thinking more about this makes me worry about all the system call versioning and extra work done by libc. At least glibc has traditionally decided to munge and extend on kernel system call interfaces, to the point where even fairly core data structures (like "struct stat") may not always look the same to the kernel as they do to user space. So with that worry, I have to admit that maybe a limited interface - rather than allowing arbitrary generic async system calls - might have advantages. Less room for mismatches. I'll have to think about this some more. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jan 11, 2016 8:04 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote: > > On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Yes, I heard you the first time, but you haven't acknowledged that > > the aio fsync interface is indeed different because it already > > exists. What's the problem with implementing an AIO call that we've > > advertised as supported for many years now that people are asking us > > to implement it? > > Oh, I don't disagree with that. I think it should be exposed, my point > was that that too was not enough. > > I don't see why you argue. You said "that's not enough". And I jjust > said that your expansion wasn't sufficient either, and that I think we > should strive to expand things even more. > > And preferably not in some ad-hoc manner. Expand it to *everything* we can do. > > > As for a generic async syscall interface, why not just add > > IOCB_CMD_SYSCALL that encodes the syscall number and parameters > > into the iovec structure and let the existing aio subsystem handle > > demultiplexing it and handing them off to threads/workqueues/etc? > > That would likely be the simplest approach, yes. > > There's a few arguments against it, though: > > - doing the indirect system call thing does end up being > architecture-specific, so now you do need the AIO code to call into > some arch wrapper. How many arches *can* do it? As of 4.4, x86_32 can, but x86_64 can't yet. We'd also need a whitelist of acceptable indirect syscalls (e.g. exit is bad). And we have to worry about things that depend on the mm or creds. It would be extra nice if we could avoid switch_mm for things that don't need it (fsync) and only do it for things like read that do. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/01/2016 02:20, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> Insufficient. Needs the range to be passed through and call >> vfs_fsync_range(), as I implemented here: > > And I think that's insufficient *also*. > > What you actually want is "sync_file_range()", with the full set of arguments. > > Yes, really. Sometimes you want to start the writeback, sometimes you > want to wait for it. Sometimes you want both. > > For example, if you are doing your own manual write-behind logic, it > is not sufficient for "wait for data". What you want is "start IO on > new data" followed by "wait for old data to have been written out". > > I think this only strengthens my "stop with the idiotic > special-case-AIO magic already" argument. If we want something more > generic than the usual aio, then we should go all in. Not "let's make > more limited special cases". The question is, do we really want something more generic than the usual AIO? Virt is one of the 10 (that's a binary number) users of AIO, and we don't even use it by default because in most cases it's really a wash. Let's compare AIO with a simple userspace thread pool. AIO has the ability to submit and retrieve the results of multiple operations at once. Thread pools do not have the ability to submit multiple operations at a time (you could play games with FUTEX_WAKE, but then all the threads in the pool would have cacheline bounces on the futex). The syscall overhead on the critical path is comparable. For AIO it's io_submit+io_getevents, for a thread pool it's FUTEX_WAKE plus invoking the actual syscall. Again, the only difference for AIO is batching. Unless userspace is submitting tens of thousands of operations per second, which is pretty much the case only for read/write, there's no real benefit in asynchronous system calls over a userspace thread pool. That applies to openat, unlinkat, fadvise (for readahead). It also applies to msync and fsync, etc. because if your workload is doing tons of those you'd better buy yourself a disk with a battery-backed cache, or an UPS, and remove the msync/fsync altogether. So I'm really happy if we can move the thread creation overhead for such a thread pool to the kernel. It keeps the benefits of batching, it uses the optimized kernel workqueues, it doesn't incur the cost of pthreads, it makes it easy to remove the cases where AIO is blocking, it makes it easy to add support for !O_DIRECT. But everything else seems overkill. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So my argument is really that I think it would be better to at least > > look into maybe creating something less crapulent, and striving to > > make it easy to make the old legacy interfaces be just wrappers around > > a more capable model. > > Hmm. Thinking more about this makes me worry about all the system call > versioning and extra work done by libc. > > At least glibc has traditionally decided to munge and extend on kernel > system call interfaces, to the point where even fairly core data > structures (like "struct stat") may not always look the same to the > kernel as they do to user space. > > So with that worry, I have to admit that maybe a limited interface - > rather than allowing arbitrary generic async system calls - might have > advantages. Less room for mismatches. > > I'll have to think about this some more. Any further thoughts on this after a few days worth of pondering? -ben > Linus
On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: >> >> I'll have to think about this some more. > > Any further thoughts on this after a few days worth of pondering? Sorry about the delay, with the merge window and me being sick for a couple of days I didn't get around to this. After thinking it over some more, I guess I'm ok with your approach. The table-driven patch makes me a bit happier, and I guess not very many people end up ever wanting to do async system calls anyway. Are there other users outside of Solace? It would be good to get comments.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote: > > After thinking it over some more, I guess I'm ok with your approach. > The table-driven patch makes me a bit happier, and I guess not very > many people end up ever wanting to do async system calls anyway. > > Are there other users outside of Solace? It would be good to get comments.. For async I/O? We're using it inside Google, for networking and for storage I/O's. We don't need async fsync/fdatasync, but we do need very fast, low overhead I/O's. To that end, we have some patches to batch block layer completion handling, which Kent tried upstreaming a few years back but which everyone thought was too ugly to live. (It *was* ugly, but we had access to some very fast storage devices where it really mattered. With upcoming NVMe devices, that sort of hardware should be available to more folks, so it's something that I've been meaning to revisit from an upstreaming perspective, especially if I can get my hands on some publically available hardware for benchmarking purposes to demonstrate why it's useful, even if it is ugly.) The other thing which we have which is a bit more experimental is that we've plumbed through the aio priority bits to the block layer, as well as aio_cancel. The idea for the latter is if you are are interested in low latency access to a clustered file system, where sometimes a read request can get stuck behind other I/O requests if a server has a long queue of requests to service. So the client for which low latency is very important fires off the request to more than one server, and as soon as it gets an answer it sends a "never mind" message to the other server(s). The code to do aio_cancellation in the block layer is fairly well tested, and was in Kent's git trees, but never got formally pushed upstream. The code to push the cancellation request all the way to the HDD (for those hard disks / storage devices that support I/O cancellation) is even more experimental, and needs a lot of cleanup before it could be sent for review (it was done by someone who isn't used to upstream coding standards). The reason why we haven't tried to pushed more of these changes upsream has been lack of resources, and the fact that the AIO code *is* ugly, which means extensions tend to make the code at the very least, more complex. Especially since some of the folks working on it, such as Kent, were really worried about performance at all costs, and Kerningham's "it's twice as hard to debug code as to write it" comment really applies here. And since very few people outside of Google seem to use AIO, and even fewer seem eager to review or work on AIO, and our team is quite small for the work we need to do, it just hasn't risen to the top of the priority list. Still, it's fair to say that if you are using Google Hangouts, or Google Mail, or Google Docs, AIO is most definitely getting used to process your queries. As far as comments, aside from the "we really care about performance", and "the code is scary complex and barely on the edge of being maintainable", the other comment I'd make is libaio is pretty awful, and so as a result a number (most?) of our AIO users have elected to use the raw system call interfaces and are *not* using the libaio abstractions --- which, as near as I can tell, don't really buy you much anyway. (Do we really need to keep code that provides backwards compatibility with kernels over 10+ years old at this point?) Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote: > On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > >> > >> I'll have to think about this some more. > > > > Any further thoughts on this after a few days worth of pondering? > > Sorry about the delay, with the merge window and me being sick for a > couple of days I didn't get around to this. > > After thinking it over some more, I guess I'm ok with your approach. > The table-driven patch makes me a bit happier, and I guess not very > many people end up ever wanting to do async system calls anyway. > > Are there other users outside of Solace? It would be good to get comments.. I know of quite a few storage/db products that use AIO. The most recent high profile project that have been reporting issues with AIO on XFS is http://www.scylladb.com/. That project is architected around non-blocking AIO for scalability reasons... Cheers, Dave.
On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> Are there other users outside of Solace? It would be good to get comments.. > > I know of quite a few storage/db products that use AIO. The most > recent high profile project that have been reporting issues with AIO > on XFS is http://www.scylladb.com/. That project is architected > around non-blocking AIO for scalability reasons... I was more wondering about the new interfaces, making sure that the feature set actually matches what people want to do.. That said, I also agree that it would be interesting to hear what the performance impact is for existing performance-sensitive users. Could we make that "aio_may_use_threads()" case be unconditional, making things simpler? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote: > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> Are there other users outside of Solace? It would be good to get comments.. > > > > I know of quite a few storage/db products that use AIO. The most > > recent high profile project that have been reporting issues with AIO > > on XFS is http://www.scylladb.com/. That project is architected > > around non-blocking AIO for scalability reasons... > > I was more wondering about the new interfaces, making sure that the > feature set actually matches what people want to do.. I suspect this will be an ongoing learning exercise as people start to use the new functionality and find gaps in terms of what is needed. Certainly there is a bunch of stuff we need to add to cover the cases where disk i/o is required. getdents() is one example, but the ABI issues we have with it are somewhat more complicated given the history associated with that interface. > That said, I also agree that it would be interesting to hear what the > performance impact is for existing performance-sensitive users. Could > we make that "aio_may_use_threads()" case be unconditional, making > things simpler? Making it unconditional is a goal, but some work is required before that can be the case. The O_DIRECT issue is one such matter -- it requires some changes to the filesystems to ensure that they adhere to the non-blocking nature of the new interface (ie taking i_mutex is a Bad Thing that users really do not want to be exposed to; if taking it blocks, the code should punt to a helper thread). Additional auditing of some of the read/write implementations is also required, which will likely need some minor changes in things like sysfs and other weird functionality we have. Having the flag reflects that while the functionality is useful, not all of the bugs have been worked out yet. What's the desired approach to merge these changes? Does it make sense to merge what is ready now and prepare the next round of changes for 4.6? Or is it more important to grow things to a more complete state before merging? Regards, -ben > Linus
On Wed, Jan 20, 2016 at 03:44:49PM -0500, Benjamin LaHaise wrote: > On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote: > > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote: > > >> > > >> Are there other users outside of Solace? It would be good to get comments.. > > > > > > I know of quite a few storage/db products that use AIO. The most > > > recent high profile project that have been reporting issues with AIO > > > on XFS is http://www.scylladb.com/. That project is architected > > > around non-blocking AIO for scalability reasons... > > > > I was more wondering about the new interfaces, making sure that the > > feature set actually matches what people want to do.. > > I suspect this will be an ongoing learning exercise as people start to use > the new functionality and find gaps in terms of what is needed. Certainly > there is a bunch of stuff we need to add to cover the cases where disk i/o > is required. getdents() is one example, but the ABI issues we have with it > are somewhat more complicated given the history associated with that > interface. > > > That said, I also agree that it would be interesting to hear what the > > performance impact is for existing performance-sensitive users. Could > > we make that "aio_may_use_threads()" case be unconditional, making > > things simpler? > > Making it unconditional is a goal, but some work is required before that > can be the case. The O_DIRECT issue is one such matter -- it requires some > changes to the filesystems to ensure that they adhere to the non-blocking > nature of the new interface (ie taking i_mutex is a Bad Thing that users > really do not want to be exposed to; if taking it blocks, the code should > punt to a helper thread). Filesystems *must take locks* in the IO path. We have to serialise against truncate and other operations at some point in the IO path (e.g. block mapping vs concurrent allocation and/or removal), and that can only be done sanely with sleeping locks. There is no way of knowing in advance if we are going to block, and so either we always use threads for IO submission or we accept that occasionally the AIO submission will block. Cheers, Dave.
On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote: > Filesystems *must take locks* in the IO path. We have to serialise > against truncate and other operations at some point in the IO path > (e.g. block mapping vs concurrent allocation and/or removal), and > that can only be done sanely with sleeping locks. There is no way > of knowing in advance if we are going to block, and so either we > always use threads for IO submission or we accept that occasionally > the AIO submission will block. I never said we don't take locks. Still, we can be more intelligent about when and where we do so. With the nonblocking pread() and pwrite() changes being proposed elsewhere, we can do the part of the I/O that doesn't block in the submitter, which is a huge win when possible. As it stands today, *every* buffered write takes i_mutex immediately on entering ->write(). That one issue alone accounts for a nearly 10x performance difference between an O_SYNC write and an O_DIRECT write, and using O_SYNC writes is a legitimate use-case for users who want caching of data by the kernel (duplicating that functionality is a huge amount of work for an application, plus if you want the cache to be persistent between runs of an app, you have to get the kernel to do it). -ben > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote: > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> Are there other users outside of Solace? It would be good to get comments.. > > > > I know of quite a few storage/db products that use AIO. The most > > recent high profile project that have been reporting issues with AIO > > on XFS is http://www.scylladb.com/. That project is architected > > around non-blocking AIO for scalability reasons... > > I was more wondering about the new interfaces, making sure that the > feature set actually matches what people want to do.. Well, they have mentioned that openat() can block, as will the first operation after open that requires reading the file extent map from disk. There are some ways of hacking around this (e.g. running FIEMAP with a zero extent count or ext4's special extent prefetch ioctl in a separate thread to prefetch the extent list into memory before IO is required) so I suspect we may actually need some interfaces that don't current exist at all.... > That said, I also agree that it would be interesting to hear what the > performance impact is for existing performance-sensitive users. Could > we make that "aio_may_use_threads()" case be unconditional, making > things simpler? That would make things a lot simpler in the kernel and AIO submission a lot more predictable/deterministic for userspace. I'd suggest that, at minimum, it should be the default behaviour... Cheers, Dave.
On 2016-01-12 12:11:28 +1100, Dave Chinner wrote: > On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote: > > Enable a fully asynchronous fsync and fdatasync operations in aio using > > the aio thread queuing mechanism. > > > > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com> > > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> > > Insufficient. Needs the range to be passed through and call > vfs_fsync_range(), as I implemented here: > > https://lkml.org/lkml/2015/10/28/878 FWIW, I finally started to play around with this (or more precisely https://lkml.org/lkml/2015/10/29/517). There were some prerequisite changes in postgres required, to actually be able to benefit, delaying things. First results are good, increasing OLTP throughput considerably. It'd also be rather helpful to be able to do sync_file_range(SYNC_FILE_RANGE_WRITE) asynchronously, i.e. flush without an implied barrier. Currently this blocks very frequently, even if there's actually IO bandwidth available. Regards, Andres -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-01-19 19:59:35 -0800, Linus Torvalds wrote:
> Are there other users outside of Solace? It would be good to get comments..
PostgreSQL is a potential user of async fdatasync, fsync,
sync_file_range and potentially readahead, write, read. First tests with
Dave's async fsync/fsync_range are positive, so are the results with a
self-hacked async sync_file_range (although I'm kinda thinking that it
shouldn't really require to be used asynchronously).
I rather doubt openat, unlink et al are going to be interesting for
*us*, the requires structural changes would be too bit. But obviously
that doesn't mean anything for others.
Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote: > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote: > > Filesystems *must take locks* in the IO path. We have to serialise > > against truncate and other operations at some point in the IO path > > (e.g. block mapping vs concurrent allocation and/or removal), and > > that can only be done sanely with sleeping locks. There is no way > > of knowing in advance if we are going to block, and so either we > > always use threads for IO submission or we accept that occasionally > > the AIO submission will block. > > I never said we don't take locks. Still, we can be more intelligent > about when and where we do so. With the nonblocking pread() and pwrite() > changes being proposed elsewhere, we can do the part of the I/O that > doesn't block in the submitter, which is a huge win when possible. > > As it stands today, *every* buffered write takes i_mutex immediately > on entering ->write(). That one issue alone accounts for a nearly 10x > performance difference between an O_SYNC write and an O_DIRECT write, Yes, that locking is for correct behaviour, not for performance reasons. The i_mutex is providing the required semantics for POSIX write(2) functionality - writes must serialise against other reads and writes so that they are completed atomically w.r.t. other IO. i.e. writes to the same offset must not interleave, not should reads be able to see partial data from a write in progress. Direct IO does not conform to POSIX concurrency standards, so we don't have to serialise concurrent IO against each other. > and using O_SYNC writes is a legitimate use-case for users who want > caching of data by the kernel (duplicating that functionality is a huge > amount of work for an application, plus if you want the cache to be > persistent between runs of an app, you have to get the kernel to do it). Yes, but you take what you get given. Buffered IO sucks in many ways; this is just one of them. Cheers, Dave.
On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote: > On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote: > > > > > > > That said, I also agree that it would be interesting to hear what the > > > > performance impact is for existing performance-sensitive users. Could > > > > we make that "aio_may_use_threads()" case be unconditional, making > > > > things simpler? > > > > > > Making it unconditional is a goal, but some work is required before that > > > can be the case. The O_DIRECT issue is one such matter -- it requires > some > > > changes to the filesystems to ensure that they adhere to the > non-blocking > > > nature of the new interface (ie taking i_mutex is a Bad Thing that users > > > really do not want to be exposed to; if taking it blocks, the code > should > > > punt to a helper thread). > > > > Filesystems *must take locks* in the IO path. > > I agree. > > I also would prefer to make the aio code have as little interaction and > magic flags with the filesystem code as humanly possible. > > I wonder if we could make the rough rule be that the only synchronous case > the aio code ever has is more or less entirely in the generic vfs caches? > IOW, could we possibly aim to make the rule be that if we call down to the > filesystem layer, we do that within a thread? We have to go through the filesystem layer locking even on page cache hits, and even if we get into the page cache copy-in/copy-out code we can still get stuck on things like page locks and page faults. Even if hte pages are cached, we can still get caught on deeper filesystem locks for block mapping. e.g. read from a hole, get zeros back, page cache is populated. Write data into range, fetch page, realise it's unmapped, need to do block/delayed allocation which requires filesystem locks and potentially transactions and IO.... > We could do things like that for the name loopkup for openat() too, where > we could handle the successful RCU loopkup synchronously, but then if we > fall out of RCU mode we'd do the thread. We'd have to do quite a bit of work to unwind back out to the AIO layer before we can dispatch the open operation again in a thread, wouldn't we? So I'm not convinced that conditional thread dispatch makes sense. I think the simplest thing to do is make all AIO use threads/ workqueues by default, and if the application is smart enough to only do things that minimise blocking they can turn off the threaded dispatch and get the same behaviour they get now. Cheers, Dave.
On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote: > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote: > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote: > > > Filesystems *must take locks* in the IO path. We have to serialise > > > against truncate and other operations at some point in the IO path > > > (e.g. block mapping vs concurrent allocation and/or removal), and > > > that can only be done sanely with sleeping locks. There is no way > > > of knowing in advance if we are going to block, and so either we > > > always use threads for IO submission or we accept that occasionally > > > the AIO submission will block. > > > > I never said we don't take locks. Still, we can be more intelligent > > about when and where we do so. With the nonblocking pread() and pwrite() > > changes being proposed elsewhere, we can do the part of the I/O that > > doesn't block in the submitter, which is a huge win when possible. > > > > As it stands today, *every* buffered write takes i_mutex immediately > > on entering ->write(). That one issue alone accounts for a nearly 10x > > performance difference between an O_SYNC write and an O_DIRECT write, > > Yes, that locking is for correct behaviour, not for performance > reasons. The i_mutex is providing the required semantics for POSIX > write(2) functionality - writes must serialise against other reads > and writes so that they are completed atomically w.r.t. other IO. > i.e. writes to the same offset must not interleave, not should reads > be able to see partial data from a write in progress. No, the locks are not *required* for POSIX semantics, they are a legacy of how Linux filesystem code has been implemented and how we ensure the necessary internal consistency needed inside our filesystems is provided. There are other ways to achieve the required semantics that do not involve a single giant lock for the entire file/inode. And no, I am not saying that doing this is simple or easy to do. -ben > Direct IO does not conform to POSIX concurrency standards, so we > don't have to serialise concurrent IO against each other. > > > and using O_SYNC writes is a legitimate use-case for users who want > > caching of data by the kernel (duplicating that functionality is a huge > > amount of work for an application, plus if you want the cache to be > > persistent between runs of an app, you have to get the kernel to do it). > > Yes, but you take what you get given. Buffered IO sucks in many ways; > this is just one of them. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Jan 22, 2016 at 11:50:24PM -0500, Benjamin LaHaise wrote: > On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote: > > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote: > > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote: > > > > Filesystems *must take locks* in the IO path. We have to serialise > > > > against truncate and other operations at some point in the IO path > > > > (e.g. block mapping vs concurrent allocation and/or removal), and > > > > that can only be done sanely with sleeping locks. There is no way > > > > of knowing in advance if we are going to block, and so either we > > > > always use threads for IO submission or we accept that occasionally > > > > the AIO submission will block. > > > > > > I never said we don't take locks. Still, we can be more intelligent > > > about when and where we do so. With the nonblocking pread() and pwrite() > > > changes being proposed elsewhere, we can do the part of the I/O that > > > doesn't block in the submitter, which is a huge win when possible. > > > > > > As it stands today, *every* buffered write takes i_mutex immediately > > > on entering ->write(). That one issue alone accounts for a nearly 10x > > > performance difference between an O_SYNC write and an O_DIRECT write, > > > > Yes, that locking is for correct behaviour, not for performance > > reasons. The i_mutex is providing the required semantics for POSIX > > write(2) functionality - writes must serialise against other reads > > and writes so that they are completed atomically w.r.t. other IO. > > i.e. writes to the same offset must not interleave, not should reads > > be able to see partial data from a write in progress. > > No, the locks are not *required* for POSIX semantics, they are a legacy > of how Linux filesystem code has been implemented and how we ensure the > necessary internal consistency needed inside our filesystems is > provided. That may be the case, but I really don't see how you can provide such required functionality without some kind of exclusion barrier in place. No matter how you implement that exclusion, it can be seen effectively as a lock. Even if the filesystem doesn't use the i_mutex for exclusion to the page cache, it has to use some kind of lock as that IO still needs to be serialised against any truncate, hole punch or other extent manipulation that is currently in progress on the inode... > There are other ways to achieve the required semantics that > do not involve a single giant lock for the entire file/inode. Most performant filesystems don't have a "single giant lock" anymore. The problem is that the VFS expects the i_mutex to be held for certain operations in the IO path and the VFS lock order heirarchy makes it impossible to do anything but "get i_mutex first". That's the problem that needs to be solved - the VFS enforces the "one giant lock" model, even when underlying filesystems do not require it. i.e. we could quite happily remove the i_mutex completely from the XFS buffered IO path without breaking anything, but we can't because that results in the VFS throwing warnings that we don't hold the i_mutex (e.g like when removing the SUID bits on write). So there's lots of VFS functionality that needs to be turned on it's head before the i_mutex can be removed from the IO path. > And no, I > am not saying that doing this is simple or easy to do. Sure. That's always been the problem. Even when a split IO/metadata locking strategy like what XFS uses (and other modern filesystems are moving to internally) is suggested as a model for solving these problems, the usual response instant dismissal with "no way, that's unworkable" and so nothing ever changes... Cheers, Dave.
diff --git a/fs/aio.c b/fs/aio.c index 88af450..576b780 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -224,8 +224,9 @@ static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops; static void aio_complete(struct kiocb *kiocb, long res, long res2); +ssize_t aio_fsync(struct kiocb *iocb, int datasync); -static bool aio_may_use_threads(void) +static __always_inline bool aio_may_use_threads(void) { #if IS_ENABLED(CONFIG_AIO_THREAD) return !!(aio_auto_threads & 1); @@ -1654,6 +1655,26 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter) AIO_THREAD_NEED_TASK); } EXPORT_SYMBOL(generic_async_write_iter); + +static long aio_thread_op_fsync(struct aio_kiocb *iocb) +{ + return vfs_fsync(iocb->common.ki_filp, 0); +} + +static long aio_thread_op_fdatasync(struct aio_kiocb *iocb) +{ + return vfs_fsync(iocb->common.ki_filp, 1); +} + +ssize_t aio_fsync(struct kiocb *iocb, int datasync) +{ + struct aio_kiocb *req; + + req = container_of(iocb, struct aio_kiocb, common); + + return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync + : aio_thread_op_fsync, 0); +} #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */ /* @@ -1664,7 +1685,7 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode, char __user *buf, size_t len, bool compat) { struct file *file = req->common.ki_filp; - ssize_t ret; + ssize_t ret = -EINVAL; int rw; fmode_t mode; rw_iter_op *iter_op; @@ -1730,17 +1751,17 @@ rw_common: break; case IOCB_CMD_FDSYNC: - if (!file->f_op->aio_fsync) - return -EINVAL; - - ret = file->f_op->aio_fsync(&req->common, 1); + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(&req->common, 1); + else if (file->f_op->fsync && (aio_may_use_threads())) + ret = aio_fsync(&req->common, 1); break; case IOCB_CMD_FSYNC: - if (!file->f_op->aio_fsync) - return -EINVAL; - - ret = file->f_op->aio_fsync(&req->common, 0); + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(&req->common, 0); + else if (file->f_op->fsync && (aio_may_use_threads())) + ret = aio_fsync(&req->common, 0); break; default: