Message ID | 20191211152943.2933-1-axboe@kernel.dk (mailing list archive) |
---|---|
Headers | show |
Series | Support for RWF_UNCACHED | expand |
On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote: > > Comments appreciated! This should work on any standard file system, > using either the generic helpers or iomap. I have tested ext4 and xfs > for the right read/write behavior, but no further validation has been > done yet. Patches are against current git, and can also be found here: I don't think this is conceptually wrong, but the implementation smells a bit. I commented on the trivial part (the horrendous argument list to iomap_actor), but I wonder how much of the explicit invalidation is actually needed? Because active invalidation really is a horrible horrible thing to do. It immediately means that you can't use this interface for normal everyday things that may actually cache perfectly fine. What happens if you simply never _activate_ the page? Then they should get invalidated on their own, without impacting any other load - but only when there is _some_ memory pressure. They'll just stay on the inactive lru list, and get re-used quickly. Note that there are two ways to activate a page: the "accessed while on the inactive list" will activate it, but these days we also have a "pre-activate" path in the workingset code (see workingset_refault()). Even if you might also want an explicit invalidate path, I would like to hear what it looks like if you instead of - or in addition to - invalidating, have a "don't activate" flag. We don't have all _that_ many places where we activate pages, and they should be easy to find (just grep for "SetPageActive()"), although the call chain may make it a bit painful to add a "don't do it for this access" kind of things. But I think most of the regular IO call chains come through "mark_page_accessed()". So _that_ is the part you want to avoid (and maybe the workingset code). And that should be fairly straightforward, I think. In fact, that you say that just a pure random read case causes lots of kswapd activity makes me think that maybe we've screwed up page activation in general, and never noticed (because if you have enough memory, you don't really see it that often)? So this might not be an io_ring issue, but an issue in general. Linus
On 12/11/19 10:37 AM, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> Comments appreciated! This should work on any standard file system, >> using either the generic helpers or iomap. I have tested ext4 and xfs >> for the right read/write behavior, but no further validation has been >> done yet. Patches are against current git, and can also be found here: > > I don't think this is conceptually wrong, but the implementation > smells a bit. > > I commented on the trivial part (the horrendous argument list to > iomap_actor), but I wonder how much of the explicit invalidation is > actually needed? Agree on the other email on that part, if we continue on this path, then I'll clean that up and shove the arguments in an actor struct. > Because active invalidation really is a horrible horrible thing to do. > It immediately means that you can't use this interface for normal > everyday things that may actually cache perfectly fine. > > What happens if you simply never _activate_ the page? Then they should > get invalidated on their own, without impacting any other load - but > only when there is _some_ memory pressure. They'll just stay on the > inactive lru list, and get re-used quickly. > > Note that there are two ways to activate a page: the "accessed while > on the inactive list" will activate it, but these days we also have a > "pre-activate" path in the workingset code (see workingset_refault()). > > Even if you might also want an explicit invalidate path, I would like > to hear what it looks like if you instead of - or in addition to - > invalidating, have a "don't activate" flag. > > We don't have all _that_ many places where we activate pages, and they > should be easy to find (just grep for "SetPageActive()"), although the > call chain may make it a bit painful to add a "don't do it for this > access" kind of things. > > But I think most of the regular IO call chains come through > "mark_page_accessed()". So _that_ is the part you want to avoid (and > maybe the workingset code). And that should be fairly straightforward, > I think. Sure, I can give that a go and see how that behaves. > In fact, that you say that just a pure random read case causes lots of > kswapd activity makes me think that maybe we've screwed up page > activation in general, and never noticed (because if you have enough > memory, you don't really see it that often)? So this might not be an > io_ring issue, but an issue in general. This is very much not an io_uring issue, you can see exactly the same kind of behavior with normal buffered reads or mmap'ed IO. I do wonder if streamed reads are as bad in terms of making kswapd go crazy, I forget if I tested that explicitly as well. I'll run some streamed and random read testing on both and see how they behave, then report back.
[ Adding Johannes Weiner to the cc, I think he's looked at the working set and the inactive/active LRU lists the most ] On Wed, Dec 11, 2019 at 9:56 AM Jens Axboe <axboe@kernel.dk> wrote: > > > In fact, that you say that just a pure random read case causes lots of > > kswapd activity makes me think that maybe we've screwed up page > > activation in general, and never noticed (because if you have enough > > memory, you don't really see it that often)? So this might not be an > > io_ring issue, but an issue in general. > > This is very much not an io_uring issue, you can see exactly the same > kind of behavior with normal buffered reads or mmap'ed IO. I do wonder > if streamed reads are as bad in terms of making kswapd go crazy, I > forget if I tested that explicitly as well. We definitely used to have people test things like "read the same much-bigger-than-memory file over and over", and it wasn't supposed to be all _that_ painful, because the pages never activated, and they got moved out of the cache quickly and didn't disturb other activities (other than the constant IO, of course, which can be a big deal in itself). But maybe that was just the streaming case. With read-around and random accesses, maybe we end up activating too much (and maybe we always did). But I wouldn't be surprised if we've lost that as people went from having 16-32MB to having that many GB instead - simply because a lot of loads are basically entirely cached, and the few things that are not tend to be explicitly uncached (ie O_DIRECT etc). I think the workingset changes actually were maybe kind of related to this - the inactive list can become too small to ever give people time to do a good job of picking the _right_ thing to activate. So this might either be the reverse situation - maybe we let the inactive list grow too large, and then even a big random load will activate pages that really shouldn't be activated? Or it might be related to the workingset issue in that we've activated pages too eagerly and not ever moved things back to the inactive list (which then in some situations causes the inactive list to be very small). Who knows. But this is definitely an area that I suspect hasn't gotten all that much attention simply because memory has become much more plentiful, and a lot of regular loads basically have enough memory that almost all IO is cached anyway, and the old "you needed to be more clever about balancing swap/inactive/active even under normal loads" thing may have gone away a bit. These days, even if you do somewhat badly in that balancing act, a lot of loads probably won't notice that much. Either there is still so much memory for caching that the added IO isn't really ever dominant, or you had such a big load to begin with that it was long since rewritten to use O_DIRECT. Linus
On 12/11/19 10:56 AM, Jens Axboe wrote: >> But I think most of the regular IO call chains come through >> "mark_page_accessed()". So _that_ is the part you want to avoid (and >> maybe the workingset code). And that should be fairly straightforward, >> I think. > > Sure, I can give that a go and see how that behaves. Before doing that, I ran a streamed read test instead of just random reads, and the behavior is roughly the same. kswapd consumes a bit less CPU, but it's still very active once the page cache has been filled. For specifics on the setup, I deliberately boot the box with 32G of RAM, and the dataset is 320G. My initial tests were with 1 320G file, but Johannes complained about that so I went to 32 10G files instead. That's what I'm currently using. For the random test case, top of profile for kswapd is: + 33.49% kswapd0 [kernel.vmlinux] [k] xas_create ◆ + 7.93% kswapd0 [kernel.vmlinux] [k] __isolate_lru_page ▒ + 7.18% kswapd0 [kernel.vmlinux] [k] unlock_page ▒ + 5.90% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk ▒ + 5.64% kswapd0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave ▒ + 5.57% kswapd0 [kernel.vmlinux] [k] shrink_page_list ▒ + 3.48% kswapd0 [kernel.vmlinux] [k] __remove_mapping ▒ + 3.35% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages ▒ + 3.14% kswapd0 [kernel.vmlinux] [k] __delete_from_page_cache ▒ Next I ran with NOT calling mark_page_accessed() to see if that makes a difference. See patch below, I just applied this on top of this patchset and added a new RWF_NOACCESS flag for it for ease of teting. I verified that we are indeed skipping the mark_page_accessed() call in generic_file_buffered_read(). I can't tell a difference in the results, there's no discernable difference between NOT calling mark_page_accessed() or calling it. Behavior seems about the same, in terms of pre and post page cache full, and kswapd still churns a lot once the page cache is filled up.
On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote: > > I can't tell a difference in the results, there's no discernable > difference between NOT calling mark_page_accessed() or calling it. > Behavior seems about the same, in terms of pre and post page cache full, > and kswapd still churns a lot once the page cache is filled up. Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it when you do the IO that way, but it seems to be a bug relardless. Does /proc/meminfo have everything inactive for file data (ie the "Active(file)" line is basically zero?). Maybe pages got activated other ways (eg a problem with the workingset code)? You said "See patch below", but there wasn't any. That said, it's also entirely possible that even with everything in the inactive list, we might try to shrink other things first for whatever odd reason.. The fact that you see that xas_create() so prominently would imply perhaps add_to_swap_cache(), which certainly implies that the page shrinking isn't hitting the file pages... Linus
On 12/11/19 12:34 PM, Jens Axboe wrote: > On 12/11/19 10:56 AM, Jens Axboe wrote: >>> But I think most of the regular IO call chains come through >>> "mark_page_accessed()". So _that_ is the part you want to avoid (and >>> maybe the workingset code). And that should be fairly straightforward, >>> I think. >> >> Sure, I can give that a go and see how that behaves. > > Before doing that, I ran a streamed read test instead of just random > reads, and the behavior is roughly the same. kswapd consumes a bit less > CPU, but it's still very active once the page cache has been filled. For > specifics on the setup, I deliberately boot the box with 32G of RAM, and > the dataset is 320G. My initial tests were with 1 320G file, but > Johannes complained about that so I went to 32 10G files instead. That's > what I'm currently using. > > For the random test case, top of profile for kswapd is: > > + 33.49% kswapd0 [kernel.vmlinux] [k] xas_create > + 7.93% kswapd0 [kernel.vmlinux] [k] __isolate_lru_page > + 7.18% kswapd0 [kernel.vmlinux] [k] unlock_page > + 5.90% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk > + 5.64% kswapd0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave > + 5.57% kswapd0 [kernel.vmlinux] [k] shrink_page_list > + 3.48% kswapd0 [kernel.vmlinux] [k] __remove_mapping > + 3.35% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages > + 3.14% kswapd0 [kernel.vmlinux] [k] __delete_from_page_cache Here's the profile for the !mark_page_accessed() run, looks very much the same: + 32.84% kswapd0 [kernel.vmlinux] [k] xas_create + 8.05% kswapd0 [kernel.vmlinux] [k] unlock_page + 7.68% kswapd0 [kernel.vmlinux] [k] __isolate_lru_page + 6.08% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk + 5.96% kswapd0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave + 5.56% kswapd0 [kernel.vmlinux] [k] shrink_page_list + 4.02% kswapd0 [kernel.vmlinux] [k] __remove_mapping + 3.70% kswapd0 [kernel.vmlinux] [k] __delete_from_page_cache + 3.55% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages
On 12/11/19 1:03 PM, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> I can't tell a difference in the results, there's no discernable >> difference between NOT calling mark_page_accessed() or calling it. >> Behavior seems about the same, in terms of pre and post page cache full, >> and kswapd still churns a lot once the page cache is filled up. > > Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it > when you do the IO that way, but it seems to be a bug relardless. Hard to disagree with that. > Does /proc/meminfo have everything inactive for file data (ie the > "Active(file)" line is basically zero?). $ cat /proc/meminfo | grep -i active Active: 134136 kB Inactive: 28683916 kB Active(anon): 97064 kB Inactive(anon): 4 kB Active(file): 37072 kB Inactive(file): 28683912 kB This is after a run with RWF_NOACCESS. > Maybe pages got activated other ways (eg a problem with the workingset > code)? You said "See patch below", but there wasn't any. Oops, now below. > > That said, it's also entirely possible that even with everything in > the inactive list, we might try to shrink other things first for > whatever odd reason.. > > The fact that you see that xas_create() so prominently would imply > perhaps add_to_swap_cache(), which certainly implies that the page > shrinking isn't hitting the file pages... That's presumably misleading, as it's just lookups. But yes, confusing... diff --git a/include/linux/fs.h b/include/linux/fs.h index 5ea5fc167524..b2ecc66f5bd5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -316,6 +316,7 @@ enum rw_hint { #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) #define IOCB_UNCACHED (1 << 8) +#define IOCB_NOACCESS (1 << 9) struct kiocb { struct file *ki_filp; @@ -3423,6 +3424,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) ki->ki_flags |= IOCB_APPEND; if (flags & RWF_UNCACHED) ki->ki_flags |= IOCB_UNCACHED; + if (flags & RWF_NOACCESS) + ki->ki_flags |= IOCB_NOACCESS; return 0; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 357ebb0e0c5d..f20f0048d5c5 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -302,8 +302,10 @@ typedef int __bitwise __kernel_rwf_t; /* drop cache after reading or writing data */ #define RWF_UNCACHED ((__force __kernel_rwf_t)0x00000040) +#define RWF_NOACCESS ((__force __kernel_rwf_t)0x00000080) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND | RWF_UNCACHED) + RWF_APPEND | RWF_UNCACHED | RWF_NOACCESS) #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index 4dadd1a4ca7c..c37b0e221a8a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2058,7 +2058,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (iocb->ki_flags & IOCB_NOWAIT) goto would_block; /* UNCACHED implies no read-ahead */ - if (iocb->ki_flags & IOCB_UNCACHED) + if (iocb->ki_flags & (IOCB_UNCACHED|IOCB_NOACCESS)) goto no_cached_page; page_cache_sync_readahead(mapping, ra, filp, @@ -2144,7 +2144,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, * When a sequential read accesses a page several times, * only mark it as accessed the first time. */ - if (prev_index != index || offset != prev_offset) + if ((prev_index != index || offset != prev_offset) && + !(iocb->ki_flags & IOCB_NOACCESS)) mark_page_accessed(page); prev_index = index;
On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: > > $ cat /proc/meminfo | grep -i active > Active: 134136 kB > Inactive: 28683916 kB > Active(anon): 97064 kB > Inactive(anon): 4 kB > Active(file): 37072 kB > Inactive(file): 28683912 kB Yeah, that should not put pressure on some swap activity. We have 28 GB of basically free inactive file data, and the VM is doing something very very bad if it then doesn't just quickly free it with no real drama. In fact, I don't think it should even trigger kswapd at all, it should all be direct reclaim. Of course, some of the mm people hate that with a passion, but this does look like a prime example of why it should just be done. MM people - mind giving this a look? Jens, if you have that NOACCESS flag in a git tree too and a trivial way to recreate your load, that would be good for people to be able to just try things out. Linus
On Wed, Dec 11, 2019 at 01:08:32PM -0700, Jens Axboe wrote: > > The fact that you see that xas_create() so prominently would imply > > perhaps add_to_swap_cache(), which certainly implies that the page > > shrinking isn't hitting the file pages... > > That's presumably misleading, as it's just lookups. But yes, > confusing... While xas_create() could be called directly, it's more often called through xas_store() ... which would be called if we're storing a shadow entry to replace a page, which this workload would presumably be doing.
On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: > > > > $ cat /proc/meminfo | grep -i active > > Active: 134136 kB > > Inactive: 28683916 kB > > Active(anon): 97064 kB > > Inactive(anon): 4 kB > > Active(file): 37072 kB > > Inactive(file): 28683912 kB > > Yeah, that should not put pressure on some swap activity. We have 28 > GB of basically free inactive file data, and the VM is doing something > very very bad if it then doesn't just quickly free it with no real > drama. I was looking at this with Jens offline last week. One thing to note is the rate of IO that Jens is working with: combined with the low cache hit rate, it was pushing upwards of half a million pages through the page cache each second. There isn't anything obvious sticking out in the kswapd profile: it's dominated by cache tree deletions (or rather replacing pages with shadow entries, hence the misleading xas_store()), tree lock contention, etc. - all work that a direct reclaimer would have to do as well, with one exceptions: RWC_UNCACHED doesn't need to go through the LRU list, and 8-9% of kswapd cycles alone are going into physically getting pages off the list. (And I suspect part of that is also contention over the LRU lock as kswapd gets overwhelmed and direct reclaim kicks in). Jens, how much throughput difference does kswapd vs RWC_UNCACHED make?
On 12/11/19 1:18 PM, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> $ cat /proc/meminfo | grep -i active >> Active: 134136 kB >> Inactive: 28683916 kB >> Active(anon): 97064 kB >> Inactive(anon): 4 kB >> Active(file): 37072 kB >> Inactive(file): 28683912 kB > > Yeah, that should not put pressure on some swap activity. We have 28 > GB of basically free inactive file data, and the VM is doing something > very very bad if it then doesn't just quickly free it with no real > drama. > > In fact, I don't think it should even trigger kswapd at all, it should > all be direct reclaim. Of course, some of the mm people hate that with > a passion, but this does look like a prime example of why it should > just be done. For giggles, I ran just a single thread on the file set. We're only doing about 100K IOPS at that point, yet when the page cache fills, kswapd still eats 10% cpu. That seems like a lot for something that slow. > MM people - mind giving this a look? Jens, if you have that NOACCESS > flag in a git tree too and a trivial way to recreate your load, that > would be good for people to be able to just try things out. I've pushed the NOACCESS thing to my buffered-uncached branch as well, and fio has a 'noaccess' branch that enables it for pvsync2 (which is preadv2/pwritev2) and the io_uring engine. Here's what I did to reproduce: - Boot the box with 32G of memory. - On a fast device, create 10x RAM size of files. I used 32 files, each 10G. Mine are in /data, and they are named file[1-32]. - Run a buffered read workload on those files. For pvsync2, something ala: $ cat job.fio [test] ioengine=pvsync2 #uncached=1 #noaccess=1 iodepth=4 bs=4k group_reporting=1 rw=randread norandommap buffered=1 directory=/data filename=file1:file2:file3:file4:file5:file6:file7:file8:file9:file10:file11:file12:file13:file14:file15:file16:file17:file18:file19:file20:file21:file22:file23:file24:file25:file26:file27:file28:file29:file30:file31:file32 If you want to use more than one thread, add: numjobs=4 for 4 threads. Uncomment the 'uncached=1' and/or 'noaccess=1' to enable either RWF_UNCACHED or RWF_NOACCESS.
On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote: > > Here's what I did to reproduce: Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970 Pro - I'm assuming yours is a different class), but I encrypt my disk, so I only get about 15k iops and I see kcyrptd in my profiles, not kswapd. I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I wanted to see the problem. But I think that with my setup, I can't really see it just due to my IO being slow ;( I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the very first signs of this problem, but it's certainly not all that noticeable. Linus
On 12/11/19 4:41 PM, Jens Axboe wrote: > On 12/11/19 1:18 PM, Linus Torvalds wrote: >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> $ cat /proc/meminfo | grep -i active >>> Active: 134136 kB >>> Inactive: 28683916 kB >>> Active(anon): 97064 kB >>> Inactive(anon): 4 kB >>> Active(file): 37072 kB >>> Inactive(file): 28683912 kB >> >> Yeah, that should not put pressure on some swap activity. We have 28 >> GB of basically free inactive file data, and the VM is doing something >> very very bad if it then doesn't just quickly free it with no real >> drama. >> >> In fact, I don't think it should even trigger kswapd at all, it should >> all be direct reclaim. Of course, some of the mm people hate that with >> a passion, but this does look like a prime example of why it should >> just be done. > > For giggles, I ran just a single thread on the file set. We're only > doing about 100K IOPS at that point, yet when the page cache fills, > kswapd still eats 10% cpu. That seems like a lot for something that > slow. Warning, the below is from the really crazy department... Anyway, I took a closer look at the profiles for the uncached case. We're spending a lot of time doing memsets (this is the xa_node init, from the radix tree constructor), and call_rcu for the node free later on. All wasted time, and something that meant we weren't as close to the performance of O_DIRECT as we could be. So Chris and I started talking about this, and pondered "what would happen if we simply bypassed the page cache completely?". Case in point, see below incremental patch. We still do the page cache lookup, and use that page to copy from if it's there. If the page isn't there, allocate one and do IO to it, but DON'T add it to the page cache. With that, we're almost at O_DIRECT levels of performance for the 4k read case, without 1-2%. I think 512b would look awesome, but we're reading full pages, so that won't really help us much. Compared to the previous uncached method, this is 30% faster on this device. That's substantial. Obviously this has issues with truncate that would need to be resolved, and it's definitely dirtier. But the performance is very enticing... diff --git a/mm/filemap.c b/mm/filemap.c index c37b0e221a8a..9834c394fffd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -933,8 +933,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, } EXPORT_SYMBOL(add_to_page_cache_locked); -static int __add_to_page_cache(struct page *page, struct address_space *mapping, - pgoff_t offset, gfp_t gfp_mask, bool lru) +int add_to_page_cache_lru(struct page *page, struct address_space *mapping, + pgoff_t offset, gfp_t gfp_mask) { void *shadow = NULL; int ret; @@ -956,18 +956,11 @@ static int __add_to_page_cache(struct page *page, struct address_space *mapping, WARN_ON_ONCE(PageActive(page)); if (!(gfp_mask & __GFP_WRITE) && shadow) workingset_refault(page, shadow); - if (lru) - lru_cache_add(page); + lru_cache_add(page); } return ret; } - -int add_to_page_cache_lru(struct page *page, struct address_space *mapping, - pgoff_t offset, gfp_t gfp_mask) -{ - return __add_to_page_cache(page, mapping, offset, gfp_mask, true); -} EXPORT_SYMBOL_GPL(add_to_page_cache_lru); #ifdef CONFIG_NUMA @@ -1998,6 +1991,13 @@ static void shrink_readahead_size_eio(struct file *filp, ra->ra_pages /= 4; } +static void buffered_put_page(struct page *page, bool clear_mapping) +{ + if (clear_mapping) + page->mapping = NULL; + put_page(page); +} + /** * generic_file_buffered_read - generic file read routine * @iocb: the iocb to read @@ -2040,7 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset = *ppos & ~PAGE_MASK; for (;;) { - bool drop_page = false; + bool clear_mapping = false; struct page *page; pgoff_t end_index; loff_t isize; @@ -2118,7 +2118,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, isize = i_size_read(inode); end_index = (isize - 1) >> PAGE_SHIFT; if (unlikely(!isize || index > end_index)) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } @@ -2127,7 +2127,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (index == end_index) { nr = ((isize - 1) & ~PAGE_MASK) + 1; if (nr <= offset) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } } @@ -2160,27 +2160,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset &= ~PAGE_MASK; prev_offset = offset; - /* - * If we're dropping this page due to drop-behind, then - * lock it first. Ignore errors here, we can just leave it - * in the page cache. Note that we didn't add this page to - * the LRU when we added it to the page cache. So if we - * fail removing it, or lock it, add to the LRU. - */ - if (drop_page) { - bool addlru = true; - - if (!lock_page_killable(page)) { - if (page->mapping == mapping) - addlru = !remove_mapping(mapping, page); - else - addlru = false; - unlock_page(page); - } - if (addlru) - lru_cache_add(page); - } - put_page(page); + buffered_put_page(page, clear_mapping); written += ret; if (!iov_iter_count(iter)) goto out; @@ -2222,7 +2202,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (unlikely(error)) { if (error == AOP_TRUNCATED_PAGE) { - put_page(page); + buffered_put_page(page, clear_mapping); error = 0; goto find_page; } @@ -2239,7 +2219,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, * invalidate_mapping_pages got it */ unlock_page(page); - put_page(page); + buffered_put_page(page, clear_mapping); goto find_page; } unlock_page(page); @@ -2254,7 +2234,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, readpage_error: /* UHHUH! A synchronous read error occurred. Report it */ - put_page(page); + buffered_put_page(page, clear_mapping); goto out; no_cached_page: @@ -2267,12 +2247,16 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, error = -ENOMEM; goto out; } - if (iocb->ki_flags & IOCB_UNCACHED) - drop_page = true; + if (iocb->ki_flags & IOCB_UNCACHED) { + __SetPageLocked(page); + page->mapping = mapping; + page->index = index; + clear_mapping = true; + goto readpage; + } - error = __add_to_page_cache(page, mapping, index, - mapping_gfp_constraint(mapping, GFP_KERNEL), - !drop_page); + error = add_to_page_cache(page, mapping, index, + mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error) { put_page(page); if (error == -EEXIST) {
On 12/11/19 6:08 PM, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> Here's what I did to reproduce: > > Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970 > Pro - I'm assuming yours is a different class), but I encrypt my disk, > so I only get about 15k iops and I see kcyrptd in my profiles, not > kswapd. > > I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I > wanted to see the problem. But I think that with my setup, I can't > really see it just due to my IO being slow ;( > > I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the > very first signs of this problem, but it's certainly not all that > noticeable. 15K is likely too slow to really show an issue, I'm afraid. The 970 is no slouch, but your crypt setup will likely hamper it a lot. You don't have a non-encrypted partition on it? I always set aside a decently sized empty partition on my laptop to be able to run these kinds of things without having to resort to a test box. Latency is where it's down the most, you'll likely do better with more threads to get the IOPS up.
On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote: > > 15K is likely too slow to really show an issue, I'm afraid. The 970 > is no slouch, but your crypt setup will likely hamper it a lot. You > don't have a non-encrypted partition on it? No. I normally don't need all that much disk, so I've never upgraded my ssd from the 512G size. Which means that it's actually half full or so, and I never felt like "I should keep an unencrypted partition for IO testing", since I don't generally _do_ any IO testing. I can get my load up with "numjobs=8" and get my iops up to the 100k range, though. But kswapd doesn't much seem to care, the CPU percentage actually does _down_ to 0.39% when I try that. Probably simply because now my CPU's are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly idle" state ... I guess I should be happy. It does mean that the situation you see isn't exactly the normal case. I understand why you want to do the non-cached case, but the case I think it the worrisome one is the regular buffered one, so that's what I'm testing (not even trying the noaccess patches). So from your report I went "uhhuh, that sounds like a bug". And it appears that it largely isn't - you're seeing it because of pushing the IO subsystem by another order of magnitude (and then I agree that "under those kinds of IO loads, caching just won't help") Linus
On 12/11/19 6:22 PM, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> 15K is likely too slow to really show an issue, I'm afraid. The 970 >> is no slouch, but your crypt setup will likely hamper it a lot. You >> don't have a non-encrypted partition on it? > > No. I normally don't need all that much disk, so I've never upgraded > my ssd from the 512G size. > > Which means that it's actually half full or so, and I never felt like > "I should keep an unencrypted partition for IO testing", since I don't > generally _do_ any IO testing. > > I can get my load up with "numjobs=8" and get my iops up to the 100k > range, though. > > But kswapd doesn't much seem to care, the CPU percentage actually does > _down_ to 0.39% when I try that. Probably simply because now my CPU's > are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly > idle" state ... > > I guess I should be happy. It does mean that the situation you see > isn't exactly the normal case. I understand why you want to do the > non-cached case, but the case I think it the worrisome one is the > regular buffered one, so that's what I'm testing (not even trying the > noaccess patches). > > So from your report I went "uhhuh, that sounds like a bug". And it > appears that it largely isn't - you're seeing it because of pushing > the IO subsystem by another order of magnitude (and then I agree that > "under those kinds of IO loads, caching just won't help") I'd very much argue that it IS a bug, maybe just doesn't show on your system. My test box is a pretty standard 2 socket system, 24 cores / 48 threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing crazy, and granted that's only 10% kswapd cpu time, but that still seems very high for those kinds of rates. I'm surprised you see essentially no kswapd time for the same data rate. We'll keep poking here, I know Johannes is spending some time looking into the reclaim side.
On 12/11/19 2:04 PM, Johannes Weiner wrote: > On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote: >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> $ cat /proc/meminfo | grep -i active >>> Active: 134136 kB >>> Inactive: 28683916 kB >>> Active(anon): 97064 kB >>> Inactive(anon): 4 kB >>> Active(file): 37072 kB >>> Inactive(file): 28683912 kB >> >> Yeah, that should not put pressure on some swap activity. We have 28 >> GB of basically free inactive file data, and the VM is doing something >> very very bad if it then doesn't just quickly free it with no real >> drama. > > I was looking at this with Jens offline last week. One thing to note > is the rate of IO that Jens is working with: combined with the low > cache hit rate, it was pushing upwards of half a million pages through > the page cache each second. > > There isn't anything obvious sticking out in the kswapd profile: it's > dominated by cache tree deletions (or rather replacing pages with > shadow entries, hence the misleading xas_store()), tree lock > contention, etc. - all work that a direct reclaimer would have to do > as well, with one exceptions: RWC_UNCACHED doesn't need to go through > the LRU list, and 8-9% of kswapd cycles alone are going into > physically getting pages off the list. (And I suspect part of that is > also contention over the LRU lock as kswapd gets overwhelmed and > direct reclaim kicks in). > > Jens, how much throughput difference does kswapd vs RWC_UNCACHED make? It's not huge, like 5-10%. The CPU usage is the most noticable, particularly at the higher IO rates.
On Wed, Dec 11, 2019 at 5:29 PM Jens Axboe <axboe@kernel.dk> wrote: > > I'd very much argue that it IS a bug, maybe just doesn't show on your > system. Oh, I agree. But I also understand why people hadn't noticed, and I don't think it's all that critical - because if you do 1M iops cached, you're doing something really really strange. I too can see xas_create using 30% CPU time, but that's when I do a perf record on just kswapd - and when I actually look at it on a system level, it looks nowhere near that bad. So I think people should look at this. Part of it might be for Willy: does that xas_create() need to be that expensive? I hate how "perf" callchains work, but it does look like it is probably page_cache_delete -> xas_store -> xas_create that is the bulk of the cost there. Replacing the real page with the shadow entry shouldn't be that big of a deal, I would really hope. Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace thing, is there perhaps some simple optmization that could be done on the XArray case here? Linus
On 12/11/19 6:29 PM, Jens Axboe wrote: > On 12/11/19 6:22 PM, Linus Torvalds wrote: >> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> 15K is likely too slow to really show an issue, I'm afraid. The 970 >>> is no slouch, but your crypt setup will likely hamper it a lot. You >>> don't have a non-encrypted partition on it? >> >> No. I normally don't need all that much disk, so I've never upgraded >> my ssd from the 512G size. >> >> Which means that it's actually half full or so, and I never felt like >> "I should keep an unencrypted partition for IO testing", since I don't >> generally _do_ any IO testing. >> >> I can get my load up with "numjobs=8" and get my iops up to the 100k >> range, though. >> >> But kswapd doesn't much seem to care, the CPU percentage actually does >> _down_ to 0.39% when I try that. Probably simply because now my CPU's >> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly >> idle" state ... >> >> I guess I should be happy. It does mean that the situation you see >> isn't exactly the normal case. I understand why you want to do the >> non-cached case, but the case I think it the worrisome one is the >> regular buffered one, so that's what I'm testing (not even trying the >> noaccess patches). >> >> So from your report I went "uhhuh, that sounds like a bug". And it >> appears that it largely isn't - you're seeing it because of pushing >> the IO subsystem by another order of magnitude (and then I agree that >> "under those kinds of IO loads, caching just won't help") > > I'd very much argue that it IS a bug, maybe just doesn't show on your > system. My test box is a pretty standard 2 socket system, 24 cores / 48 > threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing > crazy, and granted that's only 10% kswapd cpu time, but that still seems > very high for those kinds of rates. I'm surprised you see essentially no > kswapd time for the same data rate. > > We'll keep poking here, I know Johannes is spending some time looking > into the reclaim side. Out of curiosity, just tried it on my laptop, which also has some samsung drive. Using 8 jobs, I get around 100K IOPS too, and this is my top listing: 23308 axboe 20 0 623156 1304 8 D 10.3 0.0 0:03.81 fio 23309 axboe 20 0 623160 1304 8 D 10.3 0.0 0:03.81 fio 23311 axboe 20 0 623168 1304 8 D 10.3 0.0 0:03.82 fio 23313 axboe 20 0 623176 1304 8 D 10.3 0.0 0:03.82 fio 23314 axboe 20 0 623180 1304 8 D 10.3 0.0 0:03.81 fio 162 root 20 0 0 0 0 S 9.9 0.0 0:12.97 kswapd0 23307 axboe 20 0 623152 1304 8 D 9.9 0.0 0:03.84 fio 23310 axboe 20 0 623164 1304 8 D 9.9 0.0 0:03.81 fio 23312 axboe 20 0 623172 1304 8 D 9.9 0.0 0:03.80 fio kswapd is between 9-11% the whole time, and the profile looks very similar to what I saw on my test box: 35.79% kswapd0 [kernel.vmlinux] [k] xas_create 9.97% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk 9.94% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages 7.78% kswapd0 [kernel.vmlinux] [k] shrink_page_list 3.78% kswapd0 [kernel.vmlinux] [k] xas_clear_mark 3.08% kswapd0 [kernel.vmlinux] [k] workingset_eviction 2.48% kswapd0 [kernel.vmlinux] [k] __isolate_lru_page 2.06% kswapd0 [kernel.vmlinux] [k] page_mapping 1.95% kswapd0 [kernel.vmlinux] [k] __remove_mapping So now I'm even more puzzled why your (desktop?) doesn't show it, it must be more potent than my x1 laptop. But for me, the laptop and 2 socket test box show EXACTLY the same behavior, laptop is just too slow to make it really pathological.
On Wed, Dec 11, 2019 at 5:41 PM Jens Axboe <axboe@kernel.dk> wrote: > > So now I'm even more puzzled why your (desktop?) doesn't show it, it > must be more potent than my x1 laptop. But for me, the laptop and 2 > socket test box show EXACTLY the same behavior, laptop is just too slow > to make it really pathological. I see the exact same thing. And now, go do "perf record -a sleep 100" while this is going on, to see the big picture. Suddenly that number is a lot smaller. Because the kswapd cost is less than 10% of one cpu, and the xas_create is 30% of that. IOW, it's not all that dominant until you zoom in on it. Considering the amount of IO going on, it's kind of understandable. But as mentioned, it does look like something that should be fixed. Maybe it's even something where it's a bad XArray case? Willy? Linus
On Wed, Dec 11, 2019 at 05:41:16PM -0800, Linus Torvalds wrote: > I too can see xas_create using 30% CPU time, but that's when I do a > perf record on just kswapd - and when I actually look at it on a > system level, it looks nowhere near that bad. > > So I think people should look at this. Part of it might be for Willy: > does that xas_create() need to be that expensive? I hate how "perf" > callchains work, but it does look like it is probably > page_cache_delete -> xas_store -> xas_create that is the bulk of the > cost there. > > Replacing the real page with the shadow entry shouldn't be that big of > a deal, I would really hope. > > Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace > thing, is there perhaps some simple optmization that could be done on > the XArray case here? It _should_ be the same order of complexity. Since there's already a page in the page cache, xas_create() just walks its way down to the right node calling xas_descend() and then xas_store() does the equivalent of __radix_tree_replace(). I don't see a bug that would make it more expensive than the old code ... a 10GB file is going to have four levels of radix tree node, so it shouldn't even spend that long looking for the right node.
On 12/11/19 6:09 PM, Jens Axboe wrote: > On 12/11/19 4:41 PM, Jens Axboe wrote: >> On 12/11/19 1:18 PM, Linus Torvalds wrote: >>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> $ cat /proc/meminfo | grep -i active >>>> Active: 134136 kB >>>> Inactive: 28683916 kB >>>> Active(anon): 97064 kB >>>> Inactive(anon): 4 kB >>>> Active(file): 37072 kB >>>> Inactive(file): 28683912 kB >>> >>> Yeah, that should not put pressure on some swap activity. We have 28 >>> GB of basically free inactive file data, and the VM is doing something >>> very very bad if it then doesn't just quickly free it with no real >>> drama. >>> >>> In fact, I don't think it should even trigger kswapd at all, it should >>> all be direct reclaim. Of course, some of the mm people hate that with >>> a passion, but this does look like a prime example of why it should >>> just be done. >> >> For giggles, I ran just a single thread on the file set. We're only >> doing about 100K IOPS at that point, yet when the page cache fills, >> kswapd still eats 10% cpu. That seems like a lot for something that >> slow. > > Warning, the below is from the really crazy department... > > Anyway, I took a closer look at the profiles for the uncached case. > We're spending a lot of time doing memsets (this is the xa_node init, > from the radix tree constructor), and call_rcu for the node free later > on. All wasted time, and something that meant we weren't as close to the > performance of O_DIRECT as we could be. > > So Chris and I started talking about this, and pondered "what would > happen if we simply bypassed the page cache completely?". Case in point, > see below incremental patch. We still do the page cache lookup, and use > that page to copy from if it's there. If the page isn't there, allocate > one and do IO to it, but DON'T add it to the page cache. With that, > we're almost at O_DIRECT levels of performance for the 4k read case, > without 1-2%. I think 512b would look awesome, but we're reading full > pages, so that won't really help us much. Compared to the previous > uncached method, this is 30% faster on this device. That's substantial. > > Obviously this has issues with truncate that would need to be resolved, > and it's definitely dirtier. But the performance is very enticing... Tested and cleaned a bit, and added truncate protection through inode_dio_begin()/inode_dio_end(). https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87 This is much faster than the previous page cache dance, and I _think_ we're ok as long as we block truncate and hole punching. Comments?
On 12/11/19 7:03 PM, Jens Axboe wrote: > On 12/11/19 6:09 PM, Jens Axboe wrote: >> On 12/11/19 4:41 PM, Jens Axboe wrote: >>> On 12/11/19 1:18 PM, Linus Torvalds wrote: >>>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> >>>>> $ cat /proc/meminfo | grep -i active >>>>> Active: 134136 kB >>>>> Inactive: 28683916 kB >>>>> Active(anon): 97064 kB >>>>> Inactive(anon): 4 kB >>>>> Active(file): 37072 kB >>>>> Inactive(file): 28683912 kB >>>> >>>> Yeah, that should not put pressure on some swap activity. We have 28 >>>> GB of basically free inactive file data, and the VM is doing something >>>> very very bad if it then doesn't just quickly free it with no real >>>> drama. >>>> >>>> In fact, I don't think it should even trigger kswapd at all, it should >>>> all be direct reclaim. Of course, some of the mm people hate that with >>>> a passion, but this does look like a prime example of why it should >>>> just be done. >>> >>> For giggles, I ran just a single thread on the file set. We're only >>> doing about 100K IOPS at that point, yet when the page cache fills, >>> kswapd still eats 10% cpu. That seems like a lot for something that >>> slow. >> >> Warning, the below is from the really crazy department... >> >> Anyway, I took a closer look at the profiles for the uncached case. >> We're spending a lot of time doing memsets (this is the xa_node init, >> from the radix tree constructor), and call_rcu for the node free later >> on. All wasted time, and something that meant we weren't as close to the >> performance of O_DIRECT as we could be. >> >> So Chris and I started talking about this, and pondered "what would >> happen if we simply bypassed the page cache completely?". Case in point, >> see below incremental patch. We still do the page cache lookup, and use >> that page to copy from if it's there. If the page isn't there, allocate >> one and do IO to it, but DON'T add it to the page cache. With that, >> we're almost at O_DIRECT levels of performance for the 4k read case, >> without 1-2%. I think 512b would look awesome, but we're reading full >> pages, so that won't really help us much. Compared to the previous >> uncached method, this is 30% faster on this device. That's substantial. >> >> Obviously this has issues with truncate that would need to be resolved, >> and it's definitely dirtier. But the performance is very enticing... > > Tested and cleaned a bit, and added truncate protection through > inode_dio_begin()/inode_dio_end(). > > https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87 > > This is much faster than the previous page cache dance, and I _think_ > we're ok as long as we block truncate and hole punching. > > Comments? Incremental was unnecessarily hard to read, here's the original RWF_UNCACHED with it folded in instead for easier reading: commit d1c9eec7b958760ea4d4e75483b0f2e3cfec53d5 Author: Jens Axboe <axboe@kernel.dk> Date: Thu Dec 5 17:45:00 2019 -0700 fs: add read support for RWF_UNCACHED If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private pages for the buffered reads. These pages will never be inserted into the page cache, and they are simply droped when we have done the copy at the end of IO. If pages in the read range are already in the page cache, then use those for just copying the data instead of starting IO on private pages. A previous solution used the page cache even for non-cached ranges, but the cost of doing so was too high. Removing nodes at the end is expensive, even with LRU bypass. On top of that, repeatedly instantiating new xarray nodes is very costly, as it needs to memset 576 bytes of data, and freeing said nodes involve an RCU call per node as well. All that adds up, making uncached somewhat slower than O_DIRECT. With the current solition, we're basically at O_DIRECT levels of performance for RWF_UNCACHED IO. Protect against truncate the same way O_DIRECT does, by calling inode_dio_begin() to elevate the inode->i_dio_count. Signed-off-by: Jens Axboe <axboe@kernel.dk> diff --git a/include/linux/fs.h b/include/linux/fs.h index 98e0349adb52..092ea2a4319b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -314,6 +314,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_UNCACHED (1 << 8) struct kiocb { struct file *ki_filp; @@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); if (flags & RWF_APPEND) ki->ki_flags |= IOCB_APPEND; + if (flags & RWF_UNCACHED) + ki->ki_flags |= IOCB_UNCACHED; return 0; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 379a612f8f1d..357ebb0e0c5d 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* drop cache after reading or writing data */ +#define RWF_UNCACHED ((__force __kernel_rwf_t)0x00000040) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_UNCACHED) #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index bf6aa30be58d..5d299d69b185 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1990,6 +1990,13 @@ static void shrink_readahead_size_eio(struct file *filp, ra->ra_pages /= 4; } +static void buffered_put_page(struct page *page, bool clear_mapping) +{ + if (clear_mapping) + page->mapping = NULL; + put_page(page); +} + /** * generic_file_buffered_read - generic file read routine * @iocb: the iocb to read @@ -2013,6 +2020,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; struct file_ra_state *ra = &filp->f_ra; + bool did_dio_begin = false; loff_t *ppos = &iocb->ki_pos; pgoff_t index; pgoff_t last_index; @@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset = *ppos & ~PAGE_MASK; for (;;) { + bool clear_mapping = false; struct page *page; pgoff_t end_index; loff_t isize; @@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (!page) { if (iocb->ki_flags & IOCB_NOWAIT) goto would_block; + /* UNCACHED implies no read-ahead */ + if (iocb->ki_flags & IOCB_UNCACHED) { + did_dio_begin = true; + /* block truncate for UNCACHED reads */ + inode_dio_begin(inode); + goto no_cached_page; + } page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@ -2106,7 +2122,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, isize = i_size_read(inode); end_index = (isize - 1) >> PAGE_SHIFT; if (unlikely(!isize || index > end_index)) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } @@ -2115,7 +2131,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (index == end_index) { nr = ((isize - 1) & ~PAGE_MASK) + 1; if (nr <= offset) { - put_page(page); + buffered_put_page(page, clear_mapping); goto out; } } @@ -2147,7 +2163,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, offset &= ~PAGE_MASK; prev_offset = offset; - put_page(page); + buffered_put_page(page, clear_mapping); written += ret; if (!iov_iter_count(iter)) goto out; @@ -2189,7 +2205,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, if (unlikely(error)) { if (error == AOP_TRUNCATED_PAGE) { - put_page(page); + buffered_put_page(page, clear_mapping); error = 0; goto find_page; } @@ -2206,7 +2222,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, * invalidate_mapping_pages got it */ unlock_page(page); - put_page(page); + buffered_put_page(page, clear_mapping); goto find_page; } unlock_page(page); @@ -2221,7 +2237,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, readpage_error: /* UHHUH! A synchronous read error occurred. Report it */ - put_page(page); + buffered_put_page(page, clear_mapping); goto out; no_cached_page: @@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, error = -ENOMEM; goto out; } - error = add_to_page_cache_lru(page, mapping, index, + if (iocb->ki_flags & IOCB_UNCACHED) { + __SetPageLocked(page); + page->mapping = mapping; + page->index = index; + clear_mapping = true; + goto readpage; + } + + error = add_to_page_cache(page, mapping, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error) { put_page(page); @@ -2250,6 +2274,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, would_block: error = -EAGAIN; out: + if (did_dio_begin) + inode_dio_end(inode); ra->prev_pos = prev_index; ra->prev_pos <<= PAGE_SHIFT; ra->prev_pos |= prev_offset;
On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote: > Tested and cleaned a bit, and added truncate protection through > inode_dio_begin()/inode_dio_end(). > > https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87 > > This is much faster than the previous page cache dance, and I _think_ > we're ok as long as we block truncate and hole punching. What about races between UNCACHED and regular buffered IO? Seems like we might end up sending overlapping writes to the device?
On 12/11/19 7:21 PM, Matthew Wilcox wrote: > On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote: >> Tested and cleaned a bit, and added truncate protection through >> inode_dio_begin()/inode_dio_end(). >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87 >> >> This is much faster than the previous page cache dance, and I _think_ >> we're ok as long as we block truncate and hole punching. > > What about races between UNCACHED and regular buffered IO? Seems like > we might end up sending overlapping writes to the device? The UNCACHED out-of-page-cache IO is just for reads, it's not writes. The write part is still using the same approach as earlier, though now I bet it looks kinda slow compared to the read side...
On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > It _should_ be the same order of complexity. Since there's already > a page in the page cache, xas_create() just walks its way down to the > right node calling xas_descend() and then xas_store() does the equivalent > of __radix_tree_replace(). I don't see a bug that would make it more > expensive than the old code ... a 10GB file is going to have four levels > of radix tree node, so it shouldn't even spend that long looking for > the right node. The profile seems to say that 85% of the cost of xas_create() is two instructions: # lib/xarray.c:143: return (index >> node->shift) & XA_CHUNK_MASK; movzbl (%rsi), %ecx # MEM[(unsigned char *)node_13], MEM[(unsigned char *)node_13] ... # ./include/linux/xarray.h:1145: return rcu_dereference_check(node->slots[offset], # ./include/linux/compiler.h:199: __READ_ONCE_SIZE; movq (%rax), %rax # MEM[(volatile __u64 *)_80], <retval> where that first load is "node->shift", and the second load seems to be just the node dereference. I think both of them are basically just xas_descend(), but it's a bit hard to read the several levels of inlining. I suspect the real issue is that going through kswapd means we've lost almost all locality, and with the random walking the LRU list is basically entirely unordered so you don't get any advantage of the xarray having (otherwise) possibly good cache patterns. So we're just missing in the caches all the time, making it expensive. Linus
Hi Jens. Jens Axboe - 11.12.19, 16:29:38 CET: > Recently someone asked me how io_uring buffered IO compares to mmaped > IO in terms of performance. So I ran some tests with buffered IO, and > found the experience to be somewhat painful. The test case is pretty > basic, random reads over a dataset that's 10x the size of RAM. > Performance starts out fine, and then the page cache fills up and we > hit a throughput cliff. CPU usage of the IO threads go up, and we have > kswapd spending 100% of a core trying to keep up. Seeing that, I was > reminded of the many complaints I here about buffered IO, and the > fact that most of the folks complaining will ultimately bite the > bullet and move to O_DIRECT to just get the kernel out of the way. > > But I don't think it needs to be like that. Switching to O_DIRECT > isn't always easily doable. The buffers have different life times, > size and alignment constraints, etc. On top of that, mixing buffered > and O_DIRECT can be painful. > > Seems to me that we have an opportunity to provide something that sits > somewhere in between buffered and O_DIRECT, and this is where > RWF_UNCACHED enters the picture. If this flag is set on IO, we get > the following behavior: > > - If the data is in cache, it remains in cache and the copy (in or > out) is served to/from that. > > - If the data is NOT in cache, we add it while performing the IO. When > the IO is done, we remove it again. > > With this, I can do 100% smooth buffered reads or writes without > pushing the kernel to the state where kswapd is sweating bullets. In > fact it doesn't even register. A question from a user or Linux Performance trainer perspective: How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that for example the nocache¹ command is using? Excerpt from manpage posix_fadvice(2): POSIX_FADV_DONTNEED The specified data will not be accessed in the near future. POSIX_FADV_DONTNEED attempts to free cached pages as‐ sociated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead. [1] packaged in Debian as nocache or available herehttps://github.com/ Feh/nocache In any way, would be nice to have some option in rsync… I still did not change my backup script to call rsync via nocache. Thanks, Martin > Comments appreciated! This should work on any standard file system, > using either the generic helpers or iomap. I have tested ext4 and xfs > for the right read/write behavior, but no further validation has been > done yet. Patches are against current git, and can also be found here: > > https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached > > fs/ceph/file.c | 2 +- > fs/dax.c | 2 +- > fs/ext4/file.c | 2 +- > fs/iomap/apply.c | 26 ++++++++++- > fs/iomap/buffered-io.c | 54 ++++++++++++++++------- > fs/iomap/direct-io.c | 3 +- > fs/iomap/fiemap.c | 5 ++- > fs/iomap/seek.c | 6 ++- > fs/iomap/swapfile.c | 2 +- > fs/nfs/file.c | 2 +- > include/linux/fs.h | 7 ++- > include/linux/iomap.h | 10 ++++- > include/uapi/linux/fs.h | 5 ++- > mm/filemap.c | 95 > ++++++++++++++++++++++++++++++++++++----- 14 files changed, 181 > insertions(+), 40 deletions(-) > > Changes since v2: > - Rework the write side according to Chinners suggestions. Much > cleaner this way. It does mean that we invalidate the full write > region if just ONE page (or more) had to be created, where before it > was more granular. I don't think that's a concern, and on the plus > side, we now no longer have to chunk invalidations into 15/16 pages > at the time. > - Cleanups > > Changes since v1: > - Switch to pagevecs for write_drop_cached_pages() > - Use page_offset() instead of manual shift > - Ensure we hold a reference on the page between calling ->write_end() > and checking the mapping on the locked page > - Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag > after the first page
On 12/12/19 3:44 AM, Martin Steigerwald wrote: > Hi Jens. > > Jens Axboe - 11.12.19, 16:29:38 CET: >> Recently someone asked me how io_uring buffered IO compares to mmaped >> IO in terms of performance. So I ran some tests with buffered IO, and >> found the experience to be somewhat painful. The test case is pretty >> basic, random reads over a dataset that's 10x the size of RAM. >> Performance starts out fine, and then the page cache fills up and we >> hit a throughput cliff. CPU usage of the IO threads go up, and we have >> kswapd spending 100% of a core trying to keep up. Seeing that, I was >> reminded of the many complaints I here about buffered IO, and the >> fact that most of the folks complaining will ultimately bite the >> bullet and move to O_DIRECT to just get the kernel out of the way. >> >> But I don't think it needs to be like that. Switching to O_DIRECT >> isn't always easily doable. The buffers have different life times, >> size and alignment constraints, etc. On top of that, mixing buffered >> and O_DIRECT can be painful. >> >> Seems to me that we have an opportunity to provide something that sits >> somewhere in between buffered and O_DIRECT, and this is where >> RWF_UNCACHED enters the picture. If this flag is set on IO, we get >> the following behavior: >> >> - If the data is in cache, it remains in cache and the copy (in or >> out) is served to/from that. >> >> - If the data is NOT in cache, we add it while performing the IO. When >> the IO is done, we remove it again. >> >> With this, I can do 100% smooth buffered reads or writes without >> pushing the kernel to the state where kswapd is sweating bullets. In >> fact it doesn't even register. > > A question from a user or Linux Performance trainer perspective: > > How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that > for example the nocache¹ command is using? Excerpt from manpage > posix_fadvice(2): > > POSIX_FADV_DONTNEED > The specified data will not be accessed in the near > future. > > POSIX_FADV_DONTNEED attempts to free cached pages as‐ > sociated with the specified region. This is useful, > for example, while streaming large files. A program > may periodically request the kernel to free cached > data that has already been used, so that more useful > cached pages are not discarded instead. > > [1] packaged in Debian as nocache or available herehttps://github.com/ > Feh/nocache > > In any way, would be nice to have some option in rsync… I still did not > change my backup script to call rsync via nocache. I don't know the nocache tool, but I'm guessing it just does the writes (or reads) and then uses FADV_DONTNEED to drop behind those pages? That's fine for slower use cases, it won't work very well for fast IO. The write side currently works pretty much like that internally, whereas the read side doesn't use the page cache at all.
On Wed, Dec 11, 2019 at 06:47:25PM -0800, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > It _should_ be the same order of complexity. Since there's already > > a page in the page cache, xas_create() just walks its way down to the > > right node calling xas_descend() and then xas_store() does the equivalent > > of __radix_tree_replace(). I don't see a bug that would make it more > > expensive than the old code ... a 10GB file is going to have four levels > > of radix tree node, so it shouldn't even spend that long looking for > > the right node. > > The profile seems to say that 85% of the cost of xas_create() is two > instructions: > > # lib/xarray.c:143: return (index >> node->shift) & XA_CHUNK_MASK; > movzbl (%rsi), %ecx # MEM[(unsigned char *)node_13], > MEM[(unsigned char *)node_13] > ... > # ./include/linux/xarray.h:1145: return > rcu_dereference_check(node->slots[offset], > # ./include/linux/compiler.h:199: __READ_ONCE_SIZE; > movq (%rax), %rax # MEM[(volatile __u64 *)_80], <retval> > > where that first load is "node->shift", and the second load seems to > be just the node dereference. > > I think both of them are basically just xas_descend(), but it's a bit > hard to read the several levels of inlining. Yep, that's basically the two cache misses you get for each level of the tree. I'd expect the node->shift to be slightly more costly because sometimes the node->slots[offset] is going to be in the same cacheline or the next cacheline after node->shift. > I suspect the real issue is that going through kswapd means we've lost > almost all locality, and with the random walking the LRU list is > basically entirely unordered so you don't get any advantage of the > xarray having (otherwise) possibly good cache patterns. > > So we're just missing in the caches all the time, making it expensive. I have some bad ideas for improving it. 1. We could semi-sort the pages on the LRU list. If we know we're going to remove a bunch of pages, we could take a batch of them off the list, sort them and remove them in-order. This probably wouldn't be terribly effective. 2. We could change struct page to point to the xa_node that holds them. Looking up the page mapping would be page->xa_node->array and then offsetof(i_pages) to get the mapping. 3. Once the maple tree is ready, a 10GB file can actually be held more efficiently in a maple tree than in the radix tree. Because each level of the tree is 128 bytes and (Intel) CPUs fetch a pair of cachelines, there's only one cache miss per level of the tree. So despite the tree being deeper, there are fewer cache misses. With an average of 6 pointers per level of the tree, terminating in a dense node, we'd see: : 6 * 6 * 6 * 6 * 6 * 6 * 6 * 15 : 4199040 (a 10GB file contains 2621440 4kB pages) which is still eight cache misses, so to see an improvement, we'd need to implement another planned improvement, which is allocating a large, dense leaf node of 4kB. That would reduce the height of the tree by 2: : 6 * 6 * 6 * 6 * 6 * 500 : 3888000 4. For consecutive accesses, I'm working on allocating larger pages (16kB, 64kB, 256kB, ...). That will help by reducing the number of deletions from the page cache by a factor of 4/16/64/...
On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > 1. We could semi-sort the pages on the LRU list. If we know we're going > to remove a bunch of pages, we could take a batch of them off the list, > sort them and remove them in-order. This probably wouldn't be terribly > effective. I don't think the sorting is relevant. Once you batch things, you already would get most of the locality advantage in the cache if it exists (and the batch isn't insanely large so that one batch already causes cache overflows). The problem - I suspect - is that we don't batch at all. Or rather, the "batching" does exist at a high level, but it's so high that there's just tons of stuff going on between single pages. It is at the shrink_page_list() level, which is pretty high up and basically does one page at a time with locking and a lot of tests for each page, and then we do "__remove_mapping()" (which does some more work) one at a time before we actually get to __delete_from_page_cache(). So it's "batched", but it's in a huge loop, and even at that huge loop level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX, which is just 32. Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other circumstances, but not necessarily in the "shrink clean inactive pages" thing. I wonder if we could just batch clean pages a _lot_ more aggressively. Yes, our batching loop is still very big and it might not help at an L1 level, but it might help in the L2, at least. In kswapd, when we have 28 GB of pages on the inactive list, a batch of 32 pages at a time is pretty small ;) > 2. We could change struct page to point to the xa_node that holds them. > Looking up the page mapping would be page->xa_node->array and then > offsetof(i_pages) to get the mapping. I don't think we have space in 'struct page', and I'm pretty sure we don't want to grow it. That's one of the more common data structures in the kernel. Linus
On Thu, Dec 12, 2019 at 10:29:02AM -0800, Linus Torvalds wrote: > On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > 1. We could semi-sort the pages on the LRU list. If we know we're going > > to remove a bunch of pages, we could take a batch of them off the list, > > sort them and remove them in-order. This probably wouldn't be terribly > > effective. > > I don't think the sorting is relevant. > > Once you batch things, you already would get most of the locality > advantage in the cache if it exists (and the batch isn't insanely > large so that one batch already causes cache overflows). > > The problem - I suspect - is that we don't batch at all. Or rather, > the "batching" does exist at a high level, but it's so high that > there's just tons of stuff going on between single pages. It is at the > shrink_page_list() level, which is pretty high up and basically does > one page at a time with locking and a lot of tests for each page, and > then we do "__remove_mapping()" (which does some more work) one at a > time before we actually get to __delete_from_page_cache(). > > So it's "batched", but it's in a huge loop, and even at that huge loop > level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX, > which is just 32. > > Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other > circumstances, but not necessarily in the "shrink clean inactive > pages" thing. I wonder if we could just batch clean pages a _lot_ more > aggressively. Yes, our batching loop is still very big and it might > not help at an L1 level, but it might help in the L2, at least. > > In kswapd, when we have 28 GB of pages on the inactive list, a batch > of 32 pages at a time is pretty small ;) Yeah, that's pretty poor. I just read through it, and even if pages are in order on the page list, they're not going to batch nicely. It'd be nice to accumulate them and call delete_from_page_cache_batch(), but we need to put shadow entries in to replace them, so we'd need a variant of that which took two pagevecs. > > 2. We could change struct page to point to the xa_node that holds them. > > Looking up the page mapping would be page->xa_node->array and then > > offsetof(i_pages) to get the mapping. > > I don't think we have space in 'struct page', and I'm pretty sure we > don't want to grow it. That's one of the more common data structures > in the kernel. Oh, I wasn't clear. I meant replace page->mapping with page->xa_node. We could still get from page to mapping, but it would be an extra dereference. I did say it was a _bad_ idea.
Jens Axboe - 12.12.19, 16:16:31 CET: > On 12/12/19 3:44 AM, Martin Steigerwald wrote: > > Jens Axboe - 11.12.19, 16:29:38 CET: > >> Recently someone asked me how io_uring buffered IO compares to > >> mmaped > >> IO in terms of performance. So I ran some tests with buffered IO, > >> and > >> found the experience to be somewhat painful. The test case is > >> pretty > >> basic, random reads over a dataset that's 10x the size of RAM. > >> Performance starts out fine, and then the page cache fills up and > >> we > >> hit a throughput cliff. CPU usage of the IO threads go up, and we > >> have kswapd spending 100% of a core trying to keep up. Seeing > >> that, I was reminded of the many complaints I here about buffered > >> IO, and the fact that most of the folks complaining will > >> ultimately bite the bullet and move to O_DIRECT to just get the > >> kernel out of the way. > >> > >> But I don't think it needs to be like that. Switching to O_DIRECT > >> isn't always easily doable. The buffers have different life times, > >> size and alignment constraints, etc. On top of that, mixing > >> buffered > >> and O_DIRECT can be painful. > >> > >> Seems to me that we have an opportunity to provide something that > >> sits somewhere in between buffered and O_DIRECT, and this is where > >> RWF_UNCACHED enters the picture. If this flag is set on IO, we get > >> the following behavior: > >> > >> - If the data is in cache, it remains in cache and the copy (in or > >> out) is served to/from that. > >> > >> - If the data is NOT in cache, we add it while performing the IO. > >> When the IO is done, we remove it again. > >> > >> With this, I can do 100% smooth buffered reads or writes without > >> pushing the kernel to the state where kswapd is sweating bullets. > >> In > >> fact it doesn't even register. > > > > A question from a user or Linux Performance trainer perspective: > > > > How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED > > that for example the nocache¹ command is using? Excerpt from > > manpage> > > posix_fadvice(2): > > POSIX_FADV_DONTNEED > > > > The specified data will not be accessed in the near > > future. > > > > POSIX_FADV_DONTNEED attempts to free cached pages as‐ > > sociated with the specified region. This is useful, > > for example, while streaming large files. A program > > may periodically request the kernel to free cached > > data that has already been used, so that more useful > > cached pages are not discarded instead. > > > > [1] packaged in Debian as nocache or available > > herehttps://github.com/ Feh/nocache > > > > In any way, would be nice to have some option in rsync… I still did > > not change my backup script to call rsync via nocache. > > I don't know the nocache tool, but I'm guessing it just does the > writes (or reads) and then uses FADV_DONTNEED to drop behind those > pages? That's fine for slower use cases, it won't work very well for > fast IO. The write side currently works pretty much like that > internally, whereas the read side doesn't use the page cache at all. Yes, it does that. And yeah I saw you changed the read site to bypass the cache entirely. Also as I understand it this is for asynchronous using io uring primarily?
On 12/12/19 2:45 PM, Martin Steigerwald wrote: > Jens Axboe - 12.12.19, 16:16:31 CET: >> On 12/12/19 3:44 AM, Martin Steigerwald wrote: >>> Jens Axboe - 11.12.19, 16:29:38 CET: >>>> Recently someone asked me how io_uring buffered IO compares to >>>> mmaped >>>> IO in terms of performance. So I ran some tests with buffered IO, >>>> and >>>> found the experience to be somewhat painful. The test case is >>>> pretty >>>> basic, random reads over a dataset that's 10x the size of RAM. >>>> Performance starts out fine, and then the page cache fills up and >>>> we >>>> hit a throughput cliff. CPU usage of the IO threads go up, and we >>>> have kswapd spending 100% of a core trying to keep up. Seeing >>>> that, I was reminded of the many complaints I here about buffered >>>> IO, and the fact that most of the folks complaining will >>>> ultimately bite the bullet and move to O_DIRECT to just get the >>>> kernel out of the way. >>>> >>>> But I don't think it needs to be like that. Switching to O_DIRECT >>>> isn't always easily doable. The buffers have different life times, >>>> size and alignment constraints, etc. On top of that, mixing >>>> buffered >>>> and O_DIRECT can be painful. >>>> >>>> Seems to me that we have an opportunity to provide something that >>>> sits somewhere in between buffered and O_DIRECT, and this is where >>>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get >>>> the following behavior: >>>> >>>> - If the data is in cache, it remains in cache and the copy (in or >>>> out) is served to/from that. >>>> >>>> - If the data is NOT in cache, we add it while performing the IO. >>>> When the IO is done, we remove it again. >>>> >>>> With this, I can do 100% smooth buffered reads or writes without >>>> pushing the kernel to the state where kswapd is sweating bullets. >>>> In >>>> fact it doesn't even register. >>> >>> A question from a user or Linux Performance trainer perspective: >>> >>> How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED >>> that for example the nocache¹ command is using? Excerpt from >>> manpage> >>> posix_fadvice(2): >>> POSIX_FADV_DONTNEED >>> >>> The specified data will not be accessed in the near >>> future. >>> >>> POSIX_FADV_DONTNEED attempts to free cached pages as‐ >>> sociated with the specified region. This is useful, >>> for example, while streaming large files. A program >>> may periodically request the kernel to free cached >>> data that has already been used, so that more useful >>> cached pages are not discarded instead. >>> >>> [1] packaged in Debian as nocache or available >>> herehttps://github.com/ Feh/nocache >>> >>> In any way, would be nice to have some option in rsync… I still did >>> not change my backup script to call rsync via nocache. >> >> I don't know the nocache tool, but I'm guessing it just does the >> writes (or reads) and then uses FADV_DONTNEED to drop behind those >> pages? That's fine for slower use cases, it won't work very well for >> fast IO. The write side currently works pretty much like that >> internally, whereas the read side doesn't use the page cache at all. > > Yes, it does that. And yeah I saw you changed the read site to bypass > the cache entirely. > > Also as I understand it this is for asynchronous using io uring > primarily? Or preadv2/pwritev2, they also allow passing in RWF_* flags.
On Thu, Dec 12, 2019 at 7:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > I don't know the nocache tool, but I'm guessing it just does the writes > (or reads) and then uses FADV_DONTNEED to drop behind those pages? > That's fine for slower use cases, it won't work very well for fast IO. > The write side currently works pretty much like that internally, whereas > the read side doesn't use the page cache at all. Well, I think that if we have this RWF/IOC_UNCACHED flag, maybe we should make FADV_NOREUSE file descriptors just use it even for regular read/write.. Right now FADV_NOREUSE is a no-op for the kernel, I think, because we historically didn't have the facility. But FADV_DONTNEED is different. That's for dropping pages after use (ie invalidate_mapping_pages()) Linus
On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote: > On 12/11/19 4:41 PM, Jens Axboe wrote: > > On 12/11/19 1:18 PM, Linus Torvalds wrote: > >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote: > >>> > >>> $ cat /proc/meminfo | grep -i active > >>> Active: 134136 kB > >>> Inactive: 28683916 kB > >>> Active(anon): 97064 kB > >>> Inactive(anon): 4 kB > >>> Active(file): 37072 kB > >>> Inactive(file): 28683912 kB > >> > >> Yeah, that should not put pressure on some swap activity. We have 28 > >> GB of basically free inactive file data, and the VM is doing something > >> very very bad if it then doesn't just quickly free it with no real > >> drama. > >> > >> In fact, I don't think it should even trigger kswapd at all, it should > >> all be direct reclaim. Of course, some of the mm people hate that with > >> a passion, but this does look like a prime example of why it should > >> just be done. > > > > For giggles, I ran just a single thread on the file set. We're only > > doing about 100K IOPS at that point, yet when the page cache fills, > > kswapd still eats 10% cpu. That seems like a lot for something that > > slow. > > Warning, the below is from the really crazy department... > > Anyway, I took a closer look at the profiles for the uncached case. > We're spending a lot of time doing memsets (this is the xa_node init, > from the radix tree constructor), and call_rcu for the node free later > on. All wasted time, and something that meant we weren't as close to the > performance of O_DIRECT as we could be. > > So Chris and I started talking about this, and pondered "what would > happen if we simply bypassed the page cache completely?". Case in point, > see below incremental patch. We still do the page cache lookup, and use > that page to copy from if it's there. If the page isn't there, allocate > one and do IO to it, but DON'T add it to the page cache. With that, > we're almost at O_DIRECT levels of performance for the 4k read case, > without 1-2%. I think 512b would look awesome, but we're reading full > pages, so that won't really help us much. Compared to the previous > uncached method, this is 30% faster on this device. That's substantial. Interesting idea, but this seems like it is just direct IO with kernel pages and a memcpy() rather than just mapping user pages, but has none of the advantages of direct IO in that we can run reads and writes concurrently because it's going through the buffered IO path. It also needs all the special DIO truncate/hole punch serialisation mechanisms to be propagated into the buffered IO path - the requirement for inode_dio_wait() serialisation is something I'm trying to remove from XFS, not have to add into more paths. And it introduces the same issues with other buffered read/mmap access to the same file ranges as direct IO has. > Obviously this has issues with truncate that would need to be resolved, > and it's definitely dirtier. But the performance is very enticing... At which point I have to ask: why are we considering repeating the mistakes that were made with direct IO? Yes, it might be faster than a coherent RWF_UNCACHED IO implementation, but I don't think making it more like O_DIRECT is worth the price. And, ultimately, RWF_UNCACHED will never be as fast as direct IO because it *requires* the CPU to copy the data at least once. Direct IO is zero-copy, and so it's always going to have lower overhead than RWF_UNCACHED, and so when CPU or memory bandwidth is the limiting facter, O_DIRECT will always be faster. IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a fool's game and attempting to do so is taking a step in the wrong direction architecturally. I'd much prefer a sane IO model for RWF_UNCACHED that provides coherency w/ mmap and other buffered IO than compromise these things in the chase for ultimate performance. Speaking of IO path architecture, perhaps what we really need here is an iomap_apply()->iomap_read_actor loop here similar to the write side. This would allow us to bypass all the complex readahead shenanigans that generic_file_buffered_read() has to deal with and directly control page cache residency and build the exact IOs we need when RWF_UNCACHED is set. This moves it much closer to the direct IO path in terms IO setup overhead and physical IO patterns, but still has all the benefits of being fully cache coherent.... And, really, when we are talking about high end nvme drives that can do 5-10GB/s read each, and we can put 20+ of them in a single machine, there's no real value in doing readahead. i.e. there's little read IO latency to hide in the first place and we such systems have little memory bandwidth to spare to waste on readahead IO that we don't end up using... Cheers, Dave.
On 12 Dec 2019, at 17:18, Dave Chinner wrote: > On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote: >> >> So Chris and I started talking about this, and pondered "what would >> happen if we simply bypassed the page cache completely?". Case in >> point, >> see below incremental patch. We still do the page cache lookup, and >> use >> that page to copy from if it's there. If the page isn't there, >> allocate >> one and do IO to it, but DON'T add it to the page cache. With that, >> we're almost at O_DIRECT levels of performance for the 4k read case, >> without 1-2%. I think 512b would look awesome, but we're reading full >> pages, so that won't really help us much. Compared to the previous >> uncached method, this is 30% faster on this device. That's >> substantial. > > Interesting idea, but this seems like it is just direct IO with > kernel pages and a memcpy() rather than just mapping user pages, but > has none of the advantages of direct IO in that we can run reads and > writes concurrently because it's going through the buffered IO path. > > It also needs all the special DIO truncate/hole punch serialisation > mechanisms to be propagated into the buffered IO path - the > requirement for inode_dio_wait() serialisation is something I'm > trying to remove from XFS, not have to add into more paths. And it > introduces the same issues with other buffered read/mmap access to > the same file ranges as direct IO has. > >> Obviously this has issues with truncate that would need to be >> resolved, >> and it's definitely dirtier. But the performance is very enticing... > > At which point I have to ask: why are we considering repeating the > mistakes that were made with direct IO? Yes, it might be faster > than a coherent RWF_UNCACHED IO implementation, but I don't think > making it more like O_DIRECT is worth the price. > > And, ultimately, RWF_UNCACHED will never be as fast as direct IO > because it *requires* the CPU to copy the data at least once. They just have different tradeoffs. O_DIRECT actively blows away caches and can also force writes during reads, making RWF_UNCACHED a more natural fit for some applications. There are fewer surprises, and some services are willing to pay for flexibility with a memcpy. In general, they still want to do some cache management because it reduces p90+ latencies across the board, and gives them more control over which pages stay in cache. Most services using buffered IO here as part of their main workload are pairing it with sync_file_range() and sometimes fadvise DONT_NEED. We've seen kswapd saturating cores with much slower flash than the fancy stuff Jens is using, and the solution is usually O_DIRECT or fadvise. Grepping through the code shows a wonderful assortment of helpers to control the cache, and RWF_UNCACHED would be both cleaner and faster than what we have today. I'm on the fence about asking for RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without pitching pages, but we can talk to some service owners to see how useful that would be. They can always chain a sync_file_range() in io_uring, but RWF_ would be lower overhead if it were a common pattern. With all of that said, I really agree that xfs+O_DIRECT wins on write concurrency. Jens's current patches are a great first step, but I think that if he really loved us, Jens would carve up a concurrent pageless write patch series before Christmas. > Direct > IO is zero-copy, and so it's always going to have lower overhead > than RWF_UNCACHED, and so when CPU or memory bandwidth is the > limiting facter, O_DIRECT will always be faster. > > IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a > fool's game and attempting to do so is taking a step in the wrong > direction architecturally. I'd much prefer a sane IO model for > RWF_UNCACHED that provides coherency w/ mmap and other buffered IO > than compromise these things in the chase for ultimate performance. No matter what I wrote in my letters to Santa this year, I agree that we shouldn't compromise on avoiding the warts from O_DIRECT. -chris
On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote: > They just have different tradeoffs. O_DIRECT actively blows away caches > and can also force writes during reads, making RWF_UNCACHED a more > natural fit for some applications. There are fewer surprises, and some > services are willing to pay for flexibility with a memcpy. In general, > they still want to do some cache management because it reduces p90+ > latencies across the board, and gives them more control over which pages > stay in cache. We can always have a variant of O_DIRECT that doesn't do that and instead check if data was in the cache and then also copy / from to it in that case. I need some time to actually look through this series, so it might be pretty similar to the implementation, but if defined the right way it could be concurrent for at least the fast path of no cached pages.
On 7 Jan 2020, at 12:42, Christoph Hellwig wrote: > On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote: >> They just have different tradeoffs. O_DIRECT actively blows away >> caches >> and can also force writes during reads, making RWF_UNCACHED a more >> natural fit for some applications. There are fewer surprises, and >> some >> services are willing to pay for flexibility with a memcpy. In >> general, >> they still want to do some cache management because it reduces p90+ >> latencies across the board, and gives them more control over which >> pages >> stay in cache. > > We can always have a variant of O_DIRECT that doesn't do that and > instead check if data was in the cache and then also copy / from to > it in that case. I need some time to actually look through this > series, > so it might be pretty similar to the implementation, but if defined > the right way it could be concurrent for at least the fast path of no > cached pages. Yeah, I really do think we can end up with a fairly unified solution through iomap: * Allowing concurrent writes (xfs DIO does this now) * Optionally doing zero copy if alignment is good (btrfs DIO does this now) * Optionally tossing pages at the end (requires a separate syscall now) * Supporting aio via io_uring We could just call this O_DIRECT, but I like RWF_UNCACHED as a way to avoid surprises for people that know and love the existing O_DIRECT semantics. -chris
Hi, On 2019-12-13 01:32:10 +0000, Chris Mason wrote: > Grepping through the code shows a wonderful assortment of helpers to > control the cache, and RWF_UNCACHED would be both cleaner and faster > than what we have today. I'm on the fence about asking for > RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without > pitching pages, but we can talk to some service owners to see how useful > that would be. They can always chain a sync_file_range() in io_uring, > but RWF_ would be lower overhead if it were a common pattern. FWIW, for postgres something that'd allow us to do writes that a) Doesn't remove pages from the pagecache if they're already there. b) Doesn't delay writeback to some unpredictable point later. The later write causes both latency issues, and often under-utilizes write bandwidth for a while. For most cases where we write, we know that we're not likely to write the same page again soon. c) Doesn't (except maybe temporarily) bring pages into the pagecache, if they weren't before. In the cases where the page previously wasn't in the page cache, and we wrote it out, it's very likely to have been resident for long enough in our cache, that the kernel caching it for the future isn't useful. would be really helpful. Right now we simulate that to some degree by doing normal buffered writes followed by sync_file_range(WRITE). For most environments always using O_DIRECT isn't really an option for us, as we can't rely on settings being tuned well enough (i.e. using a large enough application cache), as well as continuing to want to support setups where using a large enough postgres buffer cache isn't an option because it'd prevent putting a number of variably used database servers on one piece of hardware. (There's also postgres side issues preventing us from doing O_DIRECT performantly, partially because we so far couldn't rely on AIO, due to also using buffered IO, but we're fixing that now.) For us a per-request interface where we'd have to fulfill all the requirements for O_DIRECT, but where neither reads nor writes would cause a page to move in/out of the pagecache, would be optimal for a good part of our IO. Especially when we still could get zero-copy IO for the pretty common case that there's no pagecache presence for a file at all. That'd allow us to use zero copy writes for the common case of a file's data fitting entirely in our cache, and us only occasionally writing the deta out at checkpoints. And do zero copy reads for the the cases where we know it's unnecessary for the kernel to cache (e.g. because we are scanning a few TB of data on a machine with less memory, because we're re-filling our cache after a restart, or because it's a maintenance operation doing the reading). But still rely on the kernel page cache for other reads where the kernel caching when memory is available is a huge benefit. Some well tuned workloads would turn that off, to only use O_DIRECT, but everyone else would benefit with that being the default. We can concoct an approximation of that behaviour with a mix of sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if writes should use O_DIRECT). But that's quite a bit of overhead. The reason that specifying this on a per-request basis would be useful is mainly that that would allow us to avoid having to either have two sets of FDs, or having to turn O_DIRECT on/off with fcntl. Greetings, Andres Freund