Message ID | 20190808190300.GA9067@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND] block: annotate refault stalls from IO submission | expand |
On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote: > psi tracks the time tasks wait for refaulting pages to become > uptodate, but it does not track the time spent submitting the IO. The > submission part can be significant if backing storage is contended or > when cgroup throttling (io.latency) is in effect - a lot of time is Or the wbt is throttling. > spent in submit_bio(). In that case, we underreport memory pressure. > > Annotate submit_bio() to account submission time as memory stall when > the bio is reading userspace workingset pages. PAtch looks fine to me, but it raises another question w.r.t. IO stalls and reclaim pressure feedback to the vm: how do we make use of the pressure stall infrastructure to track inode cache pressure and stalls? With the congestion_wait() and wait_iff_congested() being entire non-functional for block devices since 5.0, there is no IO load based feedback going into memory reclaim from shrinkers that might require IO to free objects before they can be reclaimed. This is directly analogous to page reclaim writing back dirty pages from the LRU, and as I understand it one of things the PSI is supposed to be tracking. Lots of workloads create inode cache pressure and often it can dominate the time spent in memory reclaim, so it would seem to me that having PSI only track/calculate pressure and stalls from LRU pages misses a fair chunk of the memory pressure and reclaim stalls that can be occurring. Any thoughts of how we might be able to integrate more of the system caches into the PSI infrastructure, Johannes? Cheers, Dave.
On Thu, Aug 8, 2019 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > psi tracks the time tasks wait for refaulting pages to become > uptodate, but it does not track the time spent submitting the IO. The > submission part can be significant if backing storage is contended or > when cgroup throttling (io.latency) is in effect - a lot of time is > spent in submit_bio(). In that case, we underreport memory pressure. > > Annotate submit_bio() to account submission time as memory stall when > the bio is reading userspace workingset pages. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > block/bio.c | 3 +++ > block/blk-core.c | 23 ++++++++++++++++++++++- > include/linux/blk_types.h | 1 + > 3 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/block/bio.c b/block/bio.c > index 299a0e7651ec..4196865dd300 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -806,6 +806,9 @@ void __bio_add_page(struct bio *bio, struct page *page, > > bio->bi_iter.bi_size += len; > bio->bi_vcnt++; > + > + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) > + bio_set_flag(bio, BIO_WORKINGSET); > } > EXPORT_SYMBOL_GPL(__bio_add_page); > > diff --git a/block/blk-core.c b/block/blk-core.c > index d0cc6e14d2f0..1b1705b7dde7 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -36,6 +36,7 @@ > #include <linux/blk-cgroup.h> > #include <linux/debugfs.h> > #include <linux/bpf.h> > +#include <linux/psi.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/block.h> > @@ -1128,6 +1129,10 @@ EXPORT_SYMBOL_GPL(direct_make_request); > */ > blk_qc_t submit_bio(struct bio *bio) > { > + bool workingset_read = false; > + unsigned long pflags; > + blk_qc_t ret; > + > if (blkcg_punt_bio_submit(bio)) > return BLK_QC_T_NONE; > > @@ -1146,6 +1151,8 @@ blk_qc_t submit_bio(struct bio *bio) > if (op_is_write(bio_op(bio))) { > count_vm_events(PGPGOUT, count); > } else { > + if (bio_flagged(bio, BIO_WORKINGSET)) > + workingset_read = true; > task_io_account_read(bio->bi_iter.bi_size); > count_vm_events(PGPGIN, count); > } > @@ -1160,7 +1167,21 @@ blk_qc_t submit_bio(struct bio *bio) > } > } > > - return generic_make_request(bio); > + /* > + * If we're reading data that is part of the userspace > + * workingset, count submission time as memory stall. When the > + * device is congested, or the submitting cgroup IO-throttled, > + * submission can be a significant part of overall IO time. > + */ > + if (workingset_read) > + psi_memstall_enter(&pflags); > + > + ret = generic_make_request(bio); > + > + if (workingset_read) > + psi_memstall_leave(&pflags); > + > + return ret; > } > EXPORT_SYMBOL(submit_bio); > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 1b1fa1557e68..a9dadfc16a92 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -209,6 +209,7 @@ enum { > BIO_BOUNCED, /* bio is a bounce bio */ > BIO_USER_MAPPED, /* contains user pages */ > BIO_NULL_MAPPED, /* contains invalid user pages */ > + BIO_WORKINGSET, /* contains userspace workingset pages */ > BIO_QUIET, /* Make BIO Quiet */ > BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ > BIO_REFFED, /* bio has elevated ->bi_cnt */ > -- > 2.22.0 > The change contributes to the amount of recorded stall while running memory stress test with and without the patch. Did not notice any performance regressions so far. Thanks! Tested-by: Suren Baghdasaryan <surenb@google.com>
On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote: > On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote: > > psi tracks the time tasks wait for refaulting pages to become > > uptodate, but it does not track the time spent submitting the IO. The > > submission part can be significant if backing storage is contended or > > when cgroup throttling (io.latency) is in effect - a lot of time is > > Or the wbt is throttling. > > > spent in submit_bio(). In that case, we underreport memory pressure. > > > > Annotate submit_bio() to account submission time as memory stall when > > the bio is reading userspace workingset pages. > > PAtch looks fine to me, but it raises another question w.r.t. IO > stalls and reclaim pressure feedback to the vm: how do we make use > of the pressure stall infrastructure to track inode cache pressure > and stalls? > > With the congestion_wait() and wait_iff_congested() being entire > non-functional for block devices since 5.0, there is no IO load > based feedback going into memory reclaim from shrinkers that might > require IO to free objects before they can be reclaimed. This is > directly analogous to page reclaim writing back dirty pages from > the LRU, and as I understand it one of things the PSI is supposed > to be tracking. > > Lots of workloads create inode cache pressure and often it can > dominate the time spent in memory reclaim, so it would seem to me > that having PSI only track/calculate pressure and stalls from LRU > pages misses a fair chunk of the memory pressure and reclaim stalls > that can be occurring. psi already tracks the entire reclaim operation. So if reclaim calls into the shrinker and the shrinker scans inodes, initiates IO, or even waits on IO, that time is accounted for as memory pressure stalling. If you can think of asynchronous events that are initiated from reclaim but cause indirect stalls in other contexts, contexts which can clearly link the stall back to reclaim activity, we can annotate them using psi_memstall_enter() / psi_memstall_leave(). In that vein, what would be great to have is be a distinction between read stalls on dentries/inodes that have never been touched before versus those that have been recently reclaimed - analogous to cold page faults vs refaults. It would help psi, sure, but more importantly it would help us better balance pressure between filesystem metadata and the data pages. We would be able to tell the difference between a `find /' and actual thrashing, where hot inodes are getting kicked out and reloaded repeatedly - and we could backfeed that pressure to the LRU pages to allow the metadata caches to grow as needed. For example, it could make sense to swap out a couple of completely unused anonymous pages if it means we could hold the metadata workingset fully in memory. But right now we cannot do that, because we cannot risk swapping just because somebody runs find /. I have semi-seriously talked to Josef about this before, but it wasn't quite obvious where we could track non-residency or eviction information for inodes, dentries etc. Maybe you have an idea?
On Tue, Aug 13, 2019 at 01:46:25PM -0400, Johannes Weiner wrote: > On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote: > > On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote: > > > psi tracks the time tasks wait for refaulting pages to become > > > uptodate, but it does not track the time spent submitting the IO. The > > > submission part can be significant if backing storage is contended or > > > when cgroup throttling (io.latency) is in effect - a lot of time is > > > > Or the wbt is throttling. > > > > > spent in submit_bio(). In that case, we underreport memory pressure. > > > > > > Annotate submit_bio() to account submission time as memory stall when > > > the bio is reading userspace workingset pages. > > > > PAtch looks fine to me, but it raises another question w.r.t. IO > > stalls and reclaim pressure feedback to the vm: how do we make use > > of the pressure stall infrastructure to track inode cache pressure > > and stalls? > > > > With the congestion_wait() and wait_iff_congested() being entire > > non-functional for block devices since 5.0, there is no IO load > > based feedback going into memory reclaim from shrinkers that might > > require IO to free objects before they can be reclaimed. This is > > directly analogous to page reclaim writing back dirty pages from > > the LRU, and as I understand it one of things the PSI is supposed > > to be tracking. > > > > Lots of workloads create inode cache pressure and often it can > > dominate the time spent in memory reclaim, so it would seem to me > > that having PSI only track/calculate pressure and stalls from LRU > > pages misses a fair chunk of the memory pressure and reclaim stalls > > that can be occurring. > > psi already tracks the entire reclaim operation. So if reclaim calls > into the shrinker and the shrinker scans inodes, initiates IO, or even > waits on IO, that time is accounted for as memory pressure stalling. hmmmm - reclaim _scanning_ is considered a stall event? i.e. even if scanning does not block, it's still accounting that _time_ as a memory pressure stall? I'm probably missing it, but I don't see anything in vmpressure() that actually accounts for time spent scanning. AFAICT it accounts for LRU objects scanned and reclaimed from memcgs, and then the memory freed from the shrinkers is accounted only to the sc->target_mem_cgroup once all memcgs have been iterated. So, AFAICT, there's no time aspect to this, and the amount of scanning that shrinkers do is not taken into account, so pressure can't really be determined properly there. It seems like what the shrinkers reclaim will actually give an incorrect interpretation of pressure right now... > If you can think of asynchronous events that are initiated from > reclaim but cause indirect stalls in other contexts, contexts which > can clearly link the stall back to reclaim activity, we can annotate > them using psi_memstall_enter() / psi_memstall_leave(). Well, I was more thinking that issuing/waiting on IOs is a stall event, not scanning. The IO-less inode reclaim stuff for XFS really needs the main reclaim loop to back off under heavy IO load, but we cannot put the entire metadata writeback path under psi_memstall_enter/leave() because: a) it's not linked to any user context - it's a per-superblock kernel thread; and b) it's designed to always be stalled on IO when there is metadata writeback pressure. That pressure most often comes from running out of journal space rather than memory pressure, and really there is no way to distinguish between the two from the writeback context. Hence I don't think the vmpressure mechanism does what the memory reclaim scanning loops really need because they do not feed back a clear picture of the load on the IO subsystem load into the reclaim loops..... > In that vein, what would be great to have is be a distinction between > read stalls on dentries/inodes that have never been touched before > versus those that have been recently reclaimed - analogous to cold > page faults vs refaults. See my "nonblocking inode reclaim for XFS" series. It adds new measures of that the shrinkers feed back to the main reclaim loop. One of those measures is the number of objects scanned. Shrinkers already report the number of objects they free, but that it tossed away and not used by the main reclaim loop. As for cold faults vs refaults, we could only report that for dentries - if the inode is still cached in memory, then the dentry hits the inode cache (hot fault), otherwise it's got to fetch the inode from disk (cold fault). There is no mechanisms for tracking inodes that have been recently reclaimed - the VFS uses a hash for tracking cached inodes and so there's nothign you can drop exceptional entries into to track reclaim state. That said, we /could/ do this with the XFS inode cache. It uses radix trees to index the cache, not the VFS level inode hash. Hence we could place exceptional entries into the tree on reclaim to do working set tracking similar to the way the page cache is used to track the working set of pages. The other thing we could do here is similar to the dentry cache - we often have inode metadata buffers in the buffer cache (we have a multi-level cache heirarchy that spans most of the metadata in the active working set in XFS) and so if we miss the inode cache we might hit the inode buffer in the buffer cache (hot fault). If we miss the inode cache and have to do IO to read the inodes, then it's a cold fault. That might be misleading, however, because the inode buffers cache 32 physical inodes and so reading 32 sequential cold inodes would give 1 cold fault and 31 hot faults, even though those 31 inodes have never been referenced by the workload before and that's not ideal. To complicate matters further, we can thrash the buffer cache, resulting in cached inodes that have no backing buffer in memory. then we we go to write the inode, we have to read in the inode buffer before we can write it. This may have nothing to do with working set thrashing, too. e.g. we have an inode that has been referenced over and over again by the workload for several hours, then a relatime update occurs and the inode is dirtied. when writeback occurs, the inode buffer is nowhere to be found because it was cycled out of the buffer cache hours ago and hasn't been referenced since. hence I think we're probably best to ignore the underlying filesystem metadata cache for the purposes of measuring and detecting inode cache working set thrashing... > It would help psi, sure, but more importantly it would help us better > balance pressure between filesystem metadata and the data pages. We > would be able to tell the difference between a `find /' and actual > thrashing, where hot inodes are getting kicked out and reloaded > repeatedly - and we could backfeed that pressure to the LRU pages to > allow the metadata caches to grow as needed. Well, hot inodes getting kicked out and immmediate re-used is something we largely already handle via caching inode buffers in the buffer cache. So inode cache misses on XFS likely happen a lot more than is obvious as we only see inode cache thrashing when we have misses the metadata cache, not the inode cache. > For example, it could make sense to swap out a couple of completely > unused anonymous pages if it means we could hold the metadata > workingset fully in memory. But right now we cannot do that, because > we cannot risk swapping just because somebody runs find /. I have workloads that run find to produce slab cache memory pressure. On 5.2, they cause the system to swap madly because there's no file page cache to reclaim and so the only reclaim that can be done is inodes/dentries and swapping anonymous pages. And swap it does - if we don't throttle reclaim sufficiently to allow IO to make progress, then direct relcaim ends up in priority windup and I see lots of OOM kills on swap-in. I found quite a few ways to end up in "reclaim doesn't throttle on IO sufficiently and OOM kills" in the last 3-4 weeks... > I have semi-seriously talked to Josef about this before, but it wasn't > quite obvious where we could track non-residency or eviction > information for inodes, dentries etc. Maybe you have an idea? See above - I think only XFS could track working inodes because of it's unique caching infrastructure. Dentries largely don't matter, because dentry cache misses either hit or miss the inode cache and that's the working set that largely matters in terms of detecting IO-related thrashing... Cheers, Dave.
On Wed, Aug 14, 2019 at 12:51:30PM +1000, Dave Chinner wrote: > On Tue, Aug 13, 2019 at 01:46:25PM -0400, Johannes Weiner wrote: > > On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote: > > > On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote: > > > > psi tracks the time tasks wait for refaulting pages to become > > > > uptodate, but it does not track the time spent submitting the IO. The > > > > submission part can be significant if backing storage is contended or > > > > when cgroup throttling (io.latency) is in effect - a lot of time is > > > > > > Or the wbt is throttling. > > > > > > > spent in submit_bio(). In that case, we underreport memory pressure. > > > > > > > > Annotate submit_bio() to account submission time as memory stall when > > > > the bio is reading userspace workingset pages. > > > > > > PAtch looks fine to me, but it raises another question w.r.t. IO > > > stalls and reclaim pressure feedback to the vm: how do we make use > > > of the pressure stall infrastructure to track inode cache pressure > > > and stalls? > > > > > > With the congestion_wait() and wait_iff_congested() being entire > > > non-functional for block devices since 5.0, there is no IO load > > > based feedback going into memory reclaim from shrinkers that might > > > require IO to free objects before they can be reclaimed. This is > > > directly analogous to page reclaim writing back dirty pages from > > > the LRU, and as I understand it one of things the PSI is supposed > > > to be tracking. > > > > > > Lots of workloads create inode cache pressure and often it can > > > dominate the time spent in memory reclaim, so it would seem to me > > > that having PSI only track/calculate pressure and stalls from LRU > > > pages misses a fair chunk of the memory pressure and reclaim stalls > > > that can be occurring. > > > > psi already tracks the entire reclaim operation. So if reclaim calls > > into the shrinker and the shrinker scans inodes, initiates IO, or even > > waits on IO, that time is accounted for as memory pressure stalling. > > hmmmm - reclaim _scanning_ is considered a stall event? i.e. even if > scanning does not block, it's still accounting that _time_ as a > memory pressure stall? Yes. Reclaim doesn't need to block, the entire operation itself is an interruption of the workload that only happens due to a lack of RAM. Of course, as long as kswapd is just picking up one-off cache, it does not take a whole lot of time, and it will barely register as pressure. But as memory demand mounts and we have to look harder for unused pages, reclaim time can become significant, even without IO. > I'm probably missing it, but I don't see anything in vmpressure() > that actually accounts for time spent scanning. AFAICT it accounts > for LRU objects scanned and reclaimed from memcgs, and then the > memory freed from the shrinkers is accounted only to the > sc->target_mem_cgroup once all memcgs have been iterated. vmpressure is an orthogonal feature that is based purely on reclaim efficiency (reclaimed/scanned). psi accounting begins when we first call into try_to_free_pages() and friends. psi_memstall_enter() marks the task, and it's the scheduler part of psi that aggregates task state time into pressure ratios. > > If you can think of asynchronous events that are initiated from > > reclaim but cause indirect stalls in other contexts, contexts which > > can clearly link the stall back to reclaim activity, we can annotate > > them using psi_memstall_enter() / psi_memstall_leave(). > > Well, I was more thinking that issuing/waiting on IOs is a stall > event, not scanning. > > The IO-less inode reclaim stuff for XFS really needs the main > reclaim loop to back off under heavy IO load, but we cannot put the > entire metadata writeback path under psi_memstall_enter/leave() > because: > > a) it's not linked to any user context - it's a > per-superblock kernel thread; and > > b) it's designed to always be stalled on IO when there is > metadata writeback pressure. That pressure most often comes from > running out of journal space rather than memory pressure, and > really there is no way to distinguish between the two from > the writeback context. > > Hence I don't think the vmpressure mechanism does what the memory > reclaim scanning loops really need because they do not feed back a > clear picture of the load on the IO subsystem load into the reclaim > loops..... Memory pressure metrics really seem unrelated to this problem, and that's not what vmpressure or psi try to solve in the first place. When you say we need better IO pressure feedback / congestion throttling in reclaim, I can believe it, even though it's not something we necessarily observed in our fleet.
On 8/8/19 1:03 PM, Johannes Weiner wrote: > psi tracks the time tasks wait for refaulting pages to become > uptodate, but it does not track the time spent submitting the IO. The > submission part can be significant if backing storage is contended or > when cgroup throttling (io.latency) is in effect - a lot of time is > spent in submit_bio(). In that case, we underreport memory pressure. > > Annotate submit_bio() to account submission time as memory stall when > the bio is reading userspace workingset pages. Applied for 5.4.
diff --git a/block/bio.c b/block/bio.c index 299a0e7651ec..4196865dd300 100644 --- a/block/bio.c +++ b/block/bio.c @@ -806,6 +806,9 @@ void __bio_add_page(struct bio *bio, struct page *page, bio->bi_iter.bi_size += len; bio->bi_vcnt++; + + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) + bio_set_flag(bio, BIO_WORKINGSET); } EXPORT_SYMBOL_GPL(__bio_add_page); diff --git a/block/blk-core.c b/block/blk-core.c index d0cc6e14d2f0..1b1705b7dde7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -36,6 +36,7 @@ #include <linux/blk-cgroup.h> #include <linux/debugfs.h> #include <linux/bpf.h> +#include <linux/psi.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -1128,6 +1129,10 @@ EXPORT_SYMBOL_GPL(direct_make_request); */ blk_qc_t submit_bio(struct bio *bio) { + bool workingset_read = false; + unsigned long pflags; + blk_qc_t ret; + if (blkcg_punt_bio_submit(bio)) return BLK_QC_T_NONE; @@ -1146,6 +1151,8 @@ blk_qc_t submit_bio(struct bio *bio) if (op_is_write(bio_op(bio))) { count_vm_events(PGPGOUT, count); } else { + if (bio_flagged(bio, BIO_WORKINGSET)) + workingset_read = true; task_io_account_read(bio->bi_iter.bi_size); count_vm_events(PGPGIN, count); } @@ -1160,7 +1167,21 @@ blk_qc_t submit_bio(struct bio *bio) } } - return generic_make_request(bio); + /* + * If we're reading data that is part of the userspace + * workingset, count submission time as memory stall. When the + * device is congested, or the submitting cgroup IO-throttled, + * submission can be a significant part of overall IO time. + */ + if (workingset_read) + psi_memstall_enter(&pflags); + + ret = generic_make_request(bio); + + if (workingset_read) + psi_memstall_leave(&pflags); + + return ret; } EXPORT_SYMBOL(submit_bio); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 1b1fa1557e68..a9dadfc16a92 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -209,6 +209,7 @@ enum { BIO_BOUNCED, /* bio is a bounce bio */ BIO_USER_MAPPED, /* contains user pages */ BIO_NULL_MAPPED, /* contains invalid user pages */ + BIO_WORKINGSET, /* contains userspace workingset pages */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */
psi tracks the time tasks wait for refaulting pages to become uptodate, but it does not track the time spent submitting the IO. The submission part can be significant if backing storage is contended or when cgroup throttling (io.latency) is in effect - a lot of time is spent in submit_bio(). In that case, we underreport memory pressure. Annotate submit_bio() to account submission time as memory stall when the bio is reading userspace workingset pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- block/bio.c | 3 +++ block/blk-core.c | 23 ++++++++++++++++++++++- include/linux/blk_types.h | 1 + 3 files changed, 26 insertions(+), 1 deletion(-)