diff mbox series

psi: annotate refault stalls from IO submission

Message ID 20190722201337.19180-1-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series psi: annotate refault stalls from IO submission | expand

Commit Message

Johannes Weiner July 22, 2019, 8:13 p.m. UTC
psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate the submit_bio() paths (or the indirection through readpage)
for refaults and swapin to get proper psi coverage of delays there.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/extent_io.c | 14 ++++++++++++--
 fs/ext4/readpage.c   |  9 +++++++++
 fs/f2fs/data.c       |  8 ++++++++
 fs/mpage.c           |  9 +++++++++
 mm/filemap.c         | 20 ++++++++++++++++++++
 mm/page_io.c         | 11 ++++++++---
 mm/readahead.c       | 24 +++++++++++++++++++++++-
 7 files changed, 89 insertions(+), 6 deletions(-)

Comments

Andrew Morton July 22, 2019, 10:26 p.m. UTC | #1
On Mon, 22 Jul 2019 16:13:37 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> psi tracks the time tasks wait for refaulting pages to become
> uptodate, but it does not track the time spent submitting the IO. The
> submission part can be significant if backing storage is contended or
> when cgroup throttling (io.latency) is in effect - a lot of time is
> spent in submit_bio(). In that case, we underreport memory pressure.

It's a somewhat broad patch.  How significant is this problem in the
real world?  Can we be confident that the end-user benefit is worth the
code changes?

> Annotate the submit_bio() paths (or the indirection through readpage)
> for refaults and swapin to get proper psi coverage of delays there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/extent_io.c | 14 ++++++++++++--
>  fs/ext4/readpage.c   |  9 +++++++++
>  fs/f2fs/data.c       |  8 ++++++++
>  fs/mpage.c           |  9 +++++++++
>  mm/filemap.c         | 20 ++++++++++++++++++++
>  mm/page_io.c         | 11 ++++++++---
>  mm/readahead.c       | 24 +++++++++++++++++++++++-

We touch three filesystems.  Why these three?  Are all other
filesystems OK or will they need work as well?

> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
>
> ...
>
> @@ -2753,11 +2763,14 @@ static struct page *do_read_cache_page(struct address_space *mapping,
>  				void *data,
>  				gfp_t gfp)
>  {
> +	bool refault = false;
>  	struct page *page;
>  	int err;
>  repeat:
>  	page = find_get_page(mapping, index);
>  	if (!page) {
> +		unsigned long pflags;
> +

That was a bit odd.  This?

--- a/mm/filemap.c~psi-annotate-refault-stalls-from-io-submission-fix
+++ a/mm/filemap.c
@@ -2815,12 +2815,12 @@ static struct page *do_read_cache_page(s
 				void *data,
 				gfp_t gfp)
 {
-	bool refault = false;
 	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
+		bool refault = false;
 		unsigned long pflags;
 
 		page = __page_cache_alloc(gfp);
Johannes Weiner July 22, 2019, 11:35 p.m. UTC | #2
On Mon, Jul 22, 2019 at 03:26:07PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 16:13:37 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > psi tracks the time tasks wait for refaulting pages to become
> > uptodate, but it does not track the time spent submitting the IO. The
> > submission part can be significant if backing storage is contended or
> > when cgroup throttling (io.latency) is in effect - a lot of time is
> > spent in submit_bio(). In that case, we underreport memory pressure.
> 
> It's a somewhat broad patch.  How significant is this problem in the
> real world?  Can we be confident that the end-user benefit is worth the
> code changes?

The error scales with how aggressively IO is throttled compared to the
device's capability.

For example, we have system maintenance software throttled down pretty
hard on IO compared to the workload. When memory is contended, the
system software starts thrashing cache, but since the backing device
is actually pretty fast, the majority of "io time" is from injected
throttling delays during submit_bio().

As a result we barely see memory pressure, when the reality is that
there is almost no progress due to the thrashing and we should be
killing misbehaving stuff.

> > Annotate the submit_bio() paths (or the indirection through readpage)
> > for refaults and swapin to get proper psi coverage of delays there.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  fs/btrfs/extent_io.c | 14 ++++++++++++--
> >  fs/ext4/readpage.c   |  9 +++++++++
> >  fs/f2fs/data.c       |  8 ++++++++
> >  fs/mpage.c           |  9 +++++++++
> >  mm/filemap.c         | 20 ++++++++++++++++++++
> >  mm/page_io.c         | 11 ++++++++---
> >  mm/readahead.c       | 24 +++++++++++++++++++++++-
> 
> We touch three filesystems.  Why these three?  Are all other
> filesystems OK or will they need work as well?

These are the ones that I found open-coding add_to_page_cache_lru()
followed by submit_bio() instead of going through generic code like
mpage, use read_cache_pages(), implement ->readpage only.

> > @@ -2753,11 +2763,14 @@ static struct page *do_read_cache_page(struct address_space *mapping,
> >  				void *data,
> >  				gfp_t gfp)
> >  {
> > +	bool refault = false;
> >  	struct page *page;
> >  	int err;
> >  repeat:
> >  	page = find_get_page(mapping, index);
> >  	if (!page) {
> > +		unsigned long pflags;
> > +
> 
> That was a bit odd.  This?
> 
> --- a/mm/filemap.c~psi-annotate-refault-stalls-from-io-submission-fix
> +++ a/mm/filemap.c
> @@ -2815,12 +2815,12 @@ static struct page *do_read_cache_page(s
>  				void *data,
>  				gfp_t gfp)
>  {
> -	bool refault = false;
>  	struct page *page;
>  	int err;
>  repeat:
>  	page = find_get_page(mapping, index);
>  	if (!page) {
> +		bool refault = false;
>  		unsigned long pflags;
>  
>  		page = __page_cache_alloc(gfp);
> _
> 

It's so that when we jump to 'filler:' from outside the branch, the
'refault' variable is initialized from the first time through:

	bool refault = false;
	struct page *page;

	page = find_get_page(mapping, index);
	if (!page) {
	   	__page_cache_alloc()
		add_to_page_cache_lru()
		refault = PageWorkingset(page);
filler:
		if (refault)
			psi_memstall_enter(&pflags);

		readpage()

		if (refault)
			psi_memstall_leave(&pflags);
	}
	lock_page()
	if (PageUptodate())
		goto out;
	goto filler;
Dave Chinner July 23, 2019, 12:02 a.m. UTC | #3
On Mon, Jul 22, 2019 at 04:13:37PM -0400, Johannes Weiner wrote:
> psi tracks the time tasks wait for refaulting pages to become
> uptodate, but it does not track the time spent submitting the IO. The
> submission part can be significant if backing storage is contended or
> when cgroup throttling (io.latency) is in effect - a lot of time is
> spent in submit_bio(). In that case, we underreport memory pressure.
> 
> Annotate the submit_bio() paths (or the indirection through readpage)
> for refaults and swapin to get proper psi coverage of delays there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/extent_io.c | 14 ++++++++++++--
>  fs/ext4/readpage.c   |  9 +++++++++
>  fs/f2fs/data.c       |  8 ++++++++
>  fs/mpage.c           |  9 +++++++++
>  mm/filemap.c         | 20 ++++++++++++++++++++
>  mm/page_io.c         | 11 ++++++++---
>  mm/readahead.c       | 24 +++++++++++++++++++++++-
>  7 files changed, 89 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1eb671c16ff1..2d2b3239965a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -13,6 +13,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/prefetch.h>
>  #include <linux/cleancache.h>
> +#include <linux/psi.h>
>  #include "extent_io.h"
>  #include "extent_map.h"
>  #include "ctree.h"
> @@ -4267,6 +4268,9 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
>  	int nr = 0;
>  	u64 prev_em_start = (u64)-1;
> +	int ret = 0;
> +	bool refault = false;
> +	unsigned long pflags;
>  
>  	while (!list_empty(pages)) {
>  		u64 contig_end = 0;
> @@ -4281,6 +4285,10 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  				put_page(page);
>  				break;
>  			}
> +			if (PageWorkingset(page) && !refault) {
> +				psi_memstall_enter(&pflags);
> +				refault = true;
> +			}
>  
>  			pagepool[nr++] = page;
>  			contig_end = page_offset(page) + PAGE_SIZE - 1;
> @@ -4301,8 +4309,10 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  		free_extent_map(em_cached);
>  
>  	if (bio)
> -		return submit_one_bio(bio, 0, bio_flags);
> -	return 0;
> +		ret = submit_one_bio(bio, 0, bio_flags);
> +	if (refault)
> +		psi_memstall_leave(&pflags);
> +	return ret;

This all seems extremely fragile to me. Sprinkling magic,
undocumented pixie dust through the IO paths to account for
something nobody can actually determine is working correctly is a
bad idea.  People are going to break this without knowing it, nobody
is going to notice because there are no regression tests for it,
and this will all end up in frustration for users because it
constantly gets broken and doesn't work reliably.

e.g. If this is needed around all calls to ->readpage(), then please
write a readpage wrapper function and convert all the callers to use
that wrapper.

Even better: If this memstall and "refault" check is needed to
account for bio submission blocking, then page cache iteration is
the wrong place to be doing this check. It should be done entirely
in the bio code when adding pages to the bio because we'll only ever
be doing page cache read IO on page cache misses. i.e. this isn't
dependent on adding a new page to the LRU or not - if we add a new
page then we are going to be doing IO and so this does not require
magic pixie dust at the page cache iteration level

e.g. bio_add_page_memstall() can do the working set check and then
set a flag on the bio to say it contains a memstall page. Then on
submission of the bio the memstall condition can be cleared.

Cheers,

-Dave.
Johannes Weiner July 23, 2019, 7:04 p.m. UTC | #4
CCing Jens for bio layer stuff

On Tue, Jul 23, 2019 at 10:02:26AM +1000, Dave Chinner wrote:
> Even better: If this memstall and "refault" check is needed to
> account for bio submission blocking, then page cache iteration is
> the wrong place to be doing this check. It should be done entirely
> in the bio code when adding pages to the bio because we'll only ever
> be doing page cache read IO on page cache misses. i.e. this isn't
> dependent on adding a new page to the LRU or not - if we add a new
> page then we are going to be doing IO and so this does not require
> magic pixie dust at the page cache iteration level

That could work. I had it at the page cache level because that's
logically where the refault occurs. But PG_workingset encodes
everything we need from the page cache layer and is available where
the actual stall occurs, so we should be able to push it down.

> e.g. bio_add_page_memstall() can do the working set check and then
> set a flag on the bio to say it contains a memstall page. Then on
> submission of the bio the memstall condition can be cleared.

A separate bio_add_page_memstall() would have all the problems you
pointed out with the original patch: it's magic, people will get it
wrong, and it'll be hard to verify and notice regressions.

How about just doing it in __bio_add_page()? PG_workingset is not
overloaded - when we see it set, we can generally and unconditionally
flag the bio as containing userspace workingset pages.

At submission time, in conjunction with the IO direction, we can
clearly tell whether we are reloading userspace workingset data,
i.e. stalling on memory.

This?

---
From 033e0c4789ef4ceefb2d8038b4e162dfb434d03d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 11 Jul 2019 16:01:40 -0400
Subject: [PATCH] psi: annotate refault stalls from IO submission

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/bio.c               |  3 +++
 block/blk-core.c          | 23 ++++++++++++++++++++++-
 include/linux/blk_types.h |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 29cd6cf4da51..6156cb1b9c2c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -805,6 +805,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
+
+	if (PageWorkingset(page))
+		bio_set_flag(bio, BIO_WORKINGSET);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 5d1fc8e17dd1..5993922d63fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -36,6 +36,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/debugfs.h>
 #include <linux/bpf.h>
+#include <linux/psi.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -1127,6 +1128,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	bool workingset_read = false;
+	unsigned long pflags;
+	blk_qc_t ret;
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
@@ -1142,6 +1147,8 @@ blk_qc_t submit_bio(struct bio *bio)
 		if (op_is_write(bio_op(bio))) {
 			count_vm_events(PGPGOUT, count);
 		} else {
+			if (bio_flagged(bio, BIO_WORKINGSET))
+				workingset_read = true;
 			task_io_account_read(bio->bi_iter.bi_size);
 			count_vm_events(PGPGIN, count);
 		}
@@ -1156,7 +1163,21 @@ blk_qc_t submit_bio(struct bio *bio)
 		}
 	}
 
-	return generic_make_request(bio);
+	/*
+	 * If we're reading data that is part of the userspace
+	 * workingset, count submission time as memory stall. When the
+	 * device is congested, or the submitting cgroup IO-throttled,
+	 * submission can be a significant part of overall IO time.
+	 */
+	if (workingset_read)
+		psi_memstall_enter(&pflags);
+
+	ret = generic_make_request(bio);
+
+	if (workingset_read)
+		psi_memstall_leave(&pflags);
+
+	return ret;
 }
 EXPORT_SYMBOL(submit_bio);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6a53799c3fe2..2f77e3446760 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -209,6 +209,7 @@ enum {
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_USER_MAPPED,	/* contains user pages */
 	BIO_NULL_MAPPED,	/* contains invalid user pages */
+	BIO_WORKINGSET,		/* contains userspace workingset pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
Jens Axboe July 23, 2019, 7:34 p.m. UTC | #5
On 7/23/19 1:04 PM, Johannes Weiner wrote:
> CCing Jens for bio layer stuff
> 
> On Tue, Jul 23, 2019 at 10:02:26AM +1000, Dave Chinner wrote:
>> Even better: If this memstall and "refault" check is needed to
>> account for bio submission blocking, then page cache iteration is
>> the wrong place to be doing this check. It should be done entirely
>> in the bio code when adding pages to the bio because we'll only ever
>> be doing page cache read IO on page cache misses. i.e. this isn't
>> dependent on adding a new page to the LRU or not - if we add a new
>> page then we are going to be doing IO and so this does not require
>> magic pixie dust at the page cache iteration level
> 
> That could work. I had it at the page cache level because that's
> logically where the refault occurs. But PG_workingset encodes
> everything we need from the page cache layer and is available where
> the actual stall occurs, so we should be able to push it down.
> 
>> e.g. bio_add_page_memstall() can do the working set check and then
>> set a flag on the bio to say it contains a memstall page. Then on
>> submission of the bio the memstall condition can be cleared.
> 
> A separate bio_add_page_memstall() would have all the problems you
> pointed out with the original patch: it's magic, people will get it
> wrong, and it'll be hard to verify and notice regressions.
> 
> How about just doing it in __bio_add_page()? PG_workingset is not
> overloaded - when we see it set, we can generally and unconditionally
> flag the bio as containing userspace workingset pages.
> 
> At submission time, in conjunction with the IO direction, we can
> clearly tell whether we are reloading userspace workingset data,
> i.e. stalling on memory.
> 
> This?

Not vehemently opposed to it, even if it sucks having to test page flags
in the hot path. Maybe even do:

	if (!bio_flagged(bio, BIO_WORKINGSET) && PageWorkingset(page))
		bio_set_flag(bio, BIO_WORKINGSET);

to at least avoid it for the (common?) case where multiple pages are
marked as workingset.
Johannes Weiner July 23, 2019, 8:42 p.m. UTC | #6
On Tue, Jul 23, 2019 at 01:34:50PM -0600, Jens Axboe wrote:
> On 7/23/19 1:04 PM, Johannes Weiner wrote:
> > CCing Jens for bio layer stuff
> > 
> > On Tue, Jul 23, 2019 at 10:02:26AM +1000, Dave Chinner wrote:
> >> Even better: If this memstall and "refault" check is needed to
> >> account for bio submission blocking, then page cache iteration is
> >> the wrong place to be doing this check. It should be done entirely
> >> in the bio code when adding pages to the bio because we'll only ever
> >> be doing page cache read IO on page cache misses. i.e. this isn't
> >> dependent on adding a new page to the LRU or not - if we add a new
> >> page then we are going to be doing IO and so this does not require
> >> magic pixie dust at the page cache iteration level
> > 
> > That could work. I had it at the page cache level because that's
> > logically where the refault occurs. But PG_workingset encodes
> > everything we need from the page cache layer and is available where
> > the actual stall occurs, so we should be able to push it down.
> > 
> >> e.g. bio_add_page_memstall() can do the working set check and then
> >> set a flag on the bio to say it contains a memstall page. Then on
> >> submission of the bio the memstall condition can be cleared.
> > 
> > A separate bio_add_page_memstall() would have all the problems you
> > pointed out with the original patch: it's magic, people will get it
> > wrong, and it'll be hard to verify and notice regressions.
> > 
> > How about just doing it in __bio_add_page()? PG_workingset is not
> > overloaded - when we see it set, we can generally and unconditionally
> > flag the bio as containing userspace workingset pages.
> > 
> > At submission time, in conjunction with the IO direction, we can
> > clearly tell whether we are reloading userspace workingset data,
> > i.e. stalling on memory.
> > 
> > This?
> 
> Not vehemently opposed to it, even if it sucks having to test page flags
> in the hot path.

Yeah, it's not great :/ Just seems marginally better than annotating
all the callsites and maintain correctness there in the future.

> Maybe even do:
> 
> 	if (!bio_flagged(bio, BIO_WORKINGSET) && PageWorkingset(page))
> 		bio_set_flag(bio, BIO_WORKINGSET);
> 
> to at least avoid it for the (common?) case where multiple pages are
> marked as workingset.

Sounds good. If refaults occur, most likely the whole readahead batch
has that flag set, so I've added that. I've also marked the page test
unlikely.

This way we have no jumps in the most common path (no refaults), one
jump in the second most common (bit already set), and the double for
the least likely case of hitting the first refault page in a batch.

Updated patch below.

---
From 1b3888bdf075f86f226af4e350c8a88435d1fe8e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 11 Jul 2019 16:01:40 -0400
Subject: [PATCH] psi: annotate refault stalls from IO submission

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/bio.c               |  3 +++
 block/blk-core.c          | 23 ++++++++++++++++++++++-
 include/linux/blk_types.h |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 29cd6cf4da51..4dd9ea0b068b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -805,6 +805,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
+
+	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
+		bio_set_flag(bio, BIO_WORKINGSET);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 5d1fc8e17dd1..5993922d63fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -36,6 +36,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/debugfs.h>
 #include <linux/bpf.h>
+#include <linux/psi.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -1127,6 +1128,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	bool workingset_read = false;
+	unsigned long pflags;
+	blk_qc_t ret;
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
@@ -1142,6 +1147,8 @@ blk_qc_t submit_bio(struct bio *bio)
 		if (op_is_write(bio_op(bio))) {
 			count_vm_events(PGPGOUT, count);
 		} else {
+			if (bio_flagged(bio, BIO_WORKINGSET))
+				workingset_read = true;
 			task_io_account_read(bio->bi_iter.bi_size);
 			count_vm_events(PGPGIN, count);
 		}
@@ -1156,7 +1163,21 @@ blk_qc_t submit_bio(struct bio *bio)
 		}
 	}
 
-	return generic_make_request(bio);
+	/*
+	 * If we're reading data that is part of the userspace
+	 * workingset, count submission time as memory stall. When the
+	 * device is congested, or the submitting cgroup IO-throttled,
+	 * submission can be a significant part of overall IO time.
+	 */
+	if (workingset_read)
+		psi_memstall_enter(&pflags);
+
+	ret = generic_make_request(bio);
+
+	if (workingset_read)
+		psi_memstall_leave(&pflags);
+
+	return ret;
 }
 EXPORT_SYMBOL(submit_bio);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6a53799c3fe2..2f77e3446760 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -209,6 +209,7 @@ enum {
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_USER_MAPPED,	/* contains user pages */
 	BIO_NULL_MAPPED,	/* contains invalid user pages */
+	BIO_WORKINGSET,		/* contains userspace workingset pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
Dave Chinner July 23, 2019, 10:10 p.m. UTC | #7
On Tue, Jul 23, 2019 at 01:34:50PM -0600, Jens Axboe wrote:
> On 7/23/19 1:04 PM, Johannes Weiner wrote:
> > CCing Jens for bio layer stuff
> > 
> > On Tue, Jul 23, 2019 at 10:02:26AM +1000, Dave Chinner wrote:
> >> Even better: If this memstall and "refault" check is needed to
> >> account for bio submission blocking, then page cache iteration is
> >> the wrong place to be doing this check. It should be done entirely
> >> in the bio code when adding pages to the bio because we'll only ever
> >> be doing page cache read IO on page cache misses. i.e. this isn't
> >> dependent on adding a new page to the LRU or not - if we add a new
> >> page then we are going to be doing IO and so this does not require
> >> magic pixie dust at the page cache iteration level
> > 
> > That could work. I had it at the page cache level because that's
> > logically where the refault occurs. But PG_workingset encodes
> > everything we need from the page cache layer and is available where
> > the actual stall occurs, so we should be able to push it down.
> > 
> >> e.g. bio_add_page_memstall() can do the working set check and then
> >> set a flag on the bio to say it contains a memstall page. Then on
> >> submission of the bio the memstall condition can be cleared.
> > 
> > A separate bio_add_page_memstall() would have all the problems you
> > pointed out with the original patch: it's magic, people will get it
> > wrong, and it'll be hard to verify and notice regressions.
> > 
> > How about just doing it in __bio_add_page()? PG_workingset is not
> > overloaded - when we see it set, we can generally and unconditionally
> > flag the bio as containing userspace workingset pages.
> > 
> > At submission time, in conjunction with the IO direction, we can
> > clearly tell whether we are reloading userspace workingset data,
> > i.e. stalling on memory.
> > 
> > This?
> 
> Not vehemently opposed to it, even if it sucks having to test page flags
> in the hot path.

That's kinda why I suggested the bio_add_page_memstall() variant for
the page cache read IO paths where this check would be required.

Not fussed either way, this is much cleaner and easier to maintain
IMO....

-Dave.
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1eb671c16ff1..2d2b3239965a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -13,6 +13,7 @@ 
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/cleancache.h>
+#include <linux/psi.h>
 #include "extent_io.h"
 #include "extent_map.h"
 #include "ctree.h"
@@ -4267,6 +4268,9 @@  int extent_readpages(struct address_space *mapping, struct list_head *pages,
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
 	int nr = 0;
 	u64 prev_em_start = (u64)-1;
+	int ret = 0;
+	bool refault = false;
+	unsigned long pflags;
 
 	while (!list_empty(pages)) {
 		u64 contig_end = 0;
@@ -4281,6 +4285,10 @@  int extent_readpages(struct address_space *mapping, struct list_head *pages,
 				put_page(page);
 				break;
 			}
+			if (PageWorkingset(page) && !refault) {
+				psi_memstall_enter(&pflags);
+				refault = true;
+			}
 
 			pagepool[nr++] = page;
 			contig_end = page_offset(page) + PAGE_SIZE - 1;
@@ -4301,8 +4309,10 @@  int extent_readpages(struct address_space *mapping, struct list_head *pages,
 		free_extent_map(em_cached);
 
 	if (bio)
-		return submit_one_bio(bio, 0, bio_flags);
-	return 0;
+		ret = submit_one_bio(bio, 0, bio_flags);
+	if (refault)
+		psi_memstall_leave(&pflags);
+	return ret;
 }
 
 /*
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c916017db334..f28385900b64 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -44,6 +44,7 @@ 
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/cleancache.h>
+#include <linux/psi.h>
 
 #include "ext4.h"
 
@@ -116,6 +117,8 @@  int ext4_mpage_readpages(struct address_space *mapping,
 	int length;
 	unsigned relative_block = 0;
 	struct ext4_map_blocks map;
+	bool refault = false;
+	unsigned long pflags;
 
 	map.m_pblk = 0;
 	map.m_lblk = 0;
@@ -134,6 +137,10 @@  int ext4_mpage_readpages(struct address_space *mapping,
 			if (add_to_page_cache_lru(page, mapping, page->index,
 				  readahead_gfp_mask(mapping)))
 				goto next_page;
+			if (PageWorkingset(page) && !refault) {
+				psi_memstall_enter(&pflags);
+				refault = true;
+			}
 		}
 
 		if (page_has_buffers(page))
@@ -291,5 +298,7 @@  int ext4_mpage_readpages(struct address_space *mapping,
 	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		submit_bio(bio);
+	if (refault)
+		psi_memstall_leave(&pflags);
 	return 0;
 }
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 2d0a1a97d3fd..fe9c34247be4 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1699,6 +1699,8 @@  static int f2fs_mpage_readpages(struct address_space *mapping,
 	sector_t last_block_in_bio = 0;
 	struct inode *inode = mapping->host;
 	struct f2fs_map_blocks map;
+	bool refault = false;
+	unsigned long pflags;
 	int ret = 0;
 
 	map.m_pblk = 0;
@@ -1720,6 +1722,10 @@  static int f2fs_mpage_readpages(struct address_space *mapping,
 						  page_index(page),
 						  readahead_gfp_mask(mapping)))
 				goto next_page;
+			if (PageWorkingset(page) && !refault) {
+				psi_memstall_enter(&pflags);
+				refault = true;
+			}
 		}
 
 		ret = f2fs_read_single_page(inode, page, nr_pages, &map, &bio,
@@ -1736,6 +1742,8 @@  static int f2fs_mpage_readpages(struct address_space *mapping,
 	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		__submit_bio(F2FS_I_SB(inode), bio, DATA);
+	if (refault)
+		psi_memstall_leave(&pflags);
 	return pages ? 0 : ret;
 }
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 436a85260394..f4ef57f1ea06 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -30,6 +30,7 @@ 
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/cleancache.h>
+#include <linux/psi.h>
 #include "internal.h"
 
 /*
@@ -389,6 +390,8 @@  mpage_readpages(struct address_space *mapping, struct list_head *pages,
 		.get_block = get_block,
 		.is_readahead = true,
 	};
+	bool refault = false;
+	unsigned long pflags;
 	unsigned page_idx;
 
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
@@ -404,10 +407,16 @@  mpage_readpages(struct address_space *mapping, struct list_head *pages,
 			args.bio = do_mpage_readpage(&args);
 		}
 		put_page(page);
+		if (PageWorkingset(page) && !refault) {
+			psi_memstall_enter(&pflags);
+			refault = true;
+		}
 	}
 	BUG_ON(!list_empty(pages));
 	if (args.bio)
 		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
+	if (refault)
+		psi_memstall_leave(&pflags);
 	return 0;
 }
 EXPORT_SYMBOL(mpage_readpages);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8129eaa5f257..667fbd3f7eb2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2009,6 +2009,8 @@  static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		pgoff_t end_index;
 		loff_t isize;
 		unsigned long nr, ret;
+		unsigned long pflags;
+		bool refault;
 
 		cond_resched();
 find_page:
@@ -2157,9 +2159,17 @@  static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		 * PG_error will be set again if readpage fails.
 		 */
 		ClearPageError(page);
+
+		refault = PageWorkingset(page);
+		if (refault)
+			psi_memstall_enter(&pflags);
+
 		/* Start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
 
+		if (refault)
+			psi_memstall_leave(&pflags);
+
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
 				put_page(page);
@@ -2753,11 +2763,14 @@  static struct page *do_read_cache_page(struct address_space *mapping,
 				void *data,
 				gfp_t gfp)
 {
+	bool refault = false;
 	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
+		unsigned long pflags;
+
 		page = __page_cache_alloc(gfp);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
@@ -2770,12 +2783,19 @@  static struct page *do_read_cache_page(struct address_space *mapping,
 			return ERR_PTR(err);
 		}
 
+		refault = PageWorkingset(page);
 filler:
+		if (refault)
+			psi_memstall_enter(&pflags);
+
 		if (filler)
 			err = filler(data, page);
 		else
 			err = mapping->a_ops->readpage(data, page);
 
+		if (refault)
+			psi_memstall_leave(&pflags);
+
 		if (err < 0) {
 			put_page(page);
 			return ERR_PTR(err);
diff --git a/mm/page_io.c b/mm/page_io.c
index 24ee600f9131..e878e9559015 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -24,6 +24,7 @@ 
 #include <linux/blkdev.h>
 #include <linux/uio.h>
 #include <linux/sched/task.h>
+#include <linux/psi.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags,
@@ -354,10 +355,14 @@  int swap_readpage(struct page *page, bool synchronous)
 	struct swap_info_struct *sis = page_swap_info(page);
 	blk_qc_t qc;
 	struct gendisk *disk;
+	unsigned long pflags;
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageUptodate(page), page);
+
+	psi_memstall_enter(&pflags);
+
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
 		unlock_page(page);
@@ -371,7 +376,7 @@  int swap_readpage(struct page *page, bool synchronous)
 		ret = mapping->a_ops->readpage(swap_file, page);
 		if (!ret)
 			count_vm_event(PSWPIN);
-		return ret;
+		goto out;
 	}
 
 	ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
@@ -382,7 +387,7 @@  int swap_readpage(struct page *page, bool synchronous)
 		}
 
 		count_vm_event(PSWPIN);
-		return 0;
+		goto out;
 	}
 
 	ret = 0;
@@ -416,8 +421,8 @@  int swap_readpage(struct page *page, bool synchronous)
 	}
 	__set_current_state(TASK_RUNNING);
 	bio_put(bio);
-
 out:
+	psi_memstall_leave(&pflags);
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..a89522a053ce 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -22,6 +22,7 @@ 
 #include <linux/mm_inline.h>
 #include <linux/blk-cgroup.h>
 #include <linux/fadvise.h>
+#include <linux/psi.h>
 
 #include "internal.h"
 
@@ -92,6 +93,9 @@  int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 	int ret = 0;
 
 	while (!list_empty(pages)) {
+		unsigned long pflags;
+		bool refault;
+
 		page = lru_to_page(pages);
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping, page->index,
@@ -101,7 +105,15 @@  int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 		}
 		put_page(page);
 
+		refault = PageWorkingset(page);
+		if (refault)
+			psi_memstall_enter(&pflags);
+
 		ret = filler(data, page);
+
+		if (refault)
+			psi_memstall_leave(&pflags);
+
 		if (unlikely(ret)) {
 			read_cache_pages_invalidate_pages(mapping, pages);
 			break;
@@ -132,8 +144,18 @@  static int read_pages(struct address_space *mapping, struct file *filp,
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = lru_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
+		if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
+			bool refault = PageWorkingset(page);
+			unsigned long pflags;
+
+			if (refault)
+				psi_memstall_enter(&pflags);
+
 			mapping->a_ops->readpage(filp, page);
+
+			if (refault)
+				psi_memstall_leave(&pflags);
+		}
 		put_page(page);
 	}
 	ret = 0;