diff mbox series

mm: fix long time stall from mm_populate

Message ID 20200211001958.170261-1-minchan@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm: fix long time stall from mm_populate | expand

Commit Message

Minchan Kim Feb. 11, 2020, 12:19 a.m. UTC
Basically, fault handler releases mmap_sem before requesting readahead
and then it is supposed to retry lookup the page from page cache with
FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry.

However, what happens if the fault handler find a page from page
cache and the page has readahead marker but are waiting under
writeback? Plus one more condition, it happens under mm_populate
which repeats faulting unless it encounters error. So let's assemble
conditions below.

__mm_populate
for (...)
  get_user_pages(faluty_address)
    handle_mm_fault
      filemap_fault
        find a page form page(PG_uptodate|PG_readahead|PG_writeback)
	it will return VM_FAULT_RETRY
  continue with faulty_address

IOW, it will repeat fault retry logic until the page will be written
back in the long run. It makes big spike latency of several seconds.

This patch solves the issue by turning off fault retry logic in second
trial.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
It was orignated from code review once I have seen several user reports
but didn't confirm yet it's the root cause.

 mm/gup.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox (Oracle) Feb. 11, 2020, 1:10 a.m. UTC | #1
On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
>       filemap_fault
>         find a page form page(PG_uptodate|PG_readahead|PG_writeback)

Uh ... That shouldn't be possible.

        /*
         * Same bit is used for PG_readahead and PG_reclaim.
         */
        if (PageWriteback(page))
                return;

        ClearPageReadahead(page);
Minchan Kim Feb. 11, 2020, 3:50 a.m. UTC | #2
On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> >       filemap_fault
> >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> 
> Uh ... That shouldn't be possible.

Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
page reclaim when the writeback is done so the page will have both
flags at the same time and the PG reclaim could be regarded as
PG_readahead in fault conext.

> 
>         /*
>          * Same bit is used for PG_readahead and PG_reclaim.
>          */
>         if (PageWriteback(page))
>                 return;
> 
>         ClearPageReadahead(page);
>
Matthew Wilcox (Oracle) Feb. 11, 2020, 3:54 a.m. UTC | #3
On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > >       filemap_fault
> > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > 
> > Uh ... That shouldn't be possible.
> 
> Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> page reclaim when the writeback is done so the page will have both
> flags at the same time and the PG reclaim could be regarded as
> PG_readahead in fault conext.

What part of fault context can make that mistake?  The snippet I quoted
below is from page_cache_async_readahead() where it will clearly not
make that mistake.  There's a lot of code here; please don't presume I
know all the areas you're talking about.

> > 
> >         /*
> >          * Same bit is used for PG_readahead and PG_reclaim.
> >          */
> >         if (PageWriteback(page))
> >                 return;
> > 
> >         ClearPageReadahead(page);
Minchan Kim Feb. 11, 2020, 4:25 a.m. UTC | #4
On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > >       filemap_fault
> > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > 
> > > Uh ... That shouldn't be possible.
> > 
> > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > page reclaim when the writeback is done so the page will have both
> > flags at the same time and the PG reclaim could be regarded as
> > PG_readahead in fault conext.
> 
> What part of fault context can make that mistake?  The snippet I quoted
> below is from page_cache_async_readahead() where it will clearly not
> make that mistake.  There's a lot of code here; please don't presume I
> know all the areas you're talking about.

Sorry about being not clear. I am saying  filemap_fault ->
do_async_mmap_readahead

Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
and PG_writeback by shrink_page_list, it goes to 

do_async_mmap_readahead
  if (PageReadahead(page))
    fpin = maybe_unlock_mmap_for_io();
    page_cache_async_readahead
      if (PageWriteback(page))
        return;
      ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
      
So, mm_populate will repeat the loop until the writeback is done.
It's my just theory but didn't comfirm it by the testing.
If I miss something clear, let me know it.

Thanks!

> 
> > > 
> > >         /*
> > >          * Same bit is used for PG_readahead and PG_reclaim.
> > >          */
> > >         if (PageWriteback(page))
> > >                 return;
> > > 
> > >         ClearPageReadahead(page);
Matthew Wilcox (Oracle) Feb. 11, 2020, 12:23 p.m. UTC | #5
On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > >       filemap_fault
> > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > 
> > > > Uh ... That shouldn't be possible.
> > > 
> > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > page reclaim when the writeback is done so the page will have both
> > > flags at the same time and the PG reclaim could be regarded as
> > > PG_readahead in fault conext.
> > 
> > What part of fault context can make that mistake?  The snippet I quoted
> > below is from page_cache_async_readahead() where it will clearly not
> > make that mistake.  There's a lot of code here; please don't presume I
> > know all the areas you're talking about.
> 
> Sorry about being not clear. I am saying  filemap_fault ->
> do_async_mmap_readahead
> 
> Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> and PG_writeback by shrink_page_list, it goes to 
> 
> do_async_mmap_readahead
>   if (PageReadahead(page))
>     fpin = maybe_unlock_mmap_for_io();
>     page_cache_async_readahead
>       if (PageWriteback(page))
>         return;
>       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
>       
> So, mm_populate will repeat the loop until the writeback is done.
> It's my just theory but didn't comfirm it by the testing.
> If I miss something clear, let me know it.

Ah!  Surely the right way to fix this is ...

+++ b/mm/filemap.c
@@ -2420,7 +2420,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
                return fpin;
        if (ra->mmap_miss > 0)
                ra->mmap_miss--;
-       if (PageReadahead(page)) {
+       if (!PageWriteback(page) && PageReadahead(page)) {
                fpin = maybe_unlock_mmap_for_io(vmf, fpin);
                page_cache_async_readahead(mapping, ra, file,
                                           page, offset, ra->ra_pages);
Minchan Kim Feb. 11, 2020, 4:34 p.m. UTC | #6
On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > >       filemap_fault
> > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > 
> > > > > Uh ... That shouldn't be possible.
> > > > 
> > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > page reclaim when the writeback is done so the page will have both
> > > > flags at the same time and the PG reclaim could be regarded as
> > > > PG_readahead in fault conext.
> > > 
> > > What part of fault context can make that mistake?  The snippet I quoted
> > > below is from page_cache_async_readahead() where it will clearly not
> > > make that mistake.  There's a lot of code here; please don't presume I
> > > know all the areas you're talking about.
> > 
> > Sorry about being not clear. I am saying  filemap_fault ->
> > do_async_mmap_readahead
> > 
> > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > and PG_writeback by shrink_page_list, it goes to 
> > 
> > do_async_mmap_readahead
> >   if (PageReadahead(page))
> >     fpin = maybe_unlock_mmap_for_io();
> >     page_cache_async_readahead
> >       if (PageWriteback(page))
> >         return;
> >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> >       
> > So, mm_populate will repeat the loop until the writeback is done.
> > It's my just theory but didn't comfirm it by the testing.
> > If I miss something clear, let me know it.
> 
> Ah!  Surely the right way to fix this is ...

I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
in page_cache_async_readahead because I don't see corelation. Why couldn't we
do readahead if the marker page is PG_readahead|PG_writeback design PoV?
Only reason I can think of is it makes *a page* will be delayed for freeing
since we removed PG_reclaim bit, which would be over-optimization for me.

Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
in your snippet but it was under PG_writeback in page_cache_async_readahead and
then the IO was done before refault reaching the code again. It could be repeated
*theoretically* even though it's very hard to happen in real practice.
Thus, I think it would be better to remove PageWriteback check from
page_cache_async_readahead if we really want to go the approach.

However, page_cache_async_readahead has another condition to bail out: ra_pages
I think it's also racy with fadvise or shrinking the window size from other tasks.

That's why I thought second trial with non-fault retry logic from caller would fix
all potnetial issues all at once like page fault handler have done.

> 
> +++ b/mm/filemap.c
> @@ -2420,7 +2420,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>                 return fpin;
>         if (ra->mmap_miss > 0)
>                 ra->mmap_miss--;
> -       if (PageReadahead(page)) {
> +       if (!PageWriteback(page) && PageReadahead(page)) {
>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>                 page_cache_async_readahead(mapping, ra, file,
>                                            page, offset, ra->ra_pages);
>
Matthew Wilcox (Oracle) Feb. 11, 2020, 5:28 p.m. UTC | #7
On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > >       filemap_fault
> > > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > 
> > > > > > Uh ... That shouldn't be possible.
> > > > > 
> > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > page reclaim when the writeback is done so the page will have both
> > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > PG_readahead in fault conext.
> > > > 
> > > > What part of fault context can make that mistake?  The snippet I quoted
> > > > below is from page_cache_async_readahead() where it will clearly not
> > > > make that mistake.  There's a lot of code here; please don't presume I
> > > > know all the areas you're talking about.
> > > 
> > > Sorry about being not clear. I am saying  filemap_fault ->
> > > do_async_mmap_readahead
> > > 
> > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > and PG_writeback by shrink_page_list, it goes to 
> > > 
> > > do_async_mmap_readahead
> > >   if (PageReadahead(page))
> > >     fpin = maybe_unlock_mmap_for_io();
> > >     page_cache_async_readahead
> > >       if (PageWriteback(page))
> > >         return;
> > >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > >       
> > > So, mm_populate will repeat the loop until the writeback is done.
> > > It's my just theory but didn't comfirm it by the testing.
> > > If I miss something clear, let me know it.
> > 
> > Ah!  Surely the right way to fix this is ...
> 
> I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> in page_cache_async_readahead because I don't see corelation. Why couldn't we
> do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> Only reason I can think of is it makes *a page* will be delayed for freeing
> since we removed PG_reclaim bit, which would be over-optimization for me.

You're confused.  Because we have a shortage of bits in the page flags,
we use the same bit for both PageReadahead and PageReclaim.  That doesn't
mean that a page marked as PageReclaim should be treated as PageReadahead.

> Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
> in your snippet but it was under PG_writeback in page_cache_async_readahead and
> then the IO was done before refault reaching the code again. It could be repeated
> *theoretically* even though it's very hard to happen in real practice.
> Thus, I think it would be better to remove PageWriteback check from
> page_cache_async_readahead if we really want to go the approach.

PageReclaim is always cleared before PageWriteback.  eg here:

void end_page_writeback(struct page *page)
...
        if (PageReclaim(page)) {
                ClearPageReclaim(page);
                rotate_reclaimable_page(page);
        }

        if (!test_clear_page_writeback(page))
                BUG();

so if PageWriteback is clear, PageReclaim must already be observable as clear.
Minchan Kim Feb. 11, 2020, 5:57 p.m. UTC | #8
On Tue, Feb 11, 2020 at 09:28:03AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> > On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > > >       filemap_fault
> > > > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > > 
> > > > > > > Uh ... That shouldn't be possible.
> > > > > > 
> > > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > > page reclaim when the writeback is done so the page will have both
> > > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > > PG_readahead in fault conext.
> > > > > 
> > > > > What part of fault context can make that mistake?  The snippet I quoted
> > > > > below is from page_cache_async_readahead() where it will clearly not
> > > > > make that mistake.  There's a lot of code here; please don't presume I
> > > > > know all the areas you're talking about.
> > > > 
> > > > Sorry about being not clear. I am saying  filemap_fault ->
> > > > do_async_mmap_readahead
> > > > 
> > > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > > and PG_writeback by shrink_page_list, it goes to 
> > > > 
> > > > do_async_mmap_readahead
> > > >   if (PageReadahead(page))
> > > >     fpin = maybe_unlock_mmap_for_io();
> > > >     page_cache_async_readahead
> > > >       if (PageWriteback(page))
> > > >         return;
> > > >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > > >       
> > > > So, mm_populate will repeat the loop until the writeback is done.
> > > > It's my just theory but didn't comfirm it by the testing.
> > > > If I miss something clear, let me know it.
> > > 
> > > Ah!  Surely the right way to fix this is ...
> > 
> > I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> > in page_cache_async_readahead because I don't see corelation. Why couldn't we
> > do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> > Only reason I can think of is it makes *a page* will be delayed for freeing
> > since we removed PG_reclaim bit, which would be over-optimization for me.
> 
> You're confused.  Because we have a shortage of bits in the page flags,
> we use the same bit for both PageReadahead and PageReclaim.  That doesn't
> mean that a page marked as PageReclaim should be treated as PageReadahead.

My point is why we couldn't do readahead if the marker page is under PG_writeback.
It was there for a long time and you were adding one more so I was curious what's
reasoning comes from. Let me find why PageWriteback check in
page_cache_async_readahead from the beginning.

	fe3cba17c4947, mm: share PG_readahead and PG_reclaim

The reason comes from the description

    b) clear PG_readahead => implicit clear of PG_reclaim
            one(and only one) page will not be reclaimed in time
            it can be avoided by checking PageWriteback(page) in readahead first

The goal was to avoid delay freeing of the page by clearing PG_reclaim.
I'm saying that I feel it's over optimization. IOW, it would be okay to
lose a page to be accelerated reclaim.

> 
> > Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
> > in your snippet but it was under PG_writeback in page_cache_async_readahead and
> > then the IO was done before refault reaching the code again. It could be repeated
> > *theoretically* even though it's very hard to happen in real practice.
> > Thus, I think it would be better to remove PageWriteback check from
> > page_cache_async_readahead if we really want to go the approach.
> 
> PageReclaim is always cleared before PageWriteback.  eg here:
> 
> void end_page_writeback(struct page *page)
> ...
>         if (PageReclaim(page)) {
>                 ClearPageReclaim(page);
>                 rotate_reclaimable_page(page);
>         }
> 
>         if (!test_clear_page_writeback(page))
>                 BUG();
> 
> so if PageWriteback is clear, PageReclaim must already be observable as clear.
> 

I'm saying live lock siutation below.
It would be hard to trigger since IO is very slow but isn't it possible
theoretically?


                 CPU 1                                                CPU 2
mm_populate
1st trial
  __get_user_pages
    handle_mm_fault
      filemap_fault
        do_async_mmap_readahead 
        if (!PageWriteback(page) && PageReadahead(page)) {
          fpin = maybe_unlock_mmap_for_io
          page_cache_async_readahead
                                                                    set_page_writeback here
            if (PageWriteback(page))
	      return; <- hit

                                                                     writeback completed and reclaimed the page
								     ..
								     ondemand readahead allocates new page and mark it to PG_readahead
2nd trial
 __get_user_pages
    handle_mm_fault
      filemap_fault
        do_async_mmap_readahead 
        if (!PageWriteback(page) && PageReadahead(page)) {
          fpin = maybe_unlock_mmap_for_io
          page_cache_async_readahead
                                                                    set_page_writeback here
            if (PageWriteback(page))
	      return; <- hit

                                                                     writeback completed and reclaimed the page
								     ..
								     ondemand readahead allocates new page and mark it to PG_readahead

3rd trial
..


Let's consider ra_pages, too as I mentioned. Isn't it another hole to make
such live lock if other task suddenly reset it to zero?

void page_cache_async_readahead(..)
{
        /* no read-ahead */
        if (!ra->ra_pages)
                return;
Yang Shi Feb. 11, 2020, 6:14 p.m. UTC | #9
On Tue, Feb 11, 2020 at 9:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> > On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > > >       filemap_fault
> > > > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > >
> > > > > > > Uh ... That shouldn't be possible.
> > > > > >
> > > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > > page reclaim when the writeback is done so the page will have both
> > > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > > PG_readahead in fault conext.
> > > > >
> > > > > What part of fault context can make that mistake?  The snippet I quoted
> > > > > below is from page_cache_async_readahead() where it will clearly not
> > > > > make that mistake.  There's a lot of code here; please don't presume I
> > > > > know all the areas you're talking about.
> > > >
> > > > Sorry about being not clear. I am saying  filemap_fault ->
> > > > do_async_mmap_readahead
> > > >
> > > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > > and PG_writeback by shrink_page_list, it goes to
> > > >
> > > > do_async_mmap_readahead
> > > >   if (PageReadahead(page))
> > > >     fpin = maybe_unlock_mmap_for_io();
> > > >     page_cache_async_readahead
> > > >       if (PageWriteback(page))
> > > >         return;
> > > >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > > >
> > > > So, mm_populate will repeat the loop until the writeback is done.
> > > > It's my just theory but didn't comfirm it by the testing.
> > > > If I miss something clear, let me know it.
> > >
> > > Ah!  Surely the right way to fix this is ...
> >
> > I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> > in page_cache_async_readahead because I don't see corelation. Why couldn't we
> > do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> > Only reason I can think of is it makes *a page* will be delayed for freeing
> > since we removed PG_reclaim bit, which would be over-optimization for me.
>
> You're confused.  Because we have a shortage of bits in the page flags,
> we use the same bit for both PageReadahead and PageReclaim.  That doesn't
> mean that a page marked as PageReclaim should be treated as PageReadahead.
>
> > Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
> > in your snippet but it was under PG_writeback in page_cache_async_readahead and
> > then the IO was done before refault reaching the code again. It could be repeated
> > *theoretically* even though it's very hard to happen in real practice.
> > Thus, I think it would be better to remove PageWriteback check from
> > page_cache_async_readahead if we really want to go the approach.
>
> PageReclaim is always cleared before PageWriteback.  eg here:
>
> void end_page_writeback(struct page *page)
> ...
>         if (PageReclaim(page)) {
>                 ClearPageReclaim(page);
>                 rotate_reclaimable_page(page);
>         }
>
>         if (!test_clear_page_writeback(page))
>                 BUG();
>
> so if PageWriteback is clear, PageReclaim must already be observable as clear.

Not sure if the below race in vmscan matters or not.

               if (PageWriteback(page)) {
                      [snip]
                        /* Case 2 above */
                        } else if (writeback_throttling_sane(sc) ||
                            !PageReclaim(page) || !may_enter_fs) {
                                /*
                                 * This is slightly racy - end_page_writeback()
                                 * might have just cleared PageReclaim, then
                                 * setting PageReclaim here end up interpreted
                                 * as PageReadahead - but that does not matter
                                 * enough to care.  What we do want is for this
                                 * page to have PageReclaim set next time memcg
                                 * reclaim reaches the tests above, so it will
                                 * then wait_on_page_writeback() to avoid OOM;
                                 * and it's also appropriate in global reclaim.
                                 */
                                SetPageReclaim(page);
                                stat->nr_writeback++;
                                goto activate_locked;


>
>
Jan Kara Feb. 12, 2020, 10:18 a.m. UTC | #10
On Tue 11-02-20 09:57:31, Minchan Kim wrote:
> On Tue, Feb 11, 2020 at 09:28:03AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> > > On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > > > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > > > >       filemap_fault
> > > > > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > > > 
> > > > > > > > Uh ... That shouldn't be possible.
> > > > > > > 
> > > > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > > > page reclaim when the writeback is done so the page will have both
> > > > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > > > PG_readahead in fault conext.
> > > > > > 
> > > > > > What part of fault context can make that mistake?  The snippet I quoted
> > > > > > below is from page_cache_async_readahead() where it will clearly not
> > > > > > make that mistake.  There's a lot of code here; please don't presume I
> > > > > > know all the areas you're talking about.
> > > > > 
> > > > > Sorry about being not clear. I am saying  filemap_fault ->
> > > > > do_async_mmap_readahead
> > > > > 
> > > > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > > > and PG_writeback by shrink_page_list, it goes to 
> > > > > 
> > > > > do_async_mmap_readahead
> > > > >   if (PageReadahead(page))
> > > > >     fpin = maybe_unlock_mmap_for_io();
> > > > >     page_cache_async_readahead
> > > > >       if (PageWriteback(page))
> > > > >         return;
> > > > >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > > > >       
> > > > > So, mm_populate will repeat the loop until the writeback is done.
> > > > > It's my just theory but didn't comfirm it by the testing.
> > > > > If I miss something clear, let me know it.
> > > > 
> > > > Ah!  Surely the right way to fix this is ...
> > > 
> > > I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> > > in page_cache_async_readahead because I don't see corelation. Why couldn't we
> > > do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> > > Only reason I can think of is it makes *a page* will be delayed for freeing
> > > since we removed PG_reclaim bit, which would be over-optimization for me.
> > 
> > You're confused.  Because we have a shortage of bits in the page flags,
> > we use the same bit for both PageReadahead and PageReclaim.  That doesn't
> > mean that a page marked as PageReclaim should be treated as PageReadahead.
> 
> My point is why we couldn't do readahead if the marker page is under PG_writeback.

Well, as far as I'm reading the code, this shouldn't usually happen -
PageReadahead is set on a page that the preread into memory. Once it is
used for the first time (either by page fault or normal read), readahead
logic triggers starting further readahead and PageReadahead gets cleared.

What could happen though is that the page gets written into (say just a few
bytes). That would keep PageReadahead set although the page will become
dirty and can later be written back. I don't find not triggering writeback in
this case too serious though since it should be very rare.

So I'd be OK with the change Matthew suggested although I'd prefer if we
had this magic "!PageWriteback && PageReadahead" test in some helper
function (like page_should_trigger_readahead()?) with a good comment explaining
the details.

> It was there for a long time and you were adding one more so I was curious what's
> reasoning comes from. Let me find why PageWriteback check in
> page_cache_async_readahead from the beginning.
> 
> 	fe3cba17c4947, mm: share PG_readahead and PG_reclaim
> 
> The reason comes from the description
> 
>     b) clear PG_readahead => implicit clear of PG_reclaim
>             one(and only one) page will not be reclaimed in time
>             it can be avoided by checking PageWriteback(page) in readahead first
> 
> The goal was to avoid delay freeing of the page by clearing PG_reclaim.
> I'm saying that I feel it's over optimization. IOW, it would be okay to
> lose a page to be accelerated reclaim.
> 
> > 
> > > Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
> > > in your snippet but it was under PG_writeback in page_cache_async_readahead and
> > > then the IO was done before refault reaching the code again. It could be repeated
> > > *theoretically* even though it's very hard to happen in real practice.
> > > Thus, I think it would be better to remove PageWriteback check from
> > > page_cache_async_readahead if we really want to go the approach.
> > 
> > PageReclaim is always cleared before PageWriteback.  eg here:
> > 
> > void end_page_writeback(struct page *page)
> > ...
> >         if (PageReclaim(page)) {
> >                 ClearPageReclaim(page);
> >                 rotate_reclaimable_page(page);
> >         }
> > 
> >         if (!test_clear_page_writeback(page))
> >                 BUG();
> > 
> > so if PageWriteback is clear, PageReclaim must already be observable as clear.
> > 
> 
> I'm saying live lock siutation below.
> It would be hard to trigger since IO is very slow but isn't it possible
> theoretically?
> 
> 
>                  CPU 1                                                CPU 2
> mm_populate
> 1st trial
>   __get_user_pages
>     handle_mm_fault
>       filemap_fault
>         do_async_mmap_readahead 
>         if (!PageWriteback(page) && PageReadahead(page)) {
>           fpin = maybe_unlock_mmap_for_io
>           page_cache_async_readahead
>                                                                     set_page_writeback here
>             if (PageWriteback(page))
> 	      return; <- hit
> 
>                                                                      writeback completed and reclaimed the page
> 								     ..
> 								     ondemand readahead allocates new page and mark it to PG_readahead
> 2nd trial
>  __get_user_pages
>     handle_mm_fault
>       filemap_fault
>         do_async_mmap_readahead 
>         if (!PageWriteback(page) && PageReadahead(page)) {
>           fpin = maybe_unlock_mmap_for_io
>           page_cache_async_readahead
>                                                                     set_page_writeback here
>             if (PageWriteback(page))
> 	      return; <- hit
> 
>                                                                      writeback completed and reclaimed the page
> 								     ..
> 								     ondemand readahead allocates new page and mark it to PG_readahead
> 
> 3rd trial
> ..
> 
> 
> Let's consider ra_pages, too as I mentioned. Isn't it another hole to make
> such live lock if other task suddenly reset it to zero?
> 
> void page_cache_async_readahead(..)
> {
>         /* no read-ahead */
>         if (!ra->ra_pages)
>                 return;

So this is definitively a bug which was already reported previously. I've
just sent out a patch to fix this which has somehow fallen through the
cracks.

Now I agree that regardless of this fix and the fix Matthew has proposed,
mm_populate() would benefit from being more robust like you suggested so
I'll check that separately (but I'm no expert in that area).

								Honza
Jan Kara Feb. 12, 2020, 10:22 a.m. UTC | #11
On Mon 10-02-20 16:19:58, Minchan Kim wrote:
> Basically, fault handler releases mmap_sem before requesting readahead
> and then it is supposed to retry lookup the page from page cache with
> FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry.
> 
> However, what happens if the fault handler find a page from page
> cache and the page has readahead marker but are waiting under
> writeback? Plus one more condition, it happens under mm_populate
> which repeats faulting unless it encounters error. So let's assemble
> conditions below.
> 
> __mm_populate
> for (...)
>   get_user_pages(faluty_address)
>     handle_mm_fault
>       filemap_fault
>         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> 	it will return VM_FAULT_RETRY
>   continue with faulty_address
> 
> IOW, it will repeat fault retry logic until the page will be written
> back in the long run. It makes big spike latency of several seconds.
> 
> This patch solves the issue by turning off fault retry logic in second
> trial.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> It was orignated from code review once I have seen several user reports
> but didn't confirm yet it's the root cause.

Yes, I think the immediate problem is actually elsewhere but I agree that
__mm_populate() should follow the general protocol of retrying only once
so your change should make it more robust. The patch looks good to me, you
can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
>  mm/gup.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 1b521e0ac1de..b3f825092abf 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1196,6 +1196,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
>  	struct vm_area_struct *vma = NULL;
>  	int locked = 0;
>  	long ret = 0;
> +	bool tried = false;
>  
>  	end = start + len;
>  
> @@ -1226,14 +1227,18 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
>  		 * double checks the vma flags, so that it won't mlock pages
>  		 * if the vma was already munlocked.
>  		 */
> -		ret = populate_vma_page_range(vma, nstart, nend, &locked);
> +		ret = populate_vma_page_range(vma, nstart, nend,
> +						tried ? NULL : &locked);
>  		if (ret < 0) {
>  			if (ignore_errors) {
>  				ret = 0;
>  				continue;	/* continue at next VMA */
>  			}
>  			break;
> -		}
> +		} else if (ret == 0)
> +			tried = true;
> +		else
> +			tried = false;
>  		nend = nstart + ret * PAGE_SIZE;
>  		ret = 0;
>  	}
> -- 
> 2.25.0.225.g125e21ebc7-goog
>
Minchan Kim Feb. 12, 2020, 5:40 p.m. UTC | #12
Hello Jan,

On Wed, Feb 12, 2020 at 11:18:04AM +0100, Jan Kara wrote:
> On Tue 11-02-20 09:57:31, Minchan Kim wrote:
> > On Tue, Feb 11, 2020 at 09:28:03AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> > > > On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > > > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > > > > >       filemap_fault
> > > > > > > > > >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > > > > 
> > > > > > > > > Uh ... That shouldn't be possible.
> > > > > > > > 
> > > > > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > > > > page reclaim when the writeback is done so the page will have both
> > > > > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > > > > PG_readahead in fault conext.
> > > > > > > 
> > > > > > > What part of fault context can make that mistake?  The snippet I quoted
> > > > > > > below is from page_cache_async_readahead() where it will clearly not
> > > > > > > make that mistake.  There's a lot of code here; please don't presume I
> > > > > > > know all the areas you're talking about.
> > > > > > 
> > > > > > Sorry about being not clear. I am saying  filemap_fault ->
> > > > > > do_async_mmap_readahead
> > > > > > 
> > > > > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > > > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > > > > and PG_writeback by shrink_page_list, it goes to 
> > > > > > 
> > > > > > do_async_mmap_readahead
> > > > > >   if (PageReadahead(page))
> > > > > >     fpin = maybe_unlock_mmap_for_io();
> > > > > >     page_cache_async_readahead
> > > > > >       if (PageWriteback(page))
> > > > > >         return;
> > > > > >       ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > > > > >       
> > > > > > So, mm_populate will repeat the loop until the writeback is done.
> > > > > > It's my just theory but didn't comfirm it by the testing.
> > > > > > If I miss something clear, let me know it.
> > > > > 
> > > > > Ah!  Surely the right way to fix this is ...
> > > > 
> > > > I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> > > > in page_cache_async_readahead because I don't see corelation. Why couldn't we
> > > > do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> > > > Only reason I can think of is it makes *a page* will be delayed for freeing
> > > > since we removed PG_reclaim bit, which would be over-optimization for me.
> > > 
> > > You're confused.  Because we have a shortage of bits in the page flags,
> > > we use the same bit for both PageReadahead and PageReclaim.  That doesn't
> > > mean that a page marked as PageReclaim should be treated as PageReadahead.
> > 
> > My point is why we couldn't do readahead if the marker page is under PG_writeback.
> 
> Well, as far as I'm reading the code, this shouldn't usually happen -
> PageReadahead is set on a page that the preread into memory. Once it is
> used for the first time (either by page fault or normal read), readahead
> logic triggers starting further readahead and PageReadahead gets cleared.
> 
> What could happen though is that the page gets written into (say just a few
> bytes). That would keep PageReadahead set although the page will become
> dirty and can later be written back. I don't find not triggering writeback in
> this case too serious though since it should be very rare.

Please see pageout in vmscan.c which set PG_reclaim righ before calling
writepage. Since PG_reclaim and PG_readahead shares the same bit of the
page flags, do_async_mmap_readahead will decipher the PG_reclaim as
PageReadahead so it releases mmap but page_cache_async_readahead just
returns by PageWriteback check. It will be repeated until the writeback
is done.

> 
> So I'd be OK with the change Matthew suggested although I'd prefer if we
> had this magic "!PageWriteback && PageReadahead" test in some helper
> function (like page_should_trigger_readahead()?) with a good comment explaining
> the details.

How about this?

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..d07d602476df 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -363,8 +363,28 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+/*
+ * Since PG_readahead is shared with PG_reclaim of the page flags,
+ * PageReadahead should double check whether it's readahead marker
+ * or PG_reclaim. It could be done by PageWriteback check because
+ * PG_reclaim is always with PG_writeback.
+ */
+static inline int PageReadahead(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+	return test_bit(PG_reclaim, &(page)->flags) && !PageWriteback(page);
+}
+
+static inline int TestClearPageReadahead(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+	return test_and_clear_bit(PG_reclaim, &page->flags) && !PageWriteback(page);
+}
 
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..85b15e5a1d7b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -553,12 +553,6 @@ page_cache_async_readahead(struct address_space *mapping,
 	if (!ra->ra_pages)
 		return;
 
-	/*
-	 * Same bit is used for PG_readahead and PG_reclaim.
-	 */
-	if (PageWriteback(page))
-		return;
-
 	ClearPageReadahead(page);
 
 	/*
Minchan Kim Feb. 12, 2020, 5:43 p.m. UTC | #13
On Wed, Feb 12, 2020 at 11:22:06AM +0100, Jan Kara wrote:
> On Mon 10-02-20 16:19:58, Minchan Kim wrote:
> > Basically, fault handler releases mmap_sem before requesting readahead
> > and then it is supposed to retry lookup the page from page cache with
> > FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry.
> > 
> > However, what happens if the fault handler find a page from page
> > cache and the page has readahead marker but are waiting under
> > writeback? Plus one more condition, it happens under mm_populate
> > which repeats faulting unless it encounters error. So let's assemble
> > conditions below.
> > 
> > __mm_populate
> > for (...)
> >   get_user_pages(faluty_address)
> >     handle_mm_fault
> >       filemap_fault
> >         find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > 	it will return VM_FAULT_RETRY
> >   continue with faulty_address
> > 
> > IOW, it will repeat fault retry logic until the page will be written
> > back in the long run. It makes big spike latency of several seconds.
> > 
> > This patch solves the issue by turning off fault retry logic in second
> > trial.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > It was orignated from code review once I have seen several user reports
> > but didn't confirm yet it's the root cause.
> 
> Yes, I think the immediate problem is actually elsewhere but I agree that
> __mm_populate() should follow the general protocol of retrying only once
> so your change should make it more robust. The patch looks good to me, you
> can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks for the review!
Matthew Wilcox (Oracle) Feb. 12, 2020, 6:28 p.m. UTC | #14
On Wed, Feb 12, 2020 at 09:40:15AM -0800, Minchan Kim wrote:
> How about this?
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1bf83c8fcaa7..d07d602476df 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -363,8 +363,28 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
>  /* PG_readahead is only used for reads; PG_reclaim is only for writes */
>  PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
>  	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
> -PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> -	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> +
> +SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> +CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> +
> +/*
> + * Since PG_readahead is shared with PG_reclaim of the page flags,
> + * PageReadahead should double check whether it's readahead marker
> + * or PG_reclaim. It could be done by PageWriteback check because
> + * PG_reclaim is always with PG_writeback.
> + */
> +static inline int PageReadahead(struct page *page)
> +{
> +	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
> +	return test_bit(PG_reclaim, &(page)->flags) && !PageWriteback(page);

Why not ...

	return page->flags & (1UL << PG_reclaim | 1UL << PG_writeback) ==
		(1UL << PG_reclaim);

> +static inline int TestClearPageReadahead(struct page *page)
> +{
> +	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
> +
> +	return test_and_clear_bit(PG_reclaim, &page->flags) && !PageWriteback(page);

That's definitely wrong.  It'll clear PageReclaim and then pretend it did
nothing wrong.

	return !PageWriteback(page) ||
		test_and_clear_bit(PG_reclaim, &page->flags);
Minchan Kim Feb. 12, 2020, 7:53 p.m. UTC | #15
On Wed, Feb 12, 2020 at 10:28:51AM -0800, Matthew Wilcox wrote:
> On Wed, Feb 12, 2020 at 09:40:15AM -0800, Minchan Kim wrote:
> > How about this?
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 1bf83c8fcaa7..d07d602476df 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -363,8 +363,28 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
> >  /* PG_readahead is only used for reads; PG_reclaim is only for writes */
> >  PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
> >  	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
> > -PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> > -	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> > +
> > +SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> > +CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> > +
> > +/*
> > + * Since PG_readahead is shared with PG_reclaim of the page flags,
> > + * PageReadahead should double check whether it's readahead marker
> > + * or PG_reclaim. It could be done by PageWriteback check because
> > + * PG_reclaim is always with PG_writeback.
> > + */
> > +static inline int PageReadahead(struct page *page)
> > +{
> > +	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
> > +	return test_bit(PG_reclaim, &(page)->flags) && !PageWriteback(page);
> 
> Why not ...
> 
> 	return page->flags & (1UL << PG_reclaim | 1UL << PG_writeback) ==
> 		(1UL << PG_reclaim);
> 
> > +static inline int TestClearPageReadahead(struct page *page)
> > +{
> > +	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
> > +
> > +	return test_and_clear_bit(PG_reclaim, &page->flags) && !PageWriteback(page);
> 
> That's definitely wrong.  It'll clear PageReclaim and then pretend it did
> nothing wrong.
> 
> 	return !PageWriteback(page) ||
> 		test_and_clear_bit(PG_reclaim, &page->flags);
> 

Much better, Thanks for the review, Matthew!
If there is no objection, I will send two patches to Andrew.
One is PageReadahead strict, the other is limit retry from mm_populate.

From 351236413beda22cb7fec1713cad4360de930188 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Wed, 12 Feb 2020 09:28:21 -0800
Subject: [PATCH] mm: make PageReadahead more strict

PG_readahead flag is shared with PG_reclaim but PG_reclaim is only
used in write context while PG_readahead is used for read context.

To make it clear, let's introduce PageReadahead wrapper with
!PageWriteback check so it could make code clear and we could drop
PageWriteback check in page_cache_async_readahead, which removes
pointless dropping mmap_sem.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-flags.h | 28 ++++++++++++++++++++++++++--
 mm/readahead.c             |  6 ------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..f91a9b2a49bd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -363,8 +363,32 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+/*
+ * Since PG_readahead is shared with PG_reclaim of the page flags,
+ * PageReadahead should double check whether it's readahead marker
+ * or PG_reclaim. It could be done by PageWriteback check because
+ * PG_reclaim is always with PG_writeback.
+ */
+static inline int PageReadahead(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+	return (page->flags & (1UL << PG_reclaim | 1UL << PG_writeback)) ==
+		(1UL << PG_reclaim);
+}
+
+/* Clear PG_readahead only if it's PG_readahead, not PG_reclaim */
+static inline int TestClearPageReadahead(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+	return !PageWriteback(page) ||
+			test_and_clear_bit(PG_reclaim, &page->flags);
+}
 
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..85b15e5a1d7b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -553,12 +553,6 @@ page_cache_async_readahead(struct address_space *mapping,
 	if (!ra->ra_pages)
 		return;
 
-	/*
-	 * Same bit is used for PG_readahead and PG_reclaim.
-	 */
-	if (PageWriteback(page))
-		return;
-
 	ClearPageReadahead(page);
 
 	/*
Andrew Morton Feb. 12, 2020, 10:24 p.m. UTC | #16
On Wed, 12 Feb 2020 11:53:22 -0800 Minchan Kim <minchan@kernel.org> wrote:

> > That's definitely wrong.  It'll clear PageReclaim and then pretend it did
> > nothing wrong.
> > 
> > 	return !PageWriteback(page) ||
> > 		test_and_clear_bit(PG_reclaim, &page->flags);
> > 
> 
> Much better, Thanks for the review, Matthew!
> If there is no objection, I will send two patches to Andrew.
> One is PageReadahead strict, the other is limit retry from mm_populate.

With much more detailed changelogs, please!

This all seems rather screwy.  if a page is under writeback then it is
uptodate and we should be able to fault it in immediately.
Minchan Kim Feb. 12, 2020, 11:12 p.m. UTC | #17
On Wed, Feb 12, 2020 at 02:24:35PM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2020 11:53:22 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > That's definitely wrong.  It'll clear PageReclaim and then pretend it did
> > > nothing wrong.
> > > 
> > > 	return !PageWriteback(page) ||
> > > 		test_and_clear_bit(PG_reclaim, &page->flags);
> > > 
> > 
> > Much better, Thanks for the review, Matthew!
> > If there is no objection, I will send two patches to Andrew.
> > One is PageReadahead strict, the other is limit retry from mm_populate.
> 
> With much more detailed changelogs, please!
> 
> This all seems rather screwy.  if a page is under writeback then it is
> uptodate and we should be able to fault it in immediately.

Hi Andrew,

This description in cover-letter will work? If so, I will add each part
below in each patch.

Subject: [PATCH 0/3] fixing mm_populate long stall

I got several reports major page fault takes several seconds sometime.
When I review drop mmap_sem in page fault hanlder, I found several bugs.

   CPU 1							CPU 2
mm_populate
 for ()
   ..
   ret = populate_vma_page_range
     __get_user_pages
       faultin_page
         handle_mm_fault
	   filemap_fault
	     do_async_mmap_readahead
	     						shrink_page_list
							  pageout
							    SetPageReclaim(=SetPageReadahead)
							      writepage
							        SetPageWriteback
	       if (PageReadahead(page))
	         maybe_unlock_mmap_for_io
		   up_read(mmap_sem)
		 page_cache_async_readahead()
		   if (PageWriteback(page))
		     return;

    here, since ret from populate_vma_page_range is zero,
    the loop continue to run with same address with previous
    iteration. It will repeat the loop until the page's
    writeout is done(ie, PG_writeback or PG_reclaim clear).

We could fix the above specific case via adding PageWriteback. IOW,

   ret = populate_vma_page_range
   	   ...
	   ...
	   filemap_fault
	     do_async_mmap_readahead
	       if (!PageWriteback(page) && PageReadahead(page))
	         maybe_unlock_mmap_for_io
		   up_read(mmap_sem)
		 page_cache_async_readahead()
		   if (PageWriteback(page))
		     return;

That's a thing [3/3] is fixing here. Even though it could fix the
problem effectively, it has still livelock problem theoretically
because the page of faulty address could be reclaimed and then
allocated/become readahead marker on other CPUs during faulty
process is retrying in mm_populate's loop. [2/3] is fixing the
such livelock via limiting retry count.
There is another hole for the livelock or hang of the process as well
as ageWriteback - ra_pages.

mm_populate
 for ()
   ..
   ret = populate_vma_page_range
     __get_user_pages
       faultin_page
         handle_mm_fault
	   filemap_fault
	     do_async_mmap_readahead
	       if (PageReadahead(page))
	         maybe_unlock_mmap_for_io
		   up_read(mmap_sem)
		 page_cache_async_readahead()
		   if (!ra->ra_pages)
		     return;

It will repeat the loop until ra->ra_pages become non-zero.
[1/3] is fixing the problem.

Jan Kara (1):
  mm: Don't bother dropping mmap_sem for zero size readahead

Minchan Kim (2):
  mm: fix long time stall from mm_populate
  mm: make PageReadahead more strict

 include/linux/page-flags.h | 28 ++++++++++++++++++++++++++--
 mm/filemap.c               |  2 +-
 mm/gup.c                   |  9 +++++++--
 mm/readahead.c             |  6 ------
 4 files changed, 34 insertions(+), 11 deletions(-)
Andrew Morton Feb. 13, 2020, 2 a.m. UTC | #18
On Wed, 12 Feb 2020 15:12:10 -0800 Minchan Kim <minchan@kernel.org> wrote:

> On Wed, Feb 12, 2020 at 02:24:35PM -0800, Andrew Morton wrote:
> > On Wed, 12 Feb 2020 11:53:22 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > > That's definitely wrong.  It'll clear PageReclaim and then pretend it did
> > > > nothing wrong.
> > > > 
> > > > 	return !PageWriteback(page) ||
> > > > 		test_and_clear_bit(PG_reclaim, &page->flags);
> > > > 
> > > 
> > > Much better, Thanks for the review, Matthew!
> > > If there is no objection, I will send two patches to Andrew.
> > > One is PageReadahead strict, the other is limit retry from mm_populate.
> > 
> > With much more detailed changelogs, please!
> > 
> > This all seems rather screwy.  if a page is under writeback then it is
> > uptodate and we should be able to fault it in immediately.
> 
> Hi Andrew,
> 
> This description in cover-letter will work? If so, I will add each part
> below in each patch.
> 
> Subject: [PATCH 0/3] fixing mm_populate long stall
> 
> I got several reports major page fault takes several seconds sometime.
> When I review drop mmap_sem in page fault hanlder, I found several bugs.
> 
>    CPU 1							CPU 2
> mm_populate
>  for ()
>    ..
>    ret = populate_vma_page_range
>      __get_user_pages
>        faultin_page
>          handle_mm_fault
> 	   filemap_fault
> 	     do_async_mmap_readahead
> 	     						shrink_page_list
> 							  pageout
> 							    SetPageReclaim(=SetPageReadahead)
> 							      writepage
> 							        SetPageWriteback
> 	       if (PageReadahead(page))
> 	         maybe_unlock_mmap_for_io
> 		   up_read(mmap_sem)
> 		 page_cache_async_readahead()
> 		   if (PageWriteback(page))
> 		     return;
> 
>     here, since ret from populate_vma_page_range is zero,
>     the loop continue to run with same address with previous
>     iteration. It will repeat the loop until the page's
>     writeout is done(ie, PG_writeback or PG_reclaim clear).

The populate_vma_page_range() kerneldoc is wrong.  "return 0 on
success, negative error code on error".  Care to fix that please?

> We could fix the above specific case via adding PageWriteback. IOW,
> 
>    ret = populate_vma_page_range
>    	   ...
> 	   ...
> 	   filemap_fault
> 	     do_async_mmap_readahead
> 	       if (!PageWriteback(page) && PageReadahead(page))
> 	         maybe_unlock_mmap_for_io
> 		   up_read(mmap_sem)
> 		 page_cache_async_readahead()
> 		   if (PageWriteback(page))
> 		     return;

Well yes, but the testing of PageWriteback() is a hack added in
fe3cba17c49471 to permit the sharing of PG_reclaim and PG_readahead. 
If we didn't need that hack then we could avoid adding new hacks to
hack around the old hack :(.  Have you considered anything along those
lines?  Rework how we handle PG_reclaim/PG_readahead?

> That's a thing [3/3] is fixing here. Even though it could fix the
> problem effectively, it has still livelock problem theoretically
> because the page of faulty address could be reclaimed and then
> allocated/become readahead marker on other CPUs during faulty
> process is retrying in mm_populate's loop.

Really?  filemap_fault()'s

	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
		goto out_retry;

	/* Did it get truncated? */
	if (unlikely(compound_head(page)->mapping != mapping)) {
		unlock_page(page);
		put_page(page);
		goto retry_find;
	}

should handle such cases?

> [2/3] is fixing the
> such livelock via limiting retry count.

I wouldn't call that "fixing" :(

> There is another hole for the livelock or hang of the process as well
> as ageWriteback - ra_pages.
> 
> mm_populate
>  for ()
>    ..
>    ret = populate_vma_page_range
>      __get_user_pages
>        faultin_page
>          handle_mm_fault
> 	   filemap_fault
> 	     do_async_mmap_readahead
> 	       if (PageReadahead(page))
> 	         maybe_unlock_mmap_for_io
> 		   up_read(mmap_sem)
> 		 page_cache_async_readahead()
> 		   if (!ra->ra_pages)
> 		     return;
> 
> It will repeat the loop until ra->ra_pages become non-zero.
> [1/3] is fixing the problem.
>
Minchan Kim Feb. 13, 2020, 5:24 p.m. UTC | #19
Hi Andrew,

On Wed, Feb 12, 2020 at 06:00:30PM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2020 15:12:10 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > On Wed, Feb 12, 2020 at 02:24:35PM -0800, Andrew Morton wrote:
> > > On Wed, 12 Feb 2020 11:53:22 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > > > That's definitely wrong.  It'll clear PageReclaim and then pretend it did
> > > > > nothing wrong.
> > > > > 
> > > > > 	return !PageWriteback(page) ||
> > > > > 		test_and_clear_bit(PG_reclaim, &page->flags);
> > > > > 
> > > > 
> > > > Much better, Thanks for the review, Matthew!
> > > > If there is no objection, I will send two patches to Andrew.
> > > > One is PageReadahead strict, the other is limit retry from mm_populate.
> > > 
> > > With much more detailed changelogs, please!
> > > 
> > > This all seems rather screwy.  if a page is under writeback then it is
> > > uptodate and we should be able to fault it in immediately.
> > 
> > Hi Andrew,
> > 
> > This description in cover-letter will work? If so, I will add each part
> > below in each patch.
> > 
> > Subject: [PATCH 0/3] fixing mm_populate long stall
> > 
> > I got several reports major page fault takes several seconds sometime.
> > When I review drop mmap_sem in page fault hanlder, I found several bugs.
> > 
> >    CPU 1							CPU 2
> > mm_populate
> >  for ()
> >    ..
> >    ret = populate_vma_page_range
> >      __get_user_pages
> >        faultin_page
> >          handle_mm_fault
> > 	   filemap_fault
> > 	     do_async_mmap_readahead
> > 	     						shrink_page_list
> > 							  pageout
> > 							    SetPageReclaim(=SetPageReadahead)
> > 							      writepage
> > 							        SetPageWriteback
> > 	       if (PageReadahead(page))
> > 	         maybe_unlock_mmap_for_io
> > 		   up_read(mmap_sem)
> > 		 page_cache_async_readahead()
> > 		   if (PageWriteback(page))
> > 		     return;
> > 
> >     here, since ret from populate_vma_page_range is zero,
> >     the loop continue to run with same address with previous
> >     iteration. It will repeat the loop until the page's
> >     writeout is done(ie, PG_writeback or PG_reclaim clear).
> 
> The populate_vma_page_range() kerneldoc is wrong.  "return 0 on
> success, negative error code on error".  Care to fix that please?

Sure.

> 
> > We could fix the above specific case via adding PageWriteback. IOW,
> > 
> >    ret = populate_vma_page_range
> >    	   ...
> > 	   ...
> > 	   filemap_fault
> > 	     do_async_mmap_readahead
> > 	       if (!PageWriteback(page) && PageReadahead(page))
> > 	         maybe_unlock_mmap_for_io
> > 		   up_read(mmap_sem)
> > 		 page_cache_async_readahead()
> > 		   if (PageWriteback(page))
> > 		     return;
> 
> Well yes, but the testing of PageWriteback() is a hack added in
> fe3cba17c49471 to permit the sharing of PG_reclaim and PG_readahead. 
> If we didn't need that hack then we could avoid adding new hacks to
> hack around the old hack :(.  Have you considered anything along those
> lines?  Rework how we handle PG_reclaim/PG_readahead?

https://lore.kernel.org/linux-mm/20200211175731.GA185752@google.com/
"
My point is why we couldn't do readahead if the marker page is under PG_writeback.
It was there for a long time and you were adding one more so I was curious what's
reasoning comes from. Let me find why PageWriteback check in
page_cache_async_readahead from the beginning.

	fe3cba17c4947, mm: share PG_readahead and PG_reclaim

The reason comes from the description

    b) clear PG_readahead => implicit clear of PG_reclaim
            one(and only one) page will not be reclaimed in time
            it can be avoided by checking PageWriteback(page) in readahead first

The goal was to avoid delay freeing of the page by clearing PG_reclaim.
I'm saying that I feel it's over optimization. IOW, it would be okay to
lose a page to be accelerated reclaim.
"

I wanted to remove PageWriteback check in page_cache_async_readahead
but didn't hear feedback and reveiwers wanted to add PageWriteback check
along with PageReadahead. That's why [2/3] was born.

> 
> > That's a thing [3/3] is fixing here. Even though it could fix the
> > problem effectively, it has still livelock problem theoretically
> > because the page of faulty address could be reclaimed and then
> > allocated/become readahead marker on other CPUs during faulty
> > process is retrying in mm_populate's loop.
> 
> Really?  filemap_fault()'s
> 
> 	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
> 		goto out_retry;
> 
> 	/* Did it get truncated? */
> 	if (unlikely(compound_head(page)->mapping != mapping)) {
> 		unlock_page(page);
> 		put_page(page);
> 		goto retry_find;
> 	}
> 
> should handle such cases?

I don't think so because once we release mmap_sem, we start fault handling
from the beginning again and page we found in new iteration is newly
allocated valid page but has same situation(e.g., PG_readahead) which could
reprouce same condition like previous iteration.

> 
> > [2/3] is fixing the
> > such livelock via limiting retry count.
> 
> I wouldn't call that "fixing" :(

If I was not wrong, it's fixing. :)

Furthermore, please consider ra_pages dancing via parallel fadvise attack
to make ra_pges zero suddency by the race.

> 
> > There is another hole for the livelock or hang of the process as well
> > as ageWriteback - ra_pages.
> > 
> > mm_populate
> >  for ()
> >    ..
> >    ret = populate_vma_page_range
> >      __get_user_pages
> >        faultin_page
> >          handle_mm_fault
> > 	   filemap_fault
> > 	     do_async_mmap_readahead
> > 	       if (PageReadahead(page))
> > 	         maybe_unlock_mmap_for_io
> > 		   up_read(mmap_sem)
> > 		 page_cache_async_readahead()
> > 		   if (!ra->ra_pages)
> > 		     return;
> > 
> > It will repeat the loop until ra->ra_pages become non-zero.
> > [1/3] is fixing the problem.
> > 
>
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..b3f825092abf 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1196,6 +1196,7 @@  int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 	struct vm_area_struct *vma = NULL;
 	int locked = 0;
 	long ret = 0;
+	bool tried = false;
 
 	end = start + len;
 
@@ -1226,14 +1227,18 @@  int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 		 * double checks the vma flags, so that it won't mlock pages
 		 * if the vma was already munlocked.
 		 */
-		ret = populate_vma_page_range(vma, nstart, nend, &locked);
+		ret = populate_vma_page_range(vma, nstart, nend,
+						tried ? NULL : &locked);
 		if (ret < 0) {
 			if (ignore_errors) {
 				ret = 0;
 				continue;	/* continue at next VMA */
 			}
 			break;
-		}
+		} else if (ret == 0)
+			tried = true;
+		else
+			tried = false;
 		nend = nstart + ret * PAGE_SIZE;
 		ret = 0;
 	}