diff mbox series

mm/thp: flush file for !is_shmem PageDirty() case in collapse_file()

Message ID 20191030200736.3455046-1-songliubraving@fb.com (mailing list archive)
State New, archived
Headers show
Series mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() | expand

Commit Message

Song Liu Oct. 30, 2019, 8:07 p.m. UTC
For non-shmem file THPs, khugepaged only collapses read only .text mapping
(VM_DENYWRITE). These pages should not be dirty except the case where the
file hasn't been flushed since first write.

Call filemap_flush() in collapse_file() to accelerate the write back in
such cases.

Also add warning if PageDirty() triggered for pages from readahead path.

Reported-and-tested-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 mm/khugepaged.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Andrew Morton Oct. 31, 2019, 1:51 a.m. UTC | #1
On Wed, 30 Oct 2019 13:07:36 -0700 Song Liu <songliubraving@fb.com> wrote:

> For non-shmem file THPs, khugepaged only collapses read only .text mapping
> (VM_DENYWRITE). These pages should not be dirty except the case where the
> file hasn't been flushed since first write.
> 
> Call filemap_flush() in collapse_file() to accelerate the write back in
> such cases.
> 
> Also add warning if PageDirty() triggered for pages from readahead path.
> 
> Reported-and-tested-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com

If sysbot reported something then we definitely want to be able to
review that report when reviewing the patch!  So please do include a
Link: for that sort of thing.

A bit of sleuthing leads me to
http://lkml.kernel.org/r/000000000000c50fd70595fdd5b2@google.com

which shows that this patch is in fact a fix against
mmthp-recheck-each-page-before-collapsing-file-thp.patch.  This very
important information wasn't in the changelog.

And this patch doesn't really fix the sysbot hang, does it?  The patch
restores the flush which was removed in order to fix the sysbot hang.

In general, please do try to include all this sort of information when
preparing changelogs.


> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1601,6 +1601,33 @@ static void collapse_file(struct mm_struct *mm,
>  					result = SCAN_FAIL;
>  					goto xa_unlocked;
>  				}
> +				if (WARN_ON_ONCE(PageDirty(page))) {

I'm not understanding what guarantees this.  Can't another process
which has the file open for writing come in and dirty the page after
the readahead has completed and before this process locks the page?

> +					/*
> +					 * page from readahead should not
> +					 * be dirty. Show warning if this
> +					 * somehow happens.
> +					 */
> +					result = SCAN_FAIL;
> +					goto out_unlock;
> +				}
> +			} else if (PageDirty(page)) {
> +				/*
> +				 * khugepaged only works on read-only fd,
> +				 * so this page is dirty because it hasn't
> +				 * been flushed since first write. There
> +				 * won't be new dirty pages.
> +				 *
> +				 * Trigger async flush here and hope the
> +				 * writeback is done when khugepaged
> +				 * revisits this page.
> +				 *
> +				 * This is a one-off situation. We are not
> +				 * forcing writeback in loop.
> +				 */
> +				xas_unlock_irq(&xas);
> +				filemap_flush(mapping);
> +				result = SCAN_FAIL;
> +				goto xa_unlocked;
>  			} else if (trylock_page(page)) {
>  				get_page(page);
>  				xas_unlock_irq(&xas);

The patch mmthp-recheck-each-page-before-collapsing-file-thp.patch has
undergone quite a bit of churn so I don't think it should be mainlined
without more testing and review.  But it fixes a significant issue.  So
could the appropriate developers please take some time to recheck and
retest it all?

To that end, here's the combination of
mmthp-recheck-each-page-before-collapsing-file-thp.patch and this
patch:

From: Song Liu <songliubraving@fb.com>
Subject: mm,thp: recheck each page before collapsing file THP

In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date.  Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.  

With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data.  This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date.  This will NOT corrupt data in the disk,
because the page is read-only and never written back.  Fix this by
properly checking PageUptodate() after locking the page.  This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".  

Also, move PageDirty() check after locking the page.  Current
khugepaged should not try to collapse dirty file THP, because it is
limited to read-only .text.  Add a warning with the PageDirty() check
as it should not happen.  This warning is added after page_mapping()
check, because if the page is truncated, it might be dirty.

[songliubraving@fb.com: flush file for !is_shmem PageDirty() case in collapse_file()]
  Link: http://lkml.kernel.org/r/20191030200736.3455046-1-songliubraving@fb.com
[songliubraving@fb.com: v4]
  Link: http://lkml.kernel.org/r/20191022191006.411277-1-songliubraving@fb.com
[songliubraving@fb.com: fix deadlock in collapse_file()]
  Link: http://lkml.kernel.org/r/20191028221414.3685035-1-songliubraving@fb.com
Link: http://lkml.kernel.org/r/20191018180345.4188310-1-songliubraving@fb.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/khugepaged.c |   49 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

--- a/mm/khugepaged.c~mmthp-recheck-each-page-before-collapsing-file-thp
+++ a/mm/khugepaged.c
@@ -1601,17 +1601,33 @@ static void collapse_file(struct mm_stru
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
-			} else if (!PageUptodate(page)) {
-				xas_unlock_irq(&xas);
-				wait_on_page_locked(page);
-				if (!trylock_page(page)) {
-					result = SCAN_PAGE_LOCK;
-					goto xa_unlocked;
+				if (WARN_ON_ONCE(PageDirty(page))) {
+					/*
+					 * page from readahead should not
+					 * be dirty. Show warning if this
+					 * somehow happens.
+					 */
+					result = SCAN_FAIL;
+					goto out_unlock;
 				}
-				get_page(page);
 			} else if (PageDirty(page)) {
+				/*
+				 * khugepaged only works on read-only fd,
+				 * so this page is dirty because it hasn't
+				 * been flushed since first write. There
+				 * won't be new dirty pages.
+				 *
+				 * Trigger async flush here and hope the
+				 * writeback is done when khugepaged
+				 * revisits this page.
+				 *
+				 * This is a one-off situation. We are not
+				 * forcing writeback in loop.
+				 */
+				xas_unlock_irq(&xas);
+				filemap_flush(mapping);
 				result = SCAN_FAIL;
-				goto xa_locked;
+				goto xa_unlocked;
 			} else if (trylock_page(page)) {
 				get_page(page);
 				xas_unlock_irq(&xas);
@@ -1626,7 +1642,12 @@ static void collapse_file(struct mm_stru
 		 * without racing with truncate.
 		 */
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		VM_BUG_ON_PAGE(!PageUptodate(page), page);
+
+		/* double check the page is up to date */
+		if (unlikely(!PageUptodate(page))) {
+			result = SCAN_FAIL;
+			goto out_unlock;
+		}
 
 		/*
 		 * If file was truncated then extended, or hole-punched, before
@@ -1642,6 +1663,16 @@ static void collapse_file(struct mm_stru
 			goto out_unlock;
 		}
 
+		if (!is_shmem && PageDirty(page)) {
+			/*
+			 * khugepaged only works on read-only fd, so this
+			 * page is dirty because it hasn't been flushed
+			 * since first write.
+			 */
+			result = SCAN_FAIL;
+			goto out_unlock;
+		}
+
 		if (isolate_lru_page(page)) {
 			result = SCAN_DEL_PAGE_LRU;
 			goto out_unlock;
Song Liu Oct. 31, 2019, 4:55 a.m. UTC | #2
On Oct 30, 2019, at 6:51 PM, Andrew Morton <akpm@linux-foundation.org<mailto:akpm@linux-foundation.org>> wrote:

On Wed, 30 Oct 2019 13:07:36 -0700 Song Liu <songliubraving@fb.com<mailto:songliubraving@fb.com>> wrote:

For non-shmem file THPs, khugepaged only collapses read only .text mapping
(VM_DENYWRITE). These pages should not be dirty except the case where the
file hasn't been flushed since first write.

Call filemap_flush() in collapse_file() to accelerate the write back in
such cases.

Also add warning if PageDirty() triggered for pages from readahead path.

Reported-and-tested-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com<mailto:syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com>

If sysbot reported something then we definitely want to be able to
review that report when reviewing the patch!  So please do include a
Link: for that sort of thing.

A bit of sleuthing leads me to
http://lkml.kernel.org/r/000000000000c50fd70595fdd5b2@google.com

which shows that this patch is in fact a fix against
mmthp-recheck-each-page-before-collapsing-file-thp.patch.  This very
important information wasn't in the changelog.

And this patch doesn't really fix the sysbot hang, does it?  The patch
restores the flush which was removed in order to fix the sysbot hang.

In general, please do try to include all this sort of information when
preparing changelogs.

Sorry for the confusion. This one is a bit tricky.

syzbot reported hang issue with filemap_flush(), which is already fixed
by previous patch which removes filemap_flush().

This patch tries to add filemap_flush() back, in a proper way. So this
patch is an improvement, not a fix. That's why I didn't include the
fixes tag.

However, I used syzbot to test this patch, by replying "#syz test: ...".
to the previous report (the one you found). I didn't cc the mail list
for those tests. syzbot did find bug with an earlier version, and gave
green light for this version. This is the reason I would like to give
syzbot credit with the tag. Maybe I should just make it "Tested-by:
syzbot"?



--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1601,6 +1601,33 @@ static void collapse_file(struct mm_struct *mm,
result = SCAN_FAIL;
goto xa_unlocked;
}
+ if (WARN_ON_ONCE(PageDirty(page))) {

I'm not understanding what guarantees this.  Can't another process
which has the file open for writing come in and dirty the page after
the readahead has completed and before this process locks the page?

For non-shmem file, khugepaged only work for VM_DENYWRITE mappings.
Therefore, the page cannot be opened for write by another process.


+ /*
+  * page from readahead should not
+  * be dirty. Show warning if this
+  * somehow happens.
+  */
+ result = SCAN_FAIL;
+ goto out_unlock;
+ }
+ } else if (PageDirty(page)) {
+ /*
+  * khugepaged only works on read-only fd,
+  * so this page is dirty because it hasn't
+  * been flushed since first write. There
+  * won't be new dirty pages.
+  *
+  * Trigger async flush here and hope the
+  * writeback is done when khugepaged
+  * revisits this page.
+  *
+  * This is a one-off situation. We are not
+  * forcing writeback in loop.
+  */
+ xas_unlock_irq(&xas);
+ filemap_flush(mapping);
+ result = SCAN_FAIL;
+ goto xa_unlocked;
} else if (trylock_page(page)) {
get_page(page);
xas_unlock_irq(&xas);

The patch mmthp-recheck-each-page-before-collapsing-file-thp.patch has
undergone quite a bit of churn so I don't think it should be mainlined
without more testing and review.  But it fixes a significant issue.  So
could the appropriate developers please take some time to recheck and
retest it all?

We are running production tests with previous version (no filemap_flush).
syzbot also gives green light to that one. Maybe we should ship the
version without filemap_flush with 5.4, and this improvement with 5.5?

This improvement also passed syzbot's test. If this looks good, we will
also add this to production tests.

Thanks,
Song
Song Liu Oct. 31, 2019, 6:10 p.m. UTC | #3
Resending the mail, as I sent previous one in "Rich Text". 

> On Oct 30, 2019, at 6:51 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Wed, 30 Oct 2019 13:07:36 -0700 Song Liu <songliubraving@fb.com> wrote:
> 
>> For non-shmem file THPs, khugepaged only collapses read only .text mapping
>> (VM_DENYWRITE). These pages should not be dirty except the case where the
>> file hasn't been flushed since first write.
>> 
>> Call filemap_flush() in collapse_file() to accelerate the write back in
>> such cases.
>> 
>> Also add warning if PageDirty() triggered for pages from readahead path.
>> 
>> Reported-and-tested-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
> 
> If sysbot reported something then we definitely want to be able to
> review that report when reviewing the patch!  So please do include a
> Link: for that sort of thing.
> 
> A bit of sleuthing leads me to
> http://lkml.kernel.org/r/000000000000c50fd70595fdd5b2@google.com
> 
> which shows that this patch is in fact a fix against
> mmthp-recheck-each-page-before-collapsing-file-thp.patch.  This very
> important information wasn't in the changelog.
> 
> And this patch doesn't really fix the sysbot hang, does it?  The patch
> restores the flush which was removed in order to fix the sysbot hang.
> 
> In general, please do try to include all this sort of information when
> preparing changelogs.

Sorry for the confusion. This one is a bit tricky.

syzbot reported hang issue with filemap_flush(), which is already fixed
by previous patch which removes filemap_flush(). 

This patch tries to add filemap_flush() back, in a proper way. So this 
patch is an improvement, not a fix. That's why I didn't include the 
fixes tag. 

However, I used syzbot to test this patch, by replying "#syz test: ...". 
to the previous report (the one you found). I didn't cc the mail list
for those tests. syzbot did find bug with an earlier version, and gave 
green light for this version. This is the reason I would like to give 
syzbot credit with the tag. Maybe I should just make it "Tested-by: 
syzbot"? 

> 
> 
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1601,6 +1601,33 @@ static void collapse_file(struct mm_struct *mm,
>> 					result = SCAN_FAIL;
>> 					goto xa_unlocked;
>> 				}
>> +				if (WARN_ON_ONCE(PageDirty(page))) {
> 
> I'm not understanding what guarantees this.  Can't another process
> which has the file open for writing come in and dirty the page after
> the readahead has completed and before this process locks the page?

For non-shmem file, khugepaged only work for VM_DENYWRITE mappings. 
Therefore, the page cannot be opened for write by another process.

> 
>> +					/*
>> +					 * page from readahead should not
>> +					 * be dirty. Show warning if this
>> +					 * somehow happens.
>> +					 */
>> +					result = SCAN_FAIL;
>> +					goto out_unlock;
>> +				}
>> +			} else if (PageDirty(page)) {
>> +				/*
>> +				 * khugepaged only works on read-only fd,
>> +				 * so this page is dirty because it hasn't
>> +				 * been flushed since first write. There
>> +				 * won't be new dirty pages.
>> +				 *
>> +				 * Trigger async flush here and hope the
>> +				 * writeback is done when khugepaged
>> +				 * revisits this page.
>> +				 *
>> +				 * This is a one-off situation. We are not
>> +				 * forcing writeback in loop.
>> +				 */
>> +				xas_unlock_irq(&xas);
>> +				filemap_flush(mapping);
>> +				result = SCAN_FAIL;
>> +				goto xa_unlocked;
>> 			} else if (trylock_page(page)) {
>> 				get_page(page);
>> 				xas_unlock_irq(&xas);
> 
> The patch mmthp-recheck-each-page-before-collapsing-file-thp.patch has
> undergone quite a bit of churn so I don't think it should be mainlined
> without more testing and review.  But it fixes a significant issue.  So
> could the appropriate developers please take some time to recheck and
> retest it all?

We are running production tests with previous version (no filemap_flush). 
syzbot also gives green light to that one. Maybe we should ship the 
version without filemap_flush with 5.4, and this improvement with 5.5?

This improvement also passed syzbot's test. If this looks good, we will 
also add this to production tests. 

Thanks,
Song
Johannes Weiner Nov. 1, 2019, 6:31 p.m. UTC | #4
On Wed, Oct 30, 2019 at 06:51:15PM -0700, Andrew Morton wrote:
> To that end, here's the combination of
> mmthp-recheck-each-page-before-collapsing-file-thp.patch and this
> patch:
> 
> From: Song Liu <songliubraving@fb.com>
> Subject: mm,thp: recheck each page before collapsing file THP
> 
> In collapse_file(), for !is_shmem case, current check cannot guarantee
> the locked page is up-to-date.  Specifically, xas_unlock_irq() should
> not be called before lock_page() and get_page(); and it is necessary to
> recheck PageUptodate() after locking the page.  
> 
> With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
> may contain corrupted data.  This is because khugepaged mistakenly
> collapses some not up-to-date sub pages into a huge page, and assumes
> the huge page is up-to-date.  This will NOT corrupt data in the disk,
> because the page is read-only and never written back.  Fix this by
> properly checking PageUptodate() after locking the page.  This check
> replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".  
> 
> Also, move PageDirty() check after locking the page.  Current
> khugepaged should not try to collapse dirty file THP, because it is
> limited to read-only .text.  Add a warning with the PageDirty() check
> as it should not happen.  This warning is added after page_mapping()
> check, because if the page is truncated, it might be dirty.
> 
> [songliubraving@fb.com: flush file for !is_shmem PageDirty() case in collapse_file()]
>   Link: http://lkml.kernel.org/r/20191030200736.3455046-1-songliubraving@fb.com
> [songliubraving@fb.com: v4]
>   Link: http://lkml.kernel.org/r/20191022191006.411277-1-songliubraving@fb.com
> [songliubraving@fb.com: fix deadlock in collapse_file()]
>   Link: http://lkml.kernel.org/r/20191028221414.3685035-1-songliubraving@fb.com
> Link: http://lkml.kernel.org/r/20191018180345.4188310-1-songliubraving@fb.com
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: William Kucharski <william.kucharski@oracle.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/khugepaged.c |   49 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 40 insertions(+), 9 deletions(-)
> 
> --- a/mm/khugepaged.c~mmthp-recheck-each-page-before-collapsing-file-thp
> +++ a/mm/khugepaged.c
> @@ -1601,17 +1601,33 @@ static void collapse_file(struct mm_stru
>  					result = SCAN_FAIL;
>  					goto xa_unlocked;
>  				}
> -			} else if (!PageUptodate(page)) {
> -				xas_unlock_irq(&xas);
> -				wait_on_page_locked(page);
> -				if (!trylock_page(page)) {
> -					result = SCAN_PAGE_LOCK;
> -					goto xa_unlocked;
> +				if (WARN_ON_ONCE(PageDirty(page))) {
> +					/*
> +					 * page from readahead should not
> +					 * be dirty. Show warning if this
> +					 * somehow happens.
> +					 */
> +					result = SCAN_FAIL;
> +					goto out_unlock;

I'm not sure we need this check here. We have a dirty check on the
locked page below, and it's not clear why there is a suspicion of
readahead producing dirty pages...?

>  				}
> -				get_page(page);
>  			} else if (PageDirty(page)) {
> +				/*
> +				 * khugepaged only works on read-only fd,
> +				 * so this page is dirty because it hasn't
> +				 * been flushed since first write. There
> +				 * won't be new dirty pages.
> +				 *
> +				 * Trigger async flush here and hope the
> +				 * writeback is done when khugepaged
> +				 * revisits this page.
> +				 *
> +				 * This is a one-off situation. We are not
> +				 * forcing writeback in loop.
> +				 */
> +				xas_unlock_irq(&xas);
> +				filemap_flush(mapping);
>  				result = SCAN_FAIL;
> -				goto xa_locked;
> +				goto xa_unlocked;

We should do this to improve the file THP success rate. But we
shouldn't conflate it with the data corruption bug fixed in the hunks
below:

>  			} else if (trylock_page(page)) {
>  				get_page(page);
>  				xas_unlock_irq(&xas);
> @@ -1626,7 +1642,12 @@ static void collapse_file(struct mm_stru
>  		 * without racing with truncate.
>  		 */
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
> -		VM_BUG_ON_PAGE(!PageUptodate(page), page);
> +
> +		/* double check the page is up to date */
> +		if (unlikely(!PageUptodate(page))) {
> +			result = SCAN_FAIL;
> +			goto out_unlock;
> +		}

This...

> @@ -1642,6 +1663,16 @@ static void collapse_file(struct mm_stru
>  			goto out_unlock;
>  		}
>  
> +		if (!is_shmem && PageDirty(page)) {
> +			/*
> +			 * khugepaged only works on read-only fd, so this
> +			 * page is dirty because it hasn't been flushed
> +			 * since first write.
> +			 */
> +			result = SCAN_FAIL;
> +			goto out_unlock;
> +		}

...and this are pretty urgent chunks because they protect the
integrity of the page cache. They should go in asap.

So I suggest we do two patches instead:

1. These two locked page checks to fix the corruption in 5.4
2. The filemap_flush() bit to improve THP coverage in 5.5
Song Liu Nov. 1, 2019, 9:48 p.m. UTC | #5
Thanks Johannes!

> On Nov 1, 2019, at 11:31 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> On Wed, Oct 30, 2019 at 06:51:15PM -0700, Andrew Morton wrote:
>> To that end, here's the combination of
>> mmthp-recheck-each-page-before-collapsing-file-thp.patch and this
>> patch:
>> 
>> From: Song Liu <songliubraving@fb.com>
>> Subject: mm,thp: recheck each page before collapsing file THP
>> 
>> In collapse_file(), for !is_shmem case, current check cannot guarantee
>> the locked page is up-to-date.  Specifically, xas_unlock_irq() should
>> not be called before lock_page() and get_page(); and it is necessary to
>> recheck PageUptodate() after locking the page.  
>> 
>> With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
>> may contain corrupted data.  This is because khugepaged mistakenly
>> collapses some not up-to-date sub pages into a huge page, and assumes
>> the huge page is up-to-date.  This will NOT corrupt data in the disk,
>> because the page is read-only and never written back.  Fix this by
>> properly checking PageUptodate() after locking the page.  This check
>> replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".  
>> 
>> Also, move PageDirty() check after locking the page.  Current
>> khugepaged should not try to collapse dirty file THP, because it is
>> limited to read-only .text.  Add a warning with the PageDirty() check
>> as it should not happen.  This warning is added after page_mapping()
>> check, because if the page is truncated, it might be dirty.
>> 
>> [songliubraving@fb.com: flush file for !is_shmem PageDirty() case in collapse_file()]
>>  Link: http://lkml.kernel.org/r/20191030200736.3455046-1-songliubraving@fb.com
>> [songliubraving@fb.com: v4]
>>  Link: http://lkml.kernel.org/r/20191022191006.411277-1-songliubraving@fb.com
>> [songliubraving@fb.com: fix deadlock in collapse_file()]
>>  Link: http://lkml.kernel.org/r/20191028221414.3685035-1-songliubraving@fb.com
>> Link: http://lkml.kernel.org/r/20191018180345.4188310-1-songliubraving@fb.com
>> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: William Kucharski <william.kucharski@oracle.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> 
>> mm/khugepaged.c |   49 +++++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 40 insertions(+), 9 deletions(-)
>> 
>> --- a/mm/khugepaged.c~mmthp-recheck-each-page-before-collapsing-file-thp
>> +++ a/mm/khugepaged.c
>> @@ -1601,17 +1601,33 @@ static void collapse_file(struct mm_stru
>> 					result = SCAN_FAIL;
>> 					goto xa_unlocked;
>> 				}
>> -			} else if (!PageUptodate(page)) {
>> -				xas_unlock_irq(&xas);
>> -				wait_on_page_locked(page);
>> -				if (!trylock_page(page)) {
>> -					result = SCAN_PAGE_LOCK;
>> -					goto xa_unlocked;
>> +				if (WARN_ON_ONCE(PageDirty(page))) {
>> +					/*
>> +					 * page from readahead should not
>> +					 * be dirty. Show warning if this
>> +					 * somehow happens.
>> +					 */
>> +					result = SCAN_FAIL;
>> +					goto out_unlock;
> 
> I'm not sure we need this check here. We have a dirty check on the
> locked page below, and it's not clear why there is a suspicion of
> readahead producing dirty pages...?

I was thinking whether we can remove the 

    if (!is_shmem && PageDirty(page)) {
	xxxx
    }

check after this. But I agree we can keep it. 

> 
>> 				}
>> -				get_page(page);
>> 			} else if (PageDirty(page)) {
>> +				/*
>> +				 * khugepaged only works on read-only fd,
>> +				 * so this page is dirty because it hasn't
>> +				 * been flushed since first write. There
>> +				 * won't be new dirty pages.
>> +				 *
>> +				 * Trigger async flush here and hope the
>> +				 * writeback is done when khugepaged
>> +				 * revisits this page.
>> +				 *
>> +				 * This is a one-off situation. We are not
>> +				 * forcing writeback in loop.
>> +				 */
>> +				xas_unlock_irq(&xas);
>> +				filemap_flush(mapping);
>> 				result = SCAN_FAIL;
>> -				goto xa_locked;
>> +				goto xa_unlocked;
> 
> We should do this to improve the file THP success rate. But we
> shouldn't conflate it with the data corruption bug fixed in the hunks
> below:
> 
>> 			} else if (trylock_page(page)) {
>> 				get_page(page);
>> 				xas_unlock_irq(&xas);
>> @@ -1626,7 +1642,12 @@ static void collapse_file(struct mm_stru
>> 		 * without racing with truncate.
>> 		 */
>> 		VM_BUG_ON_PAGE(!PageLocked(page), page);
>> -		VM_BUG_ON_PAGE(!PageUptodate(page), page);
>> +
>> +		/* double check the page is up to date */
>> +		if (unlikely(!PageUptodate(page))) {
>> +			result = SCAN_FAIL;
>> +			goto out_unlock;
>> +		}
> 
> This...
> 
>> @@ -1642,6 +1663,16 @@ static void collapse_file(struct mm_stru
>> 			goto out_unlock;
>> 		}
>> 
>> +		if (!is_shmem && PageDirty(page)) {
>> +			/*
>> +			 * khugepaged only works on read-only fd, so this
>> +			 * page is dirty because it hasn't been flushed
>> +			 * since first write.
>> +			 */
>> +			result = SCAN_FAIL;
>> +			goto out_unlock;
>> +		}
> 
> ...and this are pretty urgent chunks because they protect the
> integrity of the page cache. They should go in asap.
> 
> So I suggest we do two patches instead:

Here are the two patches:

> 
> 1. These two locked page checks to fix the corruption in 5.4

From 6836c7c91396c7b661b2447609ec236badba62b0 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Wed, 23 Oct 2019 11:24:28 +1100
Subject: [PATCH 1/2] mm,thp: recheck each page before collapsing file THP

In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date.  Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.

With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data.  This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date.  This will NOT corrupt data in the disk,
because the page is read-only and never written back.  Fix this by
properly checking PageUptodate() after locking the page.  This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".

Also, move PageDirty() check after locking the page.  Current
khugepaged should not try to collapse dirty file THP, because it is
limited to read-only .text. The only case we hit a dirty page here is
when the page hasn't been written since write. Bail out and retry when
this happens.

syzbot reported bug on previous version of this patch.

[songliubraving@fb.com: v4]
 Link: http://lkml.kernel.org/r/20191022191006.411277-1-songliubraving@fb.com
[songliubraving@fb.com: fix deadlock in collapse_file()]
 Link: http://lkml.kernel.org/r/20191028221414.3685035-1-songliubraving@fb.com
Link: http://lkml.kernel.org/r/20191018180345.4188310-1-songliubraving@fb.com
[songliubraving@fb.com: flush file for !is_shmem PageDirty() case in collapse_file()]
https://lkml.kernel.org/linux-mm/20191030200736.3455046-1-songliubraving@fb.com/
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 mm/khugepaged.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0a1b4b484ac5..40215795d641 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1601,17 +1601,6 @@ static void collapse_file(struct mm_struct *mm,
                                        result = SCAN_FAIL;
                                        goto xa_unlocked;
                                }
-                       } else if (!PageUptodate(page)) {
-                               xas_unlock_irq(&xas);
-                               wait_on_page_locked(page);
-                               if (!trylock_page(page)) {
-                                       result = SCAN_PAGE_LOCK;
-                                       goto xa_unlocked;
-                               }
-                               get_page(page);
-                       } else if (PageDirty(page)) {
-                               result = SCAN_FAIL;
-                               goto xa_locked;
                        } else if (trylock_page(page)) {
                                get_page(page);
                                xas_unlock_irq(&xas);
@@ -1626,7 +1615,12 @@ static void collapse_file(struct mm_struct *mm,
                 * without racing with truncate.
                 */
                VM_BUG_ON_PAGE(!PageLocked(page), page);
-               VM_BUG_ON_PAGE(!PageUptodate(page), page);
+
+               /* make sure the page is up to date */
+               if (unlikely(!PageUptodate(page))) {
+                       result = SCAN_FAIL;
+                       goto out_unlock;
+               }

                /*
                 * If file was truncated then extended, or hole-punched, before
@@ -1642,6 +1636,16 @@ static void collapse_file(struct mm_struct *mm,
                        goto out_unlock;
                }

+               if (!is_shmem && PageDirty(page)) {
+                       /*
+                        * khugepaged only works on read-only fd, so this
+                        * page is dirty because it hasn't been flushed
+                        * since first write.
+                        */
+                       result = SCAN_FAIL;
+                       goto out_unlock;
+               }
+
                if (isolate_lru_page(page)) {
                        result = SCAN_DEL_PAGE_LRU;
                        goto out_unlock;
--
2.17.1



> 2. The filemap_flush() bit to improve THP coverage in 5.5


From 70092bc81f255314c1d3b1ebb7718ae2be569495 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Tue, 29 Oct 2019 10:08:48 -0700
Subject: [PATCH 2/2] mm/thp: flush file for !is_shmem PageDirty() case in
 collapse_file()

For non-shmem file THPs, khugepaged only collapses read only .text mapping
(VM_DENYWRITE). These pages should not be dirty except the case where the
file hasn't been flushed since first write.

Call filemap_flush() in collapse_file() to accelerate the write back in
such cases.

[songliubraving@fb.com: flush file for !is_shmem PageDirty() case in collapse_file()]
https://lkml.kernel.org/linux-mm/20191030200736.3455046-1-songliubraving@fb.com/

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 mm/khugepaged.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40215795d641..513cce897794 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1601,6 +1601,24 @@ static void collapse_file(struct mm_struct *mm,
                                        result = SCAN_FAIL;
                                        goto xa_unlocked;
                                }
+                       } else if (PageDirty(page)) {
+                               /*
+                                * khugepaged only works on read-only fd,
+                                * so this page is dirty because it hasn't
+                                * been flushed since first write. There
+                                * won't be new dirty pages.
+                                *
+                                * Trigger async flush here and hope the
+                                * writeback is done when khugepaged
+                                * revisits this page.
+                                *
+                                * This is a one-off situation. We are not
+                                * forcing writeback in loop.
+                                */
+                               xas_unlock_irq(&xas);
+                               filemap_flush(mapping);
+                               result = SCAN_FAIL;
+                               goto xa_unlocked;
                        } else if (trylock_page(page)) {
                                get_page(page);
                                xas_unlock_irq(&xas);
--
2.17.1
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3ec5333ae94d..d61368b8251e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1601,6 +1601,33 @@  static void collapse_file(struct mm_struct *mm,
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
+				if (WARN_ON_ONCE(PageDirty(page))) {
+					/*
+					 * page from readahead should not
+					 * be dirty. Show warning if this
+					 * somehow happens.
+					 */
+					result = SCAN_FAIL;
+					goto out_unlock;
+				}
+			} else if (PageDirty(page)) {
+				/*
+				 * khugepaged only works on read-only fd,
+				 * so this page is dirty because it hasn't
+				 * been flushed since first write. There
+				 * won't be new dirty pages.
+				 *
+				 * Trigger async flush here and hope the
+				 * writeback is done when khugepaged
+				 * revisits this page.
+				 *
+				 * This is a one-off situation. We are not
+				 * forcing writeback in loop.
+				 */
+				xas_unlock_irq(&xas);
+				filemap_flush(mapping);
+				result = SCAN_FAIL;
+				goto xa_unlocked;
 			} else if (trylock_page(page)) {
 				get_page(page);
 				xas_unlock_irq(&xas);