diff mbox series

[v2,1/2] mm, thp: check page mapping when truncating page cache

Message ID 20210922070645.47345-2-rongwei.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series [v2,1/2] mm, thp: check page mapping when truncating page cache | expand

Commit Message

Rongwei Wang Sept. 22, 2021, 7:06 a.m. UTC
Transparent huge page has supported read-only non-shmem files. The file-
backed THP is collapsed by khugepaged and truncated when written (for
shared libraries).

However, there is race in two possible places.

1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;

In both cases, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows.

[40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
[40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
[40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
[40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
[40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
[40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
[40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[40326.247077] ------------[ cut here ]------------
[40326.247080] kernel BUG at mm/truncate.c:213!
[40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
[40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
[40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
[40326.286202] Hardware name: ECS, BIOS 0.0.0 02/06/2015
[40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[40326.287584] pc : truncate_inode_page+0x64/0x70
[40326.288040] lr : truncate_inode_page+0x64/0x70
[40326.288498] sp : ffff80001b60b900
[40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
[40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
[40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
[40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
[40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
[40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
[40326.292106] x17: 0000000000000000 x16: 0000000000000000
[40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
[40326.293197] x13: 6666666666666666 x12: 3130303030303030
[40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
[40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
[40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
[40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
[40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
[40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
[40326.297047] Call trace:
[40326.297304]  truncate_inode_page+0x64/0x70
[40326.297724]  truncate_inode_pages_range+0x550/0x7e4
[40326.298251]  truncate_pagecache+0x58/0x80
[40326.298662]  do_dentry_open+0x1e4/0x3c0
[40326.299052]  vfs_open+0x38/0x44
[40326.299377]  do_open+0x1f0/0x310
[40326.299709]  path_openat+0x114/0x1dc
[40326.300077]  do_filp_open+0x84/0x134
[40326.300444]  do_sys_openat2+0xbc/0x164
[40326.300825]  __arm64_sys_openat+0x74/0xc0
[40326.301236]  el0_svc_common.constprop.0+0x88/0x220
[40326.301723]  do_el0_svc+0x30/0xa0
[40326.302089]  el0_svc+0x20/0x30
[40326.302404]  el0_sync_handler+0x1a4/0x1b0
[40326.302814]  el0_sync+0x180/0x1c0
[40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
[40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
[40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception

This checks the page mapping and retries when subpage of file THP is
found, in truncate_inode_pages_range.

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
---
 mm/filemap.c  |  7 ++++++-
 mm/truncate.c | 17 ++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox (Oracle) Sept. 22, 2021, 11:37 a.m. UTC | #1
On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> Transparent huge page has supported read-only non-shmem files. The file-
> backed THP is collapsed by khugepaged and truncated when written (for
> shared libraries).
> 
> However, there is race in two possible places.
> 
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;

As I've said before, the bug here is that somehow there is a writable fd
to a file with THPs.  That's what we need to track down and fix.

https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
Rongwei Wang Sept. 22, 2021, 5:04 p.m. UTC | #2
> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>> Transparent huge page has supported read-only non-shmem files. The file-
>> backed THP is collapsed by khugepaged and truncated when written (for
>> shared libraries).
>> 
>> However, there is race in two possible places.
>> 
>> 1) multiple writers truncate the same page cache concurrently;
>> 2) collapse_file rolls back when writer truncates the page cache;
> 
> As I've said before, the bug here is that somehow there is a writable fd
> to a file with THPs.  That's what we need to track down and fix.
Hi, Matthew
I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. So, the
Pagecache of DSO need to be cleaned when someone opened this DSO in a writeable way. The process of
Pagecache cleaned mainly refers to ’truncate_inode_pages_range’.

Of course, the above is consensus for us.

The ’somehow’ that your said is sure for us. Maybe I can try to describe roughly here:

In collapse_file <khugepaged>, when PTEs scan failed (e.g. SCAN_FAIL, SCAN_TRUNCATED..), the following
code will be run:

        } else {
                struct page *page;

                /* Something went wrong: roll back page cache changes */
                xas_lock_irq(&xas);
                mapping->nrpages -= nr_none;

                if (is_shmem)
                        shmem_uncharge(mapping->host, nr_none);

                xas_set(&xas, start);
                xas_for_each(&xas, page, end - 1) {
                        page = list_first_entry_or_null(&pagelist,
                                        struct page, lru);
                        if (!page || xas.xa_index < page->index) {
                                if (!nr_none)
                                        break;
                                nr_none--;
                                /* Put holes back where they were */
                                xas_store(&xas, NULL);
                                continue;
                        }

                        VM_BUG_ON_PAGE(page->index != xas.xa_index, page);

                        /* Unfreeze the page. */
                        list_del(&page->lru);
                        page_ref_unfreeze(page, 2);
  line1               xas_store(&xas, page);
                        xas_pause(&xas);
                        xas_unlock_irq(&xas);
                        unlock_page(page);
                        putback_lru_page(page);
                        xas_lock_irq(&xas);
                }
		….
		new_page->mapping = NULL;
	   }
line2   unlock_page(new_page);
---

The above code refers to roll back. When someone starts open a DSO in a writeable way, and the collapse_file
is scanning PTEs. The hugepage had been locked and was mapped in page cache before line1. And the hugepage
not be in pagecache and be unlocked at line2. The race that between ‘collapse_file’ and ’truncate_inode_pages_range’
will happen when ‘collapse_file' is executing line1 and line2. 
This race can be shown concisely:

Khugepaged:                                                writer
Collapse_file:                                                truncate_inode_pages_range

lock_page(hugepage)                                   skip hugepage because locked by others

					                              scan_left_pages() and wait_lock_page(hugepage)
scan_fail() & xas_store(&xas, 4k_page)

unlock_page(hugepage)
						                      get_lock_page(hugepage)
						                      hugepage is not in pagecache here, but not be checked

This situation triggers easily if we setting a very small ‘scan_sleep_millisecs’.

The above descriptions are roughly my understanding of ’somehow’. It says a lot of nonsense, maybe it helps!
> 
> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
All in all, what you mean is that we should solve this race at the source? However, a writer happens to appear in the middle
of a process in ‘collapse_file', so this bug solved when roll back. The above is my understanding, but it may be
wrong.

Thanks
Andrew Morton Sept. 24, 2021, 2:43 a.m. UTC | #3
On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:

> 
> 
> > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >> Transparent huge page has supported read-only non-shmem files. The file-
> >> backed THP is collapsed by khugepaged and truncated when written (for
> >> shared libraries).
> >> 
> >> However, there is race in two possible places.
> >> 
> >> 1) multiple writers truncate the same page cache concurrently;
> >> 2) collapse_file rolls back when writer truncates the page cache;
> > 
> > As I've said before, the bug here is that somehow there is a writable fd
> > to a file with THPs.  That's what we need to track down and fix.
> Hi, Matthew
> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>
> ...
>
> > https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> All in all, what you mean is that we should solve this race at the source?

Matthew is being pretty clear here: we shouldn't be permitting
userspace to get a writeable fd for a thp-backed file.

Why are we permitting the DSO to be opened writeably?  If there's a
legitimate case for doing this then presumably "mm, thp: relax the
VM_DENYWRITE constraint on file-backed THPs: should be fixed or
reverted.

If there is no legitimate use case for returning a writeable fd for a
thp-backed file then we should fail such an attempt at open().  This
approach has back-compatibility issues which need to be thought about. 
Perhaps we should permit the open-writeably attempt to appear to
succeed, but to really return a read-only fd?
Yang Shi Sept. 24, 2021, 3:08 a.m. UTC | #4
On Thu, Sep 23, 2021 at 7:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
>
> >
> >
> > > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> > >> Transparent huge page has supported read-only non-shmem files. The file-
> > >> backed THP is collapsed by khugepaged and truncated when written (for
> > >> shared libraries).
> > >>
> > >> However, there is race in two possible places.
> > >>
> > >> 1) multiple writers truncate the same page cache concurrently;
> > >> 2) collapse_file rolls back when writer truncates the page cache;
> > >
> > > As I've said before, the bug here is that somehow there is a writable fd
> > > to a file with THPs.  That's what we need to track down and fix.
> > Hi, Matthew
> > I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> > Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >
> > ...
> >
> > > https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> > All in all, what you mean is that we should solve this race at the source?
>
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.

No, he doesn't mean it IIRC. Actually we had the same conversation for
another patch. Quoted below:

" > > Things have already gone wrong before we get to this point.  See
> > do_dentry_open().  You aren't supposed to be able to get a writable file
> > descriptor on a file which has had huge pages added to the page cache
> > without the filesystem's knowledge.  That's the problem that needs to
> > be fixed.
>
> I don't quite understand your point here. Do you mean do_dentry_open()
> should fail for such cases instead of truncating the page cache?

No, do_dentry_open() should have truncated the page cache when it was
called and found that there were THPs in the cache.  Then khugepaged
should see that someone has the file open for write and decline to create
new THPs.  So it shouldn't be possible to get here with THPs in the cache."

Please see https://lore.kernel.org/linux-mm/YUkCI2I085Sos%2F64@casper.infradead.org/

But actually "mm, thp: relax the VM_DENYWRITE constraint on
file-backed THPs" did so exactly.

>
> Why are we permitting the DSO to be opened writeably?  If there's a
> legitimate case for doing this then presumably "mm, thp: relax the
> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.

Unfortunately we can't revert this commit anymore since VM_DENYWRITE
is gone due to commit 8d0920bde5eb ("mm: remove VM_DENYWRITE")

>
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open().  This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?
>
>
Rongwei Wang Sept. 24, 2021, 3:35 a.m. UTC | #5
On 9/24/21 10:43 AM, Andrew Morton wrote:
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
>
>>
>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>> shared libraries).
>>>>
>>>> However, there is race in two possible places.
>>>>
>>>> 1) multiple writers truncate the same page cache concurrently;
>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>> As I've said before, the bug here is that somehow there is a writable fd
>>> to a file with THPs.  That's what we need to track down and fix.
>> Hi, Matthew
>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>
>> ...
>>
>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
>> All in all, what you mean is that we should solve this race at the source?
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.
>
> Why are we permitting the DSO to be opened writeably?  If there's a
> legitimate case for doing this then presumably "mm, thp: relax the

Hi, we have written a user case to trigger this race mentioned above. 
this case just create one reader

to open DSO in RDONLY mode, and keep making the mapping page of DSO use 
huge pages by madvise,

then multiple writer to open and write the same DSO.

I will send it later after simple adjust.

Thanks

> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.
>
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open().  This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?
Rongwei Wang Sept. 24, 2021, 7:12 a.m. UTC | #6
On 9/24/21 10:43 AM, Andrew Morton wrote:
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> 
>>
>>
>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>> shared libraries).
>>>>
>>>> However, there is race in two possible places.
>>>>
>>>> 1) multiple writers truncate the same page cache concurrently;
>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>
>>> As I've said before, the bug here is that somehow there is a writable fd
>>> to a file with THPs.  That's what we need to track down and fix.
>> Hi, Matthew
>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>
>> ...
>>
>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
>> All in all, what you mean is that we should solve this race at the source?
> 
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.
> 
> Why are we permitting the DSO to be opened writeably?  If there's a
> legitimate case for doing this then presumably "mm, thp: relax the
There is a use case to stress file-backed THP within attachment.
I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:

$ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
$ ulimit -s unlimited
$ ./stress_madvise_dso 10000 <libtest.so>

the meaning of above parameters:
10000: the max test time;
<libtest.so>: the DSO that will been mapped into file-backed THP by 
madvise. It recommended that the text segment of DSO to be tested is 
greater than 2M.

The crash will been triggered at once in the latest kernel. And this
case also can used to trigger the bug that mentioned in our another patch.

> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.
> 
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open().  This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?
>
/*
 * case: stress file-backed THP
 */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sched.h>
#include <time.h>
#include <string.h>
#include <fcntl.h>
#include <signal.h> /* for signal */
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <errno.h>

#define PATH_MAX	1024
#define BUFF_MAX	1024
#define TIME_DFL	180	/* seconds */

void signal_handler(int signo)
{
	/* Restore env */
	system("echo never > /sys/kernel/mm/transparent_hugepage/enabled");
	system("echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs");

	printf("\nrestore env:\n");
	printf("	echo never > /sys/kernel/mm/transparent_hugepage/enabled\n");
	printf("	echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs\n");
	exit(-1);
}

/* in KB */
#define text_size	(14UL << 10)

#define PROCMAP_SZ	8
struct procmap {
	uint64_t vm_start;
	uint64_t vm_end;
	uint64_t pgoff;
	uint32_t maj;
	uint32_t min;
	uint32_t ino;
#define PROT_SZ		5
	char     prot[PROT_SZ];
	char     fname[PATH_MAX];
};

unsigned long sleep_secs = 0;

/*
 * Routines of procmap, i.e., /proc/pid/(s)maps
 */
static int get_memory_map(pid_t pid, struct procmap *procmap,
		const char *fname)
{
	char path[PATH_MAX];
	char line[BUFF_MAX];
	FILE *fp = NULL;
	char *end = NULL;
	char *pos, *sp = NULL, *in[PROCMAP_SZ];
	char dlm[] = "-   :   ";
	uint64_t counter;
	int i;

	snprintf(path, PATH_MAX, "/proc/%u/maps", pid);

	fp = fopen(path, "r");
	if (fp == NULL) {
		printf("fopen: %s: %s\n", path, strerror(errno));
		return -1;
	}

	if (procmap == NULL || fname == NULL) {
		perror("fail: procmap or fname is NULL");
		goto failed;
	}

	while (fgets(line, BUFF_MAX, fp)) {
		/* Split line into fields */
		pos = line;
		for (i = 0; i < PROCMAP_SZ; i++) {
			in[i] = strtok_r(pos, &dlm[i], &sp);
			if (in[i] == NULL)
				break;
			pos = NULL;
		}

		/* Check this line is procmap item header */
		if (i != PROCMAP_SZ)
			continue;

		memcpy(procmap->prot, in[2], PROT_SZ);
		memcpy(procmap->fname, in[7], PATH_MAX);

		/* Find the target entry */
		if (strcmp(procmap->prot, "r-xp") ||
				!strstr(procmap->fname, fname))
			continue;

		/* Convert/Copy each field as needed */
		errno = 0;
		procmap->vm_start = strtoull(in[0], &end, 16);
		if ((in[0] == '\0') || (end == NULL) || (*end != '\0') ||
				(errno != 0))
			goto failed;

		procmap->vm_end = strtoull(in[1], &end, 16);
		if ((in[1] == '\0') || (end == NULL) || (*end != '\0') ||
				(errno != 0))
			goto failed;

		procmap->pgoff = strtoull(in[3], &end, 16);
		if ((in[3] == '\0') || (end == NULL) || (*end != '\0') ||
				(errno != 0))
			goto failed;

		procmap->ino = strtoul(in[6], &end, 16);
		if ((in[6] == '\0') || (end == NULL) || (*end != '\0') ||
				(errno != 0))
			goto failed;
	}

	if (fp)
		fclose(fp);
	return 0;

failed:
	if (fp)
		fclose(fp);
	printf("fail: exit\n");

	return -1;
}

#define NR_CPU 32
uint64_t gettimeofday_sec(void);
inline uint64_t gettimeofday_sec(void)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	return tv.tv_sec;
}

void thread_read(int cpu, char *args)
{
	int fd;
	char *dso_path = args;
	char buf[0x800000];
	struct procmap maps;
	pid_t pid = getpid();

	cpu_set_t mask;
	CPU_ZERO(&mask);
	CPU_SET(cpu, &mask);
	if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
		printf("warning: can not set CPU affinity\n");
	}

	printf("read %s\n", dso_path);
	fd = open(dso_path, O_RDONLY);
	/* The start addr must be alignment with 2M */
	void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | PROT_EXEC,
			MAP_PRIVATE, fd, 0);
	if (p == MAP_FAILED) {
		perror("mmap");
		goto out;
	}

	/* get the mapping address (ONLY r-xp) of the DSO */
	get_memory_map(pid, &maps, dso_path);
	printf("pid: %d\n", pid);
	printf("text vm_start: 0x%lx\n", maps.vm_start);
	printf("text vm_end: 0x%lx\n", maps.vm_end);
	madvise((void *)maps.vm_start, maps.vm_end - maps.vm_start, MADV_HUGEPAGE);
	lseek(fd, 0, SEEK_SET);
	for(;;) {
		memcpy(buf, p, 0x800000 - 1);
		sleep(1);
	}

	sleep(100);

out:
	/* Restore env */
	system("echo never > /sys/kernel/mm/transparent_hugepage/enabled");
	system("echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs");

	printf("read exit %s\n", dso_path);
}

void thread_write(int cpu, char *args)
{
	void *p = NULL;
	char buf[32];
	uint64_t sec = 1;
	uint64_t count = 0;
	char *dso_path = args;

	cpu_set_t mask;
	CPU_ZERO(&mask);
	CPU_SET(cpu, &mask);
	if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
		printf("warning: can not set CPU affinity\n");
	}

	sleep(3);
	printf("write %s\n", dso_path);
	for (;;) {
		sec = gettimeofday_sec();
		while ((sec % 10) >= 3) {
			sec = gettimeofday_sec();
		}

		int fd = open(dso_path, O_RDWR);
		p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		if (p == MAP_FAILED) {
			perror("mmap");
			goto out; /* fail */
		}

		lseek(fd, 0x1600, SEEK_SET);
		for(long i=1; i <= 2; i++){
			memcpy(p + 0x10, buf, 16);
		}

		munmap(p, 0x800000);
		close(fd);

		sleep(2);
		count++;
		if (count >= sleep_secs) {
			printf("test finish: %ld\n", count);
			break;
		}
	} /* end for */

out:
	printf("write exit %s\n", dso_path);
}

/*
 * usage:
 *		stress_madvise_dso <test time> <libtest.so>
 */
int main(int argc, char *argv[])
{
	struct timeval start, end;
	char dso_path[80];
	int ret = 0;
	pid_t pid;

	if (argc > 2) {
		sleep_secs = strtoul(argv[1], NULL, 10);
		realpath(argv[2], dso_path);
	}
	else {
		printf("usage error:\n"\
				"	stress_madvise_dso <test time> <libtest.so>\n"\
				"	e.g. stress_madvise_dso 10000 libtest.so\n");
		exit(-1);
	}

	/* Set env */
	system("ulimit -s unlimited");
	system("echo madvise > /sys/kernel/mm/transparent_hugepage/enabled");
	system("echo 1000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs");

	gettimeofday(&start, NULL);

	/*
	 * fork 32 task to read and write the same DSO:
	 *		task 0: read dso;
	 *		task 1 - 31: write dso;
	 */
	for (int i = 0; i < NR_CPU; ++i) {
		pid = fork();
		if (pid == 0) {
			if (i == 0)
				thread_read(i, dso_path);
			else
				thread_write(i, dso_path);
			break; /* forbid child fork */
		}
		else {
			/* parent */
		}
	}

	if (pid != 0) {
		signal(SIGINT, signal_handler);
		signal(SIGSEGV, signal_handler);
		signal(SIGABRT, signal_handler);
		/* wait */
		while (1) {
			int status;

			pid_t done = wait(&status);
			if (done == -1) {
				if (errno == ECHILD)
					break; /* No more child processes */
			} else {
				if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
					printf("Pid:%d failed\n", done);
					goto out;
				}
			}
		}
	}

out:
	if (ret == 0) {
		gettimeofday(&end, NULL);
		printf("time to collapse file thp: %ld ms\n",
				1000 * (end.tv_sec - start.tv_sec) +
				(end.tv_usec - start.tv_usec) / 1000);
	}

	return ret;
}
Song Liu Sept. 27, 2021, 10:24 p.m. UTC | #7
On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 9/24/21 10:43 AM, Andrew Morton wrote:
> > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> >
> >>
> >>
> >>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >>>
> >>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>> shared libraries).
> >>>>
> >>>> However, there is race in two possible places.
> >>>>
> >>>> 1) multiple writers truncate the same page cache concurrently;
> >>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>
> >>> As I've said before, the bug here is that somehow there is a writable fd
> >>> to a file with THPs.  That's what we need to track down and fix.
> >> Hi, Matthew
> >> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>
> >> ...
> >>
> >>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> >> All in all, what you mean is that we should solve this race at the source?
> >
> > Matthew is being pretty clear here: we shouldn't be permitting
> > userspace to get a writeable fd for a thp-backed file.
> >
> > Why are we permitting the DSO to be opened writeably?  If there's a
> > legitimate case for doing this then presumably "mm, thp: relax the
> There is a use case to stress file-backed THP within attachment.
> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>
> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> $ ulimit -s unlimited
> $ ./stress_madvise_dso 10000 <libtest.so>
>
> the meaning of above parameters:
> 10000: the max test time;
> <libtest.so>: the DSO that will been mapped into file-backed THP by
> madvise. It recommended that the text segment of DSO to be tested is
> greater than 2M.
>
> The crash will been triggered at once in the latest kernel. And this
> case also can used to trigger the bug that mentioned in our another patch.

Hmm.. I am not able to use the repro program to crash the system. Not
sure what I did wrong.

OTOH, does it make sense to block writes within khugepaged, like:

diff --git i/mm/khugepaged.c w/mm/khugepaged.c
index 045cc579f724e..ad7c41ec15027 100644
--- i/mm/khugepaged.c
+++ w/mm/khugepaged.c
@@ -51,6 +51,7 @@ enum scan_result {
        SCAN_CGROUP_CHARGE_FAIL,
        SCAN_TRUNCATED,
        SCAN_PAGE_HAS_PRIVATE,
+       SCAN_BUSY_WRITE,
 };

 #define CREATE_TRACE_POINTS
@@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
        /* Only allocate from the target node */
        gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

+       if (deny_write_access(file)) {
+               result = SCAN_BUSY_WRITE;
+               return;
+       }
+
        new_page = khugepaged_alloc_page(hpage, gfp, node);
        if (!new_page) {
                result = SCAN_ALLOC_HUGE_PAGE_FAIL;
@@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm,
        else {
                __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
                filemap_nr_thps_inc(mapping);
-               /*
-                * Paired with smp_mb() in do_dentry_open() to ensure
-                * i_writecount is up to date and the update to nr_thps is
-                * visible. Ensures the page cache will be truncated if the
-                * file is opened writable.
-                */
-               smp_mb();
-               if (inode_is_open_for_write(mapping->host)) {
-                       result = SCAN_FAIL;
-                       __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
-                       filemap_nr_thps_dec(mapping);
-                       goto xa_locked;
-               }
        }

        if (nr_none) {
@@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm,
        VM_BUG_ON(!list_empty(&pagelist));
        if (!IS_ERR_OR_NULL(*hpage))
                mem_cgroup_uncharge(*hpage);
+       allow_write_access(file);
        /* TODO: tracepoints */
 }
Matthew Wilcox (Oracle) Sept. 28, 2021, 12:06 p.m. UTC | #8
On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote:
> OTOH, does it make sense to block writes within khugepaged, like:
> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
>         /* Only allocate from the target node */
>         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> 
> +       if (deny_write_access(file)) {
> +               result = SCAN_BUSY_WRITE;
> +               return;
> +       }

The problem is that it denies, rather than blocking.  That means that the
writer gets ETXTBSY instead of waiting until khugepaged is done.
Rongwei Wang Sept. 28, 2021, 4:20 p.m. UTC | #9
On 9/28/21 6:24 AM, Song Liu wrote:
> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 9/24/21 10:43 AM, Andrew Morton wrote:
>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
>>>
>>>>
>>>>
>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>>>
>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>>>> shared libraries).
>>>>>>
>>>>>> However, there is race in two possible places.
>>>>>>
>>>>>> 1) multiple writers truncate the same page cache concurrently;
>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>>>
>>>>> As I've said before, the bug here is that somehow there is a writable fd
>>>>> to a file with THPs.  That's what we need to track down and fix.
>>>> Hi, Matthew
>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>>>
>>>> ...
>>>>
>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
>>>> All in all, what you mean is that we should solve this race at the source?
>>>
>>> Matthew is being pretty clear here: we shouldn't be permitting
>>> userspace to get a writeable fd for a thp-backed file.
>>>
>>> Why are we permitting the DSO to be opened writeably?  If there's a
>>> legitimate case for doing this then presumably "mm, thp: relax the
>> There is a use case to stress file-backed THP within attachment.
>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>>
>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
>> $ ulimit -s unlimited
>> $ ./stress_madvise_dso 10000 <libtest.so>
>>
>> the meaning of above parameters:
>> 10000: the max test time;
>> <libtest.so>: the DSO that will been mapped into file-backed THP by
>> madvise. It recommended that the text segment of DSO to be tested is
>> greater than 2M.
>>
>> The crash will been triggered at once in the latest kernel. And this
>> case also can used to trigger the bug that mentioned in our another patch.
> 
> Hmm.. I am not able to use the repro program to crash the system. Not
> sure what I did wrong.
> 
Hi
I have tried to check my test case again. Can you make sure the DSO that 
you test have THP mapping?

If you are willing to try again, I can send my libtest.c which is used 
to test by myself (actually, it shouldn't be target DSO problem).

Thanks very much!
> OTOH, does it make sense to block writes within khugepaged, like:
> 
> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> index 045cc579f724e..ad7c41ec15027 100644
> --- i/mm/khugepaged.c
> +++ w/mm/khugepaged.c
> @@ -51,6 +51,7 @@ enum scan_result {
>          SCAN_CGROUP_CHARGE_FAIL,
>          SCAN_TRUNCATED,
>          SCAN_PAGE_HAS_PRIVATE,
> +       SCAN_BUSY_WRITE,
>   };
> 
>   #define CREATE_TRACE_POINTS
> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
>          /* Only allocate from the target node */
>          gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> 
> +       if (deny_write_access(file)) {
> +               result = SCAN_BUSY_WRITE;
> +               return;
> +       }
> +
This can indeed avoid some possible races from source.

But, I am thinking about whether this will lead to DDoS attack?
I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
is that DDoS attack. In addition, 'deny_write_access' will change
the behavior, such as user will get 'Text file busy' during 
collapse_file. I am not sure whether the behavior changing is acceptable 
in user space.

If it is acceptable, I am very willing to fix the races like your way.

Thanks!
>          new_page = khugepaged_alloc_page(hpage, gfp, node);
>          if (!new_page) {
>                  result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> @@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm,
>          else {
>                  __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
>                  filemap_nr_thps_inc(mapping);
> -               /*
> -                * Paired with smp_mb() in do_dentry_open() to ensure
> -                * i_writecount is up to date and the update to nr_thps is
> -                * visible. Ensures the page cache will be truncated if the
> -                * file is opened writable.
> -                */
> -               smp_mb();
> -               if (inode_is_open_for_write(mapping->host)) {
> -                       result = SCAN_FAIL;
> -                       __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
> -                       filemap_nr_thps_dec(mapping);
> -                       goto xa_locked;
> -               }
>          }
> 
>          if (nr_none) {
> @@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm,
>          VM_BUG_ON(!list_empty(&pagelist));
>          if (!IS_ERR_OR_NULL(*hpage))
>                  mem_cgroup_uncharge(*hpage);
> +       allow_write_access(file);
>          /* TODO: tracepoints */
>   }
>
#ifndef _DEFAULT_SOURCE
#define _DEFAULT_SOURCE
#endif

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

/* "volatile" to forbid compiler optimization */
volatile static int x;

#define DO_0 ++x;
#define DO_1 {DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0}
#define DO_2 {DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1}
#define DO_3 {DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2}
#define DO_4 {DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3}
#define DO_5 {DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4}
#define DO_6 {DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5}
#define DO_7 {DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6}

void libtest_work1(void)
{
	printf("work 1\n");
	DO_0;
}

void libtest_work2(void)
{
	printf("work 2\n");
	DO_2;
}

void libtest_work3(void)
{
	printf("work 3\n");
	DO_4;
}

void libtest_work4(void)
{
	printf("work 4\n");
	DO_6;
}
Song Liu Sept. 28, 2021, 4:59 p.m. UTC | #10
On Tue, Sep 28, 2021 at 5:07 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote:
> > OTOH, does it make sense to block writes within khugepaged, like:
> > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> >         /* Only allocate from the target node */
> >         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> > +       if (deny_write_access(file)) {
> > +               result = SCAN_BUSY_WRITE;
> > +               return;
> > +       }
>
> The problem is that it denies, rather than blocking.  That means that the
> writer gets ETXTBSY instead of waiting until khugepaged is done.
>

Yes, I was thinking about the same problem last night and this morning.
Unfortunately, I haven't got a good solution yet. Do you have some
suggestions on this?

Thanks,
Song
Song Liu Sept. 29, 2021, 7:14 a.m. UTC | #11
On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 9/28/21 6:24 AM, Song Liu wrote:
> > On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 9/24/21 10:43 AM, Andrew Morton wrote:
> >>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> >>>
> >>>>
> >>>>
> >>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >>>>>
> >>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>>>> shared libraries).
> >>>>>>
> >>>>>> However, there is race in two possible places.
> >>>>>>
> >>>>>> 1) multiple writers truncate the same page cache concurrently;
> >>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>>>
> >>>>> As I've said before, the bug here is that somehow there is a writable fd
> >>>>> to a file with THPs.  That's what we need to track down and fix.
> >>>> Hi, Matthew
> >>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>>>
> >>>> ...
> >>>>
> >>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> >>>> All in all, what you mean is that we should solve this race at the source?
> >>>
> >>> Matthew is being pretty clear here: we shouldn't be permitting
> >>> userspace to get a writeable fd for a thp-backed file.
> >>>
> >>> Why are we permitting the DSO to be opened writeably?  If there's a
> >>> legitimate case for doing this then presumably "mm, thp: relax the
> >> There is a use case to stress file-backed THP within attachment.
> >> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> >>
> >> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> >> $ ulimit -s unlimited
> >> $ ./stress_madvise_dso 10000 <libtest.so>
> >>
> >> the meaning of above parameters:
> >> 10000: the max test time;
> >> <libtest.so>: the DSO that will been mapped into file-backed THP by
> >> madvise. It recommended that the text segment of DSO to be tested is
> >> greater than 2M.
> >>
> >> The crash will been triggered at once in the latest kernel. And this
> >> case also can used to trigger the bug that mentioned in our another patch.
> >
> > Hmm.. I am not able to use the repro program to crash the system. Not
> > sure what I did wrong.
> >
> Hi
> I have tried to check my test case again. Can you make sure the DSO that
> you test have THP mapping?
>
> If you are willing to try again, I can send my libtest.c which is used
> to test by myself (actually, it shouldn't be target DSO problem).
>
> Thanks very much!
> > OTOH, does it make sense to block writes within khugepaged, like:
> >
> > diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> > index 045cc579f724e..ad7c41ec15027 100644
> > --- i/mm/khugepaged.c
> > +++ w/mm/khugepaged.c
> > @@ -51,6 +51,7 @@ enum scan_result {
> >          SCAN_CGROUP_CHARGE_FAIL,
> >          SCAN_TRUNCATED,
> >          SCAN_PAGE_HAS_PRIVATE,
> > +       SCAN_BUSY_WRITE,
> >   };
> >
> >   #define CREATE_TRACE_POINTS
> > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> >          /* Only allocate from the target node */
> >          gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> > +       if (deny_write_access(file)) {
> > +               result = SCAN_BUSY_WRITE;
> > +               return;
> > +       }
> > +
> This can indeed avoid some possible races from source.
>
> But, I am thinking about whether this will lead to DDoS attack?
> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> is that DDoS attack. In addition, 'deny_write_access' will change
> the behavior, such as user will get 'Text file busy' during
> collapse_file. I am not sure whether the behavior changing is acceptable
> in user space.
>
> If it is acceptable, I am very willing to fix the races like your way.

I guess we should not let the write get ETXTBUSY for khugepaged work.

I am getting some segfault on stress_madvise_dso. And it doesn't really
generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
version of it?

Thanks,
Song
Rongwei Wang Sept. 29, 2021, 7:50 a.m. UTC | #12
On 9/29/21 3:14 PM, Song Liu wrote:
> On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 9/28/21 6:24 AM, Song Liu wrote:
>>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
>>> <rongwei.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
>>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>>>>>> shared libraries).
>>>>>>>>
>>>>>>>> However, there is race in two possible places.
>>>>>>>>
>>>>>>>> 1) multiple writers truncate the same page cache concurrently;
>>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>>>>>
>>>>>>> As I've said before, the bug here is that somehow there is a writable fd
>>>>>>> to a file with THPs.  That's what we need to track down and fix.
>>>>>> Hi, Matthew
>>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
>>>>>> All in all, what you mean is that we should solve this race at the source?
>>>>>
>>>>> Matthew is being pretty clear here: we shouldn't be permitting
>>>>> userspace to get a writeable fd for a thp-backed file.
>>>>>
>>>>> Why are we permitting the DSO to be opened writeably?  If there's a
>>>>> legitimate case for doing this then presumably "mm, thp: relax the
>>>> There is a use case to stress file-backed THP within attachment.
>>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>>>>
>>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
>>>> $ ulimit -s unlimited
>>>> $ ./stress_madvise_dso 10000 <libtest.so>
>>>>
>>>> the meaning of above parameters:
>>>> 10000: the max test time;
>>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
>>>> madvise. It recommended that the text segment of DSO to be tested is
>>>> greater than 2M.
>>>>
>>>> The crash will been triggered at once in the latest kernel. And this
>>>> case also can used to trigger the bug that mentioned in our another patch.
>>>
>>> Hmm.. I am not able to use the repro program to crash the system. Not
>>> sure what I did wrong.
>>>
>> Hi
>> I have tried to check my test case again. Can you make sure the DSO that
>> you test have THP mapping?
>>
>> If you are willing to try again, I can send my libtest.c which is used
>> to test by myself (actually, it shouldn't be target DSO problem).
>>
>> Thanks very much!
>>> OTOH, does it make sense to block writes within khugepaged, like:
>>>
>>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
>>> index 045cc579f724e..ad7c41ec15027 100644
>>> --- i/mm/khugepaged.c
>>> +++ w/mm/khugepaged.c
>>> @@ -51,6 +51,7 @@ enum scan_result {
>>>           SCAN_CGROUP_CHARGE_FAIL,
>>>           SCAN_TRUNCATED,
>>>           SCAN_PAGE_HAS_PRIVATE,
>>> +       SCAN_BUSY_WRITE,
>>>    };
>>>
>>>    #define CREATE_TRACE_POINTS
>>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
>>>           /* Only allocate from the target node */
>>>           gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>>>
>>> +       if (deny_write_access(file)) {
>>> +               result = SCAN_BUSY_WRITE;
>>> +               return;
>>> +       }
>>> +
>> This can indeed avoid some possible races from source.
>>
>> But, I am thinking about whether this will lead to DDoS attack?
>> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
>> is that DDoS attack. In addition, 'deny_write_access' will change
>> the behavior, such as user will get 'Text file busy' during
>> collapse_file. I am not sure whether the behavior changing is acceptable
>> in user space.
>>
>> If it is acceptable, I am very willing to fix the races like your way.
> 
> I guess we should not let the write get ETXTBUSY for khugepaged work.
> 
> I am getting some segfault on stress_madvise_dso. And it doesn't really
> generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
Hi, I can sure I am not update the stress_madvise_dso.c.

My test environment is vm (qemu-system-aarch64, 32 cores). And I can 
think of the following possibilities:

(1) in thread_read()

printf("read %s\n", dso_path);
d = open(dso_path, O_RDONLY);
/* The start addr must be alignment with 2M */
void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | 
PROT_EXEC,MAP_PRIVATE, fd, 0);
if (p == MAP_FAILED) {
	perror("mmap");
	goto out;
}

0x40000dc00000 is random setting by myself. I am not sure this address 
is available in your vm.

(2) in thread_write()
int fd = open(dso_path, O_RDWR);
p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
	perror("mmap");
         goto out; /* fail */
}

because of I am sure the DSO is bigger than 0x800000, so directly map 
the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to 
compile the DSO? likes:
$ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o

If you don't mind, you can send the segment fault log to me. And I will
find x86 environment to test.

Thanks!
> version of it?
> 
> Thanks,
> Song
>
Song Liu Sept. 29, 2021, 4:59 p.m. UTC | #13
On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 9/29/21 3:14 PM, Song Liu wrote:
> > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 9/28/21 6:24 AM, Song Liu wrote:
> >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> >>> <rongwei.wang@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
> >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >>>>>>>
> >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>>>>>> shared libraries).
> >>>>>>>>
> >>>>>>>> However, there is race in two possible places.
> >>>>>>>>
> >>>>>>>> 1) multiple writers truncate the same page cache concurrently;
> >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>>>>>
> >>>>>>> As I've said before, the bug here is that somehow there is a writable fd
> >>>>>>> to a file with THPs.  That's what we need to track down and fix.
> >>>>>> Hi, Matthew
> >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> >>>>>> All in all, what you mean is that we should solve this race at the source?
> >>>>>
> >>>>> Matthew is being pretty clear here: we shouldn't be permitting
> >>>>> userspace to get a writeable fd for a thp-backed file.
> >>>>>
> >>>>> Why are we permitting the DSO to be opened writeably?  If there's a
> >>>>> legitimate case for doing this then presumably "mm, thp: relax the
> >>>> There is a use case to stress file-backed THP within attachment.
> >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> >>>>
> >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> >>>> $ ulimit -s unlimited
> >>>> $ ./stress_madvise_dso 10000 <libtest.so>
> >>>>
> >>>> the meaning of above parameters:
> >>>> 10000: the max test time;
> >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
> >>>> madvise. It recommended that the text segment of DSO to be tested is
> >>>> greater than 2M.
> >>>>
> >>>> The crash will been triggered at once in the latest kernel. And this
> >>>> case also can used to trigger the bug that mentioned in our another patch.
> >>>
> >>> Hmm.. I am not able to use the repro program to crash the system. Not
> >>> sure what I did wrong.
> >>>
> >> Hi
> >> I have tried to check my test case again. Can you make sure the DSO that
> >> you test have THP mapping?
> >>
> >> If you are willing to try again, I can send my libtest.c which is used
> >> to test by myself (actually, it shouldn't be target DSO problem).
> >>
> >> Thanks very much!
> >>> OTOH, does it make sense to block writes within khugepaged, like:
> >>>
> >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> >>> index 045cc579f724e..ad7c41ec15027 100644
> >>> --- i/mm/khugepaged.c
> >>> +++ w/mm/khugepaged.c
> >>> @@ -51,6 +51,7 @@ enum scan_result {
> >>>           SCAN_CGROUP_CHARGE_FAIL,
> >>>           SCAN_TRUNCATED,
> >>>           SCAN_PAGE_HAS_PRIVATE,
> >>> +       SCAN_BUSY_WRITE,
> >>>    };
> >>>
> >>>    #define CREATE_TRACE_POINTS
> >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> >>>           /* Only allocate from the target node */
> >>>           gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >>>
> >>> +       if (deny_write_access(file)) {
> >>> +               result = SCAN_BUSY_WRITE;
> >>> +               return;
> >>> +       }
> >>> +
> >> This can indeed avoid some possible races from source.
> >>
> >> But, I am thinking about whether this will lead to DDoS attack?
> >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> >> is that DDoS attack. In addition, 'deny_write_access' will change
> >> the behavior, such as user will get 'Text file busy' during
> >> collapse_file. I am not sure whether the behavior changing is acceptable
> >> in user space.
> >>
> >> If it is acceptable, I am very willing to fix the races like your way.
> >
> > I guess we should not let the write get ETXTBUSY for khugepaged work.
> >
> > I am getting some segfault on stress_madvise_dso. And it doesn't really
> > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
> Hi, I can sure I am not update the stress_madvise_dso.c.
>
> My test environment is vm (qemu-system-aarch64, 32 cores). And I can
> think of the following possibilities:
>
> (1) in thread_read()
>
> printf("read %s\n", dso_path);
> d = open(dso_path, O_RDONLY);
> /* The start addr must be alignment with 2M */
> void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ |
> PROT_EXEC,MAP_PRIVATE, fd, 0);
> if (p == MAP_FAILED) {
>         perror("mmap");
>         goto out;
> }
>
> 0x40000dc00000 is random setting by myself. I am not sure this address
> is available in your vm.
>
> (2) in thread_write()
> int fd = open(dso_path, O_RDWR);
> p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> if (p == MAP_FAILED) {
>         perror("mmap");
>          goto out; /* fail */
> }
>
> because of I am sure the DSO is bigger than 0x800000, so directly map
> the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to
> compile the DSO? likes:
> $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o
>
> If you don't mind, you can send the segment fault log to me. And I will
> find x86 environment to test.

I fixed the segfault with
1. malloc buf (as it is too big for stack) in thread_read
2. reduce memcpy() size in thread_read.

Now, I am able to crash the system on
    find_lock_entries () {
     ...
       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
    }
I guess it is related. I will test more.

Thanks,
Song
Matthew Wilcox (Oracle) Sept. 29, 2021, 5:55 p.m. UTC | #14
On Wed, Sep 29, 2021 at 09:59:38AM -0700, Song Liu wrote:
> On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 9/29/21 3:14 PM, Song Liu wrote:
> > > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> > > <rongwei.wang@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 9/28/21 6:24 AM, Song Liu wrote:
> > >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> > >>> <rongwei.wang@linux.alibaba.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
> > >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > >>>>>>>
> > >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> > >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> > >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> > >>>>>>>> shared libraries).
> > >>>>>>>>
> > >>>>>>>> However, there is race in two possible places.
> > >>>>>>>>
> > >>>>>>>> 1) multiple writers truncate the same page cache concurrently;
> > >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> > >>>>>>>
> > >>>>>>> As I've said before, the bug here is that somehow there is a writable fd
> > >>>>>>> to a file with THPs.  That's what we need to track down and fix.
> > >>>>>> Hi, Matthew
> > >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> > >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> > >>>>>>
> > >>>>>> ...
> > >>>>>>
> > >>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> > >>>>>> All in all, what you mean is that we should solve this race at the source?
> > >>>>>
> > >>>>> Matthew is being pretty clear here: we shouldn't be permitting
> > >>>>> userspace to get a writeable fd for a thp-backed file.
> > >>>>>
> > >>>>> Why are we permitting the DSO to be opened writeably?  If there's a
> > >>>>> legitimate case for doing this then presumably "mm, thp: relax the
> > >>>> There is a use case to stress file-backed THP within attachment.
> > >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> > >>>>
> > >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> > >>>> $ ulimit -s unlimited
> > >>>> $ ./stress_madvise_dso 10000 <libtest.so>
> > >>>>
> > >>>> the meaning of above parameters:
> > >>>> 10000: the max test time;
> > >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
> > >>>> madvise. It recommended that the text segment of DSO to be tested is
> > >>>> greater than 2M.
> > >>>>
> > >>>> The crash will been triggered at once in the latest kernel. And this
> > >>>> case also can used to trigger the bug that mentioned in our another patch.
> > >>>
> > >>> Hmm.. I am not able to use the repro program to crash the system. Not
> > >>> sure what I did wrong.
> > >>>
> > >> Hi
> > >> I have tried to check my test case again. Can you make sure the DSO that
> > >> you test have THP mapping?
> > >>
> > >> If you are willing to try again, I can send my libtest.c which is used
> > >> to test by myself (actually, it shouldn't be target DSO problem).
> > >>
> > >> Thanks very much!
> > >>> OTOH, does it make sense to block writes within khugepaged, like:
> > >>>
> > >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> > >>> index 045cc579f724e..ad7c41ec15027 100644
> > >>> --- i/mm/khugepaged.c
> > >>> +++ w/mm/khugepaged.c
> > >>> @@ -51,6 +51,7 @@ enum scan_result {
> > >>>           SCAN_CGROUP_CHARGE_FAIL,
> > >>>           SCAN_TRUNCATED,
> > >>>           SCAN_PAGE_HAS_PRIVATE,
> > >>> +       SCAN_BUSY_WRITE,
> > >>>    };
> > >>>
> > >>>    #define CREATE_TRACE_POINTS
> > >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> > >>>           /* Only allocate from the target node */
> > >>>           gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > >>>
> > >>> +       if (deny_write_access(file)) {
> > >>> +               result = SCAN_BUSY_WRITE;
> > >>> +               return;
> > >>> +       }
> > >>> +
> > >> This can indeed avoid some possible races from source.
> > >>
> > >> But, I am thinking about whether this will lead to DDoS attack?
> > >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> > >> is that DDoS attack. In addition, 'deny_write_access' will change
> > >> the behavior, such as user will get 'Text file busy' during
> > >> collapse_file. I am not sure whether the behavior changing is acceptable
> > >> in user space.
> > >>
> > >> If it is acceptable, I am very willing to fix the races like your way.
> > >
> > > I guess we should not let the write get ETXTBUSY for khugepaged work.
> > >
> > > I am getting some segfault on stress_madvise_dso. And it doesn't really
> > > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
> > Hi, I can sure I am not update the stress_madvise_dso.c.
> >
> > My test environment is vm (qemu-system-aarch64, 32 cores). And I can
> > think of the following possibilities:
> >
> > (1) in thread_read()
> >
> > printf("read %s\n", dso_path);
> > d = open(dso_path, O_RDONLY);
> > /* The start addr must be alignment with 2M */
> > void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ |
> > PROT_EXEC,MAP_PRIVATE, fd, 0);
> > if (p == MAP_FAILED) {
> >         perror("mmap");
> >         goto out;
> > }
> >
> > 0x40000dc00000 is random setting by myself. I am not sure this address
> > is available in your vm.
> >
> > (2) in thread_write()
> > int fd = open(dso_path, O_RDWR);
> > p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > if (p == MAP_FAILED) {
> >         perror("mmap");
> >          goto out; /* fail */
> > }
> >
> > because of I am sure the DSO is bigger than 0x800000, so directly map
> > the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to
> > compile the DSO? likes:
> > $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o
> >
> > If you don't mind, you can send the segment fault log to me. And I will
> > find x86 environment to test.
> 
> I fixed the segfault with
> 1. malloc buf (as it is too big for stack) in thread_read
> 2. reduce memcpy() size in thread_read.
> 
> Now, I am able to crash the system on
>     find_lock_entries () {
>      ...
>        VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>     }
> I guess it is related. I will test more.

That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
Andrew has it too, but for some reason, he hasn't sent it on to Linus.

+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
                if (!xa_is_value(page)) {
                        if (page->index < start)
                                goto put;
-                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
                        if (page->index + thp_nr_pages(page) - 1 > end)
                                goto put;
                        if (!trylock_page(page))
Song Liu Sept. 29, 2021, 11:41 p.m. UTC | #15
On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
[...]
> > Now, I am able to crash the system on
> >     find_lock_entries () {
> >      ...
> >        VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >     }
> > I guess it is related. I will test more.
>
> That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
>
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
>                 if (!xa_is_value(page)) {
>                         if (page->index < start)
>                                 goto put;
> -                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>                         if (page->index + thp_nr_pages(page) - 1 > end)
>                                 goto put;
>                         if (!trylock_page(page))

Yes, after removing this line, I am able to see the same bug.

Here is my finding so far:

The issue is NOT caused by concurrent khugepaged:collapse_file() and
truncate_pagecache(inode, 0). With some printks, we can see a clear
time gap (>2 second )  between collapse_file() finishes, and
truncate_pagecache() (which crashes soon). Therefore, my earlier
suggestion that adds deny_write_access() to collapse_file() does NOT
work.

The crash is actually caused by concurrent truncate_pagecache(inode, 0).
If I change the number of write thread in stress_madvise_dso.c to one,
(IOW, one thread_read and one thread_write), I cannot reproduce the
crash anymore.

I think this means we cannot fix this issue in collapse_file(), because it
finishes long before the crash.

Thanks,
Song
Matthew Wilcox (Oracle) Sept. 30, 2021, midnight UTC | #16
On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
> The issue is NOT caused by concurrent khugepaged:collapse_file() and
> truncate_pagecache(inode, 0). With some printks, we can see a clear
> time gap (>2 second )  between collapse_file() finishes, and
> truncate_pagecache() (which crashes soon). Therefore, my earlier
> suggestion that adds deny_write_access() to collapse_file() does NOT
> work.
> 
> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> If I change the number of write thread in stress_madvise_dso.c to one,
> (IOW, one thread_read and one thread_write), I cannot reproduce the
> crash anymore.
> 
> I think this means we cannot fix this issue in collapse_file(), because it
> finishes long before the crash.

Ah!  So are we missing one or more of these locks:

	inode_lock(inode);
	filemap_invalidate_lock(mapping);

in the open path?
Song Liu Sept. 30, 2021, 12:41 a.m. UTC | #17
On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
> > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > time gap (>2 second )  between collapse_file() finishes, and
> > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > suggestion that adds deny_write_access() to collapse_file() does NOT
> > work.
> >
> > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > If I change the number of write thread in stress_madvise_dso.c to one,
> > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > crash anymore.
> >
> > I think this means we cannot fix this issue in collapse_file(), because it
> > finishes long before the crash.
>
> Ah!  So are we missing one or more of these locks:
>
>         inode_lock(inode);
>         filemap_invalidate_lock(mapping);
>
> in the open path?

The following fixes the crash in my test. But I am not sure whether this is the
best fix.

Rongwei, could you please run more tests on it?

Thanks,
Song


diff --git i/fs/open.c w/fs/open.c
index daa324606a41f..d13c4668b2e53 100644
--- i/fs/open.c
+++ w/fs/open.c
@@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
                 * of THPs into the page cache will fail.
                 */
                smp_mb();
-               if (filemap_nr_thps(inode->i_mapping))
+               if (filemap_nr_thps(inode->i_mapping)) {
+                       filemap_invalidate_lock(inode->i_mapping);
                        truncate_pagecache(inode, 0);
+                       filemap_invalidate_unlock(inode->i_mapping);
+               }
        }

        return 0;
Rongwei Wang Sept. 30, 2021, 1:54 a.m. UTC | #18
On 9/30/21 7:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
> [...]
>>> Now, I am able to crash the system on
>>>      find_lock_entries () {
>>>       ...
>>>         VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>>>      }
>>> I guess it is related. I will test more.
>>
>> That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
>> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
>>
>> +++ b/mm/filemap.c
>> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
>>                  if (!xa_is_value(page)) {
>>                          if (page->index < start)
>>                                  goto put;
>> -                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>>                          if (page->index + thp_nr_pages(page) - 1 > end)
>>                                  goto put;
>>                          if (!trylock_page(page))
> 
> Yes, after removing this line, I am able to see the same bug.
> 
> Here is my finding so far:
> 
> The issue is NOT caused by concurrent khugepaged:collapse_file() and
> truncate_pagecache(inode, 0). With some printks, we can see a clear
> time gap (>2 second )  between collapse_file() finishes, and
> truncate_pagecache() (which crashes soon). Therefore, my earlier
> suggestion that adds deny_write_access() to collapse_file() does NOT
> work.
> 
> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> If I change the number of write thread in stress_madvise_dso.c to one,
> (IOW, one thread_read and one thread_write), I cannot reproduce the
> crash anymore.
Whether CONFIG_DEBUG_VM is enabled in your vm?

I think the second possibility mentioned above will been found if you 
enable CONFIG_DEBUG_VM:

1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;

The following log will be print after enable CONFIG_DEBUG_VM:

[22216.789904]  do_idle+0xb4/0x104
[22216.789906]  cpu_startup_entry+0x34/0x9c
[22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 
0.0.0 02/06/2015
[22216.790553]  secondary_start_kernel+0x104/0x180
[22216.790778] Call trace:
[22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
[22216.791662]  dump_backtrace+0x0/0x1ec
[22216.791664]  show_stack+0x24/0x30
[22216.791956] ---[ end trace dc769a61c1af087b ]---
[22216.792295]  dump_stack+0xd0/0x128
[22216.792299]  bad_page+0xe4/0x110
[22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception 
in interrupt
[22216.792937]  check_free_page_bad+0x84/0x90
[22216.792940]  free_pcp_prepare+0x1fc/0x21c
[22216.793253] SMP: stopping secondary CPUs
[22216.793525]  free_unref_page+0x2c/0xec
[22216.805537]  __put_page+0x60/0x70
[22216.805931]  collapse_file+0xdc8/0x12f0
[22216.806385]  khugepaged_scan_file+0x2dc/0x37c
[22216.806900]  khugepaged_scan_mm_slot+0x2e0/0x380
[22216.807450]  khugepaged_do_scan+0x2dc/0x2fc
[22216.807946]  khugepaged+0x38/0x100
[22216.808342]  kthread+0x11c/0x120
[22216.808735] Kernel Offset: disabled
[22216.809153] CPU features: 0x0040002,62208238
[22216.809681] Memory Limit: none
[22216.813477] Starting crashdump kernel...

So I think the race also exists between collapse_file and 
truncate_pagecache.

> 
> I think this means we cannot fix this issue in collapse_file(), because it
> finishes long before the crash.
> 
> Thanks,
> Song
>
Rongwei Wang Sept. 30, 2021, 2:14 a.m. UTC | #19
On 9/30/21 8:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
>>> The issue is NOT caused by concurrent khugepaged:collapse_file() and
>>> truncate_pagecache(inode, 0). With some printks, we can see a clear
>>> time gap (>2 second )  between collapse_file() finishes, and
>>> truncate_pagecache() (which crashes soon). Therefore, my earlier
>>> suggestion that adds deny_write_access() to collapse_file() does NOT
>>> work.
>>>
>>> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
>>> If I change the number of write thread in stress_madvise_dso.c to one,
>>> (IOW, one thread_read and one thread_write), I cannot reproduce the
>>> crash anymore.
>>>
>>> I think this means we cannot fix this issue in collapse_file(), because it
>>> finishes long before the crash.
>>
>> Ah!  So are we missing one or more of these locks:
>>
>>          inode_lock(inode);
>>          filemap_invalidate_lock(mapping);
>>
>> in the open path?
> 
> The following fixes the crash in my test. But I am not sure whether this is the
> best fix.
> 
> Rongwei, could you please run more tests on it?
Yes, I'd like to.
> 
> Thanks,
> Song
> 
> 
> diff --git i/fs/open.c w/fs/open.c
> index daa324606a41f..d13c4668b2e53 100644
> --- i/fs/open.c
> +++ w/fs/open.c
> @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
>                   * of THPs into the page cache will fail.
>                   */
>                  smp_mb();
> -               if (filemap_nr_thps(inode->i_mapping))
> +               if (filemap_nr_thps(inode->i_mapping)) {
> +                       filemap_invalidate_lock(inode->i_mapping);
Learned something, Thanks!

But, the race between collapse_file and truncate_pagecache, I am not 
sure whether it exists or not. If exists, whether this patch only can 
fix truncate_pagecache concurrent?

Anyway, I will run more tests on it at first.

Thanks!
>                          truncate_pagecache(inode, 0);
> +                       filemap_invalidate_unlock(inode->i_mapping);
> +               }
>          }
> 
>          return 0;
>
Song Liu Sept. 30, 2021, 3:26 a.m. UTC | #20
On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 9/30/21 7:41 AM, Song Liu wrote:
> > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> > [...]
> >>> Now, I am able to crash the system on
> >>>      find_lock_entries () {
> >>>       ...
> >>>         VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >>>      }
> >>> I guess it is related. I will test more.
> >>
> >> That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
> >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
> >>
> >> +++ b/mm/filemap.c
> >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> >>                  if (!xa_is_value(page)) {
> >>                          if (page->index < start)
> >>                                  goto put;
> >> -                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >>                          if (page->index + thp_nr_pages(page) - 1 > end)
> >>                                  goto put;
> >>                          if (!trylock_page(page))
> >
> > Yes, after removing this line, I am able to see the same bug.
> >
> > Here is my finding so far:
> >
> > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > time gap (>2 second )  between collapse_file() finishes, and
> > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > suggestion that adds deny_write_access() to collapse_file() does NOT
> > work.
> >
> > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > If I change the number of write thread in stress_madvise_dso.c to one,
> > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > crash anymore.
> Whether CONFIG_DEBUG_VM is enabled in your vm?
>
> I think the second possibility mentioned above will been found if you
> enable CONFIG_DEBUG_VM:
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;
>
> The following log will be print after enable CONFIG_DEBUG_VM:
>
> [22216.789904]  do_idle+0xb4/0x104
> [22216.789906]  cpu_startup_entry+0x34/0x9c
> [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> 0.0.0 02/06/2015
> [22216.790553]  secondary_start_kernel+0x104/0x180
> [22216.790778] Call trace:
> [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> [22216.791662]  dump_backtrace+0x0/0x1ec
> [22216.791664]  show_stack+0x24/0x30
> [22216.791956] ---[ end trace dc769a61c1af087b ]---
> [22216.792295]  dump_stack+0xd0/0x128
> [22216.792299]  bad_page+0xe4/0x110
> [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> in interrupt
> [22216.792937]  check_free_page_bad+0x84/0x90
> [22216.792940]  free_pcp_prepare+0x1fc/0x21c
> [22216.793253] SMP: stopping secondary CPUs
> [22216.793525]  free_unref_page+0x2c/0xec
> [22216.805537]  __put_page+0x60/0x70
> [22216.805931]  collapse_file+0xdc8/0x12f0
> [22216.806385]  khugepaged_scan_file+0x2dc/0x37c
> [22216.806900]  khugepaged_scan_mm_slot+0x2e0/0x380
> [22216.807450]  khugepaged_do_scan+0x2dc/0x2fc
> [22216.807946]  khugepaged+0x38/0x100
> [22216.808342]  kthread+0x11c/0x120
> [22216.808735] Kernel Offset: disabled
> [22216.809153] CPU features: 0x0040002,62208238
> [22216.809681] Memory Limit: none
> [22216.813477] Starting crashdump kernel...
>
> So I think the race also exists between collapse_file and
> truncate_pagecache.

I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.

Thanks,
Song
Hugh Dickins Sept. 30, 2021, 5:24 a.m. UTC | #21
On Wed, 29 Sep 2021, Song Liu wrote:
> On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
> > On 9/30/21 7:41 AM, Song Liu wrote:
> > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >>
> > > [...]
> > >>> Now, I am able to crash the system on
> > >>>      find_lock_entries () {
> > >>>       ...
> > >>>         VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > >>>      }
> > >>> I guess it is related. I will test more.
> > >>
> > >> That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
> > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.

I think Andrew has held back from sending it on to Linus because I pointed
out that the example Matthew cited (shmem and swap cache) was wrong, and
could not explain it: we wanted to understand what syzbot had actually
hit before sending on.

Would this bug be a good explanation for it?

In the meantime, independently, I was looking at fuse_try_move_page(),
which appears to do the old splice thievery that got disabled from splice
itself, stealing a page from one mapping to put into another.  I suspect
that could result in a page->index != xas.xa_index too (outside page lock).

> > >>
> > >> +++ b/mm/filemap.c
> > >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> > >>                  if (!xa_is_value(page)) {
> > >>                          if (page->index < start)
> > >>                                  goto put;
> > >> -                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > >>                          if (page->index + thp_nr_pages(page) - 1 > end)
> > >>                                  goto put;
> > >>                          if (!trylock_page(page))
> > >
> > > Yes, after removing this line, I am able to see the same bug.
> > >
> > > Here is my finding so far:
> > >
> > > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > > time gap (>2 second )  between collapse_file() finishes, and
> > > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > > suggestion that adds deny_write_access() to collapse_file() does NOT
> > > work.
> > >
> > > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > > If I change the number of write thread in stress_madvise_dso.c to one,
> > > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > > crash anymore.
> > Whether CONFIG_DEBUG_VM is enabled in your vm?
> >
> > I think the second possibility mentioned above will been found if you
> > enable CONFIG_DEBUG_VM:
> >
> > 1) multiple writers truncate the same page cache concurrently;
> > 2) collapse_file rolls back when writer truncates the page cache;
> >
> > The following log will be print after enable CONFIG_DEBUG_VM:
> >
> > [22216.789904]  do_idle+0xb4/0x104
> > [22216.789906]  cpu_startup_entry+0x34/0x9c
> > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> > 0.0.0 02/06/2015
> > [22216.790553]  secondary_start_kernel+0x104/0x180
> > [22216.790778] Call trace:
> > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> > [22216.791662]  dump_backtrace+0x0/0x1ec
> > [22216.791664]  show_stack+0x24/0x30
> > [22216.791956] ---[ end trace dc769a61c1af087b ]---
> > [22216.792295]  dump_stack+0xd0/0x128
> > [22216.792299]  bad_page+0xe4/0x110
> > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > in interrupt
> > [22216.792937]  check_free_page_bad+0x84/0x90
> > [22216.792940]  free_pcp_prepare+0x1fc/0x21c
> > [22216.793253] SMP: stopping secondary CPUs
> > [22216.793525]  free_unref_page+0x2c/0xec
> > [22216.805537]  __put_page+0x60/0x70
> > [22216.805931]  collapse_file+0xdc8/0x12f0
> > [22216.806385]  khugepaged_scan_file+0x2dc/0x37c
> > [22216.806900]  khugepaged_scan_mm_slot+0x2e0/0x380
> > [22216.807450]  khugepaged_do_scan+0x2dc/0x2fc
> > [22216.807946]  khugepaged+0x38/0x100
> > [22216.808342]  kthread+0x11c/0x120
> > [22216.808735] Kernel Offset: disabled
> > [22216.809153] CPU features: 0x0040002,62208238
> > [22216.809681] Memory Limit: none
> > [22216.813477] Starting crashdump kernel...
> >
> > So I think the race also exists between collapse_file and
> > truncate_pagecache.
> 
> I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.

Sorry, it's taken me a long time to get into this bug(s).

I haven't tried to reproduce it, but I do think that Rongwei will
be proved right.

In doing a lockless lookup of the page cache, we tend to imagine
that the THP head page will be encountered first, and the special
treatment for the head will do the right thing to skip the tails.

But when racing against collapse_file()'s rewind, I agree with
Rongwei that it is possible for truncation (or page cache cleanup
in this instance) to collect a pagevec which starts with a PageTail
some way into the compound page.

shmem_undo_range() (which shmem uses rather than truncate_pagecache())
would not call truncate_inode_page() on a THP tail: if it encounters a
tail, it will try to split the THP, and not call truncate_inode_page()
if it cannot.  Unless I'm inventing the memory, I now think that I did
encounter this race between truncate and collapse on huge shmem, and
had to re-craft my shmem_punch_compound() to handle it correctly.

If it is just a matter of collapse_file() rewind, I suppose we could
reverse the direction in which that proceeds; but I'm not convinced
that would be enough, and don't think we need such a "big" change.

Aside from the above page->index mischeck in find_lock_entries(),
I now think this bug needs nothing more than simply removing the
VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().

(You could say move its page->mapping check before its PageTail
check; but the PageTail check would then never catch anything.)

Rongwei's patch went in the right direction, but why add N lines
if deleting one is good?  And for a while I thought that idea of
using filemap_invalidate_lock() was very good; but now think
the page lock we already have is better, and solve both races.

Hugh
Matthew Wilcox (Oracle) Sept. 30, 2021, 3:28 p.m. UTC | #22
On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> On Wed, 29 Sep 2021, Song Liu wrote:
> > On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> > > On 9/30/21 7:41 AM, Song Liu wrote:
> > > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >>
> > > > [...]
> > > >>> Now, I am able to crash the system on
> > > >>>      find_lock_entries () {
> > > >>>       ...
> > > >>>         VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > > >>>      }
> > > >>> I guess it is related. I will test more.
> > > >>
> > > >> That's a bogus VM_BUG_ON.  I have a patch in my tree to delete it.
> > > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
> 
> I think Andrew has held back from sending it on to Linus because I pointed
> out that the example Matthew cited (shmem and swap cache) was wrong, and
> could not explain it: we wanted to understand what syzbot had actually
> hit before sending on.
> 
> Would this bug be a good explanation for it?

I don't think so?  Even if that specific example is not what's happening,
the general principle is that you can't verify page->index until you're
holding the page lock.  So the VM_BUG_ON is definitely in the wrong place.

> In the meantime, independently, I was looking at fuse_try_move_page(),
> which appears to do the old splice thievery that got disabled from splice
> itself, stealing a page from one mapping to put into another.  I suspect
> that could result in a page->index != xas.xa_index too (outside page lock).

I think there are other examples too where page->index gets changed ...
they're not springing to mind right now, but I have some other intricate
details occupying that bit of my brain at the moment.

> > > I think the second possibility mentioned above will been found if you
> > > enable CONFIG_DEBUG_VM:
> > >
> > > 1) multiple writers truncate the same page cache concurrently;
> > > 2) collapse_file rolls back when writer truncates the page cache;
> > >
> > > The following log will be print after enable CONFIG_DEBUG_VM:
> > >
> > > [22216.789904]  do_idle+0xb4/0x104
> > > [22216.789906]  cpu_startup_entry+0x34/0x9c
> > > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> > > 0.0.0 02/06/2015
> > > [22216.790553]  secondary_start_kernel+0x104/0x180
> > > [22216.790778] Call trace:
> > > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> > > [22216.791662]  dump_backtrace+0x0/0x1ec
> > > [22216.791664]  show_stack+0x24/0x30
> > > [22216.791956] ---[ end trace dc769a61c1af087b ]---
> > > [22216.792295]  dump_stack+0xd0/0x128
> > > [22216.792299]  bad_page+0xe4/0x110
> > > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > in interrupt
> > > [22216.792937]  check_free_page_bad+0x84/0x90
> > > [22216.792940]  free_pcp_prepare+0x1fc/0x21c
> > > [22216.793253] SMP: stopping secondary CPUs
> > > [22216.793525]  free_unref_page+0x2c/0xec
> > > [22216.805537]  __put_page+0x60/0x70
> > > [22216.805931]  collapse_file+0xdc8/0x12f0
> > > [22216.806385]  khugepaged_scan_file+0x2dc/0x37c
> > > [22216.806900]  khugepaged_scan_mm_slot+0x2e0/0x380
> > > [22216.807450]  khugepaged_do_scan+0x2dc/0x2fc
> > > [22216.807946]  khugepaged+0x38/0x100
> > > [22216.808342]  kthread+0x11c/0x120
> > > [22216.808735] Kernel Offset: disabled
> > > [22216.809153] CPU features: 0x0040002,62208238
> > > [22216.809681] Memory Limit: none
> > > [22216.813477] Starting crashdump kernel...
> > >
> > > So I think the race also exists between collapse_file and
> > > truncate_pagecache.
> > 
> > I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.
> 
> Sorry, it's taken me a long time to get into this bug(s).
> 
> I haven't tried to reproduce it, but I do think that Rongwei will
> be proved right.
> 
> In doing a lockless lookup of the page cache, we tend to imagine
> that the THP head page will be encountered first, and the special
> treatment for the head will do the right thing to skip the tails.
> 
> But when racing against collapse_file()'s rewind, I agree with
> Rongwei that it is possible for truncation (or page cache cleanup
> in this instance) to collect a pagevec which starts with a PageTail
> some way into the compound page.
> 
> shmem_undo_range() (which shmem uses rather than truncate_pagecache())
> would not call truncate_inode_page() on a THP tail: if it encounters a
> tail, it will try to split the THP, and not call truncate_inode_page()
> if it cannot.  Unless I'm inventing the memory, I now think that I did
> encounter this race between truncate and collapse on huge shmem, and
> had to re-craft my shmem_punch_compound() to handle it correctly.
> 
> If it is just a matter of collapse_file() rewind, I suppose we could
> reverse the direction in which that proceeds; but I'm not convinced
> that would be enough, and don't think we need such a "big" change.
> 
> Aside from the above page->index mischeck in find_lock_entries(),
> I now think this bug needs nothing more than simply removing the
> VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().

I don't think that's right.  This bug was also observed when calling
truncate(), so there's clearly a situation where two concurrent calls
to truncate_pagecache() leaves a THP in the cache.

Unless there's a case where mapping->nr_thps gets corrupted, so the
open() thinks there's no THPs in the page cache, when there's actually
one or more?  That might be a problem that we're solving by locking
around the truncate_pagecache() call?

> (You could say move its page->mapping check before its PageTail
> check; but the PageTail check would then never catch anything.)
> 
> Rongwei's patch went in the right direction, but why add N lines
> if deleting one is good?  And for a while I thought that idea of
> using filemap_invalidate_lock() was very good; but now think
> the page lock we already have is better, and solve both races.
> 
> Hugh
Hugh Dickins Sept. 30, 2021, 4:49 p.m. UTC | #23
On Thu, 30 Sep 2021, Matthew Wilcox wrote:
> On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> > 
> > Aside from the above page->index mischeck in find_lock_entries(),
> > I now think this bug needs nothing more than simply removing the
> > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
> 
> I don't think that's right.  This bug was also observed when calling
> truncate(), so there's clearly a situation where two concurrent calls
> to truncate_pagecache() leaves a THP in the cache.

I assume you're thinking of one of the fuzzer blkdev ones:
https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
or
https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/

I haven't started on those ones yet: yes, I imagine one or both of those
will need a further fix (S_ISREG() check somewhere if we're lucky; but
could well be nastier); but for the bug in this thread, I expect
removing the VM_BUG_ON_PAGE(PageTail) to be enough.

If you're thinking of something else, please send a link if you can - thanks.

Hugh
Yang Shi Sept. 30, 2021, 5:39 p.m. UTC | #24
On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 30 Sep 2021, Matthew Wilcox wrote:
> > On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> > >
> > > Aside from the above page->index mischeck in find_lock_entries(),
> > > I now think this bug needs nothing more than simply removing the
> > > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
> >
> > I don't think that's right.  This bug was also observed when calling
> > truncate(), so there's clearly a situation where two concurrent calls
> > to truncate_pagecache() leaves a THP in the cache.
>
> I assume you're thinking of one of the fuzzer blkdev ones:
> https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> or
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
>
> I haven't started on those ones yet: yes, I imagine one or both of those
> will need a further fix (S_ISREG() check somewhere if we're lucky; but
> could well be nastier); but for the bug in this thread, I expect

Makes sense to me. We should be able to check S_ISREG() in khugepaged,
if it is not a regular file, just bail out. Sounds not that nasty to
me AFAIU.

> removing the VM_BUG_ON_PAGE(PageTail) to be enough.
>
> If you're thinking of something else, please send a link if you can - thanks.
>
> Hugh
>
Rongwei Wang Oct. 2, 2021, 2:22 a.m. UTC | #25
On 10/1/21 12:49 AM, Hugh Dickins wrote:
> On Thu, 30 Sep 2021, Matthew Wilcox wrote:
>> On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
>>>
>>> Aside from the above page->index mischeck in find_lock_entries(),
>>> I now think this bug needs nothing more than simply removing the
>>> VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
>>
>> I don't think that's right.  This bug was also observed when calling
>> truncate(), so there's clearly a situation where two concurrent calls
>> to truncate_pagecache() leaves a THP in the cache.
> 
> I assume you're thinking of one of the fuzzer blkdev ones:
> https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> or
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> 
> I haven't started on those ones yet: yes, I imagine one or both of those
> will need a further fix (S_ISREG() check somewhere if we're lucky; but
> could well be nastier); but for the bug in this thread, I expect
> removing the VM_BUG_ON_PAGE(PageTail) to be enough.
Thanks for your advices!

I'm so sorry that delay the test due to my recent vacation. I plan to 
start further test tomorrow. I think removing the 
VM_BUG_ON_PAGE(PageTail) is a good idea, but also think using 
filemap_invalidate_lock is safer and necessary. And of course, this is 
just my own view! So, now, I tend to use filemap_invalidate_lock and
either check page mapping again or remove VM_BUG_ON_PAGE(PageTail).

Anyway, I will run more tests and then give feedback.

Thanks!
> 
> If you're thinking of something else, please send a link if you can - thanks.
> 
> Hugh
>
Matthew Wilcox (Oracle) Oct. 2, 2021, 5:08 p.m. UTC | #26
On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote:
> > I assume you're thinking of one of the fuzzer blkdev ones:
> > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > or
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> >
> > I haven't started on those ones yet: yes, I imagine one or both of those
> > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > could well be nastier); but for the bug in this thread, I expect
> 
> Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> if it is not a regular file, just bail out. Sounds not that nasty to
> me AFAIU.

I don't see why we should have an S_ISREG() check.  I agree it's not the
intended usecase, but it ought to work fine.  Unless there's something
I'm missing?
Rongwei Wang Oct. 4, 2021, 5:26 p.m. UTC | #27
Hi,
I have run our cases these two days to stress test new Patch #1. The new 
Patch #1 mainly add filemap_invalidate_{un}lock before and after 
truncate_pagecache(), basing on original Patch #1. And the crash has not 
happened.

Now, I keep the original Patch #1, then adding the code below which 
suggested by liu song (I'm not sure which one I should add in the next 
version, Suggested-by or Signed-off-by? If you know, please remind me).

-               if (filemap_nr_thps(inode->i_mapping))
+               if (filemap_nr_thps(inode->i_mapping)) {
+                       filemap_invalidate_lock(inode->i_mapping);
                         truncate_pagecache(inode, 0);
+                       filemap_invalidate_unlock(inode->i_mapping);
+               }

And the reason for keeping the original Patch #1 is mainly to fix the 
race between collapse_file and truncate_pagecache. It seems necessary. 
Despite the two-day test, I did not reproduce this race any more.

In addition, I also test the below method:

diff --git a/mm/truncate.c b/mm/truncate.c
index 3f47190f98a8..33604e4ce60a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space 
*mapping, struct page *page)

  int truncate_inode_page(struct address_space *mapping, struct page *page)
  {
-       VM_BUG_ON_PAGE(PageTail(page), page);
-
         if (page->mapping != mapping)
                 return -EIO;

I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
the test results show that only removing this VM_BUG_ON_PAGE(PageTail) 
has no effect. So, I still keep the original Patch #1 to fix one race.

I plan to send Patch v3 after receiving your reply.

Thanks!

On 9/30/21 8:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
>>> The issue is NOT caused by concurrent khugepaged:collapse_file() and
>>> truncate_pagecache(inode, 0). With some printks, we can see a clear
>>> time gap (>2 second )  between collapse_file() finishes, and
>>> truncate_pagecache() (which crashes soon). Therefore, my earlier
>>> suggestion that adds deny_write_access() to collapse_file() does NOT
>>> work.
>>>
>>> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
>>> If I change the number of write thread in stress_madvise_dso.c to one,
>>> (IOW, one thread_read and one thread_write), I cannot reproduce the
>>> crash anymore.
>>>
>>> I think this means we cannot fix this issue in collapse_file(), because it
>>> finishes long before the crash.
>>
>> Ah!  So are we missing one or more of these locks:
>>
>>          inode_lock(inode);
>>          filemap_invalidate_lock(mapping);
>>
>> in the open path?
> 
> The following fixes the crash in my test. But I am not sure whether this is the
> best fix.
> 
> Rongwei, could you please run more tests on it?
> 
> Thanks,
> Song
> 
> 
> diff --git i/fs/open.c w/fs/open.c
> index daa324606a41f..d13c4668b2e53 100644
> --- i/fs/open.c
> +++ w/fs/open.c
> @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
>                   * of THPs into the page cache will fail.
>                   */
>                  smp_mb();
> -               if (filemap_nr_thps(inode->i_mapping))
> +               if (filemap_nr_thps(inode->i_mapping)) {
> +                       filemap_invalidate_lock(inode->i_mapping);
>                          truncate_pagecache(inode, 0);
> +                       filemap_invalidate_unlock(inode->i_mapping);
> +               }
>          }
> 
>          return 0;
>
Yang Shi Oct. 4, 2021, 6:28 p.m. UTC | #28
On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote:
> > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > or
> > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > >
> > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > could well be nastier); but for the bug in this thread, I expect
> >
> > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > if it is not a regular file, just bail out. Sounds not that nasty to
> > me AFAIU.
>
> I don't see why we should have an S_ISREG() check.  I agree it's not the
> intended usecase, but it ought to work fine.  Unless there's something
> I'm missing?

Check out this bug report:
https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
and the patch from me:
https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/

I don't think we handle buffers correctly for file THP, right? My
patch is ad hoc, so I thought Hugh's suggestion makes some sense to
me. Why do we have THP collapsed for unintended usecase in the first
place?
Matthew Wilcox (Oracle) Oct. 4, 2021, 7:05 p.m. UTC | #29
On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote:
> Hi,
> I have run our cases these two days to stress test new Patch #1. The new
> Patch #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.

You shouldn't need most of patch 1.

In fact, the only two patches you should need would be this:

+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
                if (!xa_is_value(page)) {
                        if (page->index < start)
                                goto put;
-                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
                        if (page->index + thp_nr_pages(page) - 1 > end)
                                goto put;
                        if (!trylock_page(page))

(already in Andrew's tree) and:

> -               if (filemap_nr_thps(inode->i_mapping))
> +               if (filemap_nr_thps(inode->i_mapping)) {
> +                       filemap_invalidate_lock(inode->i_mapping);
>                         truncate_pagecache(inode, 0);
> +                       filemap_invalidate_unlock(inode->i_mapping);
> +               }

If you can still hit a bug with just those two patches, then something
else is going wrong, and needs to be investigated.
Matthew Wilcox (Oracle) Oct. 4, 2021, 7:31 p.m. UTC | #30
On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote:
> > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > > or
> > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > > >
> > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > could well be nastier); but for the bug in this thread, I expect
> > >
> > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > me AFAIU.
> >
> > I don't see why we should have an S_ISREG() check.  I agree it's not the
> > intended usecase, but it ought to work fine.  Unless there's something
> > I'm missing?
> 
> Check out this bug report:
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> and the patch from me:
> https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/
> 
> I don't think we handle buffers correctly for file THP, right? My
> patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> me. Why do we have THP collapsed for unintended usecase in the first
> place?

OK, I've done some more digging.  I think what's going on with this
report is userspace opens the block device RO, causes the page cache to
be loaded with data, then khugepaged comes in and creates THPs.
What confuses me is that these THPs have private data attached to them.
I don't know how that happens.  If it's block device specific, then
yes, something like your S_ISREG() idea should work fine.  Otherwise,
we might need to track down another problem.

Hao Sun, can you try this patch and see what comes out of it?

diff --git a/fs/buffer.c b/fs/buffer.c
index c615387aedca..fefe05a9973d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -872,6 +872,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head)
 		bh = bh->b_this_page;
 	} while (bh);
 	tail->b_this_page = head;
+	VM_BUG_ON_PAGE(PageCompound(page), page);
 	attach_page_private(page, head);
 }
 
@@ -1577,6 +1578,7 @@ void create_empty_buffers(struct page *page,
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
+	VM_BUG_ON_PAGE(PageCompound(page), page);
 	attach_page_private(page, head);
 	spin_unlock(&page->mapping->private_lock);
 }
@@ -2565,6 +2567,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
 			bh->b_this_page = head;
 		bh = bh->b_this_page;
 	} while (bh != head);
+	VM_BUG_ON_PAGE(PageCompound(page), page);
 	attach_page_private(page, head);
 	spin_unlock(&page->mapping->private_lock);
 }
Song Liu Oct. 4, 2021, 8:26 p.m. UTC | #31
On Mon, Oct 4, 2021 at 10:26 AM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
> Hi,
> I have run our cases these two days to stress test new Patch #1. The new
> Patch #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.
>
> Now, I keep the original Patch #1, then adding the code below which
> suggested by liu song (I'm not sure which one I should add in the next
> version, Suggested-by or Signed-off-by? If you know, please remind me).
>
> -               if (filemap_nr_thps(inode->i_mapping))
> +               if (filemap_nr_thps(inode->i_mapping)) {
> +                       filemap_invalidate_lock(inode->i_mapping);
>                          truncate_pagecache(inode, 0);
> +                       filemap_invalidate_unlock(inode->i_mapping);
> +               }

It is mostly suggested by Matthew. If the patch goes that way, you can add

Tested-by: Song Liu <song@kernel.org>
Rongwei Wang Oct. 5, 2021, 1:58 a.m. UTC | #32
On 10/5/21 3:05 AM, Matthew Wilcox wrote:
> On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote:
>> Hi,
>> I have run our cases these two days to stress test new Patch #1. The new
>> Patch #1 mainly add filemap_invalidate_{un}lock before and after
>> truncate_pagecache(), basing on original Patch #1. And the crash has not
>> happened.
> 
> You shouldn't need most of patch 1.
> 
> In fact, the only two patches you should need would be this:
> 
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
>                  if (!xa_is_value(page)) {
>                          if (page->index < start)
>                                  goto put;
> -                       VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>                          if (page->index + thp_nr_pages(page) - 1 > end)
>                                  goto put;
>                          if (!trylock_page(page))
> 
> (already in Andrew's tree) and:
> 
>> -               if (filemap_nr_thps(inode->i_mapping))
>> +               if (filemap_nr_thps(inode->i_mapping)) {
>> +                       filemap_invalidate_lock(inode->i_mapping);
>>                          truncate_pagecache(inode, 0);
>> +                       filemap_invalidate_unlock(inode->i_mapping);
>> +               }
> 
> If you can still hit a bug with just those two patches, then something
> else is going wrong, and needs to be investigated.
OK, I see what your mean. I will send Patch v3 and only keep 
filemap_invalidate_{un}lock in Patch #1.

Thanks!
>
Hugh Dickins Oct. 5, 2021, 2:26 a.m. UTC | #33
On Mon, 4 Oct 2021, Matthew Wilcox wrote:
> On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote:
> > > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > > > or
> > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > > > >
> > > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > > could well be nastier); but for the bug in this thread, I expect
> > > >
> > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > > me AFAIU.
> > >
> > > I don't see why we should have an S_ISREG() check.  I agree it's not the
> > > intended usecase, but it ought to work fine.  Unless there's something
> > > I'm missing?
> > 
> > Check out this bug report:
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > and the patch from me:
> > https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/
> > 
> > I don't think we handle buffers correctly for file THP, right? My
> > patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> > me. Why do we have THP collapsed for unintended usecase in the first
> > place?
> 
> OK, I've done some more digging.  I think what's going on with this
> report is userspace opens the block device RO, causes the page cache to
> be loaded with data, then khugepaged comes in and creates THPs.

Yes.

> What confuses me is that these THPs have private data attached to them.
> I don't know how that happens.  If it's block device specific, then
> yes, something like your S_ISREG() idea should work fine.  Otherwise,
> we might need to track down another problem.

Agreed, the file THP is created without PagePrivate, so the puzzle was
why the read-only cached page would later become page_has_private().

The C repro showed that it uses (a BTRFS_IOC_ADD_DEV ioctl which might
not be relevant and) a BLKRRPART ioctl 0x125f: I didn't follow BLKRRPART
all the way down, but imagine it has to attach buffer-heads to re-read
the partition table.  Which would explain it.

Aside from that particular ioctl, it seems a good idea to insist on
S_ISREG just to shrink the attack surface: as Yang Shi says, executable
THP on block device was never an intended usecase, and not a usecase
anyone is likely to miss! And that fuzzer appears to delight in
tormenting /dev/nullb0, so let's just seal off that avenue.

You're right to have some doubt, as to whether there might be other
ways for buffer-heads to get attached, even on a read-only regular
file; but no way has sprung to my mind, and READ_ONLY_THP_FOR_FS has
survived well in its intended usage: so I think we should proceed on
the assumption that no further bugs remain - then fix them when found.

I wasn't able to reproduce the problem with the repro, would need to
waste many hours to do so.  But here's the untested S_ISREG patch I
came up with.  Sorry, I've mixed something else in: in moving the
alignment part to clarify the conditions, I was alarmed to see that
shmem with !shmem_huge_enabled was falling through to THP_FOR_FS to
give unexpected huge pages: fixed that, though later found there's
a separate shmem_huge_enabled() check which should exclude it.

--- 5.15-rc4/mm/khugepaged.c	2021-09-12 17:39:21.943438422 -0700
+++ linux/khugepaged.c	2021-10-03 20:41:13.194822795 -0700
@@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm
 	if (!transhuge_vma_enabled(vma, vm_flags))
 		return false;
 
+	if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+					vma->vm_pgoff, HPAGE_PMD_NR))
+		return false;
+
 	/* Enabled via shmem mount options or sysfs settings. */
-	if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) {
-		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR);
-	}
+	if (shmem_file(vma->vm_file))
+		return shmem_huge_enabled(vma);
 
 	/* THP settings require madvise. */
 	if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
 		return false;
 
 	/* Read-only file mappings need to be aligned for THP to work. */
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
-	    !inode_is_open_for_write(vma->vm_file->f_inode) &&
-	    (vm_flags & VM_EXEC)) {
-		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR);
+	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+	    (vm_flags & VM_EXEC) && vma->vm_file) {
+		struct inode *inode = vma->vm_file->f_inode;
+
+	        return !inode_is_open_for_write(inode) &&
+			S_ISREG(inode->i_mode);
 	}
 
 	if (!vma->anon_vma || vma->vm_ops)
Hugh Dickins Oct. 5, 2021, 2:58 a.m. UTC | #34
On Tue, 5 Oct 2021, Rongwei Wang wrote:

> Hi,
> I have run our cases these two days to stress test new Patch #1. The new Patch
> #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.
> 
> Now, I keep the original Patch #1, then adding the code below which suggested
> by liu song (I'm not sure which one I should add in the next version,
> Suggested-by or Signed-off-by? If you know, please remind me).
> 
> -               if (filemap_nr_thps(inode->i_mapping))
> +               if (filemap_nr_thps(inode->i_mapping)) {
> +                       filemap_invalidate_lock(inode->i_mapping);
>                         truncate_pagecache(inode, 0);
> +                       filemap_invalidate_unlock(inode->i_mapping);
> +               }

I won't NAK that patch; but I still believe it's unnecessary, and don't
see how it protects against all the races (collapse_file() does not use
that lock, whereas collapse_file() does use page lock).  And if you're
hoping to fix 5.10, then you will have to backport those invalidate_lock
patches there too (they're really intended to protect hole-punching).

> 
> And the reason for keeping the original Patch #1 is mainly to fix the race
> between collapse_file and truncate_pagecache. It seems necessary. Despite the
> two-day test, I did not reproduce this race any more.
> 
> In addition, I also test the below method:
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 3f47190f98a8..33604e4ce60a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
> struct page *page)
> 
>  int truncate_inode_page(struct address_space *mapping, struct page *page)
>  {
> -       VM_BUG_ON_PAGE(PageTail(page), page);
> -
>         if (page->mapping != mapping)
>                 return -EIO;
> 
> I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
> the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
> effect. So, I still keep the original Patch #1 to fix one race.

Yes, that's exactly what I meant, and thank you for intending to try it.

But if that patch had "no effect", then I think you were not running the
kernel with that patch applied: because it deletes the BUG on line 213
of mm/truncate.c, which is what you reported in the first mail!

Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
something else?  I've been looking at 5.15-rc.

But I wasn't proposing to delete it merely to hide the BUG: as I hope
I explained, we could move it below the page->mapping check, but it
wouldn't really be of any value there since tails have NULL page->mapping
anyway (well, I didn't check first and second tails, maybe mapping gets
reused for some compound page field in those). I was proposing to delete
it because the page->mapping check then weeds out the racy case once
we're holding page lock, without the need for adding anything special.

Hugh
Matthew Wilcox (Oracle) Oct. 5, 2021, 3:07 a.m. UTC | #35
On Mon, Oct 04, 2021 at 07:58:10PM -0700, Hugh Dickins wrote:
> On Tue, 5 Oct 2021, Rongwei Wang wrote:
> 
> > Hi,
> > I have run our cases these two days to stress test new Patch #1. The new Patch
> > #1 mainly add filemap_invalidate_{un}lock before and after
> > truncate_pagecache(), basing on original Patch #1. And the crash has not
> > happened.
> > 
> > Now, I keep the original Patch #1, then adding the code below which suggested
> > by liu song (I'm not sure which one I should add in the next version,
> > Suggested-by or Signed-off-by? If you know, please remind me).
> > 
> > -               if (filemap_nr_thps(inode->i_mapping))
> > +               if (filemap_nr_thps(inode->i_mapping)) {
> > +                       filemap_invalidate_lock(inode->i_mapping);
> >                         truncate_pagecache(inode, 0);
> > +                       filemap_invalidate_unlock(inode->i_mapping);
> > +               }
> 
> I won't NAK that patch; but I still believe it's unnecessary, and don't
> see how it protects against all the races (collapse_file() does not use
> that lock, whereas collapse_file() does use page lock).  And if you're
> hoping to fix 5.10, then you will have to backport those invalidate_lock
> patches there too (they're really intended to protect hole-punching).

I believe all we really need to do is protect against calling
truncate_pagecache() simultaneously to avoid one of the calls seeing
a tail page.  i_mutex would work for this purpose just as well as
filemap_invalidate_lock().  See, for example, ext4_zero_range() which
first takes inode_lock(), then filemap_invalidate_lock() before calling
truncate_pagecache_range().

> > And the reason for keeping the original Patch #1 is mainly to fix the race
> > between collapse_file and truncate_pagecache. It seems necessary. Despite the
> > two-day test, I did not reproduce this race any more.
> > 
> > In addition, I also test the below method:
> > 
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 3f47190f98a8..33604e4ce60a 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
> > struct page *page)
> > 
> >  int truncate_inode_page(struct address_space *mapping, struct page *page)
> >  {
> > -       VM_BUG_ON_PAGE(PageTail(page), page);
> > -
> >         if (page->mapping != mapping)
> >                 return -EIO;
> > 
> > I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
> > the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
> > effect. So, I still keep the original Patch #1 to fix one race.
> 
> Yes, that's exactly what I meant, and thank you for intending to try it.
> 
> But if that patch had "no effect", then I think you were not running the
> kernel with that patch applied: because it deletes the BUG on line 213
> of mm/truncate.c, which is what you reported in the first mail!
> 
> Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
> something else?  I've been looking at 5.15-rc.
> 
> But I wasn't proposing to delete it merely to hide the BUG: as I hope
> I explained, we could move it below the page->mapping check, but it
> wouldn't really be of any value there since tails have NULL page->mapping
> anyway (well, I didn't check first and second tails, maybe mapping gets
> reused for some compound page field in those). I was proposing to delete
> it because the page->mapping check then weeds out the racy case once
> we're holding page lock, without the need for adding anything special.

I think if we remove the race with the above mutex lock then we'll never
see a tail page in this routine.
Rongwei Wang Oct. 5, 2021, 9:03 a.m. UTC | #36
On 10/5/21 10:58 AM, Hugh Dickins wrote:
> On Tue, 5 Oct 2021, Rongwei Wang wrote:
> 
>> Hi,
>> I have run our cases these two days to stress test new Patch #1. The new Patch
>> #1 mainly add filemap_invalidate_{un}lock before and after
>> truncate_pagecache(), basing on original Patch #1. And the crash has not
>> happened.
>>
>> Now, I keep the original Patch #1, then adding the code below which suggested
>> by liu song (I'm not sure which one I should add in the next version,
>> Suggested-by or Signed-off-by? If you know, please remind me).
>>
>> -               if (filemap_nr_thps(inode->i_mapping))
>> +               if (filemap_nr_thps(inode->i_mapping)) {
>> +                       filemap_invalidate_lock(inode->i_mapping);
>>                          truncate_pagecache(inode, 0);
>> +                       filemap_invalidate_unlock(inode->i_mapping);
>> +               }
> 
> I won't NAK that patch; but I still believe it's unnecessary, and don't
> see how it protects against all the races (collapse_file() does not use
> that lock, whereas collapse_file() does use page lock).  And if you're
> hoping to fix 5.10, then you will have to backport those invalidate_lock
> patches there too (they're really intended to protect hole-punching).
> 
>>
>> And the reason for keeping the original Patch #1 is mainly to fix the race
>> between collapse_file and truncate_pagecache. It seems necessary. Despite the
>> two-day test, I did not reproduce this race any more.
>>
>> In addition, I also test the below method:
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 3f47190f98a8..33604e4ce60a 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
>> struct page *page)
>>
>>   int truncate_inode_page(struct address_space *mapping, struct page *page)
>>   {
>> -       VM_BUG_ON_PAGE(PageTail(page), page);
>> -
>>          if (page->mapping != mapping)
>>                  return -EIO;
>>
>> I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
>> the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
>> effect. So, I still keep the original Patch #1 to fix one race.
> 
> Yes, that's exactly what I meant, and thank you for intending to try it.
> 
> But if that patch had "no effect", then I think you were not running the
> kernel with that patch applied: because it deletes the BUG on line 213
> of mm/truncate.c, which is what you reported in the first mail!
> 
> Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
> something else?  I've been looking at 5.15-rc.
Hi, Hugh

I'm sorry the confusing '5.10.46-hugetext+'. I am also look and test at 
5.15-rc.
> 
> But I wasn't proposing to delete it merely to hide the BUG: as I hope
> I explained, we could move it below the page->mapping check, but it
> wouldn't really be of any value there since tails have NULL page->mapping
> anyway (well, I didn't check first and second tails, maybe mapping gets
> reused for some compound page field in those). I was proposing to delete
> it because the page->mapping check then weeds out the racy case once
> we're holding page lock, without the need for adding anything special.
> 
> Hugh
Today, I try again to create some cases to reproduce the race, such as 
ensuring that multiple processes are always executing 
‘truncate_pagecache’ and only mapping 2M DSO. In this way, I try to 
ensure that the target of 'collapse_file' and 'truncate_pagecache' can 
only be the same VMA, to increase the probability of reproducing that 
race. But, I can't reproduce that race any more.

In fact, according to the previous experience, the current number of 
attempts should be able to reproduce that race.

If you have the idea about creating this case, please tell me, and I can 
try again. Or we can solve it when it appears again.

Thanks!
>
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293..a3af2ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@  unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
 		if (!xa_is_value(page)) {
 			if (page->index < start)
 				goto put;
-			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 			if (page->index + thp_nr_pages(page) - 1 > end)
 				goto put;
 			if (!trylock_page(page))
@@ -2102,6 +2101,12 @@  unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
 				goto unlock;
 			VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
 					page);
+			/*
+			 * We can find and get head page of file THP with
+			 * non-head index. The head page should have already
+			 * be truncated with page->mapping reset to NULL.
+			 */
+			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 		}
 		indices[pvec->nr] = xas.xa_index;
 		if (!pagevec_add(pvec, page))
diff --git a/mm/truncate.c b/mm/truncate.c
index 714eaf1..3f47190 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -319,7 +319,8 @@  void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	while (index < end && find_lock_entries(mapping, index, end - 1,
 			&pvec, indices)) {
-		index = indices[pagevec_count(&pvec) - 1] + 1;
+		index = indices[pagevec_count(&pvec) - 1] +
+			thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			truncate_cleanup_page(pvec.pages[i]);
@@ -392,6 +393,20 @@  void truncate_inode_pages_range(struct address_space *mapping,
 				continue;
 
 			lock_page(page);
+			/*
+			 * Already truncated? We can find and get subpage
+			 * of file THP, of which the head page is truncated.
+			 *
+			 * In addition, another race will be avoided, where
+			 * collapse_file rolls back when writer truncates the
+			 * page cache.
+			 */
+			if (page_mapping(page) != mapping) {
+				unlock_page(page);
+				/* Restart to make sure all gone */
+				index = start - 1;
+				break;
+			}
 			WARN_ON(page_to_index(page) != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);