diff mbox series

mm/mincore: allow for making sys_mincore() privileged

Message ID nycvar.YFH.7.76.1901051817390.16954@cbobk.fhfr.pm (mailing list archive)
State New, archived
Headers show
Series mm/mincore: allow for making sys_mincore() privileged | expand

Commit Message

Jiri Kosina Jan. 5, 2019, 5:27 p.m. UTC
From: Jiri Kosina <jkosina@suse.cz>

There are possibilities [1] how mincore() could be used as a converyor of 
a sidechannel information about pagecache metadata.

Provide vm.mincore_privileged sysctl, which makes it possible to mincore() 
start returning -EPERM in case it's invoked by a process lacking 
CAP_SYS_ADMIN.

The default behavior stays "mincore() can be used by anybody" in order to 
be conservative with respect to userspace behavior.

[1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
 Documentation/sysctl/vm.txt | 9 +++++++++
 kernel/sysctl.c             | 8 ++++++++
 mm/mincore.c                | 5 +++++
 3 files changed, 22 insertions(+)

Comments

Vlastimil Babka Jan. 5, 2019, 7:14 p.m. UTC | #1
On 5.1.2019 18:27, Jiri Kosina wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> 
> There are possibilities [1] how mincore() could be used as a converyor of 
> a sidechannel information about pagecache metadata.
> 
> Provide vm.mincore_privileged sysctl, which makes it possible to mincore() 
> start returning -EPERM in case it's invoked by a process lacking 
> CAP_SYS_ADMIN.

Haven't checked the details yet, but wouldn't it be safe if anonymous private
mincore() kept working, and restrictions were applied only to page cache?

> The default behavior stays "mincore() can be used by anybody" in order to 
> be conservative with respect to userspace behavior.

What if we lied instead of returned -EPERM, to not break userspace so obviously?
I guess false positive would be the safer lie?

> [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/
> 
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> ---
>  Documentation/sysctl/vm.txt | 9 +++++++++
>  kernel/sysctl.c             | 8 ++++++++
>  mm/mincore.c                | 5 +++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 187ce4f599a2..afb8635e925e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm:
>  - min_free_kbytes
>  - min_slab_ratio
>  - min_unmapped_ratio
> +- mincore_privileged
>  - mmap_min_addr
>  - mmap_rnd_bits
>  - mmap_rnd_compat_bits
> @@ -485,6 +486,14 @@ files and similar are considered.
>  The default is 1 percent.
>  
>  ==============================================================
> +mincore_privileged:
> +
> +mincore() could be potentially used to mount a side-channel attack against
> +pagecache metadata. This sysctl provides system administrators means to
> +make it available only to processess that own CAP_SYS_ADMIN capability.
> +
> +The default is 0, which means mincore() can be used without restrictions.
> +==============================================================
>  
>  mmap_min_addr
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1825f712e73b..f03cb07c8dd4 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
>  #ifndef CONFIG_MMU
>  extern int sysctl_nr_trim_pages;
>  #endif
> +extern int sysctl_mincore_privileged;
>  
>  /* Constants used for minimum and  maximum */
>  #ifdef CONFIG_LOCKUP_DETECTOR
> @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = {
>  		.extra2		= (void *)&mmap_rnd_compat_bits_max,
>  	},
>  #endif
> +	{
> +		.procname	= "mincore_privileged",
> +		.data		= &sysctl_mincore_privileged,
> +		.maxlen		= sizeof(sysctl_mincore_privileged),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  	{ }
>  };
>  
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 218099b5ed31..77d4928cdfaa 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -21,6 +21,8 @@
>  #include <linux/uaccess.h>
>  #include <asm/pgtable.h>
>  
> +int sysctl_mincore_privileged;
> +
>  static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>  			unsigned long end, struct mm_walk *walk)
>  {
> @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
>  	unsigned long pages;
>  	unsigned char *tmp;
>  
> +	if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
>  	/* Check the start address: needs to be page-aligned.. */
>  	if (start & ~PAGE_MASK)
>  		return -EINVAL;
>
Jiri Kosina Jan. 5, 2019, 7:24 p.m. UTC | #2
On Sat, 5 Jan 2019, Vlastimil Babka wrote:

> > There are possibilities [1] how mincore() could be used as a converyor of 
> > a sidechannel information about pagecache metadata.
> > 
> > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() 
> > start returning -EPERM in case it's invoked by a process lacking 
> > CAP_SYS_ADMIN.
> 
> Haven't checked the details yet, but wouldn't it be safe if anonymous private
> mincore() kept working, and restrictions were applied only to page cache?

I was considering that, but then I decided not to do so, as that'd make 
the interface even more confusing and semantics non-obvious in the 
'privileged' case.

> > The default behavior stays "mincore() can be used by anybody" in order to 
> > be conservative with respect to userspace behavior.
> 
> What if we lied instead of returned -EPERM, to not break userspace so 
> obviously? I guess false positive would be the safer lie?

So your proposal basically would be

if (privileged && !CAP_SYS_ADMIN)
	if (pagecache)
		return false;
	else
		return do_mincore()

right ?

I think userspace would hate us for that semantics, but on the other hand 
I can sort of understand the 'mincore() is racy anyway, so what' argument, 
if that's what you are suggesting.

But then, I have no idea what userspace is using mincore() for. 
https://codesearch.debian.net/search?q=mincore might provide some insight 
I guess (thanks Matthew).
Vlastimil Babka Jan. 5, 2019, 7:38 p.m. UTC | #3
On 5.1.2019 20:24, Jiri Kosina wrote:
> On Sat, 5 Jan 2019, Vlastimil Babka wrote:
> 
>>> There are possibilities [1] how mincore() could be used as a converyor of 
>>> a sidechannel information about pagecache metadata.
>>>
>>> Provide vm.mincore_privileged sysctl, which makes it possible to mincore() 
>>> start returning -EPERM in case it's invoked by a process lacking 
>>> CAP_SYS_ADMIN.
>>
>> Haven't checked the details yet, but wouldn't it be safe if anonymous private
>> mincore() kept working, and restrictions were applied only to page cache?
> 
> I was considering that, but then I decided not to do so, as that'd make 
> the interface even more confusing and semantics non-obvious in the 
> 'privileged' case.
> 
>>> The default behavior stays "mincore() can be used by anybody" in order to 
>>> be conservative with respect to userspace behavior.
>>
>> What if we lied instead of returned -EPERM, to not break userspace so 
>> obviously? I guess false positive would be the safer lie?
> 
> So your proposal basically would be
> 
> if (privileged && !CAP_SYS_ADMIN)
> 	if (pagecache)
> 		return false;

I was thinking about "return true" here, assuming that userspace generally wants
to ensure itself there won't be page faults when it starts doing something
critical, and if it sees a "false" it will try to do some kind of prefaulting,
possibly in a loop. There might be somebody trying to make sure something is out
of pagecache (it wants to see "false"), but can't think of anything except
benchmarks?

> 	else
> 		return do_mincore()
> 
> right ?
> 
> I think userspace would hate us for that semantics, but on the other hand 
> I can sort of understand the 'mincore() is racy anyway, so what' argument, 
> if that's what you are suggesting.
> 
> But then, I have no idea what userspace is using mincore() for. 
> https://codesearch.debian.net/search?q=mincore might provide some insight 
> I guess (thanks Matthew).
>
kernel test robot Jan. 5, 2019, 7:44 p.m. UTC | #4
Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 8.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.1.0 make.cross ARCH=c6x 

All errors (new ones prefixed by >>):

>> kernel/sysctl.o:(.fardata+0x6a0): undefined reference to `sysctl_mincore_privileged'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Linus Torvalds Jan. 5, 2019, 7:46 p.m. UTC | #5
On Sat, Jan 5, 2019 at 9:27 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> From: Jiri Kosina <jkosina@suse.cz>
>
> There are possibilities [1] how mincore() could be used as a converyor of
> a sidechannel information about pagecache metadata.

Can we please just limit it to vma's that are either anonymous, or map
a file that the user actually owns?

Then the capability check could be for "override the file owner check"
instead, which makes tons of sense.

No new sysctl's for something like this, please.

             Linus
kernel test robot Jan. 5, 2019, 7:56 p.m. UTC | #6
Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: microblaze-nommu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 8.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.1.0 make.cross ARCH=microblaze 

All errors (new ones prefixed by >>):

>> kernel/sysctl.o:(.data+0x67c): undefined reference to `sysctl_mincore_privileged'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Jiri Kosina Jan. 5, 2019, 8:12 p.m. UTC | #7
On Sat, 5 Jan 2019, Linus Torvalds wrote:

> > There are possibilities [1] how mincore() could be used as a converyor of
> > a sidechannel information about pagecache metadata.
> 
> Can we please just limit it to vma's that are either anonymous, or map
> a file that the user actually owns?
> 
> Then the capability check could be for "override the file owner check"
> instead, which makes tons of sense.

Makes sense. 

I am still not completely sure what to return in such cases though; we can 
either blatantly lie and always pretend that the pages are resident (to 
avoid calling process entering some prefaulting mode), or return -ENOMEM 
for mappings of files that don't belong to the user (in case it's not 
CAP_SYS_ADMIN one).
Linus Torvalds Jan. 5, 2019, 8:13 p.m. UTC | #8
On Sat, Jan 5, 2019 at 11:46 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Can we please just limit it to vma's that are either anonymous, or map
> a file that the user actually owns?

.. or slightly simpler: a file that the user opened for writing.

IOW, some (TOTALLY UNTESTED!) patch like this?

               Linus
mm/mincore.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..61e38895fb02 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return 0;
 }
 
+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+	return vma_is_anonymous(vma)
+		|| (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
+		|| capable(CAP_SYS_ADMIN);
+}
+
 /*
  * Do a chunk of "sys_mincore()". We've already checked
  * all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 	vma = find_vma(current->mm, addr);
 	if (!vma || addr < vma->vm_start)
 		return -ENOMEM;
-	mincore_walk.mm = vma->vm_mm;
 	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+	if (!can_do_mincore(vma)) {
+		unsigned long pages = (end - addr) >> PAGE_SHIFT;
+		memset(vec, 1, pages);
+		return pages;
+	}
+	mincore_walk.mm = vma->vm_mm;
 	err = walk_page_range(addr, end, &mincore_walk);
 	if (err < 0)
 		return err;
Linus Torvalds Jan. 5, 2019, 8:17 p.m. UTC | #9
[ Crossed emails ]

On Sat, Jan 5, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote:
>
> I am still not completely sure what to return in such cases though; we can
> either blatantly lie and always pretend that the pages are resident

That's what my untested patch did. Or maybe just claim they are all
not present?

And again, that patch was entirely untested, so it may be garbage and
have some fundamental problem. I also don't know exactly what rule
might make most sense, but "you can write to the file" certainly to me
implies that you also could know what parts of it are in-core.

Who actually _uses_ mincore()? That's probably the best guide to what
we should do. Maybe they open the file read-only even if they are the
owner, and we really should look at file ownership instead.

I tried to make that "can_do_mincore()" function easy to understand
and easy to just modify to some sane state.

Again, my patch is meant as a "perhaps something like this?" rather
than some "this is _exactly_ how it must be done". Take the patch as a
quick suggestion, not some final answer.

              Linus
Jiri Kosina Jan. 5, 2019, 8:43 p.m. UTC | #10
On Sat, 5 Jan 2019, Linus Torvalds wrote:

> > I am still not completely sure what to return in such cases though; we can
> > either blatantly lie and always pretend that the pages are resident
> 
> That's what my untested patch did. Or maybe just claim they are all
> not present?

Thinking about it a little bit more, I believe Vlastimil has a good point 
with 'non present' potentially causing more bogus activity in userspace in 
response (in an effort to actually make them present, and failing 
indefinitely).

IOW, I think it's a reasonable expectation that the common scenario is 
"check if it's present, and if not, try to fault it in" instead of "check 
if it's present, and if it is, try to evict it".

> And again, that patch was entirely untested, so it may be garbage and 
> have some fundamental problem. 

I will be travelling for next ~24 hours, but I have just asked our QA guys 
to run it through some basic battery of testing (which will probably 
happen on monday anyway).

> I also don't know exactly what rule might make most sense, but "you can 
> write to the file" certainly to me implies that you also could know what 
> parts of it are in-core.

I think it's reasonable; I can't really imagine any sidechannel to a 
global state be possibly mounted on valid R/W mappings. I'd guess that 
probably the most interesting here are the code segments of shared 
libraries, allowing to trace victim's execution.

> Who actually _uses_ mincore()? That's probably the best guide to what
> we should do. Maybe they open the file read-only even if they are the
> owner, and we really should look at file ownership instead.

Yeah, well

	https://codesearch.debian.net/search?q=mincore

is a bit too much mess to get some idea quickly I am afraid.
Linus Torvalds Jan. 5, 2019, 9:54 p.m. UTC | #11
On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote:
>
> > Who actually _uses_ mincore()? That's probably the best guide to what
> > we should do. Maybe they open the file read-only even if they are the
> > owner, and we really should look at file ownership instead.
>
> Yeah, well
>
>         https://codesearch.debian.net/search?q=mincore
>
> is a bit too much mess to get some idea quickly I am afraid.

Yeah, heh.

And the first hit is 'fincore', which probably nobody cares about
anyway, but it does

    fd = open (name, O_RDONLY)
    ..
    mmap(window, len, PROT_NONE, MAP_PRIVATE, ..

so if we want to keep that working, we'd really need to actually check
file ownership rather than just looking at f_mode.

But I don't know if anybody *uses* and cares about fincore, and it's
particularly questionable for non-root users.

And the Android go runtime code seems to oddly use mincore to figure
out page size:

  // try using mincore to detect the physical page size.
  // mincore should return EINVAL when address is not a multiple of
system page size.

which is all kinds of odd, but whatever.. Why mincore, rather than
something sane and obvious like mmap? Don't ask me...

Anyway, the Debian code search just results in mostly non-present
stuff. It's sad that google code search is no more. It was great for
exactly these kinds of questions.

The mono runtime seems to have some mono_pages_not_faulted() function,
but I don't know if people use it for file mappings, and I couldn't
find any interesting users of it.

I didn't find anything that seems to really care, but I gave up after
a few pages of really boring stuff.

                    Linus
Jann Horn Jan. 5, 2019, 10:54 p.m. UTC | #12
On Sat, Jan 5, 2019 at 6:27 PM Jiri Kosina <jikos@kernel.org> wrote:
> There are possibilities [1] how mincore() could be used as a converyor of
> a sidechannel information about pagecache metadata.
>
> Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> start returning -EPERM in case it's invoked by a process lacking
> CAP_SYS_ADMIN.
>
> The default behavior stays "mincore() can be used by anybody" in order to
> be conservative with respect to userspace behavior.
>
> [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/

Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
handler) is less problematic because it only returns data about the
state of page tables, and doesn't query the address_space? In other
words, it permits monitoring evictions, but non-intrusively detecting
that something has been loaded into memory by another process is
harder?
Linus Torvalds Jan. 5, 2019, 11:05 p.m. UTC | #13
On Sat, Jan 5, 2019 at 2:55 PM Jann Horn <jannh@google.com> wrote:
>
> Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
> handler) is less problematic because it only returns data about the
> state of page tables, and doesn't query the address_space? In other
> words, it permits monitoring evictions, but non-intrusively detecting
> that something has been loaded into memory by another process is
> harder?

Yes. I think it was brought up during the original report, but to use
the pagemap for this, you'd basically need to first populate all the
pages, and then wait for pageout.

So pagemap *does* leak very similar data, but it's much harder to use
as an attack vector.

That said, I think "mincore()" is just the simple one. You *can* (and
this was also discussed on the security list) do things like

 - fault in a single page

 - the kernel will opportunistically fault in pages that it already
has available _around_ that page.

 - then use pagemap (or just _timing_ - no real kernel support needed)
to see if those pages are now mapped in your address space

so honestly, mincore is just the "big hammer" and easy way to get at
some of this data. But it's probably worth closing exactly because
it's easy. There are other ways to get at the "are these pages mapped"
information, but they are a lot more combersome to use.

Side note: maybe we could just remove the "__mincore_unmapped_range()"
thing entirely, and basically make mincore() do what pagemap does,
which is to say "are the pages mapped in this VM".

That would be nicer than my patch, simply because removing code is
always nice. And arguably it's a better semantic anyway.

                Linus
Jiri Kosina Jan. 5, 2019, 11:09 p.m. UTC | #14
On Sat, 5 Jan 2019, Jann Horn wrote:

> > Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> > start returning -EPERM in case it's invoked by a process lacking
> > CAP_SYS_ADMIN.
> >
> > The default behavior stays "mincore() can be used by anybody" in order to
> > be conservative with respect to userspace behavior.
> >
> > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/
> 
> Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
> handler) is less problematic because it only returns data about the
> state of page tables, and doesn't query the address_space? In other
> words, it permits monitoring evictions, but non-intrusively detecting
> that something has been loaded into memory by another process is
> harder?

So I was just about to immediately reply that we don't expose pagemap 
anymore due to rowhammer, but apparently that's not true any more 
(this behavioud was originally introduced by ab676b7d6fbf, but then 
changed via 1c90308e7a77 (and further adjusted for swap entries in 
ab6ecf247a, but I guess that's not all that interesting).

Hmm.

But unless it has been mapped with MAP_POPULATE (whcih is outside the 
attacker's control), there is no guarantee that the mappings would 
actually be there, right?
Linus Torvalds Jan. 5, 2019, 11:16 p.m. UTC | #15
On Sat, Jan 5, 2019 at 3:05 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That would be nicer than my patch, simply because removing code is
> always nice. And arguably it's a better semantic anyway.

Yeah, I wonder why we did that thing where mincore() walks the page
tables, but if they are empty it looks in the page cache.

[... goes and looks in history ..]

It goes back to forever, it looks like. I can't find a reason.

Anyway, a removal patch would look something like the attached, I
think. That makes mincore() actually say how many pages are in _this_
mapping, not how many pages could be paged in without doing IO.

Hmm. Maybe we should try this first. Simplicity is always good.

Again, obviously untested.

                   Linus
mm/mincore.c | 74 +++++-------------------------------------------------------
 1 file changed, 6 insertions(+), 68 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..317eb64ea4ef 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -42,64 +42,12 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	return 0;
 }
 
-/*
- * Later we can get more picky about what "in core" means precisely.
- * For now, simply check to see if the page is in the page cache,
- * and is up to date; i.e. that no page-in operation would be required
- * at this time if an application were to map and access this page.
- */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
-{
-	unsigned char present = 0;
-	struct page *page;
-
-	/*
-	 * When tmpfs swaps out a page from a file, any process mapping that
-	 * file will not get a swp_entry_t in its pte, but rather it is like
-	 * any other file mapping (ie. marked !present and faulted in with
-	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 */
-#ifdef CONFIG_SWAP
-	if (shmem_mapping(mapping)) {
-		page = find_get_entry(mapping, pgoff);
-		/*
-		 * shmem/tmpfs may return swap: account for swapcache
-		 * page too.
-		 */
-		if (xa_is_value(page)) {
-			swp_entry_t swp = radix_to_swp_entry(page);
-			page = find_get_page(swap_address_space(swp),
-					     swp_offset(swp));
-		}
-	} else
-		page = find_get_page(mapping, pgoff);
-#else
-	page = find_get_page(mapping, pgoff);
-#endif
-	if (page) {
-		present = PageUptodate(page);
-		put_page(page);
-	}
-
-	return present;
-}
-
 static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 				struct vm_area_struct *vma, unsigned char *vec)
 {
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
-	int i;
 
-	if (vma->vm_file) {
-		pgoff_t pgoff;
-
-		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
-	} else {
-		for (i = 0; i < nr; i++)
-			vec[i] = 0;
-	}
+	memset(vec, 0, nr);
 	return nr;
 }
 
@@ -144,21 +92,11 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
-			if (non_swap_entry(entry)) {
-				/*
-				 * migration or hwpoison entries are always
-				 * uptodate
-				 */
-				*vec = 1;
-			} else {
-#ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-						    swp_offset(entry));
-#else
-				WARN_ON(1);
-				*vec = 1;
-#endif
-			}
+			/*
+			 * migration or hwpoison entries are always
+			 * uptodate
+			 */
+			*vec = !!non_swap_entry(entry);
 		}
 		vec++;
 	}
Linus Torvalds Jan. 5, 2019, 11:28 p.m. UTC | #16
On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It goes back to forever, it looks like. I can't find a reason.

Our man-pages talk abouit the "without doing IO" part. That may be the
result of our code, though, not the reason for it.

The BSD man-page has other flags, but doesn't describe what "in core"
really means:

     MINCORE_INCORE        Page is in core (resident).

     MINCORE_REFERENCED        Page has been referenced by us.

     MINCORE_MODIFIED        Page has been modified by us.

     MINCORE_REFERENCED_OTHER  Page has been referenced.

     MINCORE_MODIFIED_OTHER    Page has been modified.

     MINCORE_SUPER        Page is part of a large (``super'') page.

but the fact that it has MINCORE_MODIFIED_OTHER does obviously imply
that yes, historically it really did look up the pages elsewhere, not
just in the page tables.

Still, maybe we can get away with just making it be about our own page
tables. That would be lovely.

                 Linus
Linus Torvalds Jan. 5, 2019, 11:39 p.m. UTC | #17
On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It goes back to forever, it looks like. I can't find a reason.

mincore() was originally added in 2.3.52pre3, it looks like. Around
2000 or so. But sadly before the BK history.

And that comment about

  "Later we can get more picky about what "in core" means precisely."

that still exists above mincore_page() goes back to the original patch.

           Linus
Matthew Wilcox (Oracle) Jan. 6, 2019, 12:11 a.m. UTC | #18
On Sat, Jan 05, 2019 at 03:39:10PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It goes back to forever, it looks like. I can't find a reason.
> 
> mincore() was originally added in 2.3.52pre3, it looks like. Around
> 2000 or so. But sadly before the BK history.
> 
> And that comment about
> 
>   "Later we can get more picky about what "in core" means precisely."
> 
> that still exists above mincore_page() goes back to the original patch.

FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!)

https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html

DESCRIPTION
       mincore()  returns  the primary memory residency	status of pages	in the
       address space covered by	mappings in the	range [addr, addr + len).  The
       status is returned as a char-per-page in	the character array referenced
       by *vec (which the system assumes to be large enough to	encompass  all
       the  pages  in  the  address range).  The least significant bit of each
       character is set	to 1 to	indicate that the referenced page is  in  pri-
       mary  memory, 0 if it is	not.  The settings of other bits in each char-
       acter is	undefined and may contain other	information in the future.
Linus Torvalds Jan. 6, 2019, 12:22 a.m. UTC | #19
On Sat, Jan 5, 2019 at 4:11 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!)
>
> https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html
>
> DESCRIPTION
>        mincore()  returns  the primary memory residency status of pages in the
>        address space covered by mappings in the range [addr, addr + len).

It's still not clear that "primary memory residency status" actually means.

Does it mean "mapped", or does it mean "exists in caches and doesn't need IO".

I don't even know what kind of caches SunOS 4.1.3 had. The Linux
implementation depends on the page cache, and wouldn't work (at least
not very well) in a system that has a traditional disk buffer cache.

Anyway, I guess it's mostly moot. From a "does this cause regressions"
standpoint, the only thing that matters is really whatever Linux
programs that have used this since it was introduced 18+ years ago.

But I think my patch to just rip out all that page lookup, and just
base it on the page table state has the fundamental advantage that it
gets rid of code. Maybe I should jst commit it, and see if anything
breaks? We do have options in case things break, and then we'd at
least know who cares (and perhaps a lot more information of _why_ they
care).

                     Linus
Linus Torvalds Jan. 6, 2019, 1:50 a.m. UTC | #20
On Sat, Jan 5, 2019 at 4:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I think my patch to just rip out all that page lookup, and just
> base it on the page table state has the fundamental advantage that it
> gets rid of code. Maybe I should jst commit it, and see if anything
> breaks? We do have options in case things break, and then we'd at
> least know who cares (and perhaps a lot more information of _why_ they
> care).

Slightly updated patch in case somebody wants to try things out.

Also, any comments about the pmd_trans_unstable() case?

                    Linus
mm/mincore.c | 94 +++++++++---------------------------------------------------
 1 file changed, 13 insertions(+), 81 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..f0f91461a9f4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -42,72 +42,14 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	return 0;
 }
 
-/*
- * Later we can get more picky about what "in core" means precisely.
- * For now, simply check to see if the page is in the page cache,
- * and is up to date; i.e. that no page-in operation would be required
- * at this time if an application were to map and access this page.
- */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
-{
-	unsigned char present = 0;
-	struct page *page;
-
-	/*
-	 * When tmpfs swaps out a page from a file, any process mapping that
-	 * file will not get a swp_entry_t in its pte, but rather it is like
-	 * any other file mapping (ie. marked !present and faulted in with
-	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 */
-#ifdef CONFIG_SWAP
-	if (shmem_mapping(mapping)) {
-		page = find_get_entry(mapping, pgoff);
-		/*
-		 * shmem/tmpfs may return swap: account for swapcache
-		 * page too.
-		 */
-		if (xa_is_value(page)) {
-			swp_entry_t swp = radix_to_swp_entry(page);
-			page = find_get_page(swap_address_space(swp),
-					     swp_offset(swp));
-		}
-	} else
-		page = find_get_page(mapping, pgoff);
-#else
-	page = find_get_page(mapping, pgoff);
-#endif
-	if (page) {
-		present = PageUptodate(page);
-		put_page(page);
-	}
-
-	return present;
-}
-
-static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
-				struct vm_area_struct *vma, unsigned char *vec)
-{
-	unsigned long nr = (end - addr) >> PAGE_SHIFT;
-	int i;
-
-	if (vma->vm_file) {
-		pgoff_t pgoff;
-
-		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
-	} else {
-		for (i = 0; i < nr; i++)
-			vec[i] = 0;
-	}
-	return nr;
-}
-
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 				   struct mm_walk *walk)
 {
-	walk->private += __mincore_unmapped_range(addr, end,
-						  walk->vma, walk->private);
+	unsigned char *vec = walk->private;
+	unsigned long nr = (end - addr) >> PAGE_SHIFT;
+
+	memset(vec, 0, nr);
+	walk->private += nr;
 	return 0;
 }
 
@@ -127,8 +69,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
+	/* We'll consider a THP page under construction to be there */
 	if (pmd_trans_unstable(pmd)) {
-		__mincore_unmapped_range(addr, end, vma, vec);
+		memset(vec, 1, nr);
 		goto out;
 	}
 
@@ -137,28 +80,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		pte_t pte = *ptep;
 
 		if (pte_none(pte))
-			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
-						 vma, vec);
+			*vec = 0;
 		else if (pte_present(pte))
 			*vec = 1;
 		else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
-			if (non_swap_entry(entry)) {
-				/*
-				 * migration or hwpoison entries are always
-				 * uptodate
-				 */
-				*vec = 1;
-			} else {
-#ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-						    swp_offset(entry));
-#else
-				WARN_ON(1);
-				*vec = 1;
-#endif
-			}
+			/*
+			 * migration or hwpoison entries are always
+			 * uptodate
+			 */
+			*vec = !!non_swap_entry(entry);
 		}
 		vec++;
 	}
Kevin Easton Jan. 6, 2019, 11:33 a.m. UTC | #21
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> >         https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.

> Anyway, the Debian code search just results in mostly non-present
> stuff. It's sad that google code search is no more. It was great for
> exactly these kinds of questions.

If you select the "Group search results by Debian source package"
option on the search results page it makes it a lot easier to skim
through.

It looks to me like Firefox is expecting mincore() not to fail on
libraries that it has mapped:

https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98

    - Kevin
> 
> The mono runtime seems to have some mono_pages_not_faulted() function,
> but I don't know if people use it for file mappings, and I couldn't
> find any interesting users of it.
> 
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.
> 
>                     Linus
> 
>
Linus Torvalds Jan. 6, 2019, 9:46 p.m. UTC | #22
On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Slightly updated patch in case somebody wants to try things out.

I decided to just apply that patch. It is *not* marked for stable,
very intentionally, because I expect that we will need to wait and see
if there are issues with it, and whether we might have to do something
entirely different (more like the traditional behavior with some extra
"only for owner" logic).

But doing a test patch during the merge window (which is about to
close) sounds like the right thing to do.

                 Linus
Dominique Martinet Jan. 7, 2019, 4:32 a.m. UTC | #23
Linus Torvalds wrote on Sat, Jan 05, 2019:
> But I think my patch to just rip out all that page lookup, and just
> base it on the page table state has the fundamental advantage that it
> gets rid of code. Maybe I should jst commit it, and see if anything
> breaks? We do have options in case things break, and then we'd at
> least know who cares (and perhaps a lot more information of _why_ they
> care).

There actually are many tools like fincore which depend on mincore to
try to tell whether a file is "loaded in cache" or not (I personally use
vmtouch[1], but I know of at least nocache[2] uses it as well to only
try to evict used pages)

[1] https://hoytech.com/vmtouch/
[2] https://github.com/Feh/nocache


I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or
prefetch/lock whole files so my "production" use-cases don't actually
rely on the mincore part of them; but when playing with these actions
it's actually fairly useful to be able to visualize which part of a file
ended in cache or monitor how a file's content evolve in cache...

There are various non-obvious behaviours where being able to poke around
is enlightening (e.g. fadvise dontneed is actually a hint, so even if
nothing uses the file linux sometimes keep the data around if it thinks
that would be useful and nocache has a mode to call fadvise multiple
times and things like that...)


Anyway, I agree the use of mincore for this is rather ugly, and
frankly some "cache management API" might be better in the long run if
only for performance reason (don't try these tools on a hundred TB
sparse file...), but until that pipe dream comes true I think mincore as
it was is useful for system admins.



Linus Torvalds wrote on Sun, Jan 06, 2019:
> I decided to just apply that patch. It is *not* marked for stable,
> very intentionally, because I expect that we will need to wait and see
> if there are issues with it, and whether we might have to do something
> entirely different (more like the traditional behavior with some extra
> "only for owner" logic).

FWIW I personally don't care much about "only for owner" or depending on
mmap options; I don't understand much of the security implications
honestly so I'm not sure how these limitations actually help.
On the other hand, a simple CAP_SYS_ADMIN check making the call take
either behaviour should be safe and would cover what I described above.


(by the way, while we are discussing permissions, a regular user can use
fadvise dontneed on files it doesn't own as well as long as it can open
them for reading; I'm not sure if that would need restricting as well in
the context of the security issue. Frankly even with mincore someone
could likely tell the difference through timing, if they just do it a
few times. Do magic, probe, flush out, repeat until satisfied.)


Thanks,
Michael Ellerman Jan. 7, 2019, 10:10 a.m. UTC | #24
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> It goes back to forever, it looks like. I can't find a reason.
>
> mincore() was originally added in 2.3.52pre3, it looks like. Around
> 2000 or so. But sadly before the BK history.

Yeah, it's here in the commit titled "Import 2.3.52pre3" (takes a second
or two to load):

  https://github.com/mpe/linux-fullhistory/commit/a1bcda3256956318c95c8da8bee09f79190bb034#diff-fd2d793b8b4760b4887c8c7bbb3451d7R1730

But no further detail.

(Instructions for using that tree https://github.com/mpe/linux-fullhistory/wiki)

cheers
Vlastimil Babka Jan. 7, 2019, 10:33 a.m. UTC | #25
On 1/7/19 5:32 AM, Dominique Martinet wrote:
> Linus Torvalds wrote on Sat, Jan 05, 2019:
>> But I think my patch to just rip out all that page lookup, and just
>> base it on the page table state has the fundamental advantage that it
>> gets rid of code. Maybe I should jst commit it, and see if anything
>> breaks? We do have options in case things break, and then we'd at
>> least know who cares (and perhaps a lot more information of _why_ they
>> care).
> 
> There actually are many tools like fincore which depend on mincore to
> try to tell whether a file is "loaded in cache" or not (I personally use
> vmtouch[1], but I know of at least nocache[2] uses it as well to only
> try to evict used pages)

nocache could probably do fine without mincore. IIUC the point is to not
evict anything that was already resident prior to running some command
wrapped in nocache. Without the mincore checks,
posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that
others have mapped. That means without mincore() it will drop data
that's in cache but not currently in use by anybody, which shouldn't
cause large performance regressions?

> [1] https://hoytech.com/vmtouch/
> [2] https://github.com/Feh/nocache
> 
> 
> I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or
> prefetch/lock whole files so my "production" use-cases don't actually
> rely on the mincore part of them;

Ah so you seem to confirm my above point.

...

> FWIW I personally don't care much about "only for owner" or depending on
> mmap options; I don't understand much of the security implications
> honestly so I'm not sure how these limitations actually help.
> On the other hand, a simple CAP_SYS_ADMIN check making the call take
> either behaviour should be safe and would cover what I described above.

So without CAP_SYS_ADMIN, mincore() would return mapping status, and
with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
:( Maybe if we introduced mincore2() with flags similar to BSD mentioned
earlier in the thread, and the cache residency flag would require
CAP_SYS_ADMIN or something similar.
> (by the way, while we are discussing permissions, a regular user can use
> fadvise dontneed on files it doesn't own as well as long as it can open
> them for reading; I'm not sure if that would need restricting as well in
> the context of the security issue.

Probably not, as I've mentioned it won't evict what's mapped by somebody
else. And eviction is also possible via controlling LRU, which is what
the paper [1] does anyway (and also mentions that DONTNEED doesn't
work). Being able to evict somebody's page is AFAIU not sufficient for
attack, the side channel is about knowing that somebody brought that
page back to RAM by touching it.

> Frankly even with mincore someone
> could likely tell the difference through timing, if they just do it a
> few times. Do magic, probe, flush out, repeat until satisfied.)

That's my bigger concern here. In [1] there's described a remote attack
(on webserver) using the page fault timing differences for present/not
present page cache pages. Noisy but works, and I expect locally it to be
much less noisy. Yet the countermeasures section only mentions
restricting mincore() as if it was sufficient (and also how to make
evictions harder, but that's secondary IMHO).

[1] https://arxiv.org/abs/1901.01161

> 
> Thanks,
>
Dominique Martinet Jan. 7, 2019, 11:08 a.m. UTC | #26
Vlastimil Babka wrote on Mon, Jan 07, 2019:
> On 1/7/19 5:32 AM, Dominique Martinet wrote:
> > Linus Torvalds wrote on Sat, Jan 05, 2019:
> >> But I think my patch to just rip out all that page lookup, and just
> >> base it on the page table state has the fundamental advantage that it
> >> gets rid of code. Maybe I should jst commit it, and see if anything
> >> breaks? We do have options in case things break, and then we'd at
> >> least know who cares (and perhaps a lot more information of _why_ they
> >> care).
> > 
> > There actually are many tools like fincore which depend on mincore to
> > try to tell whether a file is "loaded in cache" or not (I personally use
> > vmtouch[1], but I know of at least nocache[2] uses it as well to only
> > try to evict used pages)
> 
> nocache could probably do fine without mincore. IIUC the point is to not
> evict anything that was already resident prior to running some command
> wrapped in nocache. Without the mincore checks,
> posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that
> others have mapped. That means without mincore() it will drop data
> that's in cache but not currently in use by anybody, which shouldn't
> cause large performance regressions?

Yes, I sent them a patch to allow this behaviour (which got merged ages
ago); but the default nocache usage will try to preserve pages that were
mapped before even if the user mapped it themselves so it's a little bit
more robust that just trusting linux to do the right thing. I did
various tests with fadvise dontneed and honestly the way it works is far
from obvious when operating as a single user at least; I didn't test
multi-users as that didn't really concern me.

With the current mincore change, it will think everything was "in core"
and not flush anything unless my option to just fadvise dontneed
everything is passed though ; so even if we can make it work it is a
change of behaviour that is breaking an existing application, and it has
no way of telling it didn't work.


Honestly though, as I said, mincore() is much more useful for debugging
for me ; the application can be changed if required. I just pointed it
out as it'll need changing, and it has no obvious way of testing at
runtime if the syscall works (except dumb kernel version check, but that
won't work with stable backports); so it's not that obvious.

> > FWIW I personally don't care much about "only for owner" or depending on
> > mmap options; I don't understand much of the security implications
> > honestly so I'm not sure how these limitations actually help.
> > On the other hand, a simple CAP_SYS_ADMIN check making the call take
> > either behaviour should be safe and would cover what I described above.
> 
> So without CAP_SYS_ADMIN, mincore() would return mapping status, and
> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned
> earlier in the thread, and the cache residency flag would require
> CAP_SYS_ADMIN or something similar.

I agree, that's rather clumsy... Or rather might lead to some unexpected
behaviours. I'm open to other ideas.
I'm not sure how the BSD flags help though?

> > (by the way, while we are discussing permissions, a regular user can use
> > fadvise dontneed on files it doesn't own as well as long as it can open
> > them for reading; I'm not sure if that would need restricting as well in
> > the context of the security issue.
> 
> Probably not, as I've mentioned it won't evict what's mapped by somebody
> else. And eviction is also possible via controlling LRU, which is what
> the paper [1] does anyway (and also mentions that DONTNEED doesn't
> work). Being able to evict somebody's page is AFAIU not sufficient for
> attack, the side channel is about knowing that somebody brought that
> page back to RAM by touching it.

Thanks for the link to the paper, I hadn't taken the time to extract it
from the news article but it's much more interesting indeed.
Some of what's been said here makes more sense to me now (checking
about ownership and similar might indeed be good enough).

For evicting page I agree most targets would be different users so I
guess that makes sense as well; there seem to be other ways of
efficiently evicting a page from cache anyway.

(BTW, this was brought up earlier, the paper also mentions
/proc/self/pagemap as not being useful.)


> > Frankly even with mincore someone
> > could likely tell the difference through timing, if they just do it a
> > few times. Do magic, probe, flush out, repeat until satisfied.)

(I meant "without mincore", obviously, as you understood properly)

> That's my bigger concern here. In [1] there's described a remote attack
> (on webserver) using the page fault timing differences for present/not
> present page cache pages. Noisy but works, and I expect locally it to be
> much less noisy. Yet the countermeasures section only mentions
> restricting mincore() as if it was sufficient (and also how to make
> evictions harder, but that's secondary IMHO).

I'd suggest making clock rougher for non-root users but javascript tried
that and it wasn't enough... :)
Honestly won't be of much help there, good luck?

Thanks,
Vlastimil Babka Jan. 7, 2019, 11:59 a.m. UTC | #27
On 1/7/19 12:08 PM, Dominique Martinet wrote:
>
> With the current mincore change, it will think everything was "in core"
> and not flush anything unless my option to just fadvise dontneed
> everything is passed though ; so even if we can make it work it is a
> change of behaviour that is breaking an existing application, and it has
> no way of telling it didn't work.

IIUC the current change is commit 574823bfab82 ("Change mincore() to count
"mapped" pages rather than "cached" pages") which will not pretend
everything
is "in core", but only pages that the calling process has populated page
table mapping for (which implies in core, but the opposite doesn't
hold). "nocache" most certainly doesn't populate the mappings before
calling mincore(), as that would bring pages to page cache and defeat
the purpose of determining if they were already there prior the nocache
execution. Instead it will think that nothing was "in core", and thus
later call fadvise dontneed or everything, but as I've said earlier that
shouldn't matter much.

> Honestly though, as I said, mincore() is much more useful for debugging
> for me ; the application can be changed if required. I just pointed it
> out as it'll need changing, and it has no obvious way of testing at
> runtime if the syscall works (except dumb kernel version check, but that
> won't work with stable backports); so it's not that obvious.

Agree.

>>> FWIW I personally don't care much about "only for owner" or depending on
>>> mmap options; I don't understand much of the security implications
>>> honestly so I'm not sure how these limitations actually help.
>>> On the other hand, a simple CAP_SYS_ADMIN check making the call take
>>> either behaviour should be safe and would cover what I described above.
>>
>> So without CAP_SYS_ADMIN, mincore() would return mapping status, and
>> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
>> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned
>> earlier in the thread, and the cache residency flag would require
>> CAP_SYS_ADMIN or something similar.
> 
> I agree, that's rather clumsy... Or rather might lead to some unexpected
> behaviours. I'm open to other ideas.
> I'm not sure how the BSD flags help though?

Definitely it would be a long-term solution, introducing new API,
waiting for userspace to use it... and meanwhile we would have to keep
the status quo or some kind of the clumsy/subtle approach.

>>> (by the way, while we are discussing permissions, a regular user can use
>>> fadvise dontneed on files it doesn't own as well as long as it can open
>>> them for reading; I'm not sure if that would need restricting as well in
>>> the context of the security issue.
>>
>> Probably not, as I've mentioned it won't evict what's mapped by somebody
>> else. And eviction is also possible via controlling LRU, which is what
>> the paper [1] does anyway (and also mentions that DONTNEED doesn't
>> work). Being able to evict somebody's page is AFAIU not sufficient for
>> attack, the side channel is about knowing that somebody brought that
>> page back to RAM by touching it.
> 
> Thanks for the link to the paper, I hadn't taken the time to extract it
> from the news article but it's much more interesting indeed.

It went public only this morning, the article was older :)
Daniel Gruss Jan. 7, 2019, 1:29 p.m. UTC | #28
On 1/7/19 12:08 PM, Dominique Martinet wrote:
>> That's my bigger concern here. In [1] there's described a remote attack
>> (on webserver) using the page fault timing differences for present/not
>> present page cache pages. Noisy but works, and I expect locally it to be
>> much less noisy. Yet the countermeasures section only mentions
>> restricting mincore() as if it was sufficient (and also how to make
>> evictions harder, but that's secondary IMHO).
> 
> I'd suggest making clock rougher for non-root users but javascript tried
> that and it wasn't enough... :)
> Honestly won't be of much help there, good luck?

Restricting mincore() is sufficient to fix the hardware-agnostic part.
If the attack is not hardware-agnostic anymore, an attacker could also
just use a hardware cache attack, which has a higher temporal and
spatial resolution, so there's no reason why the attacker would use page
cache attacks instead then.


Cheers,
Daniel
Dave Chinner Jan. 8, 2019, 4:43 a.m. UTC | #29
On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Slightly updated patch in case somebody wants to try things out.
> 
> I decided to just apply that patch. It is *not* marked for stable,
> very intentionally, because I expect that we will need to wait and see
> if there are issues with it, and whether we might have to do something
> entirely different (more like the traditional behavior with some extra
> "only for owner" logic).

So, I read the paper and before I was half way through it I figured
there are a bunch of other similar page cache invalidation attacks
we can perform without needing mincore. i.e. Focussing on mmap() and
mincore() misses the wider issues we have with global shared caches.

My first thought:

	fd = open(some_file, O_RDONLY);
	iov[0].iov_base = buf;
	iov[0].iov_len = 1;
	ret = preadv2(fd, iov, 1, off, RWF_NOWAIT);
	switch (ret) {
	case -EAGAIN:
		/* page not resident in page cache */
		break;
	case 1:
		/* page resident in page cache */
		break;
	default:
		/* beyond EOF or some other error */
		break;
	}

This is "as documented" in the man page for preadv2:

RWF_NOWAIT (since Linux 4.14)
      Do  not  wait  for  data which is not immediately available.
      If this flag is specified, the preadv2() system call will
      return instantly if it would have to read data from the
      backing storage or wait for a lock.  If some data was
      successfully read, it will return the number of bytes read.
      If no bytes were read, it will return  -1 and set errno to
      EAGAIN.  Currently, this flag is meaningful only for
      preadv2().

IOWs, we've explicitly designed interfaces to communicate whether
data is "immediately accessible" or not to the application so they
can make sensible decisions about IO scheduling. i.e. IO won't
block the main application processing loop and so can be scheduled in
the background by the app and the data processed when that IO
returns.  That just so happens to be exactly the same information
about the page cache that mincore is making available to userspace.

If we "remove" this information from the interfaces like it has been
done for mincore(), it doesn't mean userspace can't get it in other
ways. e.g. it now just has to time the read(2) syscall duration and
infer whether the data came from the page cache or disk from the
timing information.

IMO, there's nothing new or novel about this page cache information
leak - it was just a matter of time before some researcher put 2 and
2 together and realised that sharing the page cache across a
security boundary is little different from sharing deduplicated
pages across those same security boundaries.  i.e. As long as we
shared caches across security boundaries and userspace can control
both cache invalidation and instantiation, we cannot prevent
userspace from constructing these invalidate+read information
exfiltration scenarios.

And then there is overlayfs. Overlay is really just a way to
efficiently share the page cache of the same underlying read-only
directory tree across all containers on a host. i.e.  we have been
specifically designing our container hosting systems to share the
underlying read-only page cache across all security boundaries on
the host. If overlay is vulnerable to these shared page cache
attacks (seems extremely likely) then we've got much bigger problems
than mincore to think about....

> But doing a test patch during the merge window (which is about to
> close) sounds like the right thing to do.

IMO it seems like the wrong thing to do. It's just a hacky band-aid
over a specific extraction method and does nothing to reduce the
actual scope of the information leak. Focussing on the band-aid
means you've missed all the other avenues that the same information
is exposed and all the infrastructure we've build on the core
concept of sharing kernel side pages across security boundaries.

And that's even without considering whether the change breaks
userspace. Which it does. e.g. vmtouch is fairly widely used to
manage page cache instantiation for rapid bring-up and migration of
guest VMs and containers. They save the hot page cache information
from a running container and then using that to instantiate the page
cache in new instances running the same workload so they run at full
speed right from the start. This use case calls mincore() to pull
the page cache information from the running container.

If anyone else proposed merging a syscall implementation change that
was extremely likely to break userspace you'd be shouting at them
that "we don't break userspace"....

Cheers,

Dave.
Kevin Easton Jan. 8, 2019, 8:50 a.m. UTC | #30
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> >         https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.
> 
> Yeah, heh.
> 
> And the first hit is 'fincore', which probably nobody cares about
> anyway, but it does
> 
>     fd = open (name, O_RDONLY)
>     ..
>     mmap(window, len, PROT_NONE, MAP_PRIVATE, ..
> 
> so if we want to keep that working, we'd really need to actually check
> file ownership rather than just looking at f_mode.
> 
> But I don't know if anybody *uses* and cares about fincore, and it's
> particularly questionable for non-root users.
> 
...
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.

I've gone through everything in the Debian code search, and this is the
stuff that seems like it would be affected at all by the current patch:

util-linux
    Contains 'fincore' as already noted above.

e2fsprogs
    e4defrag tries to drop pages that it caused to be loaded into the
    page cache, but it's not clear that this ever worked as designed 
    anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..)
    but then after the sync_file_range it drops the pages that *were*
    in the page cache at the time of mincore()).

pgfincore
    postgresql extension used to try to dump/restore page cache status
    of database backing files across reboots.  It uses a fresh mapping
    with mincore() to try to determine the current page cache status of
    a file.

nocache
    LD_PRELOAD library that tries to drop any pages that the victim
    program has caused to be loaded into the page cache, uses mincore
    on a fresh mapping to see what was resident beforehand.  Also 
    includes 'cachestats' command that's basically another 'fincore'.

xfsprogs
    xfs_io has a 'mincore' sub-command that is roughly equivalent to
    'fincore'.

vmtouch
    vmtouch is "Portable file system cache diagnostics and control",
    among other things it implements 'fincore' type functionality, and
    one of its touted use-cases is "Preserving virtual memory profile
    when failing over servers".

qemu
    qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to
    implement the "x-check-cache-dropped" option.  
    ( https://patchwork.kernel.org/patch/10395865/ )

(Everything else I could see was either looking at anonymous VMAs, its
own existing mapping that it's been using for actual IO, or was just
using mincore() to see if an address was part of any mapping at all).

    - Kevin
Bernd Petrovitsch Jan. 8, 2019, 9:14 a.m. UTC | #31
On 05/01/2019 20:38, Vlastimil Babka wrote:
[...]
> I was thinking about "return true" here, assuming that userspace generally wants
> to ensure itself there won't be page faults when it starts doing something
> critical, and if it sees a "false" it will try to do some kind of prefaulting,
> possibly in a loop. There might be somebody trying to make sure something is out

Isn't that racy by design as the pages may get flushed out after the check?
Shouldn't the application use e.g. mlock()/.... to guarantee no page
faults in the first place?

MfG,
	Bernd
Jiri Kosina Jan. 8, 2019, 11:37 a.m. UTC | #32
On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:

> Shouldn't the application use e.g. mlock()/.... to guarantee no page 
> faults in the first place?

Calling mincore() on pages you've just mlock()ed is sort of pointless 
though.
Bernd Petrovitsch Jan. 8, 2019, 1:53 p.m. UTC | #33
On 08/01/2019 12:37, Jiri Kosina wrote:
> On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:
> 
>> Shouldn't the application use e.g. mlock()/.... to guarantee no page 
>> faults in the first place?
> 
> Calling mincore() on pages you've just mlock()ed is sort of pointless 
> though.

Obviously;-)

Sorry for being unclear above: If I want my application to
avoid suffering from page faults, I use simply mlock()
(and/or friends) to nail the relevant pages into physical
RAM and not "look if they are out, if yes, get them in" which
has also the risk that these important pages are too soon
evicted again.

But perhaps I'm missing some very special use cases ....

MfG,
	Brend
Kirill A. Shutemov Jan. 8, 2019, 2:08 p.m. UTC | #34
On Tue, Jan 08, 2019 at 02:53:00PM +0100, Bernd Petrovitsch wrote:
> On 08/01/2019 12:37, Jiri Kosina wrote:
> > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:
> > 
> >> Shouldn't the application use e.g. mlock()/.... to guarantee no page 
> >> faults in the first place?
> > 
> > Calling mincore() on pages you've just mlock()ed is sort of pointless 
> > though.
> 
> Obviously;-)
> 
> Sorry for being unclear above: If I want my application to
> avoid suffering from page faults, I use simply mlock()
> (and/or friends) to nail the relevant pages into physical
> RAM and not "look if they are out, if yes, get them in" which
> has also the risk that these important pages are too soon
> evicted again.

Note, that mlock() doesn't prevent minor page faults. Mlocked memory is
still subject to mechanisms that makes the page temporary unmapped. For
instance migration (including NUMA balancing), compaction, khugepaged...
Linus Torvalds Jan. 8, 2019, 5:57 p.m. UTC | #35
On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote:
>
> So, I read the paper and before I was half way through it I figured
> there are a bunch of other similar page cache invalidation attacks
> we can perform without needing mincore. i.e. Focussing on mmap() and
> mincore() misses the wider issues we have with global shared caches.

Oh, agreed, and that was discussed in the original report too.

The thing is, you can also depend on our pre-faulting of pages in the
page fault handler, and use that to get the cached status of nearby
pages. So do something like "fault one page, then do mincore() to see
how many pages near it were mapped". See our "do_fault_around()"
logic.

But mincore is certainly the easiest interface, and the one that
doesn't require much effort or setup. It's also the one where our old
behavior was actually arguably simply stupid and actively wrong (ie
"in caches" isn't even strictly speaking a valid question, since the
caches in question may be invalid). So let's try to see if giving
mincore() slightly more well-defined semantics actually causes any
pain.

I do think that the RWF_NOWAIT case might also be interesting to look at.

                 Linus
Dave Chinner Jan. 9, 2019, 2:24 a.m. UTC | #36
On Tue, Jan 08, 2019 at 09:57:49AM -0800, Linus Torvalds wrote:
> On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > So, I read the paper and before I was half way through it I figured
> > there are a bunch of other similar page cache invalidation attacks
> > we can perform without needing mincore. i.e. Focussing on mmap() and
> > mincore() misses the wider issues we have with global shared caches.
> 
> Oh, agreed, and that was discussed in the original report too.
> 
> The thing is, you can also depend on our pre-faulting of pages in the
> page fault handler, and use that to get the cached status of nearby
> pages. So do something like "fault one page, then do mincore() to see
> how many pages near it were mapped". See our "do_fault_around()"
> logic.

Observing fault-around could help you detect what code an application is
running, but it's not necessary (and can be turned off). Also, such
an it observation is not dependent on using mincore. neither
fault-around nor mincore are required functionality to exploit the
information leaks.

And, FWIW, fault-around actually destroys the information in the
exfiltration channel described in the paper because it perturbs the
carefully constructed page cache residency pattern that encodes the
message.

> But mincore is certainly the easiest interface, and the one that
> doesn't require much effort or setup.

Off the top of my head, here's a few vectors for reading the page
cache residency state without perturbing the page cache residency
pattern:
	- mincore
	- preadv2(RWF_NOWAIT)
	- fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
	- madvise(MADV_RANDOM); timed read of first byte in each page

i.e. mincore is a messenger, but it's not the only trivial
observation technique available. The only difference between mincore
and the others will be the observation latency and hence channel
bandwidth.

IOWs, the question we need to focus on now is not "does breaking
mincore affect anyone", it is "how the hell do we mitigate and
isolate an information leak exposed by fundamental OS functionality
that *everything* depends on for performance"?

> It's also the one where our old
> behavior was actually arguably simply stupid and actively wrong (ie
> "in caches" isn't even strictly speaking a valid question, since the
> caches in question may be invalid).

This is irrelevant to the problem reported. Sure, mincore may be
an awful interface, but it's semantics are not the cause of the
information leak. You're just shooting the messenger...

> I do think that the RWF_NOWAIT case might also be interesting to look at.

As are all the other mechanisms you can use to observer page cache
residency without perturbing it's state.

Keep in mind that the researchers documented a remote observation
technique that leaked the information across the network to a remote
host, so this leak has much, much wider scope than changing mincore
can address...

Cheers,

Dave.
Jiri Kosina Jan. 9, 2019, 2:31 a.m. UTC | #37
On Wed, 9 Jan 2019, Dave Chinner wrote:

> > But mincore is certainly the easiest interface, and the one that
> > doesn't require much effort or setup.
> 
> Off the top of my head, here's a few vectors for reading the page
> cache residency state without perturbing the page cache residency
> pattern:
> 	- mincore
> 	- preadv2(RWF_NOWAIT)
> 	- fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
> 	- madvise(MADV_RANDOM); timed read of first byte in each page

While I obviously agree that all those are creating pagecache sidechannel 
in principle, I think we really should mostly focus on the first two (with 
mincore() already having been covered).

Rationale has been provided by Daniel Gruss in this thread -- if the 
attacker is left with cache timing as the only available vector, he's 
going to be much more successful with mounting hardware cache timing 
attack anyway.

Thanks,
Dave Chinner Jan. 9, 2019, 4:39 a.m. UTC | #38
On Wed, Jan 09, 2019 at 03:31:35AM +0100, Jiri Kosina wrote:
> On Wed, 9 Jan 2019, Dave Chinner wrote:
> 
> > > But mincore is certainly the easiest interface, and the one that
> > > doesn't require much effort or setup.
> > 
> > Off the top of my head, here's a few vectors for reading the page
> > cache residency state without perturbing the page cache residency
> > pattern:
> > 	- mincore
> > 	- preadv2(RWF_NOWAIT)
> > 	- fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
> > 	- madvise(MADV_RANDOM); timed read of first byte in each page
> 
> While I obviously agree that all those are creating pagecache sidechannel 
> in principle, I think we really should mostly focus on the first two (with 
> mincore() already having been covered).

FWIW, I just realised that the easiest, most reliable way to
invalidate the page cache over a file range is simply to do a
O_DIRECT read on it. IOWs, all three requirements of this
information leak - highly specific, reliable cache invalidation
control, controlled cache instantiation and 3rd-party detection of
cache residency can all be performed with just the read(2)
syscall...

> Rationale has been provided by Daniel Gruss in this thread -- if the 
> attacker is left with cache timing as the only available vector, he's 
> going to be much more successful with mounting hardware cache timing 
> attack anyway.

No, he said:

"Restricting mincore() is sufficient to fix the hardware-agnostic
part."

That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic
and provides exactly the same information about the page cache as
mincore.  Timed read/mmap access loops for cache observation are
also hardware agnostic, and on fast SSD based storage will only be
marginally slower bandwidth than preadv2(RWF_NOWAIT).

Attackers will pick whatever leak vector we don't fix, so we either
fix them all (which I think is probably impossible without removing
caching altogether) or we start thinking about how we need to
isolate the page cache so that information isn't shared across
important security boundaries (e.g. page cache contents are
per-mount namespace).

Cheers,

Dave.
Jiri Kosina Jan. 9, 2019, 10:08 a.m. UTC | #39
On Wed, 9 Jan 2019, Dave Chinner wrote:

> FWIW, I just realised that the easiest, most reliable way to invalidate 
> the page cache over a file range is simply to do a O_DIRECT read on it. 

Neat, good catch indeed. Still, it's only the invalidation part, but the 
residency check is the crucial one.

> > Rationale has been provided by Daniel Gruss in this thread -- if the 
> > attacker is left with cache timing as the only available vector, he's 
> > going to be much more successful with mounting hardware cache timing 
> > attack anyway.
> 
> No, he said:
> 
> "Restricting mincore() is sufficient to fix the hardware-agnostic
> part."
> 
> That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and 
> provides exactly the same information about the page cache as mincore.  

Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has 
"just" been overlooked. I can't speak for Daniel, but I believe he might 
be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT 
is sufficient ...".

> Timed read/mmap access loops for cache observation are also hardware 
> agnostic, and on fast SSD based storage will only be marginally slower 
> bandwidth than preadv2(RWF_NOWAIT).
> 
> Attackers will pick whatever leak vector we don't fix, so we either fix 
> them all (which I think is probably impossible without removing caching 
> altogether) 

We can't really fix the fact that it's possible to do the timing on the HW 
caches though.

> or we start thinking about how we need to isolate the page cache so that 
> information isn't shared across important security boundaries (e.g. page 
> cache contents are per-mount namespace).

Umm, sorry for being dense, but how would that help that particular attack 
scenario on a system that doesn't really employ any namespacing? (which I 
still believe is a majority of the systems out there, but I might have 
just missed the containers train long time ago :) ).
Linus Torvalds Jan. 9, 2019, 6:25 p.m. UTC | #40
On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote:
>
> FWIW, I just realised that the easiest, most reliable way to
> invalidate the page cache over a file range is simply to do a
> O_DIRECT read on it.

If that's the case, that's actually an O_DIRECT bug.

It should only invalidate the caches on write.

On reads, it wants to either _flush_ any direct caches before the
read, or just take the data from the caches. At no point is
"invalidate" a valid model.

Of course, I'm not in the least bit shocked if O_DIRECT is buggy like
this. But looking at least at the ext4 routine, the read just does

        ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,

and I don't see any invalidation.

Having read access to a file absolutely should *not* mean that you can
flush caches on it. That's a write op.

Any filesystem that invalidates the caches on read is utterly buggy.

Can you actually point to such a thing? Let's get that fixed, because
it's completely wrong regardless of this whole mincore issue.

               Linus
Dave Chinner Jan. 10, 2019, 12:44 a.m. UTC | #41
On Wed, Jan 09, 2019 at 10:25:43AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > FWIW, I just realised that the easiest, most reliable way to
> > invalidate the page cache over a file range is simply to do a
> > O_DIRECT read on it.
> 
> If that's the case, that's actually an O_DIRECT bug.
>
> It should only invalidate the caches on write.

Sounds nice from a theoretical POV, but reality has taught us
very different lessons.

FWIW, a quick check of XFS's history so you understand how long this
behaviour has been around. It was introduced in the linux port in
2001 as direct IO support was being added:

commit e837eac23662afae603aaaef7c94bc839c1b8f67
Author: Steve Lord <lord@sgi.com>
Date:   Mon Mar 5 16:47:52 2001 +0000

    Add bounds checking for direct I/O, do the cache invalidation for
    data coherency on direct I/O.

This was basically a direct port of the flush+invalidation code in
the Irix direct IO path, which was introduced in 1995:

    > revision 1.149
    > date: 1995/08/11 20:09:44;  author: ajs;  state: Exp;  lines: +70 -2
    > 280514 Adding page cache flusing calls to make direct
    > I/O coherent with buffered I/O.

IOWs, history tells us that invalidation for direct IO reads has
been done on XFS for almost 25 years.  I know for certain that there
have been applications out there that depend on this
invalidation-on-read behaviour (another of those "reality bites"
lessons) so we can't just remove it because you *think* it is a bug.

i.e. we *could* remove the invalidation on read, but this we have a
major behavioural change to the XFS direct IO path. This means we
need to determine if we've just awoken sleeping data corruption
krakens as well as determine if there are any performance
regressions that result from the behavioural change.

Which brings me to validation.  If the recent
clone/dedupe/copy_file_range() debacle has taught me anything, it's
that validating a "simple" IO path mechanism is going to take months
worth of machine time before we have any confidence that the change
is not going to expose users to new data corruption problems.

That's the difficulty here - it only takes 5 minutes to change
the code, but there's months of machine time needed to determine if
it's really safe to make that code change. Testing has a nasty habit
of finding invalid assumptions; when those are assumptions about
data coherency and integrity we can't test them on our users.

And, really, this would be just another band-aid over a symptom of
the information leak - it doesn't prevent users from being able to
control page cache invalidation. It just removes one method, just
like hacking mincore only removes one method of observing the page
cache.  And, like mincore(), there's every chance it impacts on
userspace in a negative manner and so we need to be very careful
here.

> On reads, it wants to either _flush_ any direct caches before the
> read, or just take the data from the caches. At no point is
> "invalidate" a valid model.
> 
> Of course, I'm not in the least bit shocked if O_DIRECT is buggy like
> this. But looking at least at the ext4 routine, the read just does
> 
>         ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
> 
> and I don't see any invalidation.

I wouldn't look at ext4 as an example of a reliable, problem free
direct IO implementation because, historically speaking, it's been a
series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
been far worse than XFS from data integrity, performance and
reliability perspectives.

IMO, "because ext4" has been a poor reason for justifying anything
for a long time, not the least when talking about features that
didn't even originate in extN....

> Can you actually point to such a thing? Let's get that fixed, because
> it's completely wrong regardless of this whole mincore issue.

The biggest problem that remains today is that we have no mechanism
for serialising page faults against DIO. If we leave pages cached in
memory while we have a AIO+DIO read (or write!) in progress, we can
dirty the page and run a buffered read before the AIO+DIO read
returns. This now leaves us in the state where where the AIO+DIO
read returns different (stale) data to a buffered read that has
already completed because it hit the dirty page cache.  i.e. we
still have nasty page cache vs direct IO coherency problems, and
they are largely unsolvable because of the limitations of the core
kernel infrastructure architecture.

Yes, you can argue that userspace is doing an insane thing, but
every so often we come across coherency issues like this that are
out of a user's control (e.g. backup scan vs app accesses) and we do
our best to ensure that they don't cause problems given the
constraints we have.  Invalidating the page cache on dio reads
mostly mitigates these coherency race conditions and that's why it's
still there in the XFS code paths...

Cheers,

Dave.
Dave Chinner Jan. 10, 2019, 1:15 a.m. UTC | #42
On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote:
> On Wed, 9 Jan 2019, Dave Chinner wrote:
> 
> > FWIW, I just realised that the easiest, most reliable way to invalidate 
> > the page cache over a file range is simply to do a O_DIRECT read on it. 
> 
> Neat, good catch indeed. Still, it's only the invalidation part, but the 
> residency check is the crucial one.
> 
> > > Rationale has been provided by Daniel Gruss in this thread -- if the 
> > > attacker is left with cache timing as the only available vector, he's 
> > > going to be much more successful with mounting hardware cache timing 
> > > attack anyway.
> > 
> > No, he said:
> > 
> > "Restricting mincore() is sufficient to fix the hardware-agnostic
> > part."
> > 
> > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and 
> > provides exactly the same information about the page cache as mincore.  
> 
> Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has 
> "just" been overlooked. I can't speak for Daniel, but I believe he might 
> be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT 
> is sufficient ...".

Good luck with restricting RWF_NOWAIT. I eagerly await all the
fstests that exercise both the existing and new behaviours to
demonstrate they work correctly.

> > Timed read/mmap access loops for cache observation are also hardware 
> > agnostic, and on fast SSD based storage will only be marginally slower 
> > bandwidth than preadv2(RWF_NOWAIT).
> > 
> > Attackers will pick whatever leak vector we don't fix, so we either fix 
> > them all (which I think is probably impossible without removing caching 
> > altogether) 
> 
> We can't really fix the fact that it's possible to do the timing on the HW 
> caches though.

We can't really fix the fact that it's possible to do the timing on
the page cache, either.

> > or we start thinking about how we need to isolate the page cache so that 
> > information isn't shared across important security boundaries (e.g. page 
> > cache contents are per-mount namespace).
> 
> Umm, sorry for being dense, but how would that help that particular attack 
> scenario on a system that doesn't really employ any namespacing?

What's your security boundary?

The "detect what code an app is running" exploit is based on
invalidating and then observing how shared, non-user-owned files
mapped with execute privileges change cache residency.

If the security boundary is within the local container, should users
inside that container be allowed to invalidate the cache of
executable files and libraries they don't own? In this case, we
can't stop observation, because that only require read permissions
and high precision timing, hence the only thing that can be done
here is prevent non-owners from invalidating the page cache.

If the security boundary is a namespace or guest VM, then permission
checks don't work - the user may own the file within that container.
This problem now is that the page cache is observable and
controllable from both sides of the fence. Hence the only way to
prevent observation of the code being run in a different namespace
is to prevent the page being shared across both containers.

The exfiltration exploit requires the page cache to be observable
and controllable on both sides of the security boundary. Should
users be able to observe and control the cached pages accessed by a
different container? KSM page deduplication lessons say no. This is
an even harder problem, because page cache residency can be observed
from remote machines....

What scares me is that new features being proposed could make our
exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax")
proposal will directly share host page cache pages into guest VMs w/
DAX capability. i.e. the guest directly accesses the host page
cache.  This opens up the potential for host page cache timing
attacks from the guest VMs, and potential guest to guest
observation/exploitation is possible if the same files are mapped
into multiple guests....

IOws the two questions here are simply: "What's your security
boundary?" and "Is the page cache visible and controllable on both
sides?".

Cheers,

Dave.
Linus Torvalds Jan. 10, 2019, 1:18 a.m. UTC | #43
On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote:
>
> I wouldn't look at ext4 as an example of a reliable, problem free
> direct IO implementation because, historically speaking, it's been a
> series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> been far worse than XFS from data integrity, performance and
> reliability perspectives.

That's some big words from somebody who just admitted to much worse hacks.

Seriously. XFS is buggy in this regard, ext4 apparently isn't.

Thinking that it's better to just invalidate the cache  for direct IO
reads is all kinds of odd.

                Linus
Andy Lutomirski Jan. 10, 2019, 5:26 a.m. UTC | #44
On Wed, Jan 9, 2019 at 5:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > I wouldn't look at ext4 as an example of a reliable, problem free
> > direct IO implementation because, historically speaking, it's been a
> > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> > been far worse than XFS from data integrity, performance and
> > reliability perspectives.
>
> That's some big words from somebody who just admitted to much worse hacks.
>
> Seriously. XFS is buggy in this regard, ext4 apparently isn't.
>
> Thinking that it's better to just invalidate the cache  for direct IO
> reads is all kinds of odd.
>

This whole discussion seems to have gone a little bit off the rails...

Linus, I think I agree with Dave's overall sentiment, though, and I
think you should consider reverting your patch.  Here's why.  The
basic idea behind the attack is that the authors found efficient ways
to do two things: evict a page from page cache and detect, *without
filling the cache*, whether a page is cached.  The combination lets an
attacker efficiently tell when another process reads a page.  We need
to keep in mind that this attack is a sophisticated attack, and anyone
using it won't have any problem using a nontrivial way to detect
whether a page is in page cache.

So, unless we're going to try for real to make it hard to tell whether
a page is cached without causing that page to become cached, it's not
worth playing whack-a-mole.  And, right now, mincore is whacking a
mole.  RWF_NOWAIT appears to do essentially the same thing at very
little cost.  I haven't really dug in, but I assume that various
prefaulting tricks combined with various pagetable probing tricks can
do similar things, but that's at least a *lot* more complicated.

So unless we're going to lock down RWF_NOWAIT as well, I see no reason
to lock down mincore().  Direct IO is a red herring -- O_DIRECT is
destructive enough that it seems likely to make the attacks a lot less
efficient.


--- begin digression ---

Since direct IO has been brought up, I have a question.  I've wondered
for years why direct IO works the way it does.  If I were implementing
it from scratch, my first inclination would be to use the page cache
instead of fighting it.  To do a single-page direct read, I would look
that page up in the page cache (i.e. i_pages these days).  If the page
is there, I would do a normal buffered read.  If the page is not
there, I would insert a record into i_pages indicating that direct IO
is in progress and then I would do the IO into the destination page.
If any other read, direct or otherwise, sees a record saying "under
direct IO", it would wait.

To do a single-page direct write, I would look it up in i_pages.  If
it's there, I would do a buffered write followed by a sync (because
applications expect a sync).  If it's not there, I would again add a
record saying "under direct IO" and do the IO.

The idea is that this works as much like buffered IO as possible,
except that the pages backing the IO aren't normal sharable page cache
pages.

The multi-page case would be just an optimization on top of the
single-page case.  The idea would be to somehow mark i_pages with
entire extents under direct IO.  It's a radix tree -- this can, at
least in theory, be done efficiently.  As long as all direct IO
operations run in increasing order of offset, there shouldn't be lock
ordering problems.

Other than history and possibly performance, is there any reason that
direct IO doesn't work this way?

P.S. What, if anything, prevents direct writes from causing trouble
when the underlying FS or backing store needs stable pages?
Similarly, what, if anything, prevents direct reads from temporarily
exposing unintended data to user code if the fs or underlying device
transforms the data during the read process (e.g. by decrypting
something)?

--- end digression ---
Dave Chinner Jan. 10, 2019, 7:03 a.m. UTC | #45
On Wed, Jan 09, 2019 at 05:18:21PM -0800, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > I wouldn't look at ext4 as an example of a reliable, problem free
> > direct IO implementation because, historically speaking, it's been a
> > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> > been far worse than XFS from data integrity, performance and
> > reliability perspectives.
> 
> That's some big words from somebody who just admitted to much worse hacks.

Sorry, what hacks did I just admit to making? This O_DIRECT
behaviour long predates me - I'm just the messenger and you are
shooting from the hip.

Linus, the point I was making is that there are many, many ways to
control page cache invalidation and measure page cache residency,
and that trying to address them one-by-one is just a game of
whack-a-mole.

In future, can you please try not to go off the rails when someone
mentions O_DIRECT? You have a terrible habit of going off on
misdirected rants about O_DIRECT and/or XFS at any opportunity you
can get, and all it does is derail whatever useful conversation was
taking place.

> Seriously. XFS is buggy in this regard, ext4 apparently isn't.

So you keep asserting despite being presented with evidence that it
mitigates other longstanding bugs that are really hard to solve.
Ignoring all the evidence you've been presented with and
re-asserting your original statement doesn't make it correct.

Did you not think to ask "what are those problems, and what can do
to solve them so we can remove the invalidation mitigations that XFS
uses?". That would be a useful contribution, whereas shouting about
how O_DIRECT is broken just pisses off the people working their
asses off to fix the problems you just heard about and are ranting
about.

> Thinking that it's better to just invalidate the cache  for direct IO
> reads is all kinds of odd.

No, it's practicality. If we can't fix the problem, we have to
mitigate it. When we fix the underlying problem we can remove the
mitigation code. having you assert that it's broken and demand that
it be removed doesn't change the fact that we haven't fixed the
underlying problems. It's being worked on, but we're not there yet.

-Dave.
Jiri Kosina Jan. 10, 2019, 7:54 a.m. UTC | #46
On Thu, 10 Jan 2019, Dave Chinner wrote:

> > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has 
> > "just" been overlooked. I can't speak for Daniel, but I believe he might 
> > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT 
> > is sufficient ...".
> 
> Good luck with restricting RWF_NOWAIT. I eagerly await all the
> fstests that exercise both the existing and new behaviours to
> demonstrate they work correctly.

Well, we can still resurrect my original aproach of doing this opt-in 
based on a sysctl setting, and letting the admin choose his poison.

If 'secure' mode is selected, RWF_NOWAIT will then probably just always 
fail wit EAGAIN.
Linus Torvalds Jan. 10, 2019, 11:47 a.m. UTC | #47
On Wed, Jan 9, 2019 at 11:04 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Sorry, what hacks did I just admit to making? This O_DIRECT
> behaviour long predates me - I'm just the messenger and you are
> shooting from the hip.

Sure, sorry. I find this whole thing annoying.

> Linus, the point I was making is that there are many, many ways to
> control page cache invalidation and measure page cache residency,
> and that trying to address them one-by-one is just a game of
> whack-a-mole.

.. and I agree. But let's a step back.Because there are different issues.

First off, the whole page cache attack is not necessarily something
many people will care about. As has been pointed out, it's often a
matter of convenience and (relative) portability.

And no, we're *never* going to stop all side channel leaks. Some parts
of caching (notably the timing effects of it) are pretty fundamental.

So at no point is this going to be some kind of absolute line in the
sand _anyway_. There is no black-and-white "you're protected", there's
only levels of convenience.

A remote attacker is hopefully going to be limited by the interfaces
to just timing attacks, although who knows what something like JS
might expose. Presumably neither mincore() nor arbitrary O_DIRECT or
pread2() flags.

Anyway, the reason I was trying to plug mincore() is largely that that
code didn't make much sense to begin with, and simply this:

 mm/mincore.c | 94 +++++++++---------------------------------------------------
 1 file changed, 13 insertions(+), 81 deletions(-)

if we can make people happier by removing lines of code and making the
semantics more clear anyway, it's worth trying.

No?

Is that everything? No. As mentioned, you'll never get to that "ok, we
plugged everything" point anyway. But removing a fairly easy way to
probe the cache that has no real upsides should be fairly
non-controversial.

But I do have to say that in many ways the page cache is *not* a great
attack vector because there's often lots of it, and it's fairly hard
to control. Once something is in the page cache for whatever reason,
it tends to be pretty sticky, and flushing it tends to be fairly hard
to predict.

And a cheap and residency (whether a simple probe like mincore of or a
NOWAIT flag) check is actually important just to try to control the
flushing part. Brute-forcing the flushing is generally very expensive,
but if you can't even see if you flushed it, it's way more so.

If there's a way to control the cache residency directly, that's
actually a much bigger hole than any residency check ever were.

Because once you can flush caches by reading, at that point you can
just flush a particular page and look at the IO stats for the root
partition or something. No residency check even needed.

So I do think that yes, as long as you can do a directed cache flush,
mincore is *entirely* immaterial.

Still, giving mincore clearer semantics and simpler code? Win-win.

(Except, of course, if somebody actually notices outside of tests.
Which may well happen and just force us to revert that commit. But
that's a separate issue entirely).

But I do think that we should strive to *never* invalidate caches on
read accesses. I don't actually see where you are doing that,
honestly: at least dio_complete() only does it for writes.

So I'm actually hoping that you are mis-remembering this and it turns
out that O_DIRECT reads don't invalidate caches.

                Linus
Dominique Martinet Jan. 10, 2019, 12:24 p.m. UTC | #48
Linus Torvalds wrote on Thu, Jan 10, 2019:
> (Except, of course, if somebody actually notices outside of tests.
> Which may well happen and just force us to revert that commit. But
> that's a separate issue entirely).

Both Dave and I pointed at a couple of utilities that break with
this. nocache can arguably work with the new behaviour but will behave
differently; vmtouch on the other hand is no longer able to display
what's in cache or not - people use that for example to "warm up" a
container in page cache based on how it appears after it had been
running for a while is a pretty valid usecase to me.

From the list Kevin harvested out of the debian code search, the
postgresql use case is pretty similar - probe what pages of the database
were in cache at shutdown so when you restart it you can preload these
and reach "cruse speed" faster.

Sure that's probably not billions of users but this all looks fairly
valid to me...
Matthew Wilcox (Oracle) Jan. 10, 2019, 2:47 p.m. UTC | #49
On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> Since direct IO has been brought up, I have a question.  I've wondered
> for years why direct IO works the way it does.  If I were implementing
> it from scratch, my first inclination would be to use the page cache
> instead of fighting it.  To do a single-page direct read, I would look
> that page up in the page cache (i.e. i_pages these days).  If the page
> is there, I would do a normal buffered read.  If the page is not
> there, I would insert a record into i_pages indicating that direct IO
> is in progress and then I would do the IO into the destination page.
> If any other read, direct or otherwise, sees a record saying "under
> direct IO", it would wait.

OK, you're in the same ballpark I am ;-)  Kent Overstreet pointed out
that what you want to do here is great for the mixed case, but it's
pretty inefficient for IOs to files which are wholly uncached.

So what I'm currently thinking about is an rwsem which works like this:

O_DIRECT task:
if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO,
drop rwsem.
if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index).

buffered IO:
if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem.
if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on
waitqueue. otherwise proceed as now.
Matthew Wilcox (Oracle) Jan. 10, 2019, 2:50 p.m. UTC | #50
On Thu, Jan 10, 2019 at 11:44:24AM +1100, Dave Chinner wrote:
> And, really, this would be just another band-aid over a symptom of
> the information leak - it doesn't prevent users from being able to
> control page cache invalidation. It just removes one method, just
> like hacking mincore only removes one method of observing the page
> cache.  And, like mincore(), there's every chance it impacts on
> userspace in a negative manner and so we need to be very careful
> here.

Putting the mincore() / cache timing information leak aside though,
the current behaviour of XFS means that an attacker can screw up the
performance of random applications just by repeatedly doing O_DIRECT
reads of libc.so.

Maybe O_DIRECT reads should be forbidden from files on XFS unless you
also have write access to them?  (eg owner).
Dave Chinner Jan. 10, 2019, 9:44 p.m. UTC | #51
On Thu, Jan 10, 2019 at 06:47:11AM -0800, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> > Since direct IO has been brought up, I have a question.  I've wondered
> > for years why direct IO works the way it does.  If I were implementing
> > it from scratch, my first inclination would be to use the page cache
> > instead of fighting it.  To do a single-page direct read, I would look
> > that page up in the page cache (i.e. i_pages these days).  If the page
> > is there, I would do a normal buffered read.  If the page is not
> > there, I would insert a record into i_pages indicating that direct IO
> > is in progress and then I would do the IO into the destination page.
> > If any other read, direct or otherwise, sees a record saying "under
> > direct IO", it would wait.
> 
> OK, you're in the same ballpark I am ;-)  Kent Overstreet pointed out
> that what you want to do here is great for the mixed case, but it's
> pretty inefficient for IOs to files which are wholly uncached.
> 
> So what I'm currently thinking about is an rwsem which works like this:
> 
> O_DIRECT task:
> if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO,
> drop rwsem.

GUP does page fault on user buffer which is a mmapped region of same
file. page fault sets up for buffered IO, tries to take rwsem for
write, deadlocks.

Most of the schemes we come up with fall down at this point - you
can't hold a lock over gup that is also used in the buffered IO
path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK
for truncation serialisation - we can't lock out both read()/write()
and mmap IO paths with the same lock...

> if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index).

I assume you really mean add a tag to the entry?

But this means there is no record ofthe direct IO being in flight
except for the rwsem being held across the IO. Even if we did insert
a flag to say "DIO in progress" and not rely on the lock....

> buffered IO:
> if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem.
> if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on
> waitqueue. otherwise proceed as now.

... we'll sleep on that flags in the page fault and deadlock anyway.

I'm pretty sure we explored this "record DIO state in the radix
tree" 2 or 3 years ago and came to the conclusion that it didn't
work for reasons like the above. i.e. it doesn't solve the problems
we currently have with locking and serialisation between DIO and
mmap...

Cheers,

Dave.
Linus Torvalds Jan. 10, 2019, 9:59 p.m. UTC | #52
On Thu, Jan 10, 2019 at 1:44 PM Dave Chinner <david@fromorbit.com> wrote:
>
> GUP does page fault on user buffer which is a mmapped region of same
> file. page fault sets up for buffered IO, tries to take rwsem for
> write, deadlocks.
>
> Most of the schemes we come up with fall down at this point - you
> can't hold a lock over gup that is also used in the buffered IO
> path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK
> for truncation serialisation - we can't lock out both read()/write()
> and mmap IO paths with the same lock...

Side note: a somewhat similar version of is true even in the absence
of GUP and dio, for the case of doing a mmap of a file, and then
reading or writing from the mapped region into the file itself.

There are "interesting" locking scenarios wrt just holding the page
locked, and trying to then fill that page with information with just a
regular "copy_from_user()".

Page fault -> try to read the file -> oops, the page we're trying to
read from is locked because we're trying to write to it.

So we have that odd dance in generic_perform_write() which does

 - touch the first user byte without holding any lock

 - do write_begin() (which gets the page lock)

 - copy from user space using the "atomic" copy (which just gives up on fault)

 - if nothing got copied, go back and try again with a smaller copy
that can't cross a page. We might have raced with pageout.

It might be possible to do something similar for direct IO, although
simpler: just do the GUP entirely atomically (and in the fault case
just fall back to non-direct IO).

            Linus
Linus Torvalds Jan. 10, 2019, 10:11 p.m. UTC | #53
On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Linus Torvalds wrote on Thu, Jan 10, 2019:
> > (Except, of course, if somebody actually notices outside of tests.
> > Which may well happen and just force us to revert that commit. But
> > that's a separate issue entirely).
>
> Both Dave and I pointed at a couple of utilities that break with
> this. nocache can arguably work with the new behaviour but will behave
> differently; vmtouch on the other hand is no longer able to display
> what's in cache or not - people use that for example to "warm up" a
> container in page cache based on how it appears after it had been
> running for a while is a pretty valid usecase to me.

So honestly, the main reason I'm loath to revert is that yes, we know
of theoretical differences, but they seem to all be
performance-related.

It would be really good to hear numbers. Is the warm-up optimization
something that changes things from 3ms to 3.5ms? Or does it change
things from 3ms to half a second?

Because Dave is absolutely correct that mincore() isn't really even
all that interesting an information leak if you can do the same with
RWF_NOWAIT. But the other side of that same coin is that if we're not
able to block mincore() sanely, then there's no point at looking at
RWF_NOWAIT either.

And we *can* do sane things about RWF_NOWAIT. For example, we could
start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
page cache" to "probe and fill", and be much harder to use as an
attack vector..

Do we want to do that? Maybe, maybe not. But if mincore() can't be
fixed, there's no point in even trying.

Now, if the mincore() change results in a big performance hit for
people who use it as a heuristic for filling caches etc, then
reverting the trial balloon is obviously something we must do, but at
that point I'd also like to know which load it was that cared so much,
and just what it did. Because we did have an alternate patch that just
said "was the file writably opened, then we can do the page cache
probing". But at least one user (fincore) didn't do even that.

So right now, I consider the mincore change to be a "try to probe the
state of mincore users", and we haven't really gotten a lot of
information back yet.

              Linus
Dave Chinner Jan. 11, 2019, 1:47 a.m. UTC | #54
On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> Since direct IO has been brought up, I have a question.  I've wondered
> for years why direct IO works the way it does.  If I were implementing
> it from scratch, my first inclination would be to use the page cache
> instead of fighting it.  To do a single-page direct read, I would look
> that page up in the page cache (i.e. i_pages these days).  If the page
> is there, I would do a normal buffered read.  If the page is not

Therein lies the problem. Copying data is prohibitively expensive,
and that's the primary reason for O_DIRECT existing.  i.e. O_DIRECT
is a low-overhead, zero-copy data movement interface.

The moment we switch from using CPU to dispatch IO to copying data,
performance goes down because we will be unable to keep storage
pipelines full.  IOWs, any rework of O_DIRECT that involves copying
data is a non-starter.

But let's bring this back to the issue at hand - observability of
page cache residency of file pages. If th epage is caceh resident,
then it will have a latency of copying that data out of the page
(i.e. very low latency). If the page is not resident, then it will
do IO and take much, much longer to complete. i.e. we have clear
timing differences between cachce hit and cache miss IO.  This is
exactly the timing information needed for observing page cache
residency.

We need to work out how to make page cache residency less
observable, not add new, near perfect observation mechanisms that
third parties can easily exploit...

Cheers,

Dave.
Dave Chinner Jan. 11, 2019, 2:03 a.m. UTC | #55
On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> And we *can* do sane things about RWF_NOWAIT. For example, we could
> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> page cache" to "probe and fill", and be much harder to use as an
> attack vector..

We can only do that if the application submits the read via AIO and
has an async IO completion reporting mechanism.  Otherwise we have
to wait for the IO to complete to copy the data into the user's
buffer. And given that the app is using RWF_NOWAIT to explicitly
avoid blocking on the IO....

Also, keep in mind that RWF_NOWAIT also prevents blocking on
filesystem locks and full request queues. One of the prime drivers
of RWF_NOWAIT was to prevent AIO submission from blocking on
filesystem locks - it allows userspace to submit other IO while it
can't get all the access it requires to a single file or a single
block device is congested.

Hence I don't think there's a such a simple answer here - blocking
for IO breaks RWF_NOWAIT.

Cheers,

Dave.
Linus Torvalds Jan. 11, 2019, 2:18 a.m. UTC | #56
On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> > And we *can* do sane things about RWF_NOWAIT. For example, we could
> > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> > page cache" to "probe and fill", and be much harder to use as an
> > attack vector..
>
> We can only do that if the application submits the read via AIO and
> has an async IO completion reporting mechanism.

Oh, no, you misunderstand.

RWF_NOWAIT has a lot of situations where it will potentially return
early (the DAX and direct IO ones have their own), but I was thinking
of the one in generic_file_buffered_read(), which triggers when you
don't find a page mapping. That looks like the obvious "probe page
cache" case.

But we could literally move that test down just a few lines. Let it
start read-ahead.

.. and then it will actually trigger on the *second* case instead, where we have

                if (!PageUptodate(page)) {
                        if (iocb->ki_flags & IOCB_NOWAIT) {
                                put_page(page);
                                goto would_block;
                        }

and that's where RWF_MNOWAIT would act.

It would still return EAGAIN.

But it would have started filling the page cache. So now the act of
probing would fill the page cache, and the attacker would be left high
and dry - the fact that the page cache now exists is because of the
attack, not because of whatever it was trying to measure.

See?

But obviously this kind of change only matters if we also have
mincore() not returning the probe data. mincore() obviously can't do
the same kind of read-ahead to defeat things.

              Linus
Dave Chinner Jan. 11, 2019, 4:04 a.m. UTC | #57
On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> > > And we *can* do sane things about RWF_NOWAIT. For example, we could
> > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> > > page cache" to "probe and fill", and be much harder to use as an
> > > attack vector..
> >
> > We can only do that if the application submits the read via AIO and
> > has an async IO completion reporting mechanism.
> 
> Oh, no, you misunderstand.
> 
> RWF_NOWAIT has a lot of situations where it will potentially return
> early (the DAX and direct IO ones have their own), but I was thinking
> of the one in generic_file_buffered_read(), which triggers when you
> don't find a page mapping. That looks like the obvious "probe page
> cache" case.
> 
> But we could literally move that test down just a few lines. Let it
> start read-ahead.
> 
> .. and then it will actually trigger on the *second* case instead, where we have
> 
>                 if (!PageUptodate(page)) {
>                         if (iocb->ki_flags & IOCB_NOWAIT) {
>                                 put_page(page);
>                                 goto would_block;
>                         }
> 
> and that's where RWF_MNOWAIT would act.
> 
> It would still return EAGAIN.
> 
> But it would have started filling the page cache. So now the act of
> probing would fill the page cache, and the attacker would be left high
> and dry - the fact that the page cache now exists is because of the
> attack, not because of whatever it was trying to measure.
> 
> See?

Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in
page_cache_sync_readahead():

        /* be dumb */
        if (filp && (filp->f_mode & FMODE_RANDOM)) {
                force_page_cache_readahead(mapping, filp, offset, req_size);
                return;
        }

So it will only read the single page we tried to access and won't
perturb the rest of the message encoded into subsequent pages in
file.

Cheers,

Dave.
Andy Lutomirski Jan. 11, 2019, 4:08 a.m. UTC | #58
> On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote:
>>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
>>>> And we *can* do sane things about RWF_NOWAIT. For example, we could
>>>> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
>>>> page cache" to "probe and fill", and be much harder to use as an
>>>> attack vector..
>>> 
>>> We can only do that if the application submits the read via AIO and
>>> has an async IO completion reporting mechanism.
>> 
>> Oh, no, you misunderstand.
>> 
>> RWF_NOWAIT has a lot of situations where it will potentially return
>> early (the DAX and direct IO ones have their own), but I was thinking
>> of the one in generic_file_buffered_read(), which triggers when you
>> don't find a page mapping. That looks like the obvious "probe page
>> cache" case.
>> 
>> But we could literally move that test down just a few lines. Let it
>> start read-ahead.
>> 
>> .. and then it will actually trigger on the *second* case instead, where we have
>> 
>>                if (!PageUptodate(page)) {
>>                        if (iocb->ki_flags & IOCB_NOWAIT) {
>>                                put_page(page);
>>                                goto would_block;
>>                        }
>> 
>> and that's where RWF_MNOWAIT would act.
>> 
>> It would still return EAGAIN.
>> 
>> But it would have started filling the page cache. So now the act of
>> probing would fill the page cache, and the attacker would be left high
>> and dry - the fact that the page cache now exists is because of the
>> attack, not because of whatever it was trying to measure.
>> 
>> See?
> 
> Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in
> page_cache_sync_readahead():
> 
>        /* be dumb */
>        if (filp && (filp->f_mode & FMODE_RANDOM)) {
>                force_page_cache_readahead(mapping, filp, offset, req_size);
>                return;
>        }
> 
> So it will only read the single page we tried to access and won't
> perturb the rest of the message encoded into subsequent pages in
> file.
> 

There are two types of attacks.  One is an intentional side channel where two cooperating processes communicate. This is, under some circumstances, a problem, but it’s not one we’re about to solve in general. The other is an attacker monitoring an unwilling process. I think we care a lot more about that, and Linus’ idea will help.
Dominique Martinet Jan. 11, 2019, 4:57 a.m. UTC | #59
Linus Torvalds wrote on Thu, Jan 10, 2019:
> On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet
> <asmadeus@codewreck.org> wrote:
> > Linus Torvalds wrote on Thu, Jan 10, 2019:
> > > (Except, of course, if somebody actually notices outside of tests.
> > > Which may well happen and just force us to revert that commit. But
> > > that's a separate issue entirely).
> >
> > Both Dave and I pointed at a couple of utilities that break with
> > this. nocache can arguably work with the new behaviour but will behave
> > differently; vmtouch on the other hand is no longer able to display
> > what's in cache or not - people use that for example to "warm up" a
> > container in page cache based on how it appears after it had been
> > running for a while is a pretty valid usecase to me.
> 
> So honestly, the main reason I'm loath to revert is that yes, we know
> of theoretical differences, but they seem to all be
> performance-related.

I don't see what other use mincore could have, yes - even the
"debugging" use I gave is performance investigations and not hard
problems (and I probably would go straight to perf nowadays, you'd get
the info that the program doesn't use cache from the call graphs)

> It would be really good to hear numbers. Is the warm-up optimization
> something that changes things from 3ms to 3.5ms? Or does it change
> things from 3ms to half a second?

This is heavily workload and storage hardware dependant, so hard to give
some absolute value.

Trying with some big server, fast SSD, mysql and doing:
 # echo 3 > /proc/sys/vm/drop_caches
 # (optional) prefetch table and innodb files
 # systemctl restart mariadb
 # time mysql -q db "select * from mytable where id in $ENTRIES" > /dev/null
 # time mysql -q db "select * from mytable where id in $ENTRIES2" > /dev/null
 # time mysql -q db "select * from mytable where id in $ENTRIES3" > /dev/null
(where ENTRIES* are lists of 1000 id, and id is indexed; the table is 8GB
for 62590661 entries so 1000 entries is approx 128KB of data out of that
file)

I get on average over a few queries approximately a real time of 350ms,
230ms and 220ms immediately after drop cache and service restart, and
150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I
didn't have the patience to do proper testing).
(In both cases, user/sys are less than 10ms; I don't see much difference
there)

If I restart the service without dropping caches and redo the query I
get 60ms from the first query onwards so I must not be preloading
everything properly, some real script that would look all over a
container to properly restore the page cache would do better than me
blindly preloading a few files.

Either way, we're talking about a factor of 2-3 until the application has
been looking at most of the entries, and I didn't try to see how that
would look like on spinning disks or the kind of slow storage one would
get on VPS somewhere in the cloud - I'm sure someone with time to waste
could get much more impressive figures, but this already look pretty
worthwhile to me.
Linus Torvalds Jan. 11, 2019, 7:08 a.m. UTC | #60
On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote:
>
> So it will only read the single page we tried to access and won't
> perturb the rest of the message encoded into subsequent pages in
> file.

Dave, you're being intentionally obtuse, aren't you?

It's only that single page that *matters*. That's the page that the
probe reveals the status of - but it's also the page that the probe
then *changes* the status of.

See?

Think of it as "the act of measurement changes that which is
measured". And that makes the measurement pointless.

              Linus
Linus Torvalds Jan. 11, 2019, 7:11 a.m. UTC | #61
On Thu, Jan 10, 2019 at 8:58 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> I get on average over a few queries approximately a real time of 350ms,
> 230ms and 220ms immediately after drop cache and service restart, and
> 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I
> didn't have the patience to do proper testing).
> (In both cases, user/sys are less than 10ms; I don't see much difference
> there)

But those numbers aren't about the mincore() change. That's just from
dropping caches.

Now, what's the difference with the mincore change, and without? Is it
actually measurable?

Because that's all that matters: is the mincore change something you
can even notice? Is it a big regression?

The fact that things are slower when they are cold in the cache isn't
the issue. The issue is whether the change to mincore semantics makes
any difference to real loads.

                Linus
Dave Chinner Jan. 11, 2019, 7:20 a.m. UTC | #62
On Thu, Jan 10, 2019 at 08:08:37PM -0800, Andy Lutomirski wrote:
> > On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com>
> > wrote:
> > 
> >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds
> >> wrote:
> >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner
> >>> <david@fromorbit.com> wrote:
> >>> 
> >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds
> >>>> wrote: And we *can* do sane things about RWF_NOWAIT. For
> >>>> example, we could start async IO on RWF_NOWAIT, and suddenly
> >>>> it would go from "probe the page cache" to "probe and fill",
> >>>> and be much harder to use as an attack vector..
> >>> 
> >>> We can only do that if the application submits the read via
> >>> AIO and has an async IO completion reporting mechanism.
> >> 
> >> Oh, no, you misunderstand.
> >> 
> >> RWF_NOWAIT has a lot of situations where it will potentially
> >> return early (the DAX and direct IO ones have their own), but I
> >> was thinking of the one in generic_file_buffered_read(), which
> >> triggers when you don't find a page mapping. That looks like
> >> the obvious "probe page cache" case.
> >> 
> >> But we could literally move that test down just a few lines.
> >> Let it start read-ahead.
> >> 
> >> .. and then it will actually trigger on the *second* case
> >> instead, where we have
> >> 
> >>                if (!PageUptodate(page)) { if (iocb->ki_flags &
> >>                IOCB_NOWAIT) { put_page(page); goto would_block;
> >>                }
> >> 
> >> and that's where RWF_MNOWAIT would act.
> >> 
> >> It would still return EAGAIN.
> >> 
> >> But it would have started filling the page cache. So now the
> >> act of probing would fill the page cache, and the attacker
> >> would be left high and dry - the fact that the page cache now
> >> exists is because of the attack, not because of whatever it was
> >> trying to measure.
> >> 
> >> See?
> > 
> > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code
> > in page_cache_sync_readahead():
> > 
> >        /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM))
> >        { force_page_cache_readahead(mapping, filp, offset,
> >        req_size); return; }
> > 
> > So it will only read the single page we tried to access and
> > won't perturb the rest of the message encoded into subsequent
> > pages in file.
> 
> There are two types of attacks.  One is an intentional side
> channel where two cooperating processes communicate.  This is,
> under some circumstances, a problem,

Yes, that's the covert communication channel that can cross container
and machine boundaries without any required privileges.

> but it’s not one
> we’re about to solve in general. The other is an attacker
> monitoring an unwilling process.

Which uses exactly the same mechanisms as the first case. i.e.
controlled invalidation and page cache residency monitoring.If we
aren't going to solve the first problem case, the we aren't going to
solve the second because they are one and the same problem...

However, I suspect you have misunderstood the monitoring mechanism
here - dispatch IO for this page doesn't prevent the information
leak about that page. It's when we return EAGAIN that we leak
information about page cache residency.

What Linus is attempting to do is perturb the nearby state of the
page cache by triggering async readahead in the EAGAIN case.  Async
readahead will fill all the holes in readahead window and hence
destroy the information about where the page fault landed and
instantiated the page cache. That would prevent the attacker from
determining what code the executable is running as they would only
be able to check a single page in an executable at a time and that
makes the attack highly impractical.

But if the attacker uses FADV_RANDOM, readahead is only triggered
for the page the attacker is trying to read. Hence it does not
disturb the nearby page cache residency pattern the executable's
page faults left behind and so doesn't destroy the information that
they are trying to extract from the unwilling process.

Sure, Linus's change makes monitoring the executable file after
FADV_RANDOM a "read-once" mechanism. However, detection of what code
is executing is a repeated invalidate-and-sweep exercise to begin
with, so it basically doesn't change the information or the rate at
which the monitoring process can extract information from the file.

/me hasn't thought about this sort of stuff since he was running
page cache invalidation attacks on Irix system libraries way back in
2002 when he found a libc bug that killed the init process and
paniced the kernel.

This isn't my first rodeo - it's been well known for a long, long
time that the system page cache can be exploited to monitor
executing code...

Cheers,

Dave.
Dominique Martinet Jan. 11, 2019, 7:32 a.m. UTC | #63
Linus Torvalds wrote on Thu, Jan 10, 2019:
> But those numbers aren't about the mincore() change. That's just from
> dropping caches.
> 
> Now, what's the difference with the mincore change, and without? Is it
> actually measurable?
> 
> Because that's all that matters: is the mincore change something you
> can even notice? Is it a big regression?
> 
> The fact that things are slower when they are cold in the cache isn't
> the issue. The issue is whether the change to mincore semantics makes
> any difference to real loads.

mincore itself isn't used to reload the data, but is necessary to know
*what* you need to reload.
If you don't know what pages are hot, how can you preload them?

For small enough a database and with enough memory you can act stupid
and load the whole tables in cache, that's what I did because I was lazy
and only had a big mysql data set around. But the container warming up
automaton Dave mentioned and postgresql db preloading with pgfincore
explicitely depend on being able to tell what they need to preload.


pgfincore documentation states:
----
set of PostgreSQL functions to manage blocks in memory

Those functions let you know which and how many disk block from a
relation are in the page cache of the operating system, and eventually
write the result to a file. Then using this file, it is possible to
restore the page cache state for each block of the relation.
----
If you cannot dump an arbitrary "hot state" x, you cannot restore it.


This is all basically a repeat of the other subthread; sure precaching
itself doesn't need mincore and if you're all-knowing the syscall isn't
needed, but mere mortals need it.


If it's about the commit itself, vmtouch tells me 0 page in these
database files are in cache when I reboot to a 5.0-rc1 kernel and run
some queries, so the difference after a fresh boot is exactly what I
stated. I'm probably missing something but I'm not quite sure in what
situation the "new mincore" has any use right now.
Dave Chinner Jan. 11, 2019, 7:36 a.m. UTC | #64
On Thu, Jan 10, 2019 at 11:08:07PM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > So it will only read the single page we tried to access and won't
> > perturb the rest of the message encoded into subsequent pages in
> > file.
> 
> Dave, you're being intentionally obtuse, aren't you?
> 
> It's only that single page that *matters*. That's the page that the
> probe reveals the status of - but it's also the page that the probe
> then *changes* the status of.

It changes the state of it /after/ we've already got the information
we need from it. It's not up to date, it has to come from disk, we
return EAGAIN, which means it was not in the cache.

i.e. if we return EAGAIN, we've leaked the inforation the attacker
wants regardless of how the act of initiating readahead on the page
change the state of the page.  Yes, it raises the complexity bar a
bit, and lowers the monitoring frequency somewhat, but that's about
it.

Cheers,

Dave.
Jiri Kosina Jan. 11, 2019, 7:36 a.m. UTC | #65
On Thu, 10 Jan 2019, Dave Chinner wrote:

> Sounds nice from a theoretical POV, but reality has taught us very 
> different lessons.
> 
> FWIW, a quick check of XFS's history so you understand how long this 
> behaviour has been around. It was introduced in the linux port in 2001 
> as direct IO support was being added:
> 
> commit e837eac23662afae603aaaef7c94bc839c1b8f67
> Author: Steve Lord <lord@sgi.com>
> Date:   Mon Mar 5 16:47:52 2001 +0000
> 
>     Add bounds checking for direct I/O, do the cache invalidation for
>     data coherency on direct I/O.

Out of curiosity, which repository is this from please? Even google 
doesn't seem to know about this SHA.

Thanks,
Linus Torvalds Jan. 11, 2019, 4:26 p.m. UTC | #66
On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote:
>
> > It's only that single page that *matters*. That's the page that the
> > probe reveals the status of - but it's also the page that the probe
> > then *changes* the status of.
>
> It changes the state of it /after/ we've already got the information
> we need from it. It's not up to date, it has to come from disk, we
> return EAGAIN, which means it was not in the cache.

Oh, I see the confusion.

Yes, you get the information about whether something was in the cache
or not, so the side channel does exist to some degree.

But it's actually hugely reduced for a rather important reason: the
_primary_ reason for needing to know whether some page is in the cache
or not is not actually to see if it was ever accessed - it's to see
that the cache has been scrubbed (and to _guide_ the scrubbing), and
*when* it was accessed.

Think of it this way: the buffer cache residency is actually a
horribly bad signal on its own mainly because you generally have a
very high hit-rate. In most normal non-streaming situations with
sufficient amounts of memory you have pretty much everything cached.

So in order to use it as a signal, first you have to first scrub the
cache (because if the page was already there, there's no signal at
all), and then for the signal to be as useful as possible, you're also
going to want to try to get out more than one bit of information: you
are going to try to see the patterns and the timings of how it gets
filled.

And that's actually quite painful. You don't know the initial cache
state, and you're not (in general) controlling the machine entirely,
because there's also that actual other entity that you're trying to
attack and see what it does.

So what you want to do is basically to first make sure the cache is
scrubbed (only for the pages you're interested in!), then trigger
whatever behavior you are looking for, and then look how that affected
the cache.

In other words,  you want *multiple* residency status check - first to
see what the cache state is (because you're going to want that for
scrubbing), then to see that "yes, it's gone" when doing the
scrubbing, and then to see the *pattern* and timings of how things are
brought in.

And then you're likely to want to do this over and over again, so that
you can get real data out of the signal.

This is why something that doesn't perturb what you measure is really
important. If the act of measurement brings the page in, then you
can't use it for that "did I successfully scrub it" phase at all, and
you can't use it for measurement but once, so your view into patterns
and timings is going to be *much* worse.

And notice that this is true even if the act of measurement only
affects the *one* page you're measuring. Sure, any additional noise
around it would likely be annoying too, but it's not really necessary
to make the attack much harder to carry out. In fact, it's almost
irrelevant, since the signal you're trying to *see* is going to be
affected by prefetching etc too, so the patterns and timings you need
to look at are in bigger chunks than the readahead thing.

So yes, you as an attacker can remove the prefetching from *your*
load, but you can't remove it from the target load anyway, so you'll
just have to live with it.

Can you brute-force scrubbing? Yes. For something like an L1 cache,
that's easy (well, QoS domains make it harder). For something like a
disk cache, it's much harder, and makes any attempt to read out state
a lot slower. The paper that started this all uses mincore() not just
to see "is the page now scrubbed", but also to guide the scrubbing
itself (working set estimation etc).

And note that in many ways, the *scrubbing* is really the harder part.
Populating the cache is really easy: just read the data you want to
populate.

So if you are looking for a particular signal, say "did this error
case trigger so that it faulted in *that* piece of information", you'd
want to scrub the target, populate everything else, and then try to
measure at "did I trigger that target". Except you wouldn't want to do
it one page at a time but see as much pattern of "they were touched in
this order" as you can, and you'd like to get timing information of
how the pages you are interested were populated too.

And you'd generally do this over and over and over again because
you're trying to read out some signal.

Notice what the expensive operation was? It's the scrubbing.The "did
the target do IO" you might actually even see other ways for the
trivial cases, like even just look at iostat: just pre-populate
everything but the part you care about, then try to trigger whatever
you're searching for, and see if it caused IO or not.

So it's a bit like a chalkboard: in order to read out the result, you
need to erase it first, and doing that blindly is nasty. And you want
to look at timings, which is also really nasty if every time you look,
you smudge the very place you looked at. It makes it hard to see what
somebody else is writing on the board if you're always overwriting
what you just looked at. Did you get some new information? If not, now
you have to go back and do that scrubbing again, and you'll likely be
missing what *else* the person wrote.

Ans as always: there is no "black and white". There is no "absolute
security", and similarly, there is no "absolute leak proof". It's all
about making it inconvenient enough that it's not really practical.

                 Linus
Dave Chinner Jan. 15, 2019, 11:45 p.m. UTC | #67
On Fri, Jan 11, 2019 at 08:26:14AM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > > It's only that single page that *matters*. That's the page that the
> > > probe reveals the status of - but it's also the page that the probe
> > > then *changes* the status of.
> >
> > It changes the state of it /after/ we've already got the information
> > we need from it. It's not up to date, it has to come from disk, we
> > return EAGAIN, which means it was not in the cache.
> 
> Oh, I see the confusion.
> 
> Yes, you get the information about whether something was in the cache
> or not, so the side channel does exist to some degree.
> 
> But it's actually hugely reduced for a rather important reason: the
> _primary_ reason for needing to know whether some page is in the cache
> or not is not actually to see if it was ever accessed - it's to see
> that the cache has been scrubbed (and to _guide_ the scrubbing), and
> *when* it was accessed.

Oh, you're assuming that you need to probe the page cache to
determine if brute force invalidation has progressed far enough
to invalidate the page in question.

I'm assuming that you can invalidate the page cache reliably by a
means that does not repeated require probing to detect invalidation
has occurred. I've mentioned one method in this discussion
already...

IOWs, just because the specific brute force attack documented in the
paper required repeated probing it doesn't mean all future
invalidation attacks will require repeated probing. Hence a robust
defense mechanism should not rely on the attacker requiring multiple
observations of the page cache to extract the information they are
seeking...

Cheers,

Dave.
Josh Snyder Jan. 16, 2019, 12:42 a.m. UTC | #68
Linus Torvalds wrote on Thu, Jan 10, 2019:
> So right now, I consider the mincore change to be a "try to probe the
> state of mincore users", and we haven't really gotten a lot of
> information back yet.

<raises hand>

For Netflix, losing accurate information from the mincore syscall would
lengthen database cluster maintenance operations from days to months.  We
rely on cross-process mincore to migrate the contents of a page cache from
machine to machine, and across reboots.

To do this, I wrote and maintain happycache [1], a page cache dumper/loader
tool. It is quite similar in architecture to pgfincore, except that it is
agnostic to workload. The gist of happycache's operation is "produce a dump
of residence status for each page, do some operation, then reload exactly
the same pages which were present before." happycache is entirely dependent
on accurate reporting of the in-core status of file-backed pages, as
accessed by another process.

We primarily use happycache with Cassandra, which (like Postgres +
pgfincore) relies heavily on OS page cache to reduce disk accesses. Because
our workloads never experience a cold page cache, we are able to provision
hardware for a peak utilization level that is far lower than the
hypothetical "every query is a cache miss" peak.

A database warmed by happycache can be ready for service in seconds
(bounded only by the performance of the drives and the I/O subsystem), with
no period of in-service degradation. By contrast, putting a database in
service without a page cache entails a potentially unbounded period of
degradation (at Netflix, the time to populate a single node's cache via
natural cache misses varies by workload from hours to weeks). If a single
node upgrade were to take weeks, then upgrading an entire cluster would
take months. Since we want to apply security upgrades (and other things) on
a somewhat tighter schedule, we would have to develop more complex
solutions to provide the same functionality already provided by mincore.

At the bottom line, happycache is designed to benignly exploit the same
information leak documented in the paper [2]. I think it makes perfect
sense to remove cross-process mincore functionality from unprivileged
users, but not to remove it entirely.

Josh Snyder
Netflix Cloud Database Engineering

[1] https://github.com/hashbrowncipher/happycache
[2] https://arxiv.org/abs/1901.01161
Linus Torvalds Jan. 16, 2019, 4:54 a.m. UTC | #69
On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote:
>
> I'm assuming that you can invalidate the page cache reliably by a
> means that does not repeated require probing to detect invalidation
> has occurred. I've mentioned one method in this discussion
> already...

Yes. And it was made clear to you that it was a bug in xfs dio and
what the right thing to do was.

And you ignored that, and claimed it was a feature.

Why do you then bother arguing this thing? We absolutely agree that
xfs has an information leak. If you don't care, just _say_ so. Don't
try to argue against other people who are trying to fix things.

We can easily just say "ok, xfs people don't care", and ignore the xfs
invalidation issue. That's fine.

But don't try to make it a big deal for other filesystems that _don't_
have the bug. I even pointed out how ext4 does the page cache flushing
correcrly. You pooh-poohed it.

You can't have it both ways.

Either you care or you don't. If you don't care (and so far everything
you said seems to imply you don't), then why are you even discussing
this? Just admit you don't care, and we're done.

                  Linus
Linus Torvalds Jan. 16, 2019, 5 a.m. UTC | #70
On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote:
>
> For Netflix, losing accurate information from the mincore syscall would
> lengthen database cluster maintenance operations from days to months.  We
> rely on cross-process mincore to migrate the contents of a page cache from
> machine to machine, and across reboots.

Ok, this is the kind of feedback we need, and means I guess we can't
just use the mapping existence for mincore.

The two other ways that we considered were:

 (a) owner of the file gets to know cache information for that file.

 (b) having the fd opened *writably* gets you cache residency information.

Sadly, taking a look at happycache, you open the file read-only, so
(b) doesn't work.

Judging just from the source code, I can't tell how the user ownership
works. Any input on that?

And if you're not the owner of the file, do you have another
suggestion for that "Yes, I have the right to see what's in-core for
this file". Because the problem is literally that if it's some random
read-only system file, the kernel shouldn't leak access patterns to
it..

                     Linus
Andy Lutomirski Jan. 16, 2019, 5:25 a.m. UTC | #71
> On Jan 15, 2019, at 9:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote:
>> 
>> For Netflix, losing accurate information from the mincore syscall would
>> lengthen database cluster maintenance operations from days to months.  We
>> rely on cross-process mincore to migrate the contents of a page cache from
>> machine to machine, and across reboots.
> 
> Ok, this is the kind of feedback we need, and means I guess we can't
> just use the mapping existence for mincore.
> 
> The two other ways that we considered were:
> 
> (a) owner of the file gets to know cache information for that file.
> 
> (b) having the fd opened *writably* gets you cache residency information.
> 
> Sadly, taking a look at happycache, you open the file read-only, so
> (b) doesn't work.
> 
> Judging just from the source code, I can't tell how the user ownership
> works. Any input on that?
> 
> And if you're not the owner of the file, do you have another
> suggestion for that "Yes, I have the right to see what's in-core for
> this file". Because the problem is literally that if it's some random
> read-only system file, the kernel shouldn't leak access patterns to
> it..
> 
> 


Something like CAP_DAC_READ_SEARCH might not be crazy.
Linus Torvalds Jan. 16, 2019, 5:34 a.m. UTC | #72
On Wed, Jan 16, 2019 at 5:25 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Something like CAP_DAC_READ_SEARCH might not be crazy.

I agree that it would work. In fact' it's what Jiri's patch basically
did. Except Jiri used CAP_SYS_ADMIN instead.

But that then basically limits it to root (or root-like with
capability masks), which is quite likely to not work in practice all
that well. That's why I wanted to find alternatives.

*Very* few people want to run their databases as root.

Jiri's original patch kind of acknowledged that by making the new test
be conditional, and off by default. So then it's a "only do this for
lockdown mode, because normal people won't find it acceptable".

And I'm not a huge fan of that approach. If you don't protect normal
people, then what's the point, really?

              Linus
Dominique Martinet Jan. 16, 2019, 5:46 a.m. UTC | #73
Linus Torvalds wrote on Wed, Jan 16, 2019:
> *Very* few people want to run their databases as root.

In the case of happycache, this isn't the database doing the
dump/restore, but a separate process that could have the cap - it's
better if we can do without though, and from his readme he runs as user
cassandra in the /var/lib/cassandra directory for example so that'd
match the file owner.

For pgfincore, it's a postgres extension so the main process does it -
but it does have files open as write as well as being the owner.

> Jiri's original patch kind of acknowledged that by making the new test
> be conditional, and off by default. So then it's a "only do this for
> lockdown mode, because normal people won't find it acceptable".
> 
> And I'm not a huge fan of that approach. If you don't protect normal
> people, then what's the point, really?

I agree with that. 
"Being owner or has cap" (whichever cap) is probably OK.
On the other hand, writeability check makes more sense in general -
could we somehow check if the user has write access to the file instead
of checking if it currently is opened read-write?
Linus Torvalds Jan. 16, 2019, 5:49 a.m. UTC | #74
On Wed, Jan 16, 2019 at 4:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > I'm assuming that you can invalidate the page cache reliably by a
> > means that does not repeated require probing to detect invalidation
> > has occurred. I've mentioned one method in this discussion
> > already...
>
> Yes. And it was made clear to you that it was a bug in xfs dio and
> what the right thing to do was.

Side note: I actually think we *do* the right thing. Even for xfs. I
couldn't find the alleged place that invalidates the page cache on dio
reads.

The *generic* dio code only does it for writes (which is correct and
fine). And maybe xfs has some extra invalidation, but I don't see it.

So I actually hope your "you can use direct-io read to do directed
invalidating of the page cache" isn't true. I admittedly did *not* try
to delve very deeply into it, but the invalidates I found looked
correct. The generic code does it for writes, and at least ext4 does
the "writeback and wait" for reads.

There *does* seem to be a 'invalidate_inode_pages2_range()' call in
iomap_dio_rw(). That has a *comment* that says it only is for writes,
but it looks to me like it would trigger for reads too.

Just a plain bug/oversight? Or me misreading things.

So yes, maybe xfs does that "invalidate on read", but it really seems
to be just a bug. If the xfs people insist on keeping the bug, fine
(looks like gfs2 and xfs are the only users), but it seems kind of
sad.

             Linus
Linus Torvalds Jan. 16, 2019, 5:58 a.m. UTC | #75
On Wed, Jan 16, 2019 at 5:46 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> "Being owner or has cap" (whichever cap) is probably OK.
> On the other hand, writeability check makes more sense in general -
> could we somehow check if the user has write access to the file instead
> of checking if it currently is opened read-write?

That's likely the best option. We could say "is it open for write, or
_could_ we open it for writing?"

It's a slightly annoying special case, and I'd have preferred to avoid
it, but it doesn't sound *compilcated*.

I'm on the road, but I did send out this:

    https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/

originally. The "let's try to only do the mmap residency" was the
optimistic "maybe we can just get rid of this complexity entirely"
version..

Anybody willing to test the above patch instead? And replace the

   || capable(CAP_SYS_ADMIN)

check with something like

   || inode_permission(inode, MAY_WRITE) == 0

instead?

(This is obviously after you've reverted the "only check mmap
residency" patch..)

            Linus
Dominique Martinet Jan. 16, 2019, 6:34 a.m. UTC | #76
Linus Torvalds wrote on Wed, Jan 16, 2019:
> Anybody willing to test the above patch instead? And replace the
> 
>    || capable(CAP_SYS_ADMIN)
> 
> check with something like
> 
>    || inode_permission(inode, MAY_WRITE) == 0
> 
> instead?
> 
> (This is obviously after you've reverted the "only check mmap
> residency" patch..)

That seems to work on an x86_64 vm.

I've tested with the attached patch:
 - root can lookup pages on any file I tried;
 - user can lookup page on file it owns, assuming it can write to it
(e.g. it won't work on a 0400 file you own)
 - user cannot lookup pages on e.g. /lib64/libc-2.28.so

There is a difference with your previous patch though, that used to list
no page in core when it didn't know; this patch lists pages as in core
when it refuses to tell. I don't think that's very important, though.

If anything, the 0400 user-owner file might be a problem in some edge
case (e.g. if you're preloading git directories, many objects are 0444);
should we *also* check ownership?...
Josh Snyder Jan. 16, 2019, 7:52 a.m. UTC | #77
On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org>
wrote:
>
> There is a difference with your previous patch though, that used to list no
> page in core when it didn't know; this patch lists pages as in core when it
> refuses to tell. I don't think that's very important, though.

Is there a reason not to return -EPERM in this case?

>
> If anything, the 0400 user-owner file might be a problem in some edge
> case (e.g. if you're preloading git directories, many objects are 0444);
> should we *also* check ownership?...

Yes, this seems valuable. Some databases with immutable files (e.g. git, as
you've mentioned) conceivably operate this way.

Josh
Kevin Easton Jan. 16, 2019, 12:18 p.m. UTC | #78
On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote:
> On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org>
> wrote:
> >
> > There is a difference with your previous patch though, that used to list no
> > page in core when it didn't know; this patch lists pages as in core when it
> > refuses to tell. I don't think that's very important, though.
> 
> Is there a reason not to return -EPERM in this case?

When I was looking through the Debian Code Search results, quite a few
of the hits were for code that uses mincore() as a way to check if
_anything_ is mapped at an address or not.  This code doesn't care about
the in core / not in core result of mincore(), just whether it returned
an error or not.

I think a new error return would break most of the instances of that
code I saw.

    - Kevin
Matthew Wilcox (Oracle) Jan. 16, 2019, 12:36 p.m. UTC | #79
On Wed, Jan 16, 2019 at 05:00:25PM +1200, Linus Torvalds wrote:
> And if you're not the owner of the file, do you have another
> suggestion for that "Yes, I have the right to see what's in-core for
> this file". Because the problem is literally that if it's some random
> read-only system file, the kernel shouldn't leak access patterns to
> it..

This probably isn't a good heuristic, but thought I'd mention it
anyway ...  if the file is executable and you're not the owner, mincore
always/never says its pages are resident.  That'd fix all library leaks,
but then there's probably a smart way of figuring out something from
access patterns to a data file of some kind (/etc/passwd?)
Jiri Kosina Jan. 16, 2019, 4:12 p.m. UTC | #80
On Wed, 16 Jan 2019, Linus Torvalds wrote:

> > "Being owner or has cap" (whichever cap) is probably OK. On the other 
> > hand, writeability check makes more sense in general - could we 
> > somehow check if the user has write access to the file instead of 
> > checking if it currently is opened read-write?
> 
> That's likely the best option. We could say "is it open for write, or
> _could_ we open it for writing?"
> 
> It's a slightly annoying special case, and I'd have preferred to avoid
> it, but it doesn't sound *compilcated*.
> 
> I'm on the road, but I did send out this:
> 
>     https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/
> 
> originally. The "let's try to only do the mmap residency" was the
> optimistic "maybe we can just get rid of this complexity entirely"
> version..
> 
> Anybody willing to test the above patch instead? And replace the
> 
>    || capable(CAP_SYS_ADMIN)
> 
> check with something like
> 
>    || inode_permission(inode, MAY_WRITE) == 0
> 
> instead?
> 
> (This is obviously after you've reverted the "only check mmap
> residency" patch..)

So that seems to deal with mincore() in a reasonable way indeed.

It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it 
provide any good answer what to do about it, does it?

Thanks,
Linus Torvalds Jan. 16, 2019, 5:48 p.m. UTC | #81
On Thu, Jan 17, 2019 at 4:12 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> So that seems to deal with mincore() in a reasonable way indeed.
>
> It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it
> provide any good answer what to do about it, does it?

As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
to just move the test down to after readahead.

We could/should be smarter than this,but it *might* be as simple as just

diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..7bcdd36e629d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(

                page = find_get_page(mapping, index);
                if (!page) {
-                       if (iocb->ki_flags & IOCB_NOWAIT)
-                               goto would_block;
                        page_cache_sync_readahead(mapping,
                                        ra, filp,
                                        index, last_index - index);

which starts readahead even if IOCB_NOWAIT is set and will then test
IOCB_NOWAIT _later_ and not actually wait for it.

            Linus
Jiri Kosina Jan. 16, 2019, 8:23 p.m. UTC | #82
On Thu, 17 Jan 2019, Linus Torvalds wrote:

> > So that seems to deal with mincore() in a reasonable way indeed.
> >
> > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it
> > provide any good answer what to do about it, does it?
> 
> As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> to just move the test down to after readahead.

So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the 
kernel with the three topmost patches from

	https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel

applied (also attaching to this mail), and no obvious breakage popped up.

So if noone sees any principal problem there, I'll happily submit it with 
proper attribution etc.

Thanks,
Matthew Wilcox (Oracle) Jan. 16, 2019, 9:37 p.m. UTC | #83
On Wed, Jan 16, 2019 at 09:23:04PM +0100, Jiri Kosina wrote:
> On Thu, 17 Jan 2019, Linus Torvalds wrote:
> > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> > to just move the test down to after readahead.

Your patch 3/3 just removes the test.  Am I right in thinking that it
doesn't need to be *moved* because the existing test after !PageUptodate
catches it?

Of course, there aren't any tests for RWF_NOWAIT in xfstests.  Are there
any in LTP?

Some typos in the commit messages:

> Another aproach (checking file access permissions in order to decide
"approach"

> Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative
> 
> The semantics of what mincore() considers to be resident is not completely
> clearar, but Linux has always (since 2.3.52, which is when mincore() was
"clear"

> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
"belong"
Jiri Kosina Jan. 16, 2019, 9:41 p.m. UTC | #84
On Wed, 16 Jan 2019, Matthew Wilcox wrote:

> > On Thu, 17 Jan 2019, Linus Torvalds wrote:
> > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> > > to just move the test down to after readahead.
> 
> Your patch 3/3 just removes the test.  Am I right in thinking that it
> doesn't need to be *moved* because the existing test after !PageUptodate
> catches it?

Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and 
if it's actually set, it'll be handled by the !PageUpdtodate case.

> Of course, there aren't any tests for RWF_NOWAIT in xfstests.  Are there 
> any in LTP?

Not in the released version AFAIK. I've asked the LTP maintainer (in our 
internal bugzilla) to take care of this thread a few days ago, but not 
sure what came out of it. Adding him (Cyril) to CC.

> Some typos in the commit messages:
> 
> > Another aproach (checking file access permissions in order to decide
> "approach"
> 
> > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative
> > 
> > The semantics of what mincore() considers to be resident is not completely
> > clearar, but Linux has always (since 2.3.52, which is when mincore() was
> "clear"
> 
> > initially done) treated it as "page is available in page cache".
> > 
> > That's potentially a problem, as that [in]directly exposes meta-information
> > about pagecache / memory mapping state even about memory not strictly belonging
> > to the process executing the syscall, opening possibilities for sidechannel
> > attacks.
> > 
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> "belong"

Thanks.
Dave Chinner Jan. 17, 2019, 1:26 a.m. UTC | #85
On Wed, Jan 16, 2019 at 04:54:49PM +1200, Linus Torvalds wrote:
> On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > I'm assuming that you can invalidate the page cache reliably by a
> > means that does not repeated require probing to detect invalidation
> > has occurred. I've mentioned one method in this discussion
> > already...
> 
> Yes. And it was made clear to you that it was a bug in xfs dio and
> what the right thing to do was.
> 
> And you ignored that, and claimed it was a feature.

Linus, either you aren't listening or you're being intentionally
provocative.

So, for the *third* time this thread: we can probably remove this
code but first we need to be sure it doesn't cause unexpected
regressions before we commit such a change. We are not cowboys who
test userspace behavioural changes on users without review or
discussion.

Indeed, I wrote a patch to remove the invalidation /several days
ago/ and put it into my test trees, and it's been there since. Just
because you don't see immediate changes doesn't mean it isn't
happening.

> Either you care or you don't. If you don't care (and so far everything
> you said seems to imply you don't),

Linus, this is just a personal attack and IMO a violation of the
CoC.  It's straight out wrong, insulting, totally unprofessional and
completely uncalled for.

This is most definitely not a useful technical response to the
issues I raised. i.e you cut out all the context of my response
about whether "no probing necessary" page cache invalidation attacks
are something we need to care about in the future. We don't need you
to shout about existing "no probing necessary" page cache
invalidation attacks that are already being addressed, we need to
determine if it's going to be a recurring problem in future because
that directly affects the mitigation strategies we can implement.

-Dave.
Dominique Martinet Jan. 17, 2019, 1:49 a.m. UTC | #86
Jiri Kosina wrote on Wed, Jan 16, 2019:
> So if noone sees any principal problem there, I'll happily submit it with 
> proper attribution etc.

I'm not convinced just the write permission check is enough for
mincore(), as Josh also seems to share the concern I raised (e.g. map a
git directory "hot" pages)
We probably need to add an inode_owner_or_capable() or similar, the open
question is do we still need the write access check after that - I don't
really know how expensive these calls are.

Thanks,
Dave Chinner Jan. 17, 2019, 2:22 a.m. UTC | #87
On Fri, Jan 11, 2019 at 08:36:55AM +0100, Jiri Kosina wrote:
> On Thu, 10 Jan 2019, Dave Chinner wrote:
> 
> > Sounds nice from a theoretical POV, but reality has taught us very 
> > different lessons.
> > 
> > FWIW, a quick check of XFS's history so you understand how long this 
> > behaviour has been around. It was introduced in the linux port in 2001 
> > as direct IO support was being added:
> > 
> > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > Author: Steve Lord <lord@sgi.com>
> > Date:   Mon Mar 5 16:47:52 2001 +0000
> > 
> >     Add bounds checking for direct I/O, do the cache invalidation for
> >     data coherency on direct I/O.
> 
> Out of curiosity, which repository is this from please? Even google 
> doesn't seem to know about this SHA.

because oss.sgi.com is no longer with us, it's fallen out of all the
search engines.  It was from the "archive/xfs-import.git" tree on
oss.sgi.com:

https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi

but archive.org doesn't have a copy of the git tree. It contained
the XFS history right back to the first Irix commit in 1993. Some of
us still have copies of it sitting around....

Cheers,

Dave.
Linus Torvalds Jan. 17, 2019, 4:51 a.m. UTC | #88
On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Your patch 3/3 just removes the test.  Am I right in thinking that it
> doesn't need to be *moved* because the existing test after !PageUptodate
> catches it?

That's the _hope_.

That's the simplest patch I can come up with as a potential solution.
But it's possible that there's some nasty performance regression
because somebody really relies on not even triggering read-ahead, and
we might need to do some totally different thing.

So it may be that somebody has a case that really wants something
else, and we'd need to move the RWF_NOWAIT test elsewhere and do
something slightly more complicated. As with the mincore() change,
maybe reality doesn't like the simplest fix...

> Of course, there aren't any tests for RWF_NOWAIT in xfstests.  Are there
> any in LTP?

RWF_NOWAIT is actually _fairly_ new.  It was introduced "only" about
18 months ago and made it into 4.13.

Which makes me hopeful there aren't a lot of people who care deeply.

And starting readahead *may* actually be what a RWF_NOWAIT read user
generally wants, so for all we know it might even improve performance
and/or allow new uses. With the "start readahead but don't wait for
it" semantics, you can have a model where you try to handle all the
requests that can be handled out of cache first (using RWF_NOWAIT) and
then when you've run out of cached cases you clear the RWF_NOWAIT
flag, but now the IO has been started early (and could overlap with
the cached request handling), so then when you actually do a blocking
version, you get much better performance.

So there is an argument that removing that one RWF_NOWAIT case might
actually be a good thing in general, outside of  the "don't allow
probing the cache without changing the state of it" issue.

But that's handwavy and optimistic. Reality is often not as accomodating ;)

                   Linus
Jiri Kosina Jan. 17, 2019, 8:18 a.m. UTC | #89
On Thu, 17 Jan 2019, Dave Chinner wrote:

> > > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > > Author: Steve Lord <lord@sgi.com>
> > > Date:   Mon Mar 5 16:47:52 2001 +0000
> > > 
> > >     Add bounds checking for direct I/O, do the cache invalidation for
> > >     data coherency on direct I/O.
> > 
> > Out of curiosity, which repository is this from please? Even google 
> > doesn't seem to know about this SHA.
> 
> because oss.sgi.com is no longer with us, it's fallen out of all the
> search engines.  It was from the "archive/xfs-import.git" tree on
> oss.sgi.com:
> 
> https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi
> 
> but archive.org doesn't have a copy of the git tree. It contained
> the XFS history right back to the first Irix commit in 1993. Some of
> us still have copies of it sitting around....

For cases like this, would it be worth pushing it to git.kernel.org as an 
frozen historical reference archive?

Thanks,
Cyril Hrubis Jan. 17, 2019, 9:52 a.m. UTC | #90
Hi!
> > Your patch 3/3 just removes the test.  Am I right in thinking that it
> > doesn't need to be *moved* because the existing test after !PageUptodate
> > catches it?
> 
> Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and 
> if it's actually set, it'll be handled by the !PageUpdtodate case.
> 
> > Of course, there aren't any tests for RWF_NOWAIT in xfstests.  Are there 
> > any in LTP?
> 
> Not in the released version AFAIK. I've asked the LTP maintainer (in our 
> internal bugzilla) to take care of this thread a few days ago, but not 
> sure what came out of it. Adding him (Cyril) to CC.

So far not much, I've looked over our mincore() tests and noted down how
to improve them here:

https://github.com/linux-test-project/ltp/issues/461

We do plan to test the final mincore() fix:

https://github.com/linux-test-project/ltp/issues/460

And we do have RWF_NOWAIT tests on our TODO for some time as well:

https://github.com/linux-test-project/ltp/issues/286

I guess I can raise priority for that one so that we have basic
functional tests in a week or so. Also if anyone has some RWF_NOWAIT
tests already it would be nice if these could be shared with us.


[A bit off topic rant]

I've been telling kernel developers for years that if they have a test
code they used when developing a kernel feature that they should share
it with us (LTP community) and we will turn these into automated tests
and maintain them for free. LTP is also used in many QA departements
around the word so such tests will end up executed in different
environments also for free. Sadly this does not happen much and there
are only few exceptions so far. But maybe I wasn't shouting loudly
enough.
Dave Chinner Jan. 17, 2019, 9:06 p.m. UTC | #91
On Thu, Jan 17, 2019 at 09:18:41AM +0100, Jiri Kosina wrote:
> On Thu, 17 Jan 2019, Dave Chinner wrote:
> 
> > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > > > Author: Steve Lord <lord@sgi.com>
> > > > Date:   Mon Mar 5 16:47:52 2001 +0000
> > > > 
> > > >     Add bounds checking for direct I/O, do the cache invalidation for
> > > >     data coherency on direct I/O.
> > > 
> > > Out of curiosity, which repository is this from please? Even google 
> > > doesn't seem to know about this SHA.
> > 
> > because oss.sgi.com is no longer with us, it's fallen out of all the
> > search engines.  It was from the "archive/xfs-import.git" tree on
> > oss.sgi.com:
> > 
> > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi
> > 
> > but archive.org doesn't have a copy of the git tree. It contained
> > the XFS history right back to the first Irix commit in 1993. Some of
> > us still have copies of it sitting around....
> 
> For cases like this, would it be worth pushing it to git.kernel.org as an 
> frozen historical reference archive?

I'm not sure we should be putting code from Irix on kernel.org. I
uploaded a copy to github a few months ago so XFS devs could easily
reference relevant commits in email. The reasons we decided not to
upload it to kernel.org should be clear from the readme...

https://github.com/dchinner/xfs-history

Cheers,

Dave.
Vlastimil Babka Jan. 17, 2019, 9:45 p.m. UTC | #92
On 1/16/2019 8:52 AM, Josh Snyder wrote:
> On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org>
> wrote:
>>
>> There is a difference with your previous patch though, that used to list no
>> page in core when it didn't know; this patch lists pages as in core when it
>> refuses to tell. I don't think that's very important, though.

I've argued previously that reporting false positives (as your patch does)
should be better, otherwise there might be somebody trying to fault in their
pages in a loop until mincore reports positive, which would become an endless
loop. So agreed with your change.

Or maybe we could resort to the 5.0-rc1 page table check (that is now being
reverted) but only in cases when we are not allowed the page cache residency
check? Or would that be needlessly complicated? And it would be able to leak if
a page was evicted from the page cache...

> Is there a reason not to return -EPERM in this case?

That would definitely break somebody.

>>
>> If anything, the 0400 user-owner file might be a problem in some edge
>> case (e.g. if you're preloading git directories, many objects are 0444);
>> should we *also* check ownership?...
> 
> Yes, this seems valuable. Some databases with immutable files (e.g. git, as
> you've mentioned) conceivably operate this way.
> 
> Josh
>
Linus Torvalds Jan. 18, 2019, 4:49 a.m. UTC | #93
On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Or maybe we could resort to the 5.0-rc1 page table check (that is now being
> reverted) but only in cases when we are not allowed the page cache residency
> check? Or would that be needlessly complicated?

I think it would  be good fallback semantics, but I'm not sure it's
worth it. Have you tried writing a patch for it? I don't think you'd
want to do the check *when* you find a hole, so you'd have to do it
upfront and then pass the cached data down with the private pointer
(or have a separate "struct mm_walk" structure, perhaps?

So I suspect we're better off with the patch we have. But if somebody
*wants* to try to do that fancier patch, and it doesn't look
horrendous, I think it might be the "quality" solution.

              Linus
Linus Torvalds Jan. 18, 2019, 4:54 a.m. UTC | #94
On Thu, Jan 17, 2019 at 4:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Your patch 3/3 just removes the test.  Am I right in thinking that it
> > doesn't need to be *moved* because the existing test after !PageUptodate
> > catches it?
>
> That's the _hope_.
>
> That's the simplest patch I can come up with as a potential solution.
> But it's possible that there's some nasty performance regression
> because somebody really relies on not even triggering read-ahead, and
> we might need to do some totally different thing.

Oh, and somebody should probably check that there isn't some simple
way to just avoid that readahead code entirely.

In particular, right now we skip readahead for at least these cases:

        /* no read-ahead */
        if (!ra->ra_pages)
                return;

        if (blk_cgroup_congested())
                return;

and I don't think we need to worry about the cgroup congestion case -
if the attack has to also congest its cgroup with IO, I think they
have bigger problems.

And I think 'ra_pages' can be zero only in the presence of IO errors,
but I might be wrong. It would be good if somebody double-checks that.

               Linus
Tejun Heo Jan. 18, 2019, 2:23 p.m. UTC | #95
Hello, Linus.

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> And the first hit is 'fincore', which probably nobody cares about
> anyway, but it does
> 
>     fd = open (name, O_RDONLY)
>     ..
>     mmap(window, len, PROT_NONE, MAP_PRIVATE, ..

So, folks here have been using fincore(1) for diagnostic purposes and
are also looking to expand on it to investigate per-cgroup cache
usages (mmap -> mincore -> /proc/self/pagemap -> /proc/kpagecgroup ->
cgroup path).

These are all root-only usages to find out what's going on with the
whole page cache.  We aren't attached to doing things this particular
way but it'd suck if there's no way.

Thanks.
Vlastimil Babka Jan. 18, 2019, 6:58 p.m. UTC | #96
On 1/18/19 5:49 AM, Linus Torvalds wrote:
> On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Or maybe we could resort to the 5.0-rc1 page table check (that is now being
>> reverted) but only in cases when we are not allowed the page cache residency
>> check? Or would that be needlessly complicated?
> 
> I think it would  be good fallback semantics, but I'm not sure it's
> worth it. Have you tried writing a patch for it? I don't think you'd
> want to do the check *when* you find a hole, so you'd have to do it
> upfront and then pass the cached data down with the private pointer
> (or have a separate "struct mm_walk" structure, perhaps?
> 
> So I suspect we're better off with the patch we have. But if somebody
> *wants* to try to do that fancier patch, and it doesn't look
> horrendous, I think it might be the "quality" solution.

I thought to drop the idea because of leaking that page has been
evicted, but then I realized there are other ways to check for that
anyway in /proc. So I'll try, but probably not until after next week. If
somebody else wants to, they are welcome. As you say, the current
solution should be ok, so that would be a patch on top anyway, for
bisectability etc.

Vlastimil

>               Linus
>
Linus Torvalds Jan. 23, 2019, 8:27 p.m. UTC | #97
On Thu, Jan 17, 2019 at 9:23 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the
> kernel with the three topmost patches from
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel
>
> applied (also attaching to this mail), and no obvious breakage popped up.
>
> So if noone sees any principal problem there, I'll happily submit it with
> proper attribution etc.

So this seems to have died down, and a week later we seem to not have
a lot of noise here any more. I think it means people either weren't
testing it, or just didn't find any real problems.

I've reverted the 'let's try to just remove the code' part in my tree.
But I didn't apply the two other patches yet. Any final comments
before that should happen?

                     Linus
Linus Torvalds Jan. 23, 2019, 8:35 p.m. UTC | #98
On Thu, Jan 24, 2019 at 9:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I've reverted the 'let's try to just remove the code' part in my tree.
> But I didn't apply the two other patches yet. Any final comments
> before that should happen?

Side note: the inode_permission() addition to can_do_mincore() in that
patch 0002, seems to be questionable. We do

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+       return vma_is_anonymous(vma)
+               || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
+               || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}

note how it tests whether vma->vm_file is NULL for the FMODE_WRITE
test, but not for the inode_permission() test.

So either we test unnecessarily in the second line, or we don't
properly test it in the third one.

I think the "test vm_file" thing may be unnecessary, because a
non-anonymous mapping should always have a file pointer and an inode.
But I could  imagine some odd case (vdso mapping, anyone?) that
doesn't have a vm_file, but also isn't anonymous.

Anybody?

Anyway, it's one reason why I didn't actually apply those other two
patches yet. This may be a 5.1 issue..

                   Linus
Jiri Kosina Jan. 23, 2019, 11:12 p.m. UTC | #99
On Thu, 24 Jan 2019, Linus Torvalds wrote:

> Side note: the inode_permission() addition to can_do_mincore() in that
> patch 0002, seems to be questionable. We do
> 
> +static inline bool can_do_mincore(struct vm_area_struct *vma)
> +{
> +       return vma_is_anonymous(vma)
> +               || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
> +               || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
> +}
> 
> note how it tests whether vma->vm_file is NULL for the FMODE_WRITE
> test, but not for the inode_permission() test.
> 
> So either we test unnecessarily in the second line, or we don't
> properly test it in the third one.
> 
> I think the "test vm_file" thing may be unnecessary, because a
> non-anonymous mapping should always have a file pointer and an inode.
> But I could  imagine some odd case (vdso mapping, anyone?) that
> doesn't have a vm_file, but also isn't anonymous.

Hmm, good point.

So dropping the 'vma->vm_file' test and checking whether given vma is 
special mapping should hopefully provide the desired semantics, shouldn't 
it?
Linus Torvalds Jan. 24, 2019, 12:20 a.m. UTC | #100
On Thu, Jan 24, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote:
>
> >
> > I think the "test vm_file" thing may be unnecessary, because a
> > non-anonymous mapping should always have a file pointer and an inode.
> > But I could  imagine some odd case (vdso mapping, anyone?) that
> > doesn't have a vm_file, but also isn't anonymous.
>
> Hmm, good point.
>
> So dropping the 'vma->vm_file' test and checking whether given vma is
> special mapping should hopefully provide the desired semantics, shouldn't
> it?

Maybe. But on the whole I think it would  be simpler and more
straightforward to just instead add a vm_file test for the
inode_permission() case. That way you at least know that you aren't
following a NULL pointer.

If the file then turns out to be some special thing, it doesn't really
_matter_, I think. It won't have anything in the page cache etc, but
the code should "work".

             Linus
Jiri Kosina Jan. 24, 2019, 2:25 p.m. UTC | #101
On Thu, 24 Jan 2019, Dominique Martinet wrote:

> Jiri, you've offered resubmitting the last two patches properly, can you 
> incorporate this change or should I just send this directly? (I'd take 
> most of your commit message and add your name somewhere)

I've been running some basic smoke testing with the kernel from

	https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2

(attaching the respective two patches to apply on top of latest Linus' 
tree to this mail as well), and everything looks good so far.

Thanks,
Jiri Kosina Jan. 27, 2019, 10:35 p.m. UTC | #102
On Thu, 24 Jan 2019, Jiri Kosina wrote:

> > Jiri, you've offered resubmitting the last two patches properly, can you 
> > incorporate this change or should I just send this directly? (I'd take 
> > most of your commit message and add your name somewhere)
> 
> I've been running some basic smoke testing with the kernel from
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2
> 
> (attaching the respective two patches to apply on top of latest Linus' 
> tree to this mail as well), and everything looks good so far.

So, any objections to aproaching it this way?

I've not been able to spot any obvious breakage so far with it.

Thanks,
Dominique Martinet Jan. 28, 2019, 12:05 a.m. UTC | #103
Jiri Kosina wrote on Sun, Jan 27, 2019:
> So, any objections to aproaching it this way?

I'm not sure why I'm the main recipient of that mail but answering
because I am -- let's get these patches in through the regular -mm tree
though
Cyril Hrubis Jan. 28, 2019, 1:49 p.m. UTC | #104
Hi!
> > Of course, there aren't any tests for RWF_NOWAIT in xfstests.  Are there 
> > any in LTP?

Just FYI I've send a patch with basic RWF_NOWAIT test for review to LTP
ML and also CCed mailing lists from this thread.

https://lkml.org/lkml/2019/1/28/416
Jiri Kosina Jan. 29, 2019, 11:52 p.m. UTC | #105
On Mon, 28 Jan 2019, Dominique Martinet wrote:

> > So, any objections to aproaching it this way?
> 
> I'm not sure why I'm the main recipient of that mail but answering
> because I am -- let's get these patches in through the regular -mm tree
> though

*prod to mm maintainers* (at least for an opinion)
Michal Hocko Jan. 30, 2019, 9:09 a.m. UTC | #106
On Wed 30-01-19 00:52:02, Jiri Kosina wrote:
> On Mon, 28 Jan 2019, Dominique Martinet wrote:
> 
> > > So, any objections to aproaching it this way?
> > 
> > I'm not sure why I'm the main recipient of that mail but answering
> > because I am -- let's get these patches in through the regular -mm tree
> > though
> 
> *prod to mm maintainers* (at least for an opinion)

Could you repost those patches please? The thread is long and it is not
really clear what is the most up-to-date state of patches (at least to
me).
Jiri Kosina Jan. 30, 2019, 12:29 p.m. UTC | #107
On Wed, 30 Jan 2019, Michal Hocko wrote:

> > > I'm not sure why I'm the main recipient of that mail but answering 
> > > because I am -- let's get these patches in through the regular -mm 
> > > tree though
> > 
> > *prod to mm maintainers* (at least for an opinion)
> 
> Could you repost those patches please? The thread is long and it is not
> really clear what is the most up-to-date state of patches (at least to
> me).

Vlastimil seems to have one extra patch to go on top, so we agreed that 
he'll be sending that as a complete self-contained series (either as a 
followup to the very first e-mail in this monsterthread, or completely 
separately) shortly.

Thanks,
Nicolai Stange Feb. 20, 2019, 3:49 p.m. UTC | #108
Linus Torvalds <torvalds@linux-foundation.org> writes:

<snip>

> So in order to use it as a signal, first you have to first scrub the
> cache (because if the page was already there, there's no signal at
> all), and then for the signal to be as useful as possible, you're also
> going to want to try to get out more than one bit of information: you
> are going to try to see the patterns and the timings of how it gets
> filled.
>
> And that's actually quite painful. You don't know the initial cache
> state, and you're not (in general) controlling the machine entirely,
> because there's also that actual other entity that you're trying to
> attack and see what it does.
>
> So what you want to do is basically to first make sure the cache is
> scrubbed (only for the pages you're interested in!), then trigger
> whatever behavior you are looking for, and then look how that affected
> the cache.
>
> In other words,  you want *multiple* residency status check - first to
> see what the cache state is (because you're going to want that for
> scrubbing), then to see that "yes, it's gone" when doing the
> scrubbing, and then to see the *pattern* and timings of how things are
> brought in.

<snap>

In an attempt to gain a better understanding of the guided eviction part
resp. the relevance of mincore() & friends to that, I worked on
reproducing the results from [1], section 6.1 ("Efficient Page Cache
Eviction on Linux").

In case anybody wants to run their own experiments: the sources can
be found at [2].

Disclaimer: I don't have access to the sources used by the [1]-paper's
authors nor do I know anything about their experimental setup. So it
might very well be the case, that my implementation is completely
different and/or inefficient.

Anyways, quoting from [1], section 6.1:

  "Eviction Set 1. These are pages already in the page cache,
   used by other processes. To keep them in the page cache,
   a thread continuously accesses these pages while also keep-
   ing the system load low by using sched yield and sleep.
   Consequently, they are among the most recently accessed
   pages of the system and eviction of these pages becomes
   highly unlikely."

I had two questions:
1.) Do the actual contents of "Eviction set 1" matter for the guided
    eviction's performance or can they as well be arbitrary but fixed?
    Because if the set's actual contents weren't of any importance,
    then mincore() would not be needed to initialize it with "pages
    already in the page cache".
2.) How does keeping some fixed set resident + doing some IO compare to
    simply streaming a huge random file through the page cache?

(To make it explicit: I didn't look into the probe part of the attack
 or the checking of the victim page's residency status as a
 termination condition for the eviction run.)

Obviously, there are two primary factors affecting the victim page
eviction performance: the file page cache size and disk read
bandwidth.

Naively speaking, I would suppose that keeping a certain set resident is
a cheap and stable way to constrain IO to the remaining portion of the
page cache and thus, reduce the amount of data required to be read from
disk until the victim page gets evicted.


Results summary (details can be found at the end of this mail):
- The baseline benchmark of simply streaming random data
  through the page cache behaves as expected:

    avg of "Inactive(file)" / avg of "victim page eviction time"

  yields ~480MB/s, which approx. matches my disk's read bandwidth
  (note: the victim page was mapped as !PROT_EXEC).

- I didn't do any sophisticated fine-tuning wrt. to parameters, but
  for the configuration yielding the best result, the average victim
  page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the
  "random but fixed resident set method". That's an improvement by a
  factor of 2.6 over the baseline "streaming random data method" (with
  the same amount of anonymous memory, i.e. "Eviction set 3",
  allocated: 7GB out of a total of 8GB).

- In principle, question 1.) can't be answered by experiment without
  controlling the initial, legitimate system workload. I did some lax
  tests on my desktop running firefox, libreoffice etc. though and of
  course, overall responsiveness got a lot better if the "Eviction set
  1" had been populated with pages already resident at the time the
  "attack" started. But the victim page eviction times didn't seem to
  improve -- at least not by factors such that my biased mind would
  have been able to recognize any change.

In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident
improved the victim page eviction performance by some factor over the
"streaming" baseline, where "Eviction set 1" was populated from a
single, attacker-controlled file and mincore() was not needed for
determining its initial contents.

To my surprise though, I needed to rely on mincore() at some other
place, namely *my* implementation of keeping the resident set
resident. My first naive approach was to have a single thread repeatedly
iterating over the pages and reading the first few bytes from each
through a mmapped area. That did not work out, because, even for smaller
resident sets of 1GB and if run with realtime priority, the accessing
thread would at some point in time encounter a reclaimed page and have
to wait for the page fault to get served. While waiting for that, even
more of the resident pages are likely to get reclaimed, causing
additional delays later on. Eventually, the resident page accessor
looses the game and will encounter page faults for almost the whole
resident set (which isn't resident anymore). I worked around this by
making the accessor thread check page residency via mincore(), touch
only the resident ones and queue the others to some refault ("rewarm")
thread. From briefly looking at iostat, this rewarmer thread actually
seemed to saturate the disk and thus, there was no need for additional
IO to put pressure on the page cache. For completeness, the amount of
pages from the resident set actually found resident made up for ~97% of
all file page cache pages (Inactive+Active(file)).

Note that the way mincore() is used here is different than for the
probe part of the attack: for probing, we'd like to know when a victim
page has been faulted in again, while the residency keeper needs to
check that some page has not been reclaimed before accessing
it. Furthermore, mincore() is run on pages from an attacker-controlled
and -owned file here. AFAICS, the patches currently under review
(c.f. [3]) don't mitigate against this latter abuse of mincore() & Co.

I personally doubt that doing something about it would be worth it
though: first of all, until proven otherwise, I'm assuming that the
improvement of the "resident set" based eviction method over the
"streaming" baseline is not by orders of magnitude and that the victim
page eviction times interesting to attackers (let's say better than
500ms) are achievable only under certain conditions: no swap and the
ability to block a large fraction of total memory with anonymous
mappings.

What's more, I can imagine that there are other ways to keep the
resident set resident without relying on mincore() at all: I haven't
tried it, but simply spawning a larger number of accessor threads, each
at a different position within "Eviction set 1", and hoping that most of
the time at least one of them finds a resident page to touch, might
work, too.


Thanks,

Nicolai


Experiment setup + numbers
==========================
My laptop has 8GB of RAM, a SSD and runs OpenSUSE 42.3 kernel package
version 4.4.165-81. I stopped almost all services, but didn't setup any
CPU isolation.

I ran 5 x 2 experiments where I allocated (and filled, of course)
3,4,5,6,7GB of anonymous memory each and compared the baseline
"read-in-a-huge-file" (mapped PROT_EXEC) results with the resident set
based eviction for each of these. While doing so, I continuously
measured eviction times of a !PROT_EXEC victim page and reported the
'Active(file)' + 'Inactive(file)' statistics from /proc/meminfo.


Baseline "streaming" benchmark results:

anon mem (GB):		7        6         5         4         3
Inactive(file) (MB):	189.3389  709.7063 1221.0910 1735.7510 2247.3340
eviction time (ms):	386.7712 1453.3975 2533.6084 3622.4995 4694.4668
quotient (MB/s):	489.5372  488.3085  481.9573  479.1584  478.7198

and, for comparison with the results from the resident set based eviction
below:
Inactive+Active(file) (MB):	358      1390      2415      3445      4469


Resident set based eviction:

For the results below, I chose to draw the resident set from a single
file filled with random data and made it total mem - allocated anon
mem in size. The disk bandwidth was saturated all the time.

anon mem (GB):			7        6        5         4         3
eviction time (ms):		146.6296 428.9594 620.2084  863.1423  977.8963
improvement over baseline:	2.6      3.4      4.1       4.2       4.8
Inactive+Active(file) (MB):	429      1449     2471      3494      4515
resident from res. set (MB):	417      1428     2426      3424      4429
fraction:			97.3%    98.6%    98.2%     98.0%     98.1%



[1] https://arxiv.org/abs/1901.01161 ("Page Cache Attacks")
[2] https://github.com/nicstange/pgc
[3] https://lkml.kernel.org/r/20190130124420.1834-1-vbabka@suse.cz
(*) The distribution of eviction times is not Gaussian.
diff mbox series

Patch

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 187ce4f599a2..afb8635e925e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -41,6 +41,7 @@  Currently, these files are in /proc/sys/vm:
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
+- mincore_privileged
 - mmap_min_addr
 - mmap_rnd_bits
 - mmap_rnd_compat_bits
@@ -485,6 +486,14 @@  files and similar are considered.
 The default is 1 percent.
 
 ==============================================================
+mincore_privileged:
+
+mincore() could be potentially used to mount a side-channel attack against
+pagecache metadata. This sysctl provides system administrators means to
+make it available only to processess that own CAP_SYS_ADMIN capability.
+
+The default is 0, which means mincore() can be used without restrictions.
+==============================================================
 
 mmap_min_addr
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f712e73b..f03cb07c8dd4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,7 @@  extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+extern int sysctl_mincore_privileged;
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1684,6 +1685,13 @@  static struct ctl_table vm_table[] = {
 		.extra2		= (void *)&mmap_rnd_compat_bits_max,
 	},
 #endif
+	{
+		.procname	= "mincore_privileged",
+		.data		= &sysctl_mincore_privileged,
+		.maxlen		= sizeof(sysctl_mincore_privileged),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..77d4928cdfaa 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,6 +21,8 @@ 
 #include <linux/uaccess.h>
 #include <asm/pgtable.h>
 
+int sysctl_mincore_privileged;
+
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
@@ -228,6 +230,9 @@  SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	unsigned long pages;
 	unsigned char *tmp;
 
+	if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* Check the start address: needs to be page-aligned.. */
 	if (start & ~PAGE_MASK)
 		return -EINVAL;