diff mbox series

[1/3] mm/mincore: make mincore() more conservative

Message ID 20190130124420.1834-2-vbabka@suse.cz (mailing list archive)
State New, archived
Headers show
Series mincore() and IOCB_NOWAIT adjustments | expand

Commit Message

Vlastimil Babka Jan. 30, 2019, 12:44 p.m. UTC
From: Jiri Kosina <jkosina@suse.cz>

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

Originally-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally-by: Dominique Martinet <asmadeus@codewreck.org>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daniel Gruss <daniel@gruss.cc>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mincore.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Michal Hocko Jan. 31, 2019, 9:43 a.m. UTC | #1
On Wed 30-01-19 13:44:18, Vlastimil Babka wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> 
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

I agree that this is a better way than the original 574823bfab82
("Change mincore() to count "mapped" pages rather than "cached" pages").
One thing is still not clear to me though. Is the new owner/writeable
check OK for the Netflix-like usecases? I mean does happycache have
appropriate access to the cache data? I have tried to re-read the
original thread but couldn't find any confirmation.

I nit below

> Originally-by: Linus Torvalds <torvalds@linux-foundation.org>
> Originally-by: Dominique Martinet <asmadeus@codewreck.org>
> Cc: Dominique Martinet <asmadeus@codewreck.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Kevin Easton <kevin@guarana.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Cyril Hrubis <chrubis@suse.cz>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Daniel Gruss <daniel@gruss.cc>
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

other than that looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>

If this still doesn't help happycache kind of workloads then we should
add a capability check IMO but this looks like a decent foundation to
me.

> ---
>  mm/mincore.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 218099b5ed31..747a4907a3ac 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	return 0;
>  }
>  
> +static inline bool can_do_mincore(struct vm_area_struct *vma)
> +{
> +	return vma_is_anonymous(vma) ||
> +		(vma->vm_file &&
> +			(inode_owner_or_capable(file_inode(vma->vm_file))
> +			 || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));
> +}

This is hard to read. Can we do
	if (vma_is_anonymous(vma))
		return true;
	if (!vma->vm_file)
		return false;
	return inode_owner_or_capable(file_inode(vma->vm_file)) ||
		inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
Dominique Martinet Jan. 31, 2019, 9:51 a.m. UTC | #2
Michal Hocko wrote on Thu, Jan 31, 2019:
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> > (if it tried to) successfully open for writing.
> 
> I agree that this is a better way than the original 574823bfab82
> ("Change mincore() to count "mapped" pages rather than "cached" pages").
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

It's enough for my use cases and Josh didn't seem to oppose, but since
he's not in Cc I don't think he would answer -- added him now :)

FWIW happycache writes in the current directory by default so I assume
in the way they use it it would usually have access one way or another.

> If this still doesn't help happycache kind of workloads then we should
> add a capability check IMO but this looks like a decent foundation to
> me.

the inode_owner_or_capable/inode_permission helpers already do allow
quite a few capabilities there
Josh Snyder Jan. 31, 2019, 5:46 p.m. UTC | #3
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko <mhocko@kernel.org> wrote:
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

The owner/writable check will suit every database that I've ever used
happycache with, including cassandra, postgres and git.

Acked-by: Josh Snyder <joshs@netflix.com>
Vlastimil Babka Feb. 1, 2019, 8:56 a.m. UTC | #4
Here's updated version with Michal's suggestion, and acks:

I think this patch is fine to go, less sure about 2/3 and 3/3.

----8<----
From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001
From: Jiri Kosina <jkosina@suse.cz>
Date: Wed, 16 Jan 2019 20:53:17 +0100
Subject: [PATCH v2] mm/mincore: make mincore() more conservative

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

[mhocko@suse.com: restructure can_do_mincore() conditions]
Originally-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally-by: Dominique Martinet <asmadeus@codewreck.org>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daniel Gruss <daniel@gruss.cc>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Josh Snyder <joshs@netflix.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/mincore.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..b8842b849604 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return 0;
 }
 
+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+	if (vma_is_anonymous(vma))
+		return true;
+	if (!vma->vm_file)
+		return false;
+	return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+		inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
 /*
  * Do a chunk of "sys_mincore()". We've already checked
  * all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 	vma = find_vma(current->mm, addr);
 	if (!vma || addr < vma->vm_start)
 		return -ENOMEM;
-	mincore_walk.mm = vma->vm_mm;
 	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+	if (!can_do_mincore(vma)) {
+		unsigned long pages = (end - addr) >> PAGE_SHIFT;
+		memset(vec, 1, pages);
+		return pages;
+	}
+	mincore_walk.mm = vma->vm_mm;
 	err = walk_page_range(addr, end, &mincore_walk);
 	if (err < 0)
 		return err;
Andrew Morton March 6, 2019, 11:13 p.m. UTC | #5
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> From: Jiri Kosina <jkosina@suse.cz>
> 
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

"for writing" comes as a bit of a surprise.  Why not for reading?

Could we please explain the reasoning in the changelog and in the
(presently absent) comments which describe can_do_mincore()?

> @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
>  	vma = find_vma(current->mm, addr);
>  	if (!vma || addr < vma->vm_start)
>  		return -ENOMEM;
> -	mincore_walk.mm = vma->vm_mm;
>  	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> +	if (!can_do_mincore(vma)) {
> +		unsigned long pages = (end - addr) >> PAGE_SHIFT;

I'm not sure this is correct in all cases.   If

	addr = 4095
	vma->vm_end = 4096
	pages = 1000

then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
should have been 1.

Please check?

A mincore test suite in tools/testing/selftests would be useful,
methinks.  To exercise such corner cases, check for future breakage,
etc.

> +		memset(vec, 1, pages);
> +		return pages;
> +	}
> +	mincore_walk.mm = vma->vm_mm;
>  	err = walk_page_range(addr, end, &mincore_walk);
>  	if (err < 0)
>  		return err;
Jiri Kosina March 7, 2019, 12:01 a.m. UTC | #6
On Wed, 6 Mar 2019, Andrew Morton wrote:

> > The semantics of what mincore() considers to be resident is not completely
> > clear, but Linux has always (since 2.3.52, which is when mincore() was
> > initially done) treated it as "page is available in page cache".
> > 
> > That's potentially a problem, as that [in]directly exposes meta-information
> > about pagecache / memory mapping state even about memory not strictly belonging
> > to the process executing the syscall, opening possibilities for sidechannel
> > attacks.
> > 
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> > (if it tried to) successfully open for writing.
> 
> "for writing" comes as a bit of a surprise.  Why not for reading?

I guess this is a rhetorical question from you :) but fair enough, good 
point, I'll explain this a bit more in the changelog and in the code 
comments.

> > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
> >  	vma = find_vma(current->mm, addr);
> >  	if (!vma || addr < vma->vm_start)
> >  		return -ENOMEM;
> > -	mincore_walk.mm = vma->vm_mm;
> >  	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> > +	if (!can_do_mincore(vma)) {
> > +		unsigned long pages = (end - addr) >> PAGE_SHIFT;
> 
> I'm not sure this is correct in all cases.   If
> 
> 	addr = 4095
> 	vma->vm_end = 4096
> 	pages = 1000
> 
> then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> should have been 1.

Good catch! It should rather be something like

	unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

I'll fix that up and resend tomorrow.

Thanks,
Dominique Martinet March 7, 2019, 12:40 a.m. UTC | #7
Jiri Kosina wrote on Thu, Mar 07, 2019:
> > I'm not sure this is correct in all cases.   If
> > 
> > 	addr = 4095
> > 	vma->vm_end = 4096
> > 	pages = 1000
> > 
> > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > should have been 1.
> 
> Good catch! It should rather be something like
> 
> 	unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
still want to count that as one page.
I'm not too familiar with this area of the code, but I think there's a
handy macro we can use for this, perhaps
  DIV_ROUND_UP(end - addr, PAGE_SIZE) ?

kernel/kexec_core.c has defined PAGE_COUNT() which seems more
appropriate but I do not see a global equivalent
#define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)
Jiri Kosina March 7, 2019, 5:46 a.m. UTC | #8
On Thu, 7 Mar 2019, Dominique Martinet wrote:

> > > 
> > > 	addr = 4095
> > > 	vma->vm_end = 4096
> > > 	pages = 1000
> > > 
> > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > > should have been 1.
> > 
> > Good catch! It should rather be something like
> > 
> > 	unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);
> 
> That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
> still want to count that as one page.

Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise. 
Both the variants are off.

> I'm not too familiar with this area of the code, but I think there's a
> handy macro we can use for this, perhaps
>   DIV_ROUND_UP(end - addr, PAGE_SIZE) ?
> 
> kernel/kexec_core.c has defined PAGE_COUNT() which seems more
> appropriate but I do not see a global equivalent
> #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)

I'll fix that up when doing the other changes requested by Andrew.

Thanks,
diff mbox series

Patch

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..747a4907a3ac 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,14 @@  static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return 0;
 }
 
+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+	return vma_is_anonymous(vma) ||
+		(vma->vm_file &&
+			(inode_owner_or_capable(file_inode(vma->vm_file))
+			 || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));
+}
+
 /*
  * Do a chunk of "sys_mincore()". We've already checked
  * all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +197,13 @@  static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 	vma = find_vma(current->mm, addr);
 	if (!vma || addr < vma->vm_start)
 		return -ENOMEM;
-	mincore_walk.mm = vma->vm_mm;
 	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+	if (!can_do_mincore(vma)) {
+		unsigned long pages = (end - addr) >> PAGE_SHIFT;
+		memset(vec, 1, pages);
+		return pages;
+	}
+	mincore_walk.mm = vma->vm_mm;
 	err = walk_page_range(addr, end, &mincore_walk);
 	if (err < 0)
 		return err;