Message ID | 20190130124420.1834-2-vbabka@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mincore() and IOCB_NOWAIT adjustments | expand |
On Wed 30-01-19 13:44:18, Vlastimil Babka wrote: > From: Jiri Kosina <jkosina@suse.cz> > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. I agree that this is a better way than the original 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages"). One thing is still not clear to me though. Is the new owner/writeable check OK for the Netflix-like usecases? I mean does happycache have appropriate access to the cache data? I have tried to re-read the original thread but couldn't find any confirmation. I nit below > Originally-by: Linus Torvalds <torvalds@linux-foundation.org> > Originally-by: Dominique Martinet <asmadeus@codewreck.org> > Cc: Dominique Martinet <asmadeus@codewreck.org> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Kevin Easton <kevin@guarana.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Cyril Hrubis <chrubis@suse.cz> > Cc: Tejun Heo <tj@kernel.org> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Cc: Daniel Gruss <daniel@gruss.cc> > Signed-off-by: Jiri Kosina <jkosina@suse.cz> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> other than that looks good to me. Acked-by: Michal Hocko <mhocko@suse.com> If this still doesn't help happycache kind of workloads then we should add a capability check IMO but this looks like a decent foundation to me. > --- > mm/mincore.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/mincore.c b/mm/mincore.c > index 218099b5ed31..747a4907a3ac 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > return 0; > } > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > +{ > + return vma_is_anonymous(vma) || > + (vma->vm_file && > + (inode_owner_or_capable(file_inode(vma->vm_file)) > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); > +} This is hard to read. Can we do if (vma_is_anonymous(vma)) return true; if (!vma->vm_file) return false; return inode_owner_or_capable(file_inode(vma->vm_file)) || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
Michal Hocko wrote on Thu, Jan 31, 2019: > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > > (if it tried to) successfully open for writing. > > I agree that this is a better way than the original 574823bfab82 > ("Change mincore() to count "mapped" pages rather than "cached" pages"). > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. It's enough for my use cases and Josh didn't seem to oppose, but since he's not in Cc I don't think he would answer -- added him now :) FWIW happycache writes in the current directory by default so I assume in the way they use it it would usually have access one way or another. > If this still doesn't help happycache kind of workloads then we should > add a capability check IMO but this looks like a decent foundation to > me. the inode_owner_or_capable/inode_permission helpers already do allow quite a few capabilities there
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko <mhocko@kernel.org> wrote: > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. The owner/writable check will suit every database that I've ever used happycache with, including cassandra, postgres and git. Acked-by: Josh Snyder <joshs@netflix.com>
Here's updated version with Michal's suggestion, and acks: I think this patch is fine to go, less sure about 2/3 and 3/3. ----8<---- From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH v2] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. [mhocko@suse.com: restructure can_do_mincore() conditions] Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Josh Snyder <joshs@netflix.com> Acked-by: Michal Hocko <mhocko@suse.com> --- mm/mincore.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..b8842b849604 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return true; + if (!vma->vm_file) + return false; + return inode_owner_or_capable(file_inode(vma->vm_file)) || + inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err;
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka <vbabka@suse.cz> wrote: > From: Jiri Kosina <jkosina@suse.cz> > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. "for writing" comes as a bit of a surprise. Why not for reading? Could we please explain the reasoning in the changelog and in the (presently absent) comments which describe can_do_mincore()? > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v > vma = find_vma(current->mm, addr); > if (!vma || addr < vma->vm_start) > return -ENOMEM; > - mincore_walk.mm = vma->vm_mm; > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > + if (!can_do_mincore(vma)) { > + unsigned long pages = (end - addr) >> PAGE_SHIFT; I'm not sure this is correct in all cases. If addr = 4095 vma->vm_end = 4096 pages = 1000 then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it should have been 1. Please check? A mincore test suite in tools/testing/selftests would be useful, methinks. To exercise such corner cases, check for future breakage, etc. > + memset(vec, 1, pages); > + return pages; > + } > + mincore_walk.mm = vma->vm_mm; > err = walk_page_range(addr, end, &mincore_walk); > if (err < 0) > return err;
On Wed, 6 Mar 2019, Andrew Morton wrote: > > The semantics of what mincore() considers to be resident is not completely > > clear, but Linux has always (since 2.3.52, which is when mincore() was > > initially done) treated it as "page is available in page cache". > > > > That's potentially a problem, as that [in]directly exposes meta-information > > about pagecache / memory mapping state even about memory not strictly belonging > > to the process executing the syscall, opening possibilities for sidechannel > > attacks. > > > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > > (if it tried to) successfully open for writing. > > "for writing" comes as a bit of a surprise. Why not for reading? I guess this is a rhetorical question from you :) but fair enough, good point, I'll explain this a bit more in the changelog and in the code comments. > > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v > > vma = find_vma(current->mm, addr); > > if (!vma || addr < vma->vm_start) > > return -ENOMEM; > > - mincore_walk.mm = vma->vm_mm; > > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > > + if (!can_do_mincore(vma)) { > > + unsigned long pages = (end - addr) >> PAGE_SHIFT; > > I'm not sure this is correct in all cases. If > > addr = 4095 > vma->vm_end = 4096 > pages = 1000 > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > should have been 1. Good catch! It should rather be something like unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); I'll fix that up and resend tomorrow. Thanks,
Jiri Kosina wrote on Thu, Mar 07, 2019: > > I'm not sure this is correct in all cases. If > > > > addr = 4095 > > vma->vm_end = 4096 > > pages = 1000 > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > should have been 1. > > Good catch! It should rather be something like > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would still want to count that as one page. I'm not too familiar with this area of the code, but I think there's a handy macro we can use for this, perhaps DIV_ROUND_UP(end - addr, PAGE_SIZE) ? kernel/kexec_core.c has defined PAGE_COUNT() which seems more appropriate but I do not see a global equivalent #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)
On Thu, 7 Mar 2019, Dominique Martinet wrote: > > > > > > addr = 4095 > > > vma->vm_end = 4096 > > > pages = 1000 > > > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > > should have been 1. > > > > Good catch! It should rather be something like > > > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); > > That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would > still want to count that as one page. Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise. Both the variants are off. > I'm not too familiar with this area of the code, but I think there's a > handy macro we can use for this, perhaps > DIV_ROUND_UP(end - addr, PAGE_SIZE) ? > > kernel/kexec_core.c has defined PAGE_COUNT() which seems more > appropriate but I do not see a global equivalent > #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT) I'll fix that up when doing the other changes requested by Andrew. Thanks,
diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..747a4907a3ac 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) || + (vma->vm_file && + (inode_owner_or_capable(file_inode(vma->vm_file)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err;