diff mbox

Don't mlock guardpage if the stack is growing up

Message ID alpine.DEB.2.00.1105091246550.32299@artax.karlin.mff.cuni.cz (mailing list archive)
State Awaiting Upstream
Headers show

Commit Message

Mikulas Patocka May 9, 2011, 11:01 a.m. UTC
On Sun, 8 May 2011, Linus Torvalds wrote:

> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
> <mikulas@artax.karlin.mff.cuni.cz> wrote:
> >
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> > up-growing stack). lvm2 calculates the number of used pages when locking
> > and when unlocking and reports an internal error if the numbers mismatch.
> 
> This patch won't apply on current kernels (including stable) because
> of commit a1fde08c74e9 that changed the test of "pages" to instead
> just test "flags & FOLL_MLOCK".
> 
> That should be trivial to fix up.
> 
> However, I really don't much like this complex test:
> 
> >  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
> >  {
> > -       return (vma->vm_flags & VM_GROWSDOWN) &&
> > +       return ((vma->vm_flags & VM_GROWSDOWN) &&
> >                (vma->vm_start == addr) &&
> > -               !vma_stack_continue(vma->vm_prev, addr);
> > +               !vma_stack_continue(vma->vm_prev, addr)) ||
> > +              ((vma->vm_flags & VM_GROWSUP) &&
> > +               (vma->vm_end == addr + PAGE_SIZE) &&
> > +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
> >  }
> 
> in that format. It gets really hard to read, and I think you'd be
> better off writing it as two helper functions (or macros) for the two
> cases, and then have
> 
>   static inline int stack_guard_page(struct vm_area_struct *vma,
> unsigned long addr)
>   {
>     return stack_guard_page_growsdown(vma, addr) ||
>       stack_guard_page_growsup(vma, addr);
>   }
> 
> I'd also like to verify that it doesn't actually generate any extra
> code for the common case (iirc VM_GROWSUP is 0 for the architectures
> that don't need it, and so the compiler shouldn't generate any extra
> code, but I'd like that mentioned and verified explicitly).
> 
> Hmm?
> 
> Other than that it looks ok to me.
> 
> That said, could we please fix LVM to not do that crazy sh*t in the
> first place? The STACK_GROWSUP case is never going to have a lot of
> testing, this is just sad.

LVM reads process maps from /proc/self/maps and locks them with mlock.

Why it doesn't use mlockall()? Because glibc maps all locales to the 
process. Glibc packs all locales to a 100MB file and maps that file to 
every process. Even if the process uses just one locale, glibc maps all.

So, when LVM used mlockall, it consumed >100MB memory and it caused 
out-of-memory problems in system installers.

So, alternate way of locking was added to LVM --- read all maps and lock 
them, except for the glibc locale file.

The real fix would be to fix glibc not to map 100MB to every process.

>                    Linus
> 

This is updated patch. I tested it on x86-64 and it doesn't change 
generated code.

Mikulas

---

Don't lock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

---
 include/linux/mm.h |   10 +++++++++-
 mm/memory.c        |   21 ++++++++++++++++++---
 2 files changed, 27 insertions(+), 4 deletions(-)

Comments

Zdenek Kabelac May 9, 2011, 11:43 a.m. UTC | #1
Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
> 
> 
> On Sun, 8 May 2011, Linus Torvalds wrote:
> 
>> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
>> <mikulas@artax.karlin.mff.cuni.cz> wrote:
>>>
>>> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
>>> up-growing stack). lvm2 calculates the number of used pages when locking
>>> and when unlocking and reports an internal error if the numbers mismatch.
>>
>> This patch won't apply on current kernels (including stable) because
>> of commit a1fde08c74e9 that changed the test of "pages" to instead
>> just test "flags & FOLL_MLOCK".
>>
>> That should be trivial to fix up.
>>
>> However, I really don't much like this complex test:
>>
>>>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>> -       return (vma->vm_flags & VM_GROWSDOWN) &&
>>> +       return ((vma->vm_flags & VM_GROWSDOWN) &&
>>>                (vma->vm_start == addr) &&
>>> -               !vma_stack_continue(vma->vm_prev, addr);
>>> +               !vma_stack_continue(vma->vm_prev, addr)) ||
>>> +              ((vma->vm_flags & VM_GROWSUP) &&
>>> +               (vma->vm_end == addr + PAGE_SIZE) &&
>>> +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>>>  }
>>
>> in that format. It gets really hard to read, and I think you'd be
>> better off writing it as two helper functions (or macros) for the two
>> cases, and then have
>>
>>   static inline int stack_guard_page(struct vm_area_struct *vma,
>> unsigned long addr)
>>   {
>>     return stack_guard_page_growsdown(vma, addr) ||
>>       stack_guard_page_growsup(vma, addr);
>>   }
>>
>> I'd also like to verify that it doesn't actually generate any extra
>> code for the common case (iirc VM_GROWSUP is 0 for the architectures
>> that don't need it, and so the compiler shouldn't generate any extra
>> code, but I'd like that mentioned and verified explicitly).
>>
>> Hmm?
>>
>> Other than that it looks ok to me.
>>
>> That said, could we please fix LVM to not do that crazy sh*t in the
>> first place? The STACK_GROWSUP case is never going to have a lot of
>> testing, this is just sad.
> 
> LVM reads process maps from /proc/self/maps and locks them with mlock.
> 
> Why it doesn't use mlockall()? Because glibc maps all locales to the 
> process. Glibc packs all locales to a 100MB file and maps that file to 
> every process. Even if the process uses just one locale, glibc maps all.
> 
> So, when LVM used mlockall, it consumed >100MB memory and it caused 
> out-of-memory problems in system installers.
> 
> So, alternate way of locking was added to LVM --- read all maps and lock 
> them, except for the glibc locale file.
> 
> The real fix would be to fix glibc not to map 100MB to every process.
> 

I should add here probably few words.

Glibc knows few more ways around - so it could work only with one locale file
per language, or even without using mmap and allocating them in memory.
Depends on the distribution usually - Fedora decided to combine all locales
into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
(usually ~MB)

LVM support both ways - either user may select in lvm.conf to always use
mlockall, or he may switch to use mlock mapping of individual memory areas
where those memory parts, that cannot be executed during suspend state and
cannot cause memory deadlock, are not locked into memory. As a 'bonus' it's
internally used for tracking algorithmic bugs.

Zdenek
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alasdair G Kergon May 9, 2011, 9:08 p.m. UTC | #2
On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
> > So, when LVM used mlockall, it consumed >100MB memory and it caused 
> > out-of-memory problems in system installers.

The way glibc and some distros have implemented locales makes mlockall()
quite useless for many purposes now in practice IMHO.  We discussed the
options at the time and wrote the extra code to implement the only
solution we were offered.

(1) There was no possibility of some distros not using a single locale
database (reasons: optimisation and packaging convenience);
(2) there was no possibility of glibc providing a mechanism to unmap the
locale database (if you reset your locale to the default glibc will not
unmap the file and indeed doesn't rule out doing something similar with
other files in future - these are glibc internals and it's none of our
business).

Alasdair


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox May 9, 2011, 10:45 p.m. UTC | #3
On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> > Why it doesn't use mlockall()? Because glibc maps all locales to the 
> > process. Glibc packs all locales to a 100MB file and maps that file to 
> > every process. Even if the process uses just one locale, glibc maps all.
> > 
> > So, when LVM used mlockall, it consumed >100MB memory and it caused 
> > out-of-memory problems in system installers.
> > 
> > So, alternate way of locking was added to LVM --- read all maps and lock 
> > them, except for the glibc locale file.
> > 
> > The real fix would be to fix glibc not to map 100MB to every process.
> 
> I should add here probably few words.
> 
> Glibc knows few more ways around - so it could work only with one locale file
> per language, or even without using mmap and allocating them in memory.
> Depends on the distribution usually - Fedora decided to combine all locales
> into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
> (usually ~MB)

Sounds to me like glibc should introduce an mlockmost() call that does all
the work for you ...
Linus Torvalds May 9, 2011, 10:56 p.m. UTC | #4
On Mon, May 9, 2011 at 3:45 PM, Matthew Wilcox <matthew@wil.cx> wrote:
>
> Sounds to me like glibc should introduce an mlockmost() call that does all
> the work for you ...

That sounds like the worst of all worlds. Nobody will use it, now
there's two places to break, and how do you describe what you don't
want mlocked?

I dunno. There are so few applications that use mlock at *all* that I
wonder if it's even worth worrying about. And this one case we've
hopefully now fixed anyway.

But if people really want to fix mlockall(), I'd suggest some way of
just marking certain mappings as "sparse mappings", and then we can
just teach mlockall to not lock them. And then glibc could just mark
its locale archive that way - and maybe others would mark other
mappings.

We could still make such a "sparse mapping" act as locked for the
actual pages that are mapped into it - so it would have some kind of
real semantics. You could do a "mlockall(..)" on the process, and then
touch the sparse pages you know you want, and they'd be guaranteed to
be mapped after that.

We might even have a "mlockall(MCL_SPARSE)" flag that does that for
everything - basically guarantee that "whatever is mapped in now will
remain mapped in, but I won't bother paging it in just because you ask
me to do a mlockall".

So there are sane semantics for the concept, and it would be easy to
do in the kernel. Whether it's worth doing, I dunno.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alasdair G Kergon May 10, 2011, 10:57 p.m. UTC | #5
On Mon, May 09, 2011 at 03:56:11PM -0700, Linus Torvalds wrote:
> So there are sane semantics for the concept, and it would be easy to
> do in the kernel. Whether it's worth doing, I dunno.
 
At this point we have a workaround that seems to work for us.  I find it
ugly and it has needed some tweaking already but we can cope.

If others find similar problems to ours and start replicating the logic
we're using, then that would be the time to give serious thought to a
clean 'sparse' extension.  (Maybe marking individual mappings as MAP_SPARSE
and adding a MCL_NO_SPARSE option to ignore them sounds most promising to me.)

(What other software packages make use of mlockall() and under what
circumstances?)

Alasdair

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Milan Broz May 11, 2011, 8:42 a.m. UTC | #6
On 05/11/2011 12:57 AM, Alasdair G Kergon wrote:
> (What other software packages make use of mlockall() and under what
> circumstances?)

Another one is cryptsetup for commands which manipulate with keys
in memory.
(Users of libcryptetup library are not forced to lock memory, it is optional
call. But cryptsetup itself as libcryptsetup library user always locks memory.)

And I am not happy with mlockall() as well but the lvm2 workaround
is quite complicated.

Basically it wants to lock memory with explicitly allocated keys
(this can be rewritten to use specific locked page though) but it
also need to lock various libdevmapper buffers when issuing dmcrypt
cfg ioctl (mapping table and messages contains key). So that's why
mlockall(MCL_CURRENT|MCL_FUTURE) was the simplest way (and no problems
reported yet).
(No that it is perfect but better than nothing... Of course
more important is to wipe memory with keys after use.)

Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 12, 2011, 2:12 a.m. UTC | #7
On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote:
>
> Another one is cryptsetup [..]

Quite frankly, all security-related uses should always be happy about
a "MCL_SPARSE" model, since there is no point in ever bringing in
pages that haven't been used. The whole (and only) point of
mlock[all]() for them is the "avoid to push to disk" issue.

I do wonder if we really should ever do the page-in at all. We might
simply be better off always just saying "we'll lock pages you've
touched, that's it".

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zdenek Kabelac May 12, 2011, 9:06 a.m. UTC | #8
Dne 12.5.2011 04:12, Linus Torvalds napsal(a):
> On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote:
>>
>> Another one is cryptsetup [..]
> 
> Quite frankly, all security-related uses should always be happy about
> a "MCL_SPARSE" model, since there is no point in ever bringing in
> pages that haven't been used. The whole (and only) point of
> mlock[all]() for them is the "avoid to push to disk" issue.
> 
> I do wonder if we really should ever do the page-in at all. We might
> simply be better off always just saying "we'll lock pages you've
> touched, that's it".
> 


For LVM we need to ensure the code which might ever be executed during  disk
suspend state must be paged and locked in - thus we would need MCL_SPARSE only
on several selected 'unneeded' libraries - as we are obviously not really able
to select which part of glibc might be needed during all code path (though I
guess we may find some limits).  But if we are sure that some libraries and
locale files will never be used during suspend state - we do not care about
those pages at all.

So it's not like we would always need only MCL_SPARSE all the time - we would
probably need to have some control to switch i.e.  glibc into MCL_ALL.

Zdenek
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2011-05-09 12:33:50.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2011-05-09 12:42:05.000000000 +0200
@@ -1011,11 +1011,19 @@  int set_page_dirty_lock(struct page *pag
 int clear_page_dirty_for_io(struct page *page);
 
 /* Is the vma a continuation of the stack vma above it? */
-static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
+static inline int vma_stack_growsdown_continue(struct vm_area_struct *vma,
+					       unsigned long addr)
 {
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2011-05-09 12:33:51.000000000 +0200
+++ linux-2.6/mm/memory.c	2011-05-09 12:41:38.000000000 +0200
@@ -1410,11 +1410,21 @@  no_page_table:
 	return page;
 }
 
-static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
+static inline int stack_guard_page_growsdown(struct vm_area_struct *vma,
+					     unsigned long addr)
 {
 	return (vma->vm_flags & VM_GROWSDOWN) &&
 		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+		!vma_stack_growsdown_continue(vma->vm_prev, addr);
+}
+
+
+static inline int stack_guard_page_growsup(struct vm_area_struct *vma,
+					   unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr + PAGE_SIZE) &&
+		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE);
 }
 
 /**
@@ -1554,13 +1564,18 @@  int __get_user_pages(struct task_struct 
 		/*
 		 * For mlock, just skip the stack guard page.
 		 */
-		if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
+		if ((gup_flags & FOLL_MLOCK) &&
+		    stack_guard_page_growsdown(vma, start))
 			goto next_page;
 
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
 
+			if ((gup_flags & FOLL_MLOCK) &&
+			    stack_guard_page_growsup(vma, start))
+				goto next_page;
+
 			/*
 			 * If we have a pending SIGKILL, don't keep faulting
 			 * pages and potentially allocating memory.