diff mbox

Don't mlock guardpage if the stack is growing up

Message ID alpine.DEB.2.00.1105082045250.15552@artax.karlin.mff.cuni.cz (mailing list archive)
State Awaiting Upstream
Headers show

Commit Message

Mikulas Patocka May 8, 2011, 6:55 p.m. UTC
Don't mlock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with 
down-growing stack. However, some architectures have up-growing stack and 
locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
up-growing stack). lvm2 calculates the number of used pages when locking 
and when unlocking and reports an internal error if the numbers mismatch. 
On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
page, this causes allocation of one more page and internal error in lvm2.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

---
 include/linux/mm.h |    6 ++++++
 mm/memory.c        |   21 ++++++++++++---------
 2 files changed, 18 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Hugh Dickins May 8, 2011, 8:12 p.m. UTC | #1
On Sun, 8 May 2011, Mikulas Patocka wrote:

> Don't mlock guardpage if the stack is growing up
> 
> Linux kernel excludes guard page when performing mlock on a VMA with 
> down-growing stack. However, some architectures have up-growing stack and 
> locking the guard page should be excluded in this case too.
> 
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
> up-growing stack). lvm2 calculates the number of used pages when locking 
> and when unlocking and reports an internal error if the numbers mismatch. 
> On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
> page, this causes allocation of one more page and internal error in lvm2.
> 
> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

Interesting, I'd convinced myself that the growsup case was safe,
because of how we always approach the vma from its bottom end.

I've added Michel to the Cc, he's the one with the best grasp here.

Could you point us to where lvm2 is making these calculations?
I don't understand quite what it's doing.

Thanks,
Hugh

> 
> ---
>  include/linux/mm.h |    6 ++++++
>  mm/memory.c        |   21 ++++++++++++---------
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6.39-rc6-fast/include/linux/mm.h
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/include/linux/mm.h	2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/include/linux/mm.h	2011-05-07 05:59:52.000000000 +0200
> @@ -1016,6 +1016,12 @@ static inline int vma_stack_continue(str
>  	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
>  }
>  
> +/* Is the vma a continuation of the stack vma below it? */
> +static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
> +}
> +
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len);
> Index: linux-2.6.39-rc6-fast/mm/memory.c
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/mm/memory.c	2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/mm/memory.c	2011-05-07 05:59:52.000000000 +0200
> @@ -1412,9 +1412,12 @@ no_page_table:
>  
>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>  {
> -	return (vma->vm_flags & VM_GROWSDOWN) &&
> +	return ((vma->vm_flags & VM_GROWSDOWN) &&
>  		(vma->vm_start == addr) &&
> -		!vma_stack_continue(vma->vm_prev, addr);
> +		!vma_stack_continue(vma->vm_prev, addr)) ||
> +	       ((vma->vm_flags & VM_GROWSUP) &&
> +		(vma->vm_end == addr + PAGE_SIZE) &&
> +		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>  }
>  
>  /**
> @@ -1551,18 +1554,18 @@ int __get_user_pages(struct task_struct 
>  			continue;
>  		}
>  
> -		/*
> -		 * If we don't actually want the page itself,
> -		 * and it's the stack guard page, just skip it.
> -		 */
> -		if (!pages && stack_guard_page(vma, start))
> -			goto next_page;
> -
>  		do {
>  			struct page *page;
>  			unsigned int foll_flags = gup_flags;
>  
>  			/*
> +			 * If we don't actually want the page itself,
> +			 * and it's the stack guard page, just skip it.
> +			 */
> +			if (!pages && stack_guard_page(vma, start))
> +				goto next_page;
> +
> +			/*
>  			 * If we have a pending SIGKILL, don't keep faulting
>  			 * pages and potentially allocating memory.
>  			 */
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 8, 2011, 9:47 p.m. UTC | #2
On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
<mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> up-growing stack). lvm2 calculates the number of used pages when locking
> and when unlocking and reports an internal error if the numbers mismatch.

This patch won't apply on current kernels (including stable) because
of commit a1fde08c74e9 that changed the test of "pages" to instead
just test "flags & FOLL_MLOCK".

That should be trivial to fix up.

However, I really don't much like this complex test:

>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>  {
> -       return (vma->vm_flags & VM_GROWSDOWN) &&
> +       return ((vma->vm_flags & VM_GROWSDOWN) &&
>                (vma->vm_start == addr) &&
> -               !vma_stack_continue(vma->vm_prev, addr);
> +               !vma_stack_continue(vma->vm_prev, addr)) ||
> +              ((vma->vm_flags & VM_GROWSUP) &&
> +               (vma->vm_end == addr + PAGE_SIZE) &&
> +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>  }

in that format. It gets really hard to read, and I think you'd be
better off writing it as two helper functions (or macros) for the two
cases, and then have

  static inline int stack_guard_page(struct vm_area_struct *vma,
unsigned long addr)
  {
    return stack_guard_page_growsdown(vma, addr) ||
      stack_guard_page_growsup(vma, addr);
  }

I'd also like to verify that it doesn't actually generate any extra
code for the common case (iirc VM_GROWSUP is 0 for the architectures
that don't need it, and so the compiler shouldn't generate any extra
code, but I'd like that mentioned and verified explicitly).

Hmm?

Other than that it looks ok to me.

That said, could we please fix LVM to not do that crazy sh*t in the
first place? The STACK_GROWSUP case is never going to have a lot of
testing, this is just sad.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka May 9, 2011, 11:12 a.m. UTC | #3
On Sun, 8 May 2011, Hugh Dickins wrote:

> On Sun, 8 May 2011, Mikulas Patocka wrote:
> 
> > Don't mlock guardpage if the stack is growing up
> > 
> > Linux kernel excludes guard page when performing mlock on a VMA with 
> > down-growing stack. However, some architectures have up-growing stack and 
> > locking the guard page should be excluded in this case too.
> > 
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
> > up-growing stack). lvm2 calculates the number of used pages when locking 
> > and when unlocking and reports an internal error if the numbers mismatch. 
> > On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
> > page, this causes allocation of one more page and internal error in lvm2.
> > 
> > Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
> 
> Interesting, I'd convinced myself that the growsup case was safe,
> because of how we always approach the vma from its bottom end.
> 
> I've added Michel to the Cc, he's the one with the best grasp here.
> 
> Could you point us to where lvm2 is making these calculations?
> I don't understand quite what it's doing.
> 
> Thanks,
> Hugh

See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file 
and locks each map with mlock, except for glibc locale file.

It calculates how much memory it took when locking and unlocking and 
prints an internal error if the numbers differ. The internal error 
normally means that there was some memory allocated while it was locked 
(that is wrong).

However, on up-growing stack, the internal error is always triggered, 
because mlock() of the stack touches the guard page and allocates one more 
page.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 9, 2011, 3:57 p.m. UTC | #4
On Mon, May 9, 2011 at 4:12 AM, Mikulas Patocka
<mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file
> and locks each map with mlock, except for glibc locale file.

Hmm. One thing that strikes me is this problem also implies that the
/proc/self/maps file is wrong for the GROWSUP case, isn't it?

So I think we should not just apply your lock fix, but then *also*
apply something like this:

    diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
    index 2e7addfd9803..080980880c7f 100644
    --- a/fs/proc/task_mmu.c
    +++ b/fs/proc/task_mmu.c
    @@ -214,7 +214,7 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)
     	int flags = vma->vm_flags;
     	unsigned long ino = 0;
     	unsigned long long pgoff = 0;
    -	unsigned long start;
    +	unsigned long start, end;
     	dev_t dev = 0;
     	int len;

    @@ -227,13 +227,16 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)

     	/* We don't show the stack guard page in /proc/maps */
     	start = vma->vm_start;
    -	if (vma->vm_flags & VM_GROWSDOWN)
    -		if (!vma_stack_continue(vma->vm_prev, vma->vm_start))
    -			start += PAGE_SIZE;
    +	if (stack_guard_page_growsdown(vma, start))
    +		start += PAGE_SIZE;
    +	end = vma->vm_end;
    +	if (stack_guard_page_growsup(vma, end))
    +		end -= PAGE_SIZE;
    +

     	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
     			start,
    -			vma->vm_end,
    +			end,
     			flags & VM_READ ? 'r' : '-',
     			flags & VM_WRITE ? 'w' : '-',
     			flags & VM_EXEC ? 'x' : '-',

NOTE! The above actually assumes that your
"stack_guard_page_growsup()" has been changed to actually take the
"next page" value, which I think makes more sense (ie it's the "end of
stack", the same way "stack_guard_page_growsdown()" address is).

Hmm? I don't have any growsup machine to test with, can you do that?

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6.39-rc6-fast/include/linux/mm.h
===================================================================
--- linux-2.6.39-rc6-fast.orig/include/linux/mm.h	2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/include/linux/mm.h	2011-05-07 05:59:52.000000000 +0200
@@ -1016,6 +1016,12 @@  static inline int vma_stack_continue(str
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
Index: linux-2.6.39-rc6-fast/mm/memory.c
===================================================================
--- linux-2.6.39-rc6-fast.orig/mm/memory.c	2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/mm/memory.c	2011-05-07 05:59:52.000000000 +0200
@@ -1412,9 +1412,12 @@  no_page_table:
 
 static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	return (vma->vm_flags & VM_GROWSDOWN) &&
+	return ((vma->vm_flags & VM_GROWSDOWN) &&
 		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+		!vma_stack_continue(vma->vm_prev, addr)) ||
+	       ((vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr + PAGE_SIZE) &&
+		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
 }
 
 /**
@@ -1551,18 +1554,18 @@  int __get_user_pages(struct task_struct 
 			continue;
 		}
 
-		/*
-		 * If we don't actually want the page itself,
-		 * and it's the stack guard page, just skip it.
-		 */
-		if (!pages && stack_guard_page(vma, start))
-			goto next_page;
-
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
 
 			/*
+			 * If we don't actually want the page itself,
+			 * and it's the stack guard page, just skip it.
+			 */
+			if (!pages && stack_guard_page(vma, start))
+				goto next_page;
+
+			/*
 			 * If we have a pending SIGKILL, don't keep faulting
 			 * pages and potentially allocating memory.
 			 */