diff mbox series

mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

Message ID 20190213204157.12570-1-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs | expand

Commit Message

Jann Horn Feb. 13, 2019, 8:41 p.m. UTC
The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Andrew Morton Feb. 13, 2019, 8:59 p.m. UTC | #1
On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote:

> The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> number of references that we might need to create in the fastpath later,
> the bump-allocation fastpath only has to modify the non-atomic bias value
> that tracks the number of extra references we hold instead of the atomic
> refcount. The maximum number of allocations we can serve (under the
> assumption that no allocation is made with size 0) is nc->size, so that's
> the bias used.
> 
> However, even when all memory in the allocation has been given away, a
> reference to the page is still held; and in the `offset < 0` slowpath, the
> page may be reused if everyone else has dropped their references.
> This means that the necessary number of references is actually
> `nc->size+1`.
> 
> Luckily, from a quick grep, it looks like the only path that can call
> page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> requires CAP_NET_ADMIN in the init namespace and is only intended to be
> used for kernel testing and fuzzing.

For the net-naive, what is TAP?  It doesn't appear to mean
drivers/net/tap.c.

> To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> with a vector consisting of 15 elements containing 1 byte each.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>  		/* Even if we own the page, we do not use atomic_set().
>  		 * This would break get_page_unless_zero() users.
>  		 */
> -		page_ref_add(page, size - 1);
> +		page_ref_add(page, size);
>  
>  		/* reset page count bias and offset to start of new frag */
>  		nc->pfmemalloc = page_is_pfmemalloc(page);
> -		nc->pagecnt_bias = size;
> +		nc->pagecnt_bias = size + 1;
>  		nc->offset = size;
>  	}
>  
> @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>  		size = nc->size;
>  #endif
>  		/* OK, page count is 0, we can safely set it */
> -		set_page_count(page, size);
> +		set_page_count(page, size + 1);
>  
>  		/* reset page count bias and offset to start of new frag */
> -		nc->pagecnt_bias = size;
> +		nc->pagecnt_bias = size + 1;
>  		offset = size - fragsz;
>  	}

This is probably more a davem patch than a -mm one.
Jann Horn Feb. 13, 2019, 9:11 p.m. UTC | #2
On Wed, Feb 13, 2019 at 9:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote:
>
> > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> > number of references that we might need to create in the fastpath later,
> > the bump-allocation fastpath only has to modify the non-atomic bias value
> > that tracks the number of extra references we hold instead of the atomic
> > refcount. The maximum number of allocations we can serve (under the
> > assumption that no allocation is made with size 0) is nc->size, so that's
> > the bias used.
> >
> > However, even when all memory in the allocation has been given away, a
> > reference to the page is still held; and in the `offset < 0` slowpath, the
> > page may be reused if everyone else has dropped their references.
> > This means that the necessary number of references is actually
> > `nc->size+1`.
> >
> > Luckily, from a quick grep, it looks like the only path that can call
> > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> > requires CAP_NET_ADMIN in the init namespace and is only intended to be
> > used for kernel testing and fuzzing.
>
> For the net-naive, what is TAP?  It doesn't appear to mean
> drivers/net/tap.c.

It's implemented in drivers/net/tun.c; the combined functionality
implemented in there is called TUN/TAP. TUN refers to providing raw IP
packets to the kernel, TAP refers to providing raw ethernet packets.
It's documented in Documentation/networking/tuntap.txt. The code
that's interesting here is tun_get_user(), which calls into
tun_napi_alloc_frags() if tun_napi_frags_enabled(tfile) is true, which
in turn calls into netdev_alloc_frag(), which ends up in
page_frag_alloc(). This is how you can use it (except that if you were
using it legitimately, you'd be writing an ethernet header, a layer 3
header, and application data instead of writing "aaaaaaaaaaaaaaa" like
me):

================
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags =
IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);

  while (1) {
    struct iovec iov[15];
    for (int i=0; i<sizeof(iov)/sizeof(iov[0]); i++) {
      iov[i].iov_base = "a";
      iov[i].iov_len = 1;
    }
    writev(tun_fd, iov, sizeof(iov)/sizeof(iov[0]));
  }
}
================

> > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> > `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> > with a vector consisting of 15 elements containing 1 byte each.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >               /* Even if we own the page, we do not use atomic_set().
> >                * This would break get_page_unless_zero() users.
> >                */
> > -             page_ref_add(page, size - 1);
> > +             page_ref_add(page, size);
> >
> >               /* reset page count bias and offset to start of new frag */
> >               nc->pfmemalloc = page_is_pfmemalloc(page);
> > -             nc->pagecnt_bias = size;
> > +             nc->pagecnt_bias = size + 1;
> >               nc->offset = size;
> >       }
> >
> > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >               size = nc->size;
> >  #endif
> >               /* OK, page count is 0, we can safely set it */
> > -             set_page_count(page, size);
> > +             set_page_count(page, size + 1);
> >
> >               /* reset page count bias and offset to start of new frag */
> > -             nc->pagecnt_bias = size;
> > +             nc->pagecnt_bias = size + 1;
> >               offset = size - fragsz;
> >       }
>
> This is probably more a davem patch than a -mm one.

Ah, sorry. I assumed that I just should go by which directory the
patched code is in.

You did just add it to the -mm tree though, right? So I shouldn't
resend it to davem?
Andrew Morton Feb. 13, 2019, 9:40 p.m. UTC | #3
On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn <jannh@google.com> wrote:

> > This is probably more a davem patch than a -mm one.
> 
> Ah, sorry. I assumed that I just should go by which directory the
> patched code is in.
> 
> You did just add it to the -mm tree though, right? So I shouldn't
> resend it to davem?

Yes, please send to Dave.  I'll autodrop the -mm copy if/when it turns
up in -next.
Alexander Duyck Feb. 13, 2019, 10:42 p.m. UTC | #4
On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote:
>
> The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> number of references that we might need to create in the fastpath later,
> the bump-allocation fastpath only has to modify the non-atomic bias value
> that tracks the number of extra references we hold instead of the atomic
> refcount. The maximum number of allocations we can serve (under the
> assumption that no allocation is made with size 0) is nc->size, so that's
> the bias used.
>
> However, even when all memory in the allocation has been given away, a
> reference to the page is still held; and in the `offset < 0` slowpath, the
> page may be reused if everyone else has dropped their references.
> This means that the necessary number of references is actually
> `nc->size+1`.
>
> Luckily, from a quick grep, it looks like the only path that can call
> page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> requires CAP_NET_ADMIN in the init namespace and is only intended to be
> used for kernel testing and fuzzing.

Actually that has me somewhat concerned. I wouldn't be surprised if
most drivers expect the netdev_alloc_frags call to at least output an
SKB_DATA_ALIGN sized value.

We probably should update __netdev_alloc_frag and __napi_alloc_frag so
that they will pass fragsz through SKB_DATA_ALIGN.

> To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> with a vector consisting of 15 elements containing 1 byte each.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  mm/page_alloc.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 35fdde041f5c..46285d28e43b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>                 /* Even if we own the page, we do not use atomic_set().
>                  * This would break get_page_unless_zero() users.
>                  */
> -               page_ref_add(page, size - 1);
> +               page_ref_add(page, size);
>
>                 /* reset page count bias and offset to start of new frag */
>                 nc->pfmemalloc = page_is_pfmemalloc(page);
> -               nc->pagecnt_bias = size;
> +               nc->pagecnt_bias = size + 1;
>                 nc->offset = size;
>         }
>
> @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>                 size = nc->size;
>  #endif
>                 /* OK, page count is 0, we can safely set it */
> -               set_page_count(page, size);
> +               set_page_count(page, size + 1);
>
>                 /* reset page count bias and offset to start of new frag */
> -               nc->pagecnt_bias = size;
> +               nc->pagecnt_bias = size + 1;
>                 offset = size - fragsz;
>         }

If we already have to add a constant it might be better to just use
PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having
to use "size + 1" instead of "size". That way we can avoid having to
add a constant to a register value and then program that value.
instead we can just assign the constant value right from the start.
Jann Horn Feb. 14, 2019, 3:13 p.m. UTC | #5
On Wed, Feb 13, 2019 at 11:42 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> > number of references that we might need to create in the fastpath later,
> > the bump-allocation fastpath only has to modify the non-atomic bias value
> > that tracks the number of extra references we hold instead of the atomic
> > refcount. The maximum number of allocations we can serve (under the
> > assumption that no allocation is made with size 0) is nc->size, so that's
> > the bias used.
> >
> > However, even when all memory in the allocation has been given away, a
> > reference to the page is still held; and in the `offset < 0` slowpath, the
> > page may be reused if everyone else has dropped their references.
> > This means that the necessary number of references is actually
> > `nc->size+1`.
> >
> > Luckily, from a quick grep, it looks like the only path that can call
> > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> > requires CAP_NET_ADMIN in the init namespace and is only intended to be
> > used for kernel testing and fuzzing.
>
> Actually that has me somewhat concerned. I wouldn't be surprised if
> most drivers expect the netdev_alloc_frags call to at least output an
> SKB_DATA_ALIGN sized value.
>
> We probably should update __netdev_alloc_frag and __napi_alloc_frag so
> that they will pass fragsz through SKB_DATA_ALIGN.

Do you want to do a separate patch for that? I'd like to not mix
logically separate changes in a single patch, and I also don't have a
good understanding of the alignment concerns here.

> > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> > `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> > with a vector consisting of 15 elements containing 1 byte each.
> >
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> >  mm/page_alloc.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 35fdde041f5c..46285d28e43b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >                 /* Even if we own the page, we do not use atomic_set().
> >                  * This would break get_page_unless_zero() users.
> >                  */
> > -               page_ref_add(page, size - 1);
> > +               page_ref_add(page, size);
> >
> >                 /* reset page count bias and offset to start of new frag */
> >                 nc->pfmemalloc = page_is_pfmemalloc(page);
> > -               nc->pagecnt_bias = size;
> > +               nc->pagecnt_bias = size + 1;
> >                 nc->offset = size;
> >         }
> >
> > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> >                 size = nc->size;
> >  #endif
> >                 /* OK, page count is 0, we can safely set it */
> > -               set_page_count(page, size);
> > +               set_page_count(page, size + 1);
> >
> >                 /* reset page count bias and offset to start of new frag */
> > -               nc->pagecnt_bias = size;
> > +               nc->pagecnt_bias = size + 1;
> >                 offset = size - fragsz;
> >         }
>
> If we already have to add a constant it might be better to just use
> PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having
> to use "size + 1" instead of "size". That way we can avoid having to
> add a constant to a register value and then program that value.
> instead we can just assign the constant value right from the start.

I doubt that these few instructions make a difference, but sure, I can
send a v2 with that changed.
Alexander Duyck Feb. 14, 2019, 3:37 p.m. UTC | #6
On Thu, Feb 14, 2019 at 7:13 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Feb 13, 2019 at 11:42 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote:
> > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> > > number of references that we might need to create in the fastpath later,
> > > the bump-allocation fastpath only has to modify the non-atomic bias value
> > > that tracks the number of extra references we hold instead of the atomic
> > > refcount. The maximum number of allocations we can serve (under the
> > > assumption that no allocation is made with size 0) is nc->size, so that's
> > > the bias used.
> > >
> > > However, even when all memory in the allocation has been given away, a
> > > reference to the page is still held; and in the `offset < 0` slowpath, the
> > > page may be reused if everyone else has dropped their references.
> > > This means that the necessary number of references is actually
> > > `nc->size+1`.
> > >
> > > Luckily, from a quick grep, it looks like the only path that can call
> > > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> > > requires CAP_NET_ADMIN in the init namespace and is only intended to be
> > > used for kernel testing and fuzzing.
> >
> > Actually that has me somewhat concerned. I wouldn't be surprised if
> > most drivers expect the netdev_alloc_frags call to at least output an
> > SKB_DATA_ALIGN sized value.
> >
> > We probably should update __netdev_alloc_frag and __napi_alloc_frag so
> > that they will pass fragsz through SKB_DATA_ALIGN.
>
> Do you want to do a separate patch for that? I'd like to not mix
> logically separate changes in a single patch, and I also don't have a
> good understanding of the alignment concerns here.

You could just include it as a separate patch with your work.
Otherwise I will get to it when I have time.

The point is the issue you pointed out will actually cause other
issues if the behavior is maintained since you shouldn't be getting
unaligned blocks out of the frags API anyway.

> > > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> > > `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> > > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> > > with a vector consisting of 15 elements containing 1 byte each.
> > >
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Jann Horn <jannh@google.com>
> > > ---
> > >  mm/page_alloc.c | 8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 35fdde041f5c..46285d28e43b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> > >                 /* Even if we own the page, we do not use atomic_set().
> > >                  * This would break get_page_unless_zero() users.
> > >                  */
> > > -               page_ref_add(page, size - 1);
> > > +               page_ref_add(page, size);
> > >
> > >                 /* reset page count bias and offset to start of new frag */
> > >                 nc->pfmemalloc = page_is_pfmemalloc(page);
> > > -               nc->pagecnt_bias = size;
> > > +               nc->pagecnt_bias = size + 1;
> > >                 nc->offset = size;
> > >         }
> > >
> > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> > >                 size = nc->size;
> > >  #endif
> > >                 /* OK, page count is 0, we can safely set it */
> > > -               set_page_count(page, size);
> > > +               set_page_count(page, size + 1);
> > >
> > >                 /* reset page count bias and offset to start of new frag */
> > > -               nc->pagecnt_bias = size;
> > > +               nc->pagecnt_bias = size + 1;
> > >                 offset = size - fragsz;
> > >         }
> >
> > If we already have to add a constant it might be better to just use
> > PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having
> > to use "size + 1" instead of "size". That way we can avoid having to
> > add a constant to a register value and then program that value.
> > instead we can just assign the constant value right from the start.
>
> I doubt that these few instructions make a difference, but sure, I can
> send a v2 with that changed.

You would be surprised. They all end up adding up over time.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde041f5c..46285d28e43b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4675,11 +4675,11 @@  void *page_frag_alloc(struct page_frag_cache *nc,
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, size - 1);
+		page_ref_add(page, size);
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pfmemalloc = page_is_pfmemalloc(page);
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		nc->offset = size;
 	}
 
@@ -4695,10 +4695,10 @@  void *page_frag_alloc(struct page_frag_cache *nc,
 		size = nc->size;
 #endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, size);
+		set_page_count(page, size + 1);
 
 		/* reset page count bias and offset to start of new frag */
-		nc->pagecnt_bias = size;
+		nc->pagecnt_bias = size + 1;
 		offset = size - fragsz;
 	}