Message ID | 20190213204157.12570-1-jannh@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs | expand |
On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote: > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > number of references that we might need to create in the fastpath later, > the bump-allocation fastpath only has to modify the non-atomic bias value > that tracks the number of extra references we hold instead of the atomic > refcount. The maximum number of allocations we can serve (under the > assumption that no allocation is made with size 0) is nc->size, so that's > the bias used. > > However, even when all memory in the allocation has been given away, a > reference to the page is still held; and in the `offset < 0` slowpath, the > page may be reused if everyone else has dropped their references. > This means that the necessary number of references is actually > `nc->size+1`. > > Luckily, from a quick grep, it looks like the only path that can call > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > requires CAP_NET_ADMIN in the init namespace and is only intended to be > used for kernel testing and fuzzing. For the net-naive, what is TAP? It doesn't appear to mean drivers/net/tap.c. > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > with a vector consisting of 15 elements containing 1 byte each. > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > /* Even if we own the page, we do not use atomic_set(). > * This would break get_page_unless_zero() users. > */ > - page_ref_add(page, size - 1); > + page_ref_add(page, size); > > /* reset page count bias and offset to start of new frag */ > nc->pfmemalloc = page_is_pfmemalloc(page); > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > nc->offset = size; > } > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > size = nc->size; > #endif > /* OK, page count is 0, we can safely set it */ > - set_page_count(page, size); > + set_page_count(page, size + 1); > > /* reset page count bias and offset to start of new frag */ > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > offset = size - fragsz; > } This is probably more a davem patch than a -mm one.
On Wed, Feb 13, 2019 at 9:59 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn <jannh@google.com> wrote: > > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > > number of references that we might need to create in the fastpath later, > > the bump-allocation fastpath only has to modify the non-atomic bias value > > that tracks the number of extra references we hold instead of the atomic > > refcount. The maximum number of allocations we can serve (under the > > assumption that no allocation is made with size 0) is nc->size, so that's > > the bias used. > > > > However, even when all memory in the allocation has been given away, a > > reference to the page is still held; and in the `offset < 0` slowpath, the > > page may be reused if everyone else has dropped their references. > > This means that the necessary number of references is actually > > `nc->size+1`. > > > > Luckily, from a quick grep, it looks like the only path that can call > > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > > requires CAP_NET_ADMIN in the init namespace and is only intended to be > > used for kernel testing and fuzzing. > > For the net-naive, what is TAP? It doesn't appear to mean > drivers/net/tap.c. It's implemented in drivers/net/tun.c; the combined functionality implemented in there is called TUN/TAP. TUN refers to providing raw IP packets to the kernel, TAP refers to providing raw ethernet packets. It's documented in Documentation/networking/tuntap.txt. The code that's interesting here is tun_get_user(), which calls into tun_napi_alloc_frags() if tun_napi_frags_enabled(tfile) is true, which in turn calls into netdev_alloc_frag(), which ends up in page_frag_alloc(). This is how you can use it (except that if you were using it legitimately, you'd be writing an ethernet header, a layer 3 header, and application data instead of writing "aaaaaaaaaaaaaaa" like me): ================ #define _GNU_SOURCE #include <stdlib.h> #include <stdarg.h> #include <net/if.h> #include <linux/if.h> #include <linux/if_tun.h> #include <err.h> #include <sys/types.h> #include <fcntl.h> #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/ioctl.h> void systemf(const char *command, ...) { char *full_command; va_list ap; va_start(ap, command); if (vasprintf(&full_command, command, ap) == -1) err(1, "vasprintf"); va_end(ap); printf("systemf: <<<%s>>>\n", full_command); system(full_command); } char *devname; int tun_alloc(char *name) { int fd = open("/dev/net/tun", O_RDWR); if (fd == -1) err(1, "open tun dev"); static struct ifreq req = { .ifr_flags = IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI }; strcpy(req.ifr_name, name); if (ioctl(fd, TUNSETIFF, &req)) err(1, "TUNSETIFF"); devname = req.ifr_name; printf("device name: %s\n", devname); return fd; } int main(void) { int tun_fd = tun_alloc("inject_dev%d"); systemf("ip link set %s up", devname); while (1) { struct iovec iov[15]; for (int i=0; i<sizeof(iov)/sizeof(iov[0]); i++) { iov[i].iov_base = "a"; iov[i].iov_len = 1; } writev(tun_fd, iov, sizeof(iov)/sizeof(iov[0])); } } ================ > > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > > with a vector consisting of 15 elements containing 1 byte each. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > /* Even if we own the page, we do not use atomic_set(). > > * This would break get_page_unless_zero() users. > > */ > > - page_ref_add(page, size - 1); > > + page_ref_add(page, size); > > > > /* reset page count bias and offset to start of new frag */ > > nc->pfmemalloc = page_is_pfmemalloc(page); > > - nc->pagecnt_bias = size; > > + nc->pagecnt_bias = size + 1; > > nc->offset = size; > > } > > > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > size = nc->size; > > #endif > > /* OK, page count is 0, we can safely set it */ > > - set_page_count(page, size); > > + set_page_count(page, size + 1); > > > > /* reset page count bias and offset to start of new frag */ > > - nc->pagecnt_bias = size; > > + nc->pagecnt_bias = size + 1; > > offset = size - fragsz; > > } > > This is probably more a davem patch than a -mm one. Ah, sorry. I assumed that I just should go by which directory the patched code is in. You did just add it to the -mm tree though, right? So I shouldn't resend it to davem?
On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn <jannh@google.com> wrote: > > This is probably more a davem patch than a -mm one. > > Ah, sorry. I assumed that I just should go by which directory the > patched code is in. > > You did just add it to the -mm tree though, right? So I shouldn't > resend it to davem? Yes, please send to Dave. I'll autodrop the -mm copy if/when it turns up in -next.
On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote: > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > number of references that we might need to create in the fastpath later, > the bump-allocation fastpath only has to modify the non-atomic bias value > that tracks the number of extra references we hold instead of the atomic > refcount. The maximum number of allocations we can serve (under the > assumption that no allocation is made with size 0) is nc->size, so that's > the bias used. > > However, even when all memory in the allocation has been given away, a > reference to the page is still held; and in the `offset < 0` slowpath, the > page may be reused if everyone else has dropped their references. > This means that the necessary number of references is actually > `nc->size+1`. > > Luckily, from a quick grep, it looks like the only path that can call > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > requires CAP_NET_ADMIN in the init namespace and is only intended to be > used for kernel testing and fuzzing. Actually that has me somewhat concerned. I wouldn't be surprised if most drivers expect the netdev_alloc_frags call to at least output an SKB_DATA_ALIGN sized value. We probably should update __netdev_alloc_frag and __napi_alloc_frag so that they will pass fragsz through SKB_DATA_ALIGN. > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > with a vector consisting of 15 elements containing 1 byte each. > > Cc: stable@vger.kernel.org > Signed-off-by: Jann Horn <jannh@google.com> > --- > mm/page_alloc.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 35fdde041f5c..46285d28e43b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > /* Even if we own the page, we do not use atomic_set(). > * This would break get_page_unless_zero() users. > */ > - page_ref_add(page, size - 1); > + page_ref_add(page, size); > > /* reset page count bias and offset to start of new frag */ > nc->pfmemalloc = page_is_pfmemalloc(page); > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > nc->offset = size; > } > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > size = nc->size; > #endif > /* OK, page count is 0, we can safely set it */ > - set_page_count(page, size); > + set_page_count(page, size + 1); > > /* reset page count bias and offset to start of new frag */ > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > offset = size - fragsz; > } If we already have to add a constant it might be better to just use PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having to use "size + 1" instead of "size". That way we can avoid having to add a constant to a register value and then program that value. instead we can just assign the constant value right from the start.
On Wed, Feb 13, 2019 at 11:42 PM Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote: > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > > number of references that we might need to create in the fastpath later, > > the bump-allocation fastpath only has to modify the non-atomic bias value > > that tracks the number of extra references we hold instead of the atomic > > refcount. The maximum number of allocations we can serve (under the > > assumption that no allocation is made with size 0) is nc->size, so that's > > the bias used. > > > > However, even when all memory in the allocation has been given away, a > > reference to the page is still held; and in the `offset < 0` slowpath, the > > page may be reused if everyone else has dropped their references. > > This means that the necessary number of references is actually > > `nc->size+1`. > > > > Luckily, from a quick grep, it looks like the only path that can call > > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > > requires CAP_NET_ADMIN in the init namespace and is only intended to be > > used for kernel testing and fuzzing. > > Actually that has me somewhat concerned. I wouldn't be surprised if > most drivers expect the netdev_alloc_frags call to at least output an > SKB_DATA_ALIGN sized value. > > We probably should update __netdev_alloc_frag and __napi_alloc_frag so > that they will pass fragsz through SKB_DATA_ALIGN. Do you want to do a separate patch for that? I'd like to not mix logically separate changes in a single patch, and I also don't have a good understanding of the alignment concerns here. > > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > > with a vector consisting of 15 elements containing 1 byte each. > > > > Cc: stable@vger.kernel.org > > Signed-off-by: Jann Horn <jannh@google.com> > > --- > > mm/page_alloc.c | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 35fdde041f5c..46285d28e43b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > /* Even if we own the page, we do not use atomic_set(). > > * This would break get_page_unless_zero() users. > > */ > > - page_ref_add(page, size - 1); > > + page_ref_add(page, size); > > > > /* reset page count bias and offset to start of new frag */ > > nc->pfmemalloc = page_is_pfmemalloc(page); > > - nc->pagecnt_bias = size; > > + nc->pagecnt_bias = size + 1; > > nc->offset = size; > > } > > > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > size = nc->size; > > #endif > > /* OK, page count is 0, we can safely set it */ > > - set_page_count(page, size); > > + set_page_count(page, size + 1); > > > > /* reset page count bias and offset to start of new frag */ > > - nc->pagecnt_bias = size; > > + nc->pagecnt_bias = size + 1; > > offset = size - fragsz; > > } > > If we already have to add a constant it might be better to just use > PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having > to use "size + 1" instead of "size". That way we can avoid having to > add a constant to a register value and then program that value. > instead we can just assign the constant value right from the start. I doubt that these few instructions make a difference, but sure, I can send a v2 with that changed.
On Thu, Feb 14, 2019 at 7:13 AM Jann Horn <jannh@google.com> wrote: > > On Wed, Feb 13, 2019 at 11:42 PM Alexander Duyck > <alexander.duyck@gmail.com> wrote: > > On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@google.com> wrote: > > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > > > number of references that we might need to create in the fastpath later, > > > the bump-allocation fastpath only has to modify the non-atomic bias value > > > that tracks the number of extra references we hold instead of the atomic > > > refcount. The maximum number of allocations we can serve (under the > > > assumption that no allocation is made with size 0) is nc->size, so that's > > > the bias used. > > > > > > However, even when all memory in the allocation has been given away, a > > > reference to the page is still held; and in the `offset < 0` slowpath, the > > > page may be reused if everyone else has dropped their references. > > > This means that the necessary number of references is actually > > > `nc->size+1`. > > > > > > Luckily, from a quick grep, it looks like the only path that can call > > > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > > > requires CAP_NET_ADMIN in the init namespace and is only intended to be > > > used for kernel testing and fuzzing. > > > > Actually that has me somewhat concerned. I wouldn't be surprised if > > most drivers expect the netdev_alloc_frags call to at least output an > > SKB_DATA_ALIGN sized value. > > > > We probably should update __netdev_alloc_frag and __napi_alloc_frag so > > that they will pass fragsz through SKB_DATA_ALIGN. > > Do you want to do a separate patch for that? I'd like to not mix > logically separate changes in a single patch, and I also don't have a > good understanding of the alignment concerns here. You could just include it as a separate patch with your work. Otherwise I will get to it when I have time. The point is the issue you pointed out will actually cause other issues if the behavior is maintained since you shouldn't be getting unaligned blocks out of the frags API anyway. > > > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > > > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > > > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > > > with a vector consisting of 15 elements containing 1 byte each. > > > > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Jann Horn <jannh@google.com> > > > --- > > > mm/page_alloc.c | 8 ++++---- > > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 35fdde041f5c..46285d28e43b 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > > /* Even if we own the page, we do not use atomic_set(). > > > * This would break get_page_unless_zero() users. > > > */ > > > - page_ref_add(page, size - 1); > > > + page_ref_add(page, size); > > > > > > /* reset page count bias and offset to start of new frag */ > > > nc->pfmemalloc = page_is_pfmemalloc(page); > > > - nc->pagecnt_bias = size; > > > + nc->pagecnt_bias = size + 1; > > > nc->offset = size; > > > } > > > > > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > > size = nc->size; > > > #endif > > > /* OK, page count is 0, we can safely set it */ > > > - set_page_count(page, size); > > > + set_page_count(page, size + 1); > > > > > > /* reset page count bias and offset to start of new frag */ > > > - nc->pagecnt_bias = size; > > > + nc->pagecnt_bias = size + 1; > > > offset = size - fragsz; > > > } > > > > If we already have to add a constant it might be better to just use > > PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having > > to use "size + 1" instead of "size". That way we can avoid having to > > add a constant to a register value and then program that value. > > instead we can just assign the constant value right from the start. > > I doubt that these few instructions make a difference, but sure, I can > send a v2 with that changed. You would be surprised. They all end up adding up over time.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 35fdde041f5c..46285d28e43b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, /* Even if we own the page, we do not use atomic_set(). * This would break get_page_unless_zero() users. */ - page_ref_add(page, size - 1); + page_ref_add(page, size); /* reset page count bias and offset to start of new frag */ nc->pfmemalloc = page_is_pfmemalloc(page); - nc->pagecnt_bias = size; + nc->pagecnt_bias = size + 1; nc->offset = size; } @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, size = nc->size; #endif /* OK, page count is 0, we can safely set it */ - set_page_count(page, size); + set_page_count(page, size + 1); /* reset page count bias and offset to start of new frag */ - nc->pagecnt_bias = size; + nc->pagecnt_bias = size + 1; offset = size - fragsz; }
The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum number of references that we might need to create in the fastpath later, the bump-allocation fastpath only has to modify the non-atomic bias value that tracks the number of extra references we hold instead of the atomic refcount. The maximum number of allocations we can serve (under the assumption that no allocation is made with size 0) is nc->size, so that's the bias used. However, even when all memory in the allocation has been given away, a reference to the page is still held; and in the `offset < 0` slowpath, the page may be reused if everyone else has dropped their references. This means that the necessary number of references is actually `nc->size+1`. Luckily, from a quick grep, it looks like the only path that can call page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which requires CAP_NET_ADMIN in the init namespace and is only intended to be used for kernel testing and fuzzing. To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the `offset < 0` path, below the virt_to_page() call, and then repeatedly call writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, with a vector consisting of 15 elements containing 1 byte each. Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> --- mm/page_alloc.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)