diff mbox series

[v1] mm/vmalloc: fix exact allocations with an alignment > 1

Message ID 20210908132727.16165-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v1] mm/vmalloc: fix exact allocations with an alignment > 1 | expand

Commit Message

David Hildenbrand Sept. 8, 2021, 1:27 p.m. UTC
find_vmap_lowest_match() is imprecise such that it won't always
find "the first free block ... that will accomplish the request" if
an alignment > 1 was specified: especially also when the alignment is
PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
designed, propagating the max size without alignment information through
the tree, it's hard to make it precise again when an alignment > 1 is
specified.

The issue is that in order to be able to eventually align later,
find_vmap_lowest_match() has to search for a slightly bigger area and
might skip some applicable subtrees just by lookign at the result of
get_subtree_max_size(). While this usually doesn't matter, it matters for
exact allocations as performed by KASAN when onlining a memory block,
when the free block exactly matches the request.
(mm/kasn/shadow.c:kasan_mem_notifier()).

In case we online memory blocks out of order (not lowest to highest
address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
an exact allocation if it corresponds exactly to a free block. (there are
some corner cases where it would still work, if we're lucky and hit the
first is_within_this_va() inside the while loop)

[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
[  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
[  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
[  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
[  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  223.863580] Call Trace:
[  223.863946]  dump_stack+0x8e/0xd0
[  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
[  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
[  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
[  223.866264]  ? __get_vm_area_node+0x240/0x2c0
[  223.866858]  ? kfree+0xdd/0x570
[  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
[  223.868028]  ? notifier_call_chain+0x90/0x160
[  223.868625]  __vmalloc_node_range+0x465/0x840
[  223.869230]  ? mark_held_locks+0xb7/0x120

While we could fix this in kasan_mem_notifier() by passing an alignment
of "1", this is actually an implementation detail of vmalloc and to be
handled internally.

So use an alignment of 1 when calling find_vmap_lowest_match() for exact
allocations that are expected to succeed -- if the given range can
satisfy the alignment requirements.

Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Reported-by: Ping Fang <pifang@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/vmalloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)


base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f

Comments

Uladzislau Rezki Sept. 13, 2021, 8:29 a.m. UTC | #1
Hello.

Let me have a look at it.

Vlad Rezki


ср, 8 сент. 2021 г., 15:27 David Hildenbrand <david@redhat.com>:

> find_vmap_lowest_match() is imprecise such that it won't always
> find "the first free block ... that will accomplish the request" if
> an alignment > 1 was specified: especially also when the alignment is
> PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
> designed, propagating the max size without alignment information through
> the tree, it's hard to make it precise again when an alignment > 1 is
> specified.
>
> The issue is that in order to be able to eventually align later,
> find_vmap_lowest_match() has to search for a slightly bigger area and
> might skip some applicable subtrees just by lookign at the result of
> get_subtree_max_size(). While this usually doesn't matter, it matters for
> exact allocations as performed by KASAN when onlining a memory block,
> when the free block exactly matches the request.
> (mm/kasn/shadow.c:kasan_mem_notifier()).
>
> In case we online memory blocks out of order (not lowest to highest
> address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
> an exact allocation if it corresponds exactly to a free block. (there are
> some corner cases where it would still work, if we're lucky and hit the
> first is_within_this_va() inside the while loop)
>
> [root@vm-0 fedora]# echo online >
> /sys/devices/system/memory/memory82/state
> [root@vm-0 fedora]# echo online >
> /sys/devices/system/memory/memory83/state
> [root@vm-0 fedora]# echo online >
> /sys/devices/system/memory/memory85/state
> [root@vm-0 fedora]# echo online >
> /sys/devices/system/memory/memory84/state
> [  223.858115] vmap allocation for size 16777216 failed: use
> vmalloc=<size> to increase size
> [  223.859415] bash: vmalloc: allocation failure: 16777216 bytes,
> mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> [  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted
> 4.18.0-339.el8.x86_64+debug #1
> [  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [  223.863580] Call Trace:
> [  223.863946]  dump_stack+0x8e/0xd0
> [  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
> [  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
> [  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
> [  223.866264]  ? __get_vm_area_node+0x240/0x2c0
> [  223.866858]  ? kfree+0xdd/0x570
> [  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
> [  223.868028]  ? notifier_call_chain+0x90/0x160
> [  223.868625]  __vmalloc_node_range+0x465/0x840
> [  223.869230]  ? mark_held_locks+0xb7/0x120
>
> While we could fix this in kasan_mem_notifier() by passing an alignment
> of "1", this is actually an implementation detail of vmalloc and to be
> handled internally.
>
> So use an alignment of 1 when calling find_vmap_lowest_match() for exact
> allocations that are expected to succeed -- if the given range can
> satisfy the alignment requirements.
>
> Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap
> allocation")
> Reported-by: Ping Fang <pifang@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: linux-mm@kvack.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/vmalloc.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d5cd52805149..c6071f5f8de3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1153,7 +1153,8 @@ is_within_this_va(struct vmap_area *va, unsigned
> long size,
>  /*
>   * Find the first free block(lowest start address) in the tree,
>   * that will accomplish the request corresponding to passing
> - * parameters.
> + * parameters. Note that with an alignment > 1, this function
> + * can be imprecise and skip applicable free blocks.
>   */
>  static __always_inline struct vmap_area *
>  find_vmap_lowest_match(unsigned long size,
> @@ -1396,7 +1397,15 @@ __alloc_vmap_area(unsigned long size, unsigned long
> align,
>         enum fit_type type;
>         int ret;
>
> -       va = find_vmap_lowest_match(size, align, vstart);
> +       /*
> +        * For exact allocations, ignore the alignment, such that
> +        * find_vmap_lowest_match() won't search for a bigger free area
> just
> +        * able to align later and consequently fail the search.
> +        */
> +       if (vend - vstart == size && IS_ALIGNED(vstart, align))
> +               va = find_vmap_lowest_match(size, 1, vstart);
> +       else
> +               va = find_vmap_lowest_match(size, align, vstart);
>         if (unlikely(!va))
>                 return vend;
>
>
> base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
> --
> 2.31.1
>
>
Uladzislau Rezki Sept. 13, 2021, 8:44 a.m. UTC | #2
Hello.

Let me have a look at it.

Vlad Rezki

On Wed, Sep 8, 2021 at 3:27 PM David Hildenbrand <david@redhat.com> wrote:
>
> find_vmap_lowest_match() is imprecise such that it won't always
> find "the first free block ... that will accomplish the request" if
> an alignment > 1 was specified: especially also when the alignment is
> PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
> designed, propagating the max size without alignment information through
> the tree, it's hard to make it precise again when an alignment > 1 is
> specified.
>
> The issue is that in order to be able to eventually align later,
> find_vmap_lowest_match() has to search for a slightly bigger area and
> might skip some applicable subtrees just by lookign at the result of
> get_subtree_max_size(). While this usually doesn't matter, it matters for
> exact allocations as performed by KASAN when onlining a memory block,
> when the free block exactly matches the request.
> (mm/kasn/shadow.c:kasan_mem_notifier()).
>
> In case we online memory blocks out of order (not lowest to highest
> address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
> an exact allocation if it corresponds exactly to a free block. (there are
> some corner cases where it would still work, if we're lucky and hit the
> first is_within_this_va() inside the while loop)
>
> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
> [  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
> [  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> [  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
> [  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [  223.863580] Call Trace:
> [  223.863946]  dump_stack+0x8e/0xd0
> [  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
> [  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
> [  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
> [  223.866264]  ? __get_vm_area_node+0x240/0x2c0
> [  223.866858]  ? kfree+0xdd/0x570
> [  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
> [  223.868028]  ? notifier_call_chain+0x90/0x160
> [  223.868625]  __vmalloc_node_range+0x465/0x840
> [  223.869230]  ? mark_held_locks+0xb7/0x120
>
> While we could fix this in kasan_mem_notifier() by passing an alignment
> of "1", this is actually an implementation detail of vmalloc and to be
> handled internally.
>
> So use an alignment of 1 when calling find_vmap_lowest_match() for exact
> allocations that are expected to succeed -- if the given range can
> satisfy the alignment requirements.
>
> Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
> Reported-by: Ping Fang <pifang@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: linux-mm@kvack.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/vmalloc.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d5cd52805149..c6071f5f8de3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1153,7 +1153,8 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
>  /*
>   * Find the first free block(lowest start address) in the tree,
>   * that will accomplish the request corresponding to passing
> - * parameters.
> + * parameters. Note that with an alignment > 1, this function
> + * can be imprecise and skip applicable free blocks.
>   */
>  static __always_inline struct vmap_area *
>  find_vmap_lowest_match(unsigned long size,
> @@ -1396,7 +1397,15 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
>         enum fit_type type;
>         int ret;
>
> -       va = find_vmap_lowest_match(size, align, vstart);
> +       /*
> +        * For exact allocations, ignore the alignment, such that
> +        * find_vmap_lowest_match() won't search for a bigger free area just
> +        * able to align later and consequently fail the search.
> +        */
> +       if (vend - vstart == size && IS_ALIGNED(vstart, align))
> +               va = find_vmap_lowest_match(size, 1, vstart);
> +       else
> +               va = find_vmap_lowest_match(size, align, vstart);
>         if (unlikely(!va))
>                 return vend;
>
>
> base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
> --
> 2.31.1
>
Uladzislau Rezki Sept. 14, 2021, 7:15 p.m. UTC | #3
> Let me have a look at it.
>
> Vlad Rezki
>
> On Wed, Sep 8, 2021 at 3:27 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > find_vmap_lowest_match() is imprecise such that it won't always
> > find "the first free block ... that will accomplish the request" if
> > an alignment > 1 was specified: especially also when the alignment is
> > PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
> > designed, propagating the max size without alignment information through
> > the tree, it's hard to make it precise again when an alignment > 1 is
> > specified.
> >
> > The issue is that in order to be able to eventually align later,
> > find_vmap_lowest_match() has to search for a slightly bigger area and
> > might skip some applicable subtrees just by lookign at the result of
> > get_subtree_max_size(). While this usually doesn't matter, it matters for
> > exact allocations as performed by KASAN when onlining a memory block,
> > when the free block exactly matches the request.
> > (mm/kasn/shadow.c:kasan_mem_notifier()).
> >
> > In case we online memory blocks out of order (not lowest to highest
> > address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
> > an exact allocation if it corresponds exactly to a free block. (there are
> > some corner cases where it would still work, if we're lucky and hit the
> > first is_within_this_va() inside the while loop)
> >
> > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
> > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
> > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
> > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
> > [  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
> > [  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> > [  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
> > [  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [  223.863580] Call Trace:
> > [  223.863946]  dump_stack+0x8e/0xd0
> > [  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
> > [  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
> > [  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
> > [  223.866264]  ? __get_vm_area_node+0x240/0x2c0
> > [  223.866858]  ? kfree+0xdd/0x570
> > [  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
> > [  223.868028]  ? notifier_call_chain+0x90/0x160
> > [  223.868625]  __vmalloc_node_range+0x465/0x840
> > [  223.869230]  ? mark_held_locks+0xb7/0x120
> >
> > While we could fix this in kasan_mem_notifier() by passing an alignment
> > of "1", this is actually an implementation detail of vmalloc and to be
> > handled internally.
> >
> > So use an alignment of 1 when calling find_vmap_lowest_match() for exact
> > allocations that are expected to succeed -- if the given range can
> > satisfy the alignment requirements.
> >
> > Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
> > Reported-by: Ping Fang <pifang@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: linux-mm@kvack.org
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/vmalloc.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d5cd52805149..c6071f5f8de3 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1153,7 +1153,8 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
> >  /*
> >   * Find the first free block(lowest start address) in the tree,
> >   * that will accomplish the request corresponding to passing
> > - * parameters.
> > + * parameters. Note that with an alignment > 1, this function
> > + * can be imprecise and skip applicable free blocks.
> >   */
> >  static __always_inline struct vmap_area *
> >  find_vmap_lowest_match(unsigned long size,
> > @@ -1396,7 +1397,15 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
> >         enum fit_type type;
> >         int ret;
> >
> > -       va = find_vmap_lowest_match(size, align, vstart);
> > +       /*
> > +        * For exact allocations, ignore the alignment, such that
> > +        * find_vmap_lowest_match() won't search for a bigger free area just
> > +        * able to align later and consequently fail the search.
> > +        */
> > +       if (vend - vstart == size && IS_ALIGNED(vstart, align))
> > +               va = find_vmap_lowest_match(size, 1, vstart);
> > +       else
> > +               va = find_vmap_lowest_match(size, align, vstart);
> >         if (unlikely(!va))
> >                 return vend;
> >
> >
> > base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
> > --
> > 2.31.1
> >
This patch looks like a KASAN specific to me. So i would like to avoid
such changes to
the vmalloc internals in order to simplify further maintenance and
keeping it generic
instead.

Currently the find_vmap_lowest_match() adjusts the search size
criteria for any alignment
values even for PAGE_SIZE alignment. That is not optimal. Because a
PAGE_SIZE or one
page is a minimum granularity we operate on vmalloc allocations thus
all free blocks are
page aligned.

All that means that we should adjust the search length only if an
alignment request is bigger than
one page, in case of one page, that corresponds to PAGE_SIZE value,
there is no reason in such
extra adjustment because all free->va_start have a page boundary anyway.

Could you please test below patch that is a generic improvement?

<snip>
urezki@pc638:~/data/raid0/coding/linux-next.git$ git diff
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 24a3955d5a36..9d219ab5ae57 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1153,7 +1153,9 @@ is_within_this_va(struct vmap_area *va, unsigned
long size,
 /*
  * Find the first free block(lowest start address) in the tree,
  * that will accomplish the request corresponding to passing
- * parameters.
+ * parameters. Note that with an alignment > PAGE_SIZE, this
+ * function can skip applicable free blocks with lower start
+ * addresses which may fit for an alignment request.
  */
 static __always_inline struct vmap_area *
 find_vmap_lowest_match(unsigned long size,
@@ -1167,7 +1169,7 @@ find_vmap_lowest_match(unsigned long size,
        node = free_vmap_area_root.rb_node;

        /* Adjust the search size for alignment overhead. */
-       length = size + align - 1;
+       length = (align > PAGE_SIZE) ? size + align:size;

        while (node) {
                va = rb_entry(node, struct vmap_area, rb_node);
@@ -2251,7 +2253,7 @@ void __init vm_area_register_early(struct
vm_struct *vm, size_t align)

 static void vmap_init_free_space(void)
 {
-       unsigned long vmap_start = 1;
+       unsigned long vmap_start = PAGE_ALIGN(1);
        const unsigned long vmap_end = ULONG_MAX;
        struct vmap_area *busy, *free;

urezki@pc638:~/data/raid0/coding/linux-next.git$
<snip>

Thanks!

--
Vlad Rezki
David Hildenbrand Sept. 15, 2021, 9:02 a.m. UTC | #4
On 14.09.21 21:15, Uladzislau Rezki wrote:
>> Let me have a look at it.
>>
>> Vlad Rezki
>>
>> On Wed, Sep 8, 2021 at 3:27 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> find_vmap_lowest_match() is imprecise such that it won't always
>>> find "the first free block ... that will accomplish the request" if
>>> an alignment > 1 was specified: especially also when the alignment is
>>> PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
>>> designed, propagating the max size without alignment information through
>>> the tree, it's hard to make it precise again when an alignment > 1 is
>>> specified.
>>>
>>> The issue is that in order to be able to eventually align later,
>>> find_vmap_lowest_match() has to search for a slightly bigger area and
>>> might skip some applicable subtrees just by lookign at the result of
>>> get_subtree_max_size(). While this usually doesn't matter, it matters for
>>> exact allocations as performed by KASAN when onlining a memory block,
>>> when the free block exactly matches the request.
>>> (mm/kasn/shadow.c:kasan_mem_notifier()).
>>>
>>> In case we online memory blocks out of order (not lowest to highest
>>> address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
>>> an exact allocation if it corresponds exactly to a free block. (there are
>>> some corner cases where it would still work, if we're lucky and hit the
>>> first is_within_this_va() inside the while loop)
>>>
>>> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
>>> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
>>> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
>>> [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
>>> [  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
>>> [  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
>>> [  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
>>> [  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>> [  223.863580] Call Trace:
>>> [  223.863946]  dump_stack+0x8e/0xd0
>>> [  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
>>> [  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
>>> [  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
>>> [  223.866264]  ? __get_vm_area_node+0x240/0x2c0
>>> [  223.866858]  ? kfree+0xdd/0x570
>>> [  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
>>> [  223.868028]  ? notifier_call_chain+0x90/0x160
>>> [  223.868625]  __vmalloc_node_range+0x465/0x840
>>> [  223.869230]  ? mark_held_locks+0xb7/0x120
>>>
>>> While we could fix this in kasan_mem_notifier() by passing an alignment
>>> of "1", this is actually an implementation detail of vmalloc and to be
>>> handled internally.
>>>
>>> So use an alignment of 1 when calling find_vmap_lowest_match() for exact
>>> allocations that are expected to succeed -- if the given range can
>>> satisfy the alignment requirements.
>>>
>>> Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
>>> Reported-by: Ping Fang <pifang@redhat.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> Cc: Roman Gushchin <guro@fb.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: linux-mm@kvack.org
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/vmalloc.c | 13 +++++++++++--
>>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index d5cd52805149..c6071f5f8de3 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -1153,7 +1153,8 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
>>>   /*
>>>    * Find the first free block(lowest start address) in the tree,
>>>    * that will accomplish the request corresponding to passing
>>> - * parameters.
>>> + * parameters. Note that with an alignment > 1, this function
>>> + * can be imprecise and skip applicable free blocks.
>>>    */
>>>   static __always_inline struct vmap_area *
>>>   find_vmap_lowest_match(unsigned long size,
>>> @@ -1396,7 +1397,15 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
>>>          enum fit_type type;
>>>          int ret;
>>>
>>> -       va = find_vmap_lowest_match(size, align, vstart);
>>> +       /*
>>> +        * For exact allocations, ignore the alignment, such that
>>> +        * find_vmap_lowest_match() won't search for a bigger free area just
>>> +        * able to align later and consequently fail the search.
>>> +        */
>>> +       if (vend - vstart == size && IS_ALIGNED(vstart, align))
>>> +               va = find_vmap_lowest_match(size, 1, vstart);
>>> +       else
>>> +               va = find_vmap_lowest_match(size, align, vstart);
>>>          if (unlikely(!va))
>>>                  return vend;
>>>
>>>
>>> base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
>>> --
>>> 2.31.1
>>>
> This patch looks like a KASAN specific to me. So i would like to avoid
> such changes to
> the vmalloc internals in order to simplify further maintenance and
> keeping it generic
> instead.

There is nothing KASAN specific in there :) It's specific to exact 
applications -- and KASAN may be one of the only users for now.

> 
> Currently the find_vmap_lowest_match() adjusts the search size
> criteria for any alignment
> values even for PAGE_SIZE alignment. That is not optimal. Because a
> PAGE_SIZE or one
> page is a minimum granularity we operate on vmalloc allocations thus
> all free blocks are
> page aligned.
> 
> All that means that we should adjust the search length only if an
> alignment request is bigger than
> one page, in case of one page, that corresponds to PAGE_SIZE value,
> there is no reason in such
> extra adjustment because all free->va_start have a page boundary anyway.
> 
> Could you please test below patch that is a generic improvement?

I played with the exact approach below (almost exactly the same code in 
find_vmap_lowest_match()), and it might make sense as a general 
improvement -- if we can guarantee that start+end of ranges are always 
PAGE-aligned; I was not able to come to that conclusion, and your change 
to vmap_init_free_space() shows me that our existing alignment 
code/checks might be quite fragile.

But I mainly decided to go with my patch instead because it will also 
work for exact allocations with align > PAGE_SIZE: most notably, when we 
try population of hugepages instead of base pages in 
__vmalloc_node_range(), by increasing the alignment. If the exact range 
allows for using huge pages, placing huge pages will work just fine with 
my modifications, while it won't with your approach.

Long story short: my approach handles exact allocations also for bigger 
alignment, Your optimization makes sense as a general improvement for 
any vmalloc allocations.

Thanks for having a look!

> 
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-next.git$ git diff
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 24a3955d5a36..9d219ab5ae57 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1153,7 +1153,9 @@ is_within_this_va(struct vmap_area *va, unsigned
> long size,
>   /*
>    * Find the first free block(lowest start address) in the tree,
>    * that will accomplish the request corresponding to passing
> - * parameters.
> + * parameters. Note that with an alignment > PAGE_SIZE, this
> + * function can skip applicable free blocks with lower start
> + * addresses which may fit for an alignment request.
>    */
>   static __always_inline struct vmap_area *
>   find_vmap_lowest_match(unsigned long size,
> @@ -1167,7 +1169,7 @@ find_vmap_lowest_match(unsigned long size,
>          node = free_vmap_area_root.rb_node;
> 
>          /* Adjust the search size for alignment overhead. */
> -       length = size + align - 1;
> +       length = (align > PAGE_SIZE) ? size + align:size;
> 
>          while (node) {
>                  va = rb_entry(node, struct vmap_area, rb_node);
> @@ -2251,7 +2253,7 @@ void __init vm_area_register_early(struct
> vm_struct *vm, size_t align)
> 
>   static void vmap_init_free_space(void)
>   {
> -       unsigned long vmap_start = 1;
> +       unsigned long vmap_start = PAGE_ALIGN(1);
>          const unsigned long vmap_end = ULONG_MAX;
>          struct vmap_area *busy, *free;
> 
> urezki@pc638:~/data/raid0/coding/linux-next.git$
> <snip>
> 
> Thanks!
> 
> --
> Vlad Rezki
>
Uladzislau Rezki Sept. 16, 2021, 7:34 p.m. UTC | #5
> On 14.09.21 21:15, Uladzislau Rezki wrote:
> > > Let me have a look at it.
> > > 
> > > Vlad Rezki
> > > 
> > > On Wed, Sep 8, 2021 at 3:27 PM David Hildenbrand <david@redhat.com> wrote:
> > > > 
> > > > find_vmap_lowest_match() is imprecise such that it won't always
> > > > find "the first free block ... that will accomplish the request" if
> > > > an alignment > 1 was specified: especially also when the alignment is
> > > > PAGE_SIZE. Unfortuantely, the way the vmalloc data structures were
> > > > designed, propagating the max size without alignment information through
> > > > the tree, it's hard to make it precise again when an alignment > 1 is
> > > > specified.
> > > > 
> > > > The issue is that in order to be able to eventually align later,
> > > > find_vmap_lowest_match() has to search for a slightly bigger area and
> > > > might skip some applicable subtrees just by lookign at the result of
> > > > get_subtree_max_size(). While this usually doesn't matter, it matters for
> > > > exact allocations as performed by KASAN when onlining a memory block,
> > > > when the free block exactly matches the request.
> > > > (mm/kasn/shadow.c:kasan_mem_notifier()).
> > > > 
> > > > In case we online memory blocks out of order (not lowest to highest
> > > > address), find_vmap_lowest_match() with PAGE_SIZE alignment will reject
> > > > an exact allocation if it corresponds exactly to a free block. (there are
> > > > some corner cases where it would still work, if we're lucky and hit the
> > > > first is_within_this_va() inside the while loop)
> > > > 
> > > > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
> > > > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
> > > > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
> > > > [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
> > > > [  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
> > > > [  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> > > > [  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
> > > > [  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > > [  223.863580] Call Trace:
> > > > [  223.863946]  dump_stack+0x8e/0xd0
> > > > [  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
> > > > [  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
> > > > [  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
> > > > [  223.866264]  ? __get_vm_area_node+0x240/0x2c0
> > > > [  223.866858]  ? kfree+0xdd/0x570
> > > > [  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
> > > > [  223.868028]  ? notifier_call_chain+0x90/0x160
> > > > [  223.868625]  __vmalloc_node_range+0x465/0x840
> > > > [  223.869230]  ? mark_held_locks+0xb7/0x120
> > > > 
> > > > While we could fix this in kasan_mem_notifier() by passing an alignment
> > > > of "1", this is actually an implementation detail of vmalloc and to be
> > > > handled internally.
> > > > 
> > > > So use an alignment of 1 when calling find_vmap_lowest_match() for exact
> > > > allocations that are expected to succeed -- if the given range can
> > > > satisfy the alignment requirements.
> > > > 
> > > > Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
> > > > Reported-by: Ping Fang <pifang@redhat.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Cc: Roman Gushchin <guro@fb.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > Cc: linux-mm@kvack.org
> > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > > ---
> > > >   mm/vmalloc.c | 13 +++++++++++--
> > > >   1 file changed, 11 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index d5cd52805149..c6071f5f8de3 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -1153,7 +1153,8 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
> > > >   /*
> > > >    * Find the first free block(lowest start address) in the tree,
> > > >    * that will accomplish the request corresponding to passing
> > > > - * parameters.
> > > > + * parameters. Note that with an alignment > 1, this function
> > > > + * can be imprecise and skip applicable free blocks.
> > > >    */
> > > >   static __always_inline struct vmap_area *
> > > >   find_vmap_lowest_match(unsigned long size,
> > > > @@ -1396,7 +1397,15 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
> > > >          enum fit_type type;
> > > >          int ret;
> > > > 
> > > > -       va = find_vmap_lowest_match(size, align, vstart);
> > > > +       /*
> > > > +        * For exact allocations, ignore the alignment, such that
> > > > +        * find_vmap_lowest_match() won't search for a bigger free area just
> > > > +        * able to align later and consequently fail the search.
> > > > +        */
> > > > +       if (vend - vstart == size && IS_ALIGNED(vstart, align))
> > > > +               va = find_vmap_lowest_match(size, 1, vstart);
> > > > +       else
> > > > +               va = find_vmap_lowest_match(size, align, vstart);
> > > >          if (unlikely(!va))
> > > >                  return vend;
> > > > 
> > > > 
> > > > base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
> > > > --
> > > > 2.31.1
> > > > 
> > This patch looks like a KASAN specific to me. So i would like to avoid
> > such changes to
> > the vmalloc internals in order to simplify further maintenance and
> > keeping it generic
> > instead.
> 
> There is nothing KASAN specific in there :) It's specific to exact
> applications -- and KASAN may be one of the only users for now.
> 
Well, you can name it either way :) So it should not be specific by the
design, otherwise as i mentioned the allocator would be like a peace of
code that handles corner cases what is actually not acceptable.

> > 
> > Currently the find_vmap_lowest_match() adjusts the search size
> > criteria for any alignment
> > values even for PAGE_SIZE alignment. That is not optimal. Because a
> > PAGE_SIZE or one
> > page is a minimum granularity we operate on vmalloc allocations thus
> > all free blocks are
> > page aligned.
> > 
> > All that means that we should adjust the search length only if an
> > alignment request is bigger than
> > one page, in case of one page, that corresponds to PAGE_SIZE value,
> > there is no reason in such
> > extra adjustment because all free->va_start have a page boundary anyway.
> > 
> > Could you please test below patch that is a generic improvement?
> 
> I played with the exact approach below (almost exactly the same code in
> find_vmap_lowest_match()), and it might make sense as a general improvement
> -- if we can guarantee that start+end of ranges are always PAGE-aligned; I
> was not able to come to that conclusion...
All free blocks are PAGE aligned that is how it has to be. A vstart also
should be aligned otherwise the __vunmap() will complain about freeing
a bad address: 

<snip>
    if (WARN(!PAGE_ALIGNED(addr), "Trying to vfree() bad address (%p)\n",
            addr))
        return;
<snip>

BTW, we should add an extra check to the alloc_vmap_area(), something like
below:

<snip>
    if (!PAGE_ALIGNED(ALIGN(vstart, align))) {
        WARN_ONCE(1, "vmalloc: vstart or align are not page aligned (0x%lx, 0x%lx)\n",
            vstart, align);
        return ERR_PTR(-EBUSY);
	}
<snip>

to check that passed pair of vstart and align are correct.

>
> vmap_init_free_space() shows me that our existing alignment code/checks
> might be quite fragile.
>
It is not important to page align a first address. As i mentioned before
vstart and align have to be correct and we should add such check.

> 
> But I mainly decided to go with my patch instead because it will also work
> for exact allocations with align > PAGE_SIZE: most notably, when we try
> population of hugepages instead of base pages in __vmalloc_node_range(), by
> increasing the alignment. If the exact range allows for using huge pages,
> placing huge pages will work just fine with my modifications, while it won't
> with your approach.
> 
> Long story short: my approach handles exact allocations also for bigger
> alignment, Your optimization makes sense as a general improvement for any
> vmalloc allocations.
> 
> Thanks for having a look!
>
The problem is that there are no users(seems only KASAN) who set the range
that corresponds to exact size. If you add an aligment overhead on top of
it means that search size should include it because we may end up with exact
free block and after applying aligment it will not fit. This is how allocators
handle aligment.

Another approach is(you mentioned it in your commit message):

urezki@pc638:~/data/raid0/coding/linux-next.git$ git diff mm/kasan/shadow.c
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 082ee5b6d9a1..01d3bd76c851 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -200,7 +200,7 @@ static int __meminit kasan_mem_notifier(struct notifier_block *nb,
                if (shadow_mapped(shadow_start))
                        return NOTIFY_OK;
 
-               ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start,
+               ret = __vmalloc_node_range(shadow_size, 1, shadow_start,
                                        shadow_end, GFP_KERNEL,
                                        PAGE_KERNEL, VM_NO_GUARD,
                                        pfn_to_nid(mem_data->start_pfn),
urezki@pc638:~/data/raid0/coding/linux-next.git$

why not? Also you can increase the range in KASAN code.

--
Vlad Rezki
David Hildenbrand Sept. 17, 2021, 8:47 a.m. UTC | #6
>>> This patch looks like a KASAN specific to me. So i would like to avoid
>>> such changes to
>>> the vmalloc internals in order to simplify further maintenance and
>>> keeping it generic
>>> instead.
>>
>> There is nothing KASAN specific in there :) It's specific to exact
>> applications -- and KASAN may be one of the only users for now.
>>
> Well, you can name it either way :) So it should not be specific by the
> design, otherwise as i mentioned the allocator would be like a peace of
> code that handles corner cases what is actually not acceptable.

Let's not overstress the situation of adding essentially 3 LOC in order 
to fix a sane use case of the vmalloc allocator that was not considered 
properly by internal changes due to 68ad4a330433 ("mm/vmalloc.c: keep 
track of free blocks for vmap allocation").

> 
>>>
>>> Currently the find_vmap_lowest_match() adjusts the search size
>>> criteria for any alignment
>>> values even for PAGE_SIZE alignment. That is not optimal. Because a
>>> PAGE_SIZE or one
>>> page is a minimum granularity we operate on vmalloc allocations thus
>>> all free blocks are
>>> page aligned.
>>>
>>> All that means that we should adjust the search length only if an
>>> alignment request is bigger than
>>> one page, in case of one page, that corresponds to PAGE_SIZE value,
>>> there is no reason in such
>>> extra adjustment because all free->va_start have a page boundary anyway.
>>>
>>> Could you please test below patch that is a generic improvement?
>>
>> I played with the exact approach below (almost exactly the same code in
>> find_vmap_lowest_match()), and it might make sense as a general improvement
>> -- if we can guarantee that start+end of ranges are always PAGE-aligned; I
>> was not able to come to that conclusion...
> All free blocks are PAGE aligned that is how it has to be. A vstart also
> should be aligned otherwise the __vunmap() will complain about freeing
> a bad address:
> 
> <snip>
>      if (WARN(!PAGE_ALIGNED(addr), "Trying to vfree() bad address (%p)\n",
>              addr))
>          return;
> <snip>
> 
> BTW, we should add an extra check to the alloc_vmap_area(), something like
> below:
> 
> <snip>
>      if (!PAGE_ALIGNED(ALIGN(vstart, align))) {
>          WARN_ONCE(1, "vmalloc: vstart or align are not page aligned (0x%lx, 0x%lx)\n",
>              vstart, align);
>          return ERR_PTR(-EBUSY);
> 	}
> <snip>
> 
> to check that passed pair of vstart and align are correct.
> 

Yes we better should.

>>
>> vmap_init_free_space() shows me that our existing alignment code/checks
>> might be quite fragile.
>>
> It is not important to page align a first address. As i mentioned before
> vstart and align have to be correct and we should add such check.
> 
>>
>> But I mainly decided to go with my patch instead because it will also work
>> for exact allocations with align > PAGE_SIZE: most notably, when we try
>> population of hugepages instead of base pages in __vmalloc_node_range(), by
>> increasing the alignment. If the exact range allows for using huge pages,
>> placing huge pages will work just fine with my modifications, while it won't
>> with your approach.
>>
>> Long story short: my approach handles exact allocations also for bigger
>> alignment, Your optimization makes sense as a general improvement for any
>> vmalloc allocations.
>>
>> Thanks for having a look!
>>
> The problem is that there are no users(seems only KASAN) who set the range
> that corresponds to exact size. If you add an aligment overhead on top of

So there is one user and it was broken by 68ad4a330433 ("mm/vmalloc.c: 
keep track of free blocks for vmap allocation").

> it means that search size should include it because we may end up with exact
> free block and after applying aligment it will not fit. This is how allocators
> handle aligment.

This is an implementation detail of the vmalloc allocator ever since 
68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap 
allocation").

If callers pass an exact range, and the alignment they specify applies, 
why should we reject such an allocation? It's leaking an implementation 
detail fixed easily internally into callers.

> 
> Another approach is(you mentioned it in your commit message):
> 
> urezki@pc638:~/data/raid0/coding/linux-next.git$ git diff mm/kasan/shadow.c
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 082ee5b6d9a1..01d3bd76c851 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -200,7 +200,7 @@ static int __meminit kasan_mem_notifier(struct notifier_block *nb,
>                  if (shadow_mapped(shadow_start))
>                          return NOTIFY_OK;
>   
> -               ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start,
> +               ret = __vmalloc_node_range(shadow_size, 1, shadow_start,
>                                          shadow_end, GFP_KERNEL,
>                                          PAGE_KERNEL, VM_NO_GUARD,
>                                          pfn_to_nid(mem_data->start_pfn),
> urezki@pc638:~/data/raid0/coding/linux-next.git$
> 
> why not? Also you can increase the range in KASAN code.

No, that's leaking implementation details to the caller. And no, 
increasing the range and eventually allocating something bigger (e.g., 
placing a huge page where it might not have been possible) is not 
acceptable for KASAN.

If you're terribly unhappy with this patch, please suggest something 
reasonable to fix exact allocations:
a) Fixes the KASAN use case.
b) Allows for automatic placement of huge pages for exact allocations.
c) Doesn't leak implementation details into the caller.

Thanks
Uladzislau Rezki Sept. 21, 2021, 10:13 p.m. UTC | #7
On Fri, Sep 17, 2021 at 10:47:54AM +0200, David Hildenbrand wrote:
> > > > This patch looks like a KASAN specific to me. So i would like to avoid
> > > > such changes to
> > > > the vmalloc internals in order to simplify further maintenance and
> > > > keeping it generic
> > > > instead.
> > > 
> > > There is nothing KASAN specific in there :) It's specific to exact
> > > applications -- and KASAN may be one of the only users for now.
> > > 
> > Well, you can name it either way :) So it should not be specific by the
> > design, otherwise as i mentioned the allocator would be like a peace of
> > code that handles corner cases what is actually not acceptable.
> 
> Let's not overstress the situation of adding essentially 3 LOC in order to
> fix a sane use case of the vmalloc allocator that was not considered
> properly by internal changes due to 68ad4a330433 ("mm/vmalloc.c: keep track
> of free blocks for vmap allocation").
> 
> > 
> > > > 
> > > > Currently the find_vmap_lowest_match() adjusts the search size
> > > > criteria for any alignment
> > > > values even for PAGE_SIZE alignment. That is not optimal. Because a
> > > > PAGE_SIZE or one
> > > > page is a minimum granularity we operate on vmalloc allocations thus
> > > > all free blocks are
> > > > page aligned.
> > > > 
> > > > All that means that we should adjust the search length only if an
> > > > alignment request is bigger than
> > > > one page, in case of one page, that corresponds to PAGE_SIZE value,
> > > > there is no reason in such
> > > > extra adjustment because all free->va_start have a page boundary anyway.
> > > > 
> > > > Could you please test below patch that is a generic improvement?
> > > 
> > > I played with the exact approach below (almost exactly the same code in
> > > find_vmap_lowest_match()), and it might make sense as a general improvement
> > > -- if we can guarantee that start+end of ranges are always PAGE-aligned; I
> > > was not able to come to that conclusion...
> > All free blocks are PAGE aligned that is how it has to be. A vstart also
> > should be aligned otherwise the __vunmap() will complain about freeing
> > a bad address:
> > 
> > <snip>
> >      if (WARN(!PAGE_ALIGNED(addr), "Trying to vfree() bad address (%p)\n",
> >              addr))
> >          return;
> > <snip>
> > 
> > BTW, we should add an extra check to the alloc_vmap_area(), something like
> > below:
> > 
> > <snip>
> >      if (!PAGE_ALIGNED(ALIGN(vstart, align))) {
> >          WARN_ONCE(1, "vmalloc: vstart or align are not page aligned (0x%lx, 0x%lx)\n",
> >              vstart, align);
> >          return ERR_PTR(-EBUSY);
> > 	}
> > <snip>
> > 
> > to check that passed pair of vstart and align are correct.
> > 
> 
> Yes we better should.
> 
> > > 
> > > vmap_init_free_space() shows me that our existing alignment code/checks
> > > might be quite fragile.
> > > 
> > It is not important to page align a first address. As i mentioned before
> > vstart and align have to be correct and we should add such check.
> > 
> > > 
> > > But I mainly decided to go with my patch instead because it will also work
> > > for exact allocations with align > PAGE_SIZE: most notably, when we try
> > > population of hugepages instead of base pages in __vmalloc_node_range(), by
> > > increasing the alignment. If the exact range allows for using huge pages,
> > > placing huge pages will work just fine with my modifications, while it won't
> > > with your approach.
> > > 
> > > Long story short: my approach handles exact allocations also for bigger
> > > alignment, Your optimization makes sense as a general improvement for any
> > > vmalloc allocations.
> > > 
> > > Thanks for having a look!
> > > 
> > The problem is that there are no users(seems only KASAN) who set the range
> > that corresponds to exact size. If you add an aligment overhead on top of
> 
> So there is one user and it was broken by 68ad4a330433 ("mm/vmalloc.c: keep
> track of free blocks for vmap allocation").
> 
> > it means that search size should include it because we may end up with exact
> > free block and after applying aligment it will not fit. This is how allocators
> > handle aligment.
> 
> This is an implementation detail of the vmalloc allocator ever since
> 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap
> allocation").
> 
> If callers pass an exact range, and the alignment they specify applies, why
> should we reject such an allocation? It's leaking an implementation detail
> fixed easily internally into callers.
> 
> > 
> > Another approach is(you mentioned it in your commit message):
> > 
> > urezki@pc638:~/data/raid0/coding/linux-next.git$ git diff mm/kasan/shadow.c
> > diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> > index 082ee5b6d9a1..01d3bd76c851 100644
> > --- a/mm/kasan/shadow.c
> > +++ b/mm/kasan/shadow.c
> > @@ -200,7 +200,7 @@ static int __meminit kasan_mem_notifier(struct notifier_block *nb,
> >                  if (shadow_mapped(shadow_start))
> >                          return NOTIFY_OK;
> > -               ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start,
> > +               ret = __vmalloc_node_range(shadow_size, 1, shadow_start,
> >                                          shadow_end, GFP_KERNEL,
> >                                          PAGE_KERNEL, VM_NO_GUARD,
> >                                          pfn_to_nid(mem_data->start_pfn),
> > urezki@pc638:~/data/raid0/coding/linux-next.git$
> > 
> > why not? Also you can increase the range in KASAN code.
> 
> No, that's leaking implementation details to the caller. And no, increasing
> the range and eventually allocating something bigger (e.g., placing a huge
> page where it might not have been possible) is not acceptable for KASAN.
> 
> If you're terribly unhappy with this patch,
Sorry to say but it simple does not make sense.

>
> please suggest something reasonable to fix exact allocations:
> a) Fixes the KASAN use case.
> b) Allows for automatic placement of huge pages for exact allocations.
> c) Doesn't leak implementation details into the caller.
>
I am looking at it.

--
Vlad Rezki
David Hildenbrand Sept. 22, 2021, 8:34 a.m. UTC | #8
>> No, that's leaking implementation details to the caller. And no, increasing
>> the range and eventually allocating something bigger (e.g., placing a huge
>> page where it might not have been possible) is not acceptable for KASAN.
>>
>> If you're terribly unhappy with this patch,
> Sorry to say but it simple does not make sense.
> 

Let's agree to disagree.

find_vmap_lowest_match() is imprecise now and that's an issue for exact 
allocations. We can either make it fully precise again (eventually 
degrading allocation performance) or just special-case exact allocations 
to fix the regression.

I decided to go the easy path and do the latter; I do agree that making 
find_vmap_lowest_match() fully precise again might be preferred -- we 
could have other allocations failing right now although there are still 
suitable holes.

I briefly thought about performing the search in 
find_vmap_lowest_match() twice. First, start the search without an 
extended range, and fallback to the extended range if that search fails. 
Unfortunately, I think that still won't make the function completely 
precise due to the way we might miss searching some suitable subtrees.

>>
>> please suggest something reasonable to fix exact allocations:
>> a) Fixes the KASAN use case.
>> b) Allows for automatic placement of huge pages for exact allocations.
>> c) Doesn't leak implementation details into the caller.
>>
> I am looking at it.

Thanks!
Uladzislau Rezki Sept. 22, 2021, 10:41 a.m. UTC | #9
On Wed, Sep 22, 2021 at 10:34:55AM +0200, David Hildenbrand wrote:
> > > No, that's leaking implementation details to the caller. And no, increasing
> > > the range and eventually allocating something bigger (e.g., placing a huge
> > > page where it might not have been possible) is not acceptable for KASAN.
> > > 
> > > If you're terribly unhappy with this patch,
> > Sorry to say but it simple does not make sense.
> > 
> 
> Let's agree to disagree.
> 
> find_vmap_lowest_match() is imprecise now and that's an issue for exact
> allocations. We can either make it fully precise again (eventually degrading
> allocation performance) or just special-case exact allocations to fix the
> regression.
> 
> I decided to go the easy path and do the latter; I do agree that making
> find_vmap_lowest_match() fully precise again might be preferred -- we could
> have other allocations failing right now although there are still suitable
> holes.
> 
> I briefly thought about performing the search in find_vmap_lowest_match()
> twice. First, start the search without an extended range, and fallback to
> the extended range if that search fails. Unfortunately, I think that still
> won't make the function completely precise due to the way we might miss
> searching some suitable subtrees.
> 
> > > 
> > > please suggest something reasonable to fix exact allocations:
> > > a) Fixes the KASAN use case.
> > > b) Allows for automatic placement of huge pages for exact allocations.
> > > c) Doesn't leak implementation details into the caller.
> > > 
> > I am looking at it.
> 
I am testing this:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dcf23d16a308..cdf3bda6313d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1161,18 +1161,14 @@ find_vmap_lowest_match(unsigned long size,
 {
 	struct vmap_area *va;
 	struct rb_node *node;
-	unsigned long length;
 
 	/* Start from the root. */
 	node = free_vmap_area_root.rb_node;
 
-	/* Adjust the search size for alignment overhead. */
-	length = size + align - 1;
-
 	while (node) {
 		va = rb_entry(node, struct vmap_area, rb_node);
 
-		if (get_subtree_max_size(node->rb_left) >= length &&
+		if (get_subtree_max_size(node->rb_left) >= size &&
 				vstart < va->va_start) {
 			node = node->rb_left;
 		} else {
@@ -1182,9 +1178,9 @@ find_vmap_lowest_match(unsigned long size,
 			/*
 			 * Does not make sense to go deeper towards the right
 			 * sub-tree if it does not have a free block that is
-			 * equal or bigger to the requested search length.
+			 * equal or bigger to the requested search size.
 			 */
-			if (get_subtree_max_size(node->rb_right) >= length) {
+			if (get_subtree_max_size(node->rb_right) >= size) {
 				node = node->rb_right;
 				continue;
 			}
@@ -1192,16 +1188,30 @@ find_vmap_lowest_match(unsigned long size,
 			/*
 			 * OK. We roll back and find the first right sub-tree,
 			 * that will satisfy the search criteria. It can happen
-			 * only once due to "vstart" restriction.
+			 * due to "vstart" restriction or an alignment overhead.
 			 */
 			while ((node = rb_parent(node))) {
 				va = rb_entry(node, struct vmap_area, rb_node);
 				if (is_within_this_va(va, size, align, vstart))
 					return va;
 
-				if (get_subtree_max_size(node->rb_right) >= length &&
+				if (get_subtree_max_size(node->rb_right) >= size &&
 						vstart <= va->va_start) {
+					/*
+					 * Shift the vstart forward, so we do not loop over same
+					 * sub-tree force and back. The aim is to continue tree
+					 * scanning toward higher addresses cutting off previous
+					 * ones.
+					 *
+					 * Please note we update vstart with parent's start address
+					 * adding "1" because we do not want to enter same sub-tree
+					 * one more time after it has already been inspected and no
+					 * suitable free block found there.
+					 */
+					vstart = va->va_start + 1;
 					node = node->rb_right;
+
+					/* Scan a sub-tree rooted at "node". */
 					break;
 				}
 			}
<snip>

so it handles any alignment and is accurate when it comes to searching the most
lowest free block when user wants to allocate with a special alignment value.

Could you please help and test the KASAN use case?

Thanks!

--
Vlad Rezki
David Hildenbrand Sept. 23, 2021, 5:42 p.m. UTC | #10
On 22.09.21 12:41, Uladzislau Rezki wrote:
> On Wed, Sep 22, 2021 at 10:34:55AM +0200, David Hildenbrand wrote:
>>>> No, that's leaking implementation details to the caller. And no, increasing
>>>> the range and eventually allocating something bigger (e.g., placing a huge
>>>> page where it might not have been possible) is not acceptable for KASAN.
>>>>
>>>> If you're terribly unhappy with this patch,
>>> Sorry to say but it simple does not make sense.
>>>
>>
>> Let's agree to disagree.
>>
>> find_vmap_lowest_match() is imprecise now and that's an issue for exact
>> allocations. We can either make it fully precise again (eventually degrading
>> allocation performance) or just special-case exact allocations to fix the
>> regression.
>>
>> I decided to go the easy path and do the latter; I do agree that making
>> find_vmap_lowest_match() fully precise again might be preferred -- we could
>> have other allocations failing right now although there are still suitable
>> holes.
>>
>> I briefly thought about performing the search in find_vmap_lowest_match()
>> twice. First, start the search without an extended range, and fallback to
>> the extended range if that search fails. Unfortunately, I think that still
>> won't make the function completely precise due to the way we might miss
>> searching some suitable subtrees.
>>
>>>>
>>>> please suggest something reasonable to fix exact allocations:
>>>> a) Fixes the KASAN use case.
>>>> b) Allows for automatic placement of huge pages for exact allocations.
>>>> c) Doesn't leak implementation details into the caller.
>>>>
>>> I am looking at it.
>>
> I am testing this:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dcf23d16a308..cdf3bda6313d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1161,18 +1161,14 @@ find_vmap_lowest_match(unsigned long size,
>   {
>   	struct vmap_area *va;
>   	struct rb_node *node;
> -	unsigned long length;
>   
>   	/* Start from the root. */
>   	node = free_vmap_area_root.rb_node;
>   
> -	/* Adjust the search size for alignment overhead. */
> -	length = size + align - 1;
> -
>   	while (node) {
>   		va = rb_entry(node, struct vmap_area, rb_node);
>   
> -		if (get_subtree_max_size(node->rb_left) >= length &&
> +		if (get_subtree_max_size(node->rb_left) >= size &&
>   				vstart < va->va_start) {
>   			node = node->rb_left;
>   		} else {
> @@ -1182,9 +1178,9 @@ find_vmap_lowest_match(unsigned long size,
>   			/*
>   			 * Does not make sense to go deeper towards the right
>   			 * sub-tree if it does not have a free block that is
> -			 * equal or bigger to the requested search length.
> +			 * equal or bigger to the requested search size.
>   			 */
> -			if (get_subtree_max_size(node->rb_right) >= length) {
> +			if (get_subtree_max_size(node->rb_right) >= size) {
>   				node = node->rb_right;
>   				continue;
>   			}
> @@ -1192,16 +1188,30 @@ find_vmap_lowest_match(unsigned long size,
>   			/*
>   			 * OK. We roll back and find the first right sub-tree,
>   			 * that will satisfy the search criteria. It can happen
> -			 * only once due to "vstart" restriction.
> +			 * due to "vstart" restriction or an alignment overhead.
>   			 */
>   			while ((node = rb_parent(node))) {
>   				va = rb_entry(node, struct vmap_area, rb_node);
>   				if (is_within_this_va(va, size, align, vstart))
>   					return va;
>   
> -				if (get_subtree_max_size(node->rb_right) >= length &&
> +				if (get_subtree_max_size(node->rb_right) >= size &&
>   						vstart <= va->va_start) {
> +					/*
> +					 * Shift the vstart forward, so we do not loop over same
> +					 * sub-tree force and back. The aim is to continue tree
> +					 * scanning toward higher addresses cutting off previous
> +					 * ones.
> +					 *
> +					 * Please note we update vstart with parent's start address
> +					 * adding "1" because we do not want to enter same sub-tree
> +					 * one more time after it has already been inspected and no
> +					 * suitable free block found there.
> +					 */
> +					vstart = va->va_start + 1;
>   					node = node->rb_right;
> +
> +					/* Scan a sub-tree rooted at "node". */
>   					break;
>   				}
>   			}
> <snip>
> 
> so it handles any alignment and is accurate when it comes to searching the most
> lowest free block when user wants to allocate with a special alignment value.
> 
> Could you please help and test the KASAN use case?


Sure, I'll give it a spin tomorrow! Thanks!
David Hildenbrand Sept. 24, 2021, 12:42 p.m. UTC | #11
On 22.09.21 12:41, Uladzislau Rezki wrote:
> On Wed, Sep 22, 2021 at 10:34:55AM +0200, David Hildenbrand wrote:
>>>> No, that's leaking implementation details to the caller. And no, increasing
>>>> the range and eventually allocating something bigger (e.g., placing a huge
>>>> page where it might not have been possible) is not acceptable for KASAN.
>>>>
>>>> If you're terribly unhappy with this patch,
>>> Sorry to say but it simple does not make sense.
>>>
>>
>> Let's agree to disagree.
>>
>> find_vmap_lowest_match() is imprecise now and that's an issue for exact
>> allocations. We can either make it fully precise again (eventually degrading
>> allocation performance) or just special-case exact allocations to fix the
>> regression.
>>
>> I decided to go the easy path and do the latter; I do agree that making
>> find_vmap_lowest_match() fully precise again might be preferred -- we could
>> have other allocations failing right now although there are still suitable
>> holes.
>>
>> I briefly thought about performing the search in find_vmap_lowest_match()
>> twice. First, start the search without an extended range, and fallback to
>> the extended range if that search fails. Unfortunately, I think that still
>> won't make the function completely precise due to the way we might miss
>> searching some suitable subtrees.
>>
>>>>
>>>> please suggest something reasonable to fix exact allocations:
>>>> a) Fixes the KASAN use case.
>>>> b) Allows for automatic placement of huge pages for exact allocations.
>>>> c) Doesn't leak implementation details into the caller.
>>>>
>>> I am looking at it.
>>
> I am testing this:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dcf23d16a308..cdf3bda6313d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1161,18 +1161,14 @@ find_vmap_lowest_match(unsigned long size,
>   {
>   	struct vmap_area *va;
>   	struct rb_node *node;
> -	unsigned long length;
>   
>   	/* Start from the root. */
>   	node = free_vmap_area_root.rb_node;
>   
> -	/* Adjust the search size for alignment overhead. */
> -	length = size + align - 1;
> -
>   	while (node) {
>   		va = rb_entry(node, struct vmap_area, rb_node);
>   
> -		if (get_subtree_max_size(node->rb_left) >= length &&
> +		if (get_subtree_max_size(node->rb_left) >= size &&
>   				vstart < va->va_start) {
>   			node = node->rb_left;
>   		} else {
> @@ -1182,9 +1178,9 @@ find_vmap_lowest_match(unsigned long size,
>   			/*
>   			 * Does not make sense to go deeper towards the right
>   			 * sub-tree if it does not have a free block that is
> -			 * equal or bigger to the requested search length.
> +			 * equal or bigger to the requested search size.
>   			 */
> -			if (get_subtree_max_size(node->rb_right) >= length) {
> +			if (get_subtree_max_size(node->rb_right) >= size) {
>   				node = node->rb_right;
>   				continue;
>   			}
> @@ -1192,16 +1188,30 @@ find_vmap_lowest_match(unsigned long size,
>   			/*
>   			 * OK. We roll back and find the first right sub-tree,
>   			 * that will satisfy the search criteria. It can happen
> -			 * only once due to "vstart" restriction.
> +			 * due to "vstart" restriction or an alignment overhead.
>   			 */
>   			while ((node = rb_parent(node))) {
>   				va = rb_entry(node, struct vmap_area, rb_node);
>   				if (is_within_this_va(va, size, align, vstart))
>   					return va;
>   
> -				if (get_subtree_max_size(node->rb_right) >= length &&
> +				if (get_subtree_max_size(node->rb_right) >= size &&
>   						vstart <= va->va_start) {
> +					/*
> +					 * Shift the vstart forward, so we do not loop over same
> +					 * sub-tree force and back. The aim is to continue tree
> +					 * scanning toward higher addresses cutting off previous
> +					 * ones.
> +					 *
> +					 * Please note we update vstart with parent's start address
> +					 * adding "1" because we do not want to enter same sub-tree
> +					 * one more time after it has already been inspected and no
> +					 * suitable free block found there.
> +					 */
> +					vstart = va->va_start + 1;
>   					node = node->rb_right;
> +
> +					/* Scan a sub-tree rooted at "node". */

So the idea is that once we run into a dead end because we took a left 
subtree, we rollback to the next possible rigth subtree and try again. 
If we run into another dead end, we repeat ... thus, this can now happen 
more than once.

I assume the only implication is that this can now be slower in some 
corner cases with larger alignment, because it might take longer to find 
something suitable. Fair enough.

>   					break;
>   				}
>   			}
> <snip>
> 
> so it handles any alignment and is accurate when it comes to searching the most
> lowest free block when user wants to allocate with a special alignment value.
> 
> Could you please help and test the KASAN use case?

Just tried it, works just fine with KASAN and makes sense in general, 
thanks!
Uladzislau Rezki Sept. 29, 2021, 2:30 p.m. UTC | #12
>
> So the idea is that once we run into a dead end because we took a left
> subtree, we rollback to the next possible rigth subtree and try again.
> If we run into another dead end, we repeat ... thus, this can now happen
> more than once.
>
> I assume the only implication is that this can now be slower in some
> corner cases with larger alignment, because it might take longer to find
> something suitable. Fair enough.
>
Yep, your understanding is correct regarding the tree traversal. If no
suitable block
is found in left sub-tree we roll-back and check right one. So it can
be(the scanning)
more than one time.

I did some performance analyzing using vmalloc test suite to figure
out a performance
loss for allocations with specific alignment. On that syntactic test i
see approx. 30%
of degradation:

2.225 microseconds vs 1.496 microseconds. That time includes both
vmalloc() and vfree()
calls. I do not consider it as a big degrade, but from the other hand
we can still adjust the
search length for alignments > one page:

# add it on top of previous proposal and search length instead of size
length = align > PAGE_SIZE ? size + align:size;

in that case we solve a KASAN issue + do not introduce a degrade. For
the PAGE_SIZE
alignment all free blocks are aligned to it anyway. As for users which
uses a fixed range
that is same as a requested size and at the same time want to apply a
special alignment
is not considered as a common case, also we do not have such users.

Thoughts?

> >
> > Could you please help and test the KASAN use case?
>
> Just tried it, works just fine with KASAN and makes sense in general,
> thanks!
>
Good!

Sorry for the delay.

--
Uladzislau Rezki
David Hildenbrand Sept. 29, 2021, 2:40 p.m. UTC | #13
On 29.09.21 16:30, Uladzislau Rezki wrote:
>>
>> So the idea is that once we run into a dead end because we took a left
>> subtree, we rollback to the next possible rigth subtree and try again.
>> If we run into another dead end, we repeat ... thus, this can now happen
>> more than once.
>>
>> I assume the only implication is that this can now be slower in some
>> corner cases with larger alignment, because it might take longer to find
>> something suitable. Fair enough.
>>
> Yep, your understanding is correct regarding the tree traversal. If no
> suitable block
> is found in left sub-tree we roll-back and check right one. So it can
> be(the scanning)
> more than one time.
> 
> I did some performance analyzing using vmalloc test suite to figure
> out a performance
> loss for allocations with specific alignment. On that syntactic test i
> see approx. 30%
> of degradation:

How realistic is that test case? I assume most alignment we're dealing 
with is:
* 1/PAGE_SIZE
* huge page size (for automatic huge page placing)

> 
> 2.225 microseconds vs 1.496 microseconds. That time includes both
> vmalloc() and vfree()
> calls. I do not consider it as a big degrade, but from the other hand
> we can still adjust the
> search length for alignments > one page:
> 
> # add it on top of previous proposal and search length instead of size
> length = align > PAGE_SIZE ? size + align:size;

That will not allow to place huge pages in the case of kasan. And I 
consider that more important than optimizing a syntactic test :) My 2 cents.
Uladzislau Rezki Sept. 29, 2021, 2:49 p.m. UTC | #14
On Wed, Sep 29, 2021 at 4:40 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.09.21 16:30, Uladzislau Rezki wrote:
> >>
> >> So the idea is that once we run into a dead end because we took a left
> >> subtree, we rollback to the next possible rigth subtree and try again.
> >> If we run into another dead end, we repeat ... thus, this can now happen
> >> more than once.
> >>
> >> I assume the only implication is that this can now be slower in some
> >> corner cases with larger alignment, because it might take longer to find
> >> something suitable. Fair enough.
> >>
> > Yep, your understanding is correct regarding the tree traversal. If no
> > suitable block
> > is found in left sub-tree we roll-back and check right one. So it can
> > be(the scanning)
> > more than one time.
> >
> > I did some performance analyzing using vmalloc test suite to figure
> > out a performance
> > loss for allocations with specific alignment. On that syntactic test i
> > see approx. 30%
> > of degradation:
>
> How realistic is that test case? I assume most alignment we're dealing
> with is:
> * 1/PAGE_SIZE
> * huge page size (for automatic huge page placing)
>
Well that is synthetic test. Most of the alignments are 1 or PAGE_SIZE.
There are users which use internal API where you can specify an alignment
you want but those are mainly like KASAN, module alloc, etc.

> >
> > 2.225 microseconds vs 1.496 microseconds. That time includes both
> > vmalloc() and vfree()
> > calls. I do not consider it as a big degrade, but from the other hand
> > we can still adjust the
> > search length for alignments > one page:
> >
> > # add it on top of previous proposal and search length instead of size
> > length = align > PAGE_SIZE ? size + align:size;
>
> That will not allow to place huge pages in the case of kasan. And I
> consider that more important than optimizing a syntactic test :) My 2 cents.
>
Could you please to be more specific? I mean how is it connected with huge
pages mappings? Huge-pages are which have order > 0. Or you mean that
a special alignments are needed for mapping huge pages?
David Hildenbrand Sept. 29, 2021, 3:05 p.m. UTC | #15
On 29.09.21 16:49, Uladzislau Rezki wrote:
> On Wed, Sep 29, 2021 at 4:40 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 29.09.21 16:30, Uladzislau Rezki wrote:
>>>>
>>>> So the idea is that once we run into a dead end because we took a left
>>>> subtree, we rollback to the next possible rigth subtree and try again.
>>>> If we run into another dead end, we repeat ... thus, this can now happen
>>>> more than once.
>>>>
>>>> I assume the only implication is that this can now be slower in some
>>>> corner cases with larger alignment, because it might take longer to find
>>>> something suitable. Fair enough.
>>>>
>>> Yep, your understanding is correct regarding the tree traversal. If no
>>> suitable block
>>> is found in left sub-tree we roll-back and check right one. So it can
>>> be(the scanning)
>>> more than one time.
>>>
>>> I did some performance analyzing using vmalloc test suite to figure
>>> out a performance
>>> loss for allocations with specific alignment. On that syntactic test i
>>> see approx. 30%
>>> of degradation:
>>
>> How realistic is that test case? I assume most alignment we're dealing
>> with is:
>> * 1/PAGE_SIZE
>> * huge page size (for automatic huge page placing)
>>
> Well that is synthetic test. Most of the alignments are 1 or PAGE_SIZE.
> There are users which use internal API where you can specify an alignment
> you want but those are mainly like KASAN, module alloc, etc.
> 
>>>
>>> 2.225 microseconds vs 1.496 microseconds. That time includes both
>>> vmalloc() and vfree()
>>> calls. I do not consider it as a big degrade, but from the other hand
>>> we can still adjust the
>>> search length for alignments > one page:
>>>
>>> # add it on top of previous proposal and search length instead of size
>>> length = align > PAGE_SIZE ? size + align:size;
>>
>> That will not allow to place huge pages in the case of kasan. And I
>> consider that more important than optimizing a syntactic test :) My 2 cents.
>>
> Could you please to be more specific? I mean how is it connected with huge
> pages mappings? Huge-pages are which have order > 0. Or you mean that
> a special alignments are needed for mapping huge pages?

Let me try to clarify:


KASAN does an exact allocation when onlining a memory block, 
__vmalloc_node_range() will try placing huge pages first, increasing the 
alignment to e.g., "1 << PMD_SHIFT".

If we increase the search length in find_vmap_lowest_match(), that 
search will fail if the exact allocation is surrounded by other 
allocations. In that case, we won't place a huge page although we could 
-- because find_vmap_lowest_match() would be imprecise for alignments > 
PAGE_SIZE.


Memory blocks we online/offline on x86 are at least 128MB. The KASAN 
"overhead" we have to allocate is 1/8 of that -- 16 MB, so essentially 8 
huge pages.

__vmalloc_node_range() will increase the alignment to 2MB to try placing 
huge pages first. find_vmap_lowest_match() will search within the given 
exact 16MB are a 18MB area (size + align), which won't work. So 
__vmalloc_node_range() will fallback to the original PAGE_SIZE alignment 
and shift=PAGE_SHIFT.

__vmalloc_area_node() will set the set_vm_area_page_order effectively to 
0 --  small pages.

Does that make sense or am I missing something?
Uladzislau Rezki Sept. 29, 2021, 4:08 p.m. UTC | #16
> > Could you please to be more specific? I mean how is it connected with huge
> > pages mappings? Huge-pages are which have order > 0. Or you mean that
> > a special alignments are needed for mapping huge pages?
>
> Let me try to clarify:
>
>
> KASAN does an exact allocation when onlining a memory block,
> __vmalloc_node_range() will try placing huge pages first, increasing the
> alignment to e.g., "1 << PMD_SHIFT".
>
> If we increase the search length in find_vmap_lowest_match(), that
> search will fail if the exact allocation is surrounded by other
> allocations. In that case, we won't place a huge page although we could
> -- because find_vmap_lowest_match() would be imprecise for alignments >
> PAGE_SIZE.
>
>
> Memory blocks we online/offline on x86 are at least 128MB. The KASAN
> "overhead" we have to allocate is 1/8 of that -- 16 MB, so essentially 8
> huge pages.
>
> __vmalloc_node_range() will increase the alignment to 2MB to try placing
> huge pages first. find_vmap_lowest_match() will search within the given
> exact 16MB are a 18MB area (size + align), which won't work. So
> __vmalloc_node_range() will fallback to the original PAGE_SIZE alignment
> and shift=PAGE_SHIFT.
>
> __vmalloc_area_node() will set the set_vm_area_page_order effectively to
> 0 --  small pages.
>
> Does that make sense or am I missing something?
>
Thank you for clarification. OK, we come back anyway to the "problem" with fixed
range and an exact allocation plus a special alignment > PAGE_SIZE. Thus the
KASAN will not make use of huge pages mappings and go with regular instead
as a fallback path. But we would like to utilize huge-mappings for KASAN.

I will send the patch you tested and add your "tested-by" tag. Does it
sound good?

--
Uladzislau Rezki
David Hildenbrand Sept. 29, 2021, 4:10 p.m. UTC | #17
On 29.09.21 18:08, Uladzislau Rezki wrote:
>>> Could you please to be more specific? I mean how is it connected with huge
>>> pages mappings? Huge-pages are which have order > 0. Or you mean that
>>> a special alignments are needed for mapping huge pages?
>>
>> Let me try to clarify:
>>
>>
>> KASAN does an exact allocation when onlining a memory block,
>> __vmalloc_node_range() will try placing huge pages first, increasing the
>> alignment to e.g., "1 << PMD_SHIFT".
>>
>> If we increase the search length in find_vmap_lowest_match(), that
>> search will fail if the exact allocation is surrounded by other
>> allocations. In that case, we won't place a huge page although we could
>> -- because find_vmap_lowest_match() would be imprecise for alignments >
>> PAGE_SIZE.
>>
>>
>> Memory blocks we online/offline on x86 are at least 128MB. The KASAN
>> "overhead" we have to allocate is 1/8 of that -- 16 MB, so essentially 8
>> huge pages.
>>
>> __vmalloc_node_range() will increase the alignment to 2MB to try placing
>> huge pages first. find_vmap_lowest_match() will search within the given
>> exact 16MB are a 18MB area (size + align), which won't work. So
>> __vmalloc_node_range() will fallback to the original PAGE_SIZE alignment
>> and shift=PAGE_SHIFT.
>>
>> __vmalloc_area_node() will set the set_vm_area_page_order effectively to
>> 0 --  small pages.
>>
>> Does that make sense or am I missing something?
>>
> Thank you for clarification. OK, we come back anyway to the "problem" with fixed
> range and an exact allocation plus a special alignment > PAGE_SIZE. Thus the
> KASAN will not make use of huge pages mappings and go with regular instead
> as a fallback path. But we would like to utilize huge-mappings for KASAN.
> 
> I will send the patch you tested and add your "tested-by" tag. Does it
> sound good?

Feel free to add

Tested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5cd52805149..c6071f5f8de3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1153,7 +1153,8 @@  is_within_this_va(struct vmap_area *va, unsigned long size,
 /*
  * Find the first free block(lowest start address) in the tree,
  * that will accomplish the request corresponding to passing
- * parameters.
+ * parameters. Note that with an alignment > 1, this function
+ * can be imprecise and skip applicable free blocks.
  */
 static __always_inline struct vmap_area *
 find_vmap_lowest_match(unsigned long size,
@@ -1396,7 +1397,15 @@  __alloc_vmap_area(unsigned long size, unsigned long align,
 	enum fit_type type;
 	int ret;
 
-	va = find_vmap_lowest_match(size, align, vstart);
+	/*
+	 * For exact allocations, ignore the alignment, such that
+	 * find_vmap_lowest_match() won't search for a bigger free area just
+	 * able to align later and consequently fail the search.
+	 */
+	if (vend - vstart == size && IS_ALIGNED(vstart, align))
+		va = find_vmap_lowest_match(size, 1, vstart);
+	else
+		va = find_vmap_lowest_match(size, align, vstart);
 	if (unlikely(!va))
 		return vend;