diff mbox series

[RFC] mm/page_reporting: Adjust threshold according to MAX_ORDER

Message ID 20210601033319.100737-1-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm/page_reporting: Adjust threshold according to MAX_ORDER | expand

Commit Message

Gavin Shan June 1, 2021, 3:33 a.m. UTC
The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
minimal order (threshold) to trigger page reporting. The page reporting
is never triggered with the following configurations and settings on
aarch64. In the particular scenario, the page reporting won't be triggered
until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
page freeing. The condition is very hard, or even impossible to be met.

  CONFIG_ARM64_PAGE_SHIFT:              16
  CONFIG_HUGETLB_PAGE:                  Y
  CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
  pageblock_order:                      13
  CONFIG_FORCE_MAX_ZONEORDER:           14
  MAX_ORDER:                            14

The issue can be reproduced in VM, running kernel with above configurations
and settings. The 'memhog' is used inside the VM to access 512MB anonymous
area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
  -accel kvm -machine virt,gic-version=host                        \
  -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=2048M                    \
  -object memory-backend-ram,id=mem1,size=2048M                    \
  -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
  -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
    :                                                              \
  -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes

This tries to fix the issue by adjusting the threshold to the smaller value
of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
drops after 'memhog' exits.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/page_reporting.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

David Hildenbrand June 1, 2021, 8:01 a.m. UTC | #1
On 01.06.21 05:33, Gavin Shan wrote:
> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
> minimal order (threshold) to trigger page reporting. The page reporting
> is never triggered with the following configurations and settings on
> aarch64. In the particular scenario, the page reporting won't be triggered
> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
> page freeing. The condition is very hard, or even impossible to be met.
> 
>    CONFIG_ARM64_PAGE_SHIFT:              16
>    CONFIG_HUGETLB_PAGE:                  Y
>    CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>    pageblock_order:                      13
>    CONFIG_FORCE_MAX_ZONEORDER:           14
>    MAX_ORDER:                            14
> 
> The issue can be reproduced in VM, running kernel with above configurations
> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
> 
>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>    -accel kvm -machine virt,gic-version=host                        \
>    -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>    -object memory-backend-ram,id=mem0,size=2048M                    \
>    -object memory-backend-ram,id=mem1,size=2048M                    \
>    -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>    -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>      :                                                              \
>    -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> 
> This tries to fix the issue by adjusting the threshold to the smaller value
> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
> drops after 'memhog' exits.

IIRC, we use pageblock_order to

a) Reduce the free page reporting overhead. Reporting on small chunks 
can make us report constantly with little system activity.

b) Avoid splitting THP in the hypervisor, avoiding downgraded VM 
performance.

c) Avoid affecting creation of pageblock_order pages while hinting is 
active. I think there are cases where "temporary pulling sub-pageblock 
pages" can negatively affect creation of pageblock_order pages. 
Concurrent compaction would be one of these cases.

The monstrosity called aarch64 64k is really special in that sense, 
because a) does not apply because pageblocks are just very big, b) does 
sometimes not apply because either our VM isn't backed by (rare) 512MB 
THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish 
VMs because we don't really happen to create 512MB THP either way.


For example, going on x86-64 from reporting 2MB to something like 32KB 
is absolutely undesired.

I think if we want to go down that path (and I am not 100% sure yet if 
we want to), we really want to treat only the special case in a special 
way. Note that even when doing it only for aarch64 with 64k, you will 
still end up splitting THP in a hypervisor if it uses 64k base pages 
(b)) and can affect creation of THP, for example, when compacting (c), 
so there is a negative side to that.
Andrew Morton June 2, 2021, 12:03 a.m. UTC | #2
On Tue,  1 Jun 2021 11:33:19 +0800 Gavin Shan <gshan@redhat.com> wrote:

> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
> minimal order (threshold) to trigger page reporting. The page reporting
> is never triggered with the following configurations and settings on
> aarch64. In the particular scenario, the page reporting won't be triggered
> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
> page freeing. The condition is very hard, or even impossible to be met.
> 
>   CONFIG_ARM64_PAGE_SHIFT:              16
>   CONFIG_HUGETLB_PAGE:                  Y
>   CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>   pageblock_order:                      13
>   CONFIG_FORCE_MAX_ZONEORDER:           14
>   MAX_ORDER:                            14
> 
> The issue can be reproduced in VM, running kernel with above configurations
> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>   -accel kvm -machine virt,gic-version=host                        \
>   -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=2048M                    \
>   -object memory-backend-ram,id=mem1,size=2048M                    \
>   -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>   -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>     :                                                              \
>   -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> 
> This tries to fix the issue by adjusting the threshold to the smaller value
> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
> drops after 'memhog' exits.
>

Sounds promising, but...

> --- a/mm/page_reporting.h
> +++ b/mm/page_reporting.h
> @@ -10,9 +10,10 @@
>  #include <linux/pgtable.h>
>  #include <linux/scatterlist.h>
>  
> -#define PAGE_REPORTING_MIN_ORDER	pageblock_order
> -
>  #ifdef CONFIG_PAGE_REPORTING
> +#define PAGE_REPORTING_MIN_ORDER	\
> +	min_t(unsigned int, pageblock_order, (MAX_ORDER / 2))
> +
>  DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
>  void __page_reporting_notify(void);

Could you please redo this as a regular old static function in
page_reporting.c?  Bonus points for commenting its design ;)
David Hildenbrand June 14, 2021, 11:03 a.m. UTC | #3
On 11.06.21 09:44, Gavin Shan wrote:
> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>> On 01.06.21 05:33, Gavin Shan wrote:
>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>> minimal order (threshold) to trigger page reporting. The page reporting
>>> is never triggered with the following configurations and settings on
>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>> page freeing. The condition is very hard, or even impossible to be met.
>>>
>>>     CONFIG_ARM64_PAGE_SHIFT:              16
>>>     CONFIG_HUGETLB_PAGE:                  Y
>>>     CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>     pageblock_order:                      13
>>>     CONFIG_FORCE_MAX_ZONEORDER:           14
>>>     MAX_ORDER:                            14
>>>
>>> The issue can be reproduced in VM, running kernel with above configurations
>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>
>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>     -accel kvm -machine virt,gic-version=host                        \
>>>     -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>     -object memory-backend-ram,id=mem0,size=2048M                    \
>>>     -object memory-backend-ram,id=mem1,size=2048M                    \
>>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>     -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>       :                                                              \
>>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>
>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>> drops after 'memhog' exits.
>>
>> IIRC, we use pageblock_order to
>>
>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>
>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>
>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>
>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>
>>
>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>
>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>
> 
> [Remove Alexander from the cc list as his mail isn't reachable]
> 

[adding his gmail address which should be the right one]

> David, thanks for your time to review and sorry for the delay and late response.
> I spent some time to get myself familiar with the code, but there are still some
> questions to me, explained as below.
> 
> Yes, @pageblock_order is currently taken as page reporting threshold. It will
> incur more overhead if the threshold is decreased as you said in (a).

Right. Alex did quite some performance/overhead evaluation when 
introducing this feature. Changing the reporting granularity on most 
setups (esp., x86-64) is not desired IMHO.

> 
> This patch tries to decrease the free page reporting threshold. The @pageblock_order
> isn't touched. I don't understand how the code changes affecting THP splitting
> and the creation of page blocks mentioned in (b) and (c). David, could you please
> provide more details?

Think of it like this: while reporting to the hypervisor, we temporarily 
turn free/"movable" pieces part of a pageblock "unmovable" -- see 
__isolate_free_page()->del_page_from_free_list(). While reporting them 
to the hypervisor, these pages are not available and not even marked as 
PageBuddy() anymore.

There are at least two scenarios where this could affect creation of 
free pageblocks I can see:

a. Compaction. While compacting, we might identify completely 
movable/free pageblocks, however, actual compaction on that pageblock 
can fail because some part is temporarily unmovable.

b. Free/alloc sequences. Assume a pageblocks is mostly free, except two 
pages (x and y). Assume the following sequence:

1. free(x)
2. free(y)
3. alloc

Before your change, after 1. and 2. we'll have a free pageblock. 3 won't 
allocate from that pageblock.

With your change, free page reporting might run after 1. After 2, we'll 
not have a free pageblock (until free page reporting finished), and 3. 
might just reallocate what we freed in 2 and prevent having a free 
pageblock.


No idea how relevant both points are in practice, however, the 
fundamental difference to current handling is that we would turn parts 
of pageblocks temporarily unmovable, instead of complete pageblocks.
Alexander Duyck June 15, 2021, 2:26 a.m. UTC | #4
On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.06.21 09:44, Gavin Shan wrote:
> > On 6/1/21 6:01 PM, David Hildenbrand wrote:
> >> On 01.06.21 05:33, Gavin Shan wrote:
> >>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
> >>> minimal order (threshold) to trigger page reporting. The page reporting
> >>> is never triggered with the following configurations and settings on
> >>> aarch64. In the particular scenario, the page reporting won't be triggered
> >>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
> >>> page freeing. The condition is very hard, or even impossible to be met.
> >>>
> >>>     CONFIG_ARM64_PAGE_SHIFT:              16
> >>>     CONFIG_HUGETLB_PAGE:                  Y
> >>>     CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
> >>>     pageblock_order:                      13
> >>>     CONFIG_FORCE_MAX_ZONEORDER:           14
> >>>     MAX_ORDER:                            14
> >>>
> >>> The issue can be reproduced in VM, running kernel with above configurations
> >>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
> >>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
> >>>
> >>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> >>>     -accel kvm -machine virt,gic-version=host                        \
> >>>     -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
> >>>     -object memory-backend-ram,id=mem0,size=2048M                    \
> >>>     -object memory-backend-ram,id=mem1,size=2048M                    \
> >>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
> >>>     -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
> >>>       :                                                              \
> >>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> >>>
> >>> This tries to fix the issue by adjusting the threshold to the smaller value
> >>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
> >>> drops after 'memhog' exits.
> >>
> >> IIRC, we use pageblock_order to
> >>
> >> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
> >>
> >> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
> >>
> >> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
> >>
> >> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
> >>
> >>
> >> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
> >>
> >> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
> >>
> >
> > [Remove Alexander from the cc list as his mail isn't reachable]
> >
>
> [adding his gmail address which should be the right one]
>
> > David, thanks for your time to review and sorry for the delay and late response.
> > I spent some time to get myself familiar with the code, but there are still some
> > questions to me, explained as below.
> >
> > Yes, @pageblock_order is currently taken as page reporting threshold. It will
> > incur more overhead if the threshold is decreased as you said in (a).
>
> Right. Alex did quite some performance/overhead evaluation when
> introducing this feature. Changing the reporting granularity on most
> setups (esp., x86-64) is not desired IMHO.

Yes, generally reporting pages comes at a fairly high cost so it is
important to find the right trade-off between the size of the page and
the size of the batch of pages being reported. If the size of the
pages is reduced it maybe important to increase the batch size in
order to avoid paying too much in the way of overhead.

The other main reason for holding to pageblock_order on x86 is to
avoid THP splitting. Anything smaller than pageblock_order will
trigger THP splitting which will significantly hurt the performance of
the VM in general as it forces it down to order 0 pages.
Gavin Shan June 16, 2021, 1:53 a.m. UTC | #5
On 6/14/21 9:03 PM, David Hildenbrand wrote:
> On 11.06.21 09:44, Gavin Shan wrote:
>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>> is never triggered with the following configurations and settings on
>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>
>>>>     CONFIG_ARM64_PAGE_SHIFT:              16
>>>>     CONFIG_HUGETLB_PAGE:                  Y
>>>>     CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>     pageblock_order:                      13
>>>>     CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>     MAX_ORDER:                            14
>>>>
>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>
>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>     -accel kvm -machine virt,gic-version=host                        \
>>>>     -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>     -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>     -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>     -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>       :                                                              \
>>>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>
>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>> drops after 'memhog' exits.
>>>
>>> IIRC, we use pageblock_order to
>>>
>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>
>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>
>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>
>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>
>>>
>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>
>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>
>>
>> [Remove Alexander from the cc list as his mail isn't reachable]
>>
> 
> [adding his gmail address which should be the right one]
> 
>> David, thanks for your time to review and sorry for the delay and late response.
>> I spent some time to get myself familiar with the code, but there are still some
>> questions to me, explained as below.
>>
>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>> incur more overhead if the threshold is decreased as you said in (a).
> 
> Right. Alex did quite some performance/overhead evaluation when introducing this feature. Changing the reporting granularity on most setups (esp., x86-64) is not desired IMHO.
> 

Thanks for adding Alex's correct mail address, David.

>>
>> This patch tries to decrease the free page reporting threshold. The @pageblock_order
>> isn't touched. I don't understand how the code changes affecting THP splitting
>> and the creation of page blocks mentioned in (b) and (c). David, could you please
>> provide more details?
> 
> Think of it like this: while reporting to the hypervisor, we temporarily turn free/"movable" pieces part of a pageblock "unmovable" -- see __isolate_free_page()->del_page_from_free_list(). While reporting them to the hypervisor, these pages are not available and not even marked as PageBuddy() anymore.
> 
> There are at least two scenarios where this could affect creation of free pageblocks I can see:
> 
> a. Compaction. While compacting, we might identify completely movable/free pageblocks, however, actual compaction on that pageblock can fail because some part is temporarily unmovable.
> 
> b. Free/alloc sequences. Assume a pageblocks is mostly free, except two pages (x and y). Assume the following sequence:
> 
> 1. free(x)
> 2. free(y)
> 3. alloc
> 
> Before your change, after 1. and 2. we'll have a free pageblock. 3 won't allocate from that pageblock.
> 
> With your change, free page reporting might run after 1. After 2, we'll not have a free pageblock (until free page reporting finished), and 3. might just reallocate what we freed in 2 and prevent having a free pageblock.
> 
> 
> No idea how relevant both points are in practice, however, the fundamental difference to current handling is that we would turn parts of pageblocks temporarily unmovable, instead of complete pageblocks.
> 

Thank you for the details. Without my changes and the page reporting threshold
is @pageblock_order, the whole page block can become 'movable' from 'unmovable'.
I don't think it's what we want, but I need Alex's confirm.

If we needn't change page block's migration type in page reporting, I guess I
need additional parameter for __isolate_free_page() so that the migration type
won't be changed by page reporting.

For (a), the 'movable' and 'unmovable' type is maintained with page block
granularity. So it seems the pages in one page block can't have different
migration types, or I missed something.

For (b), the scenario is possible to happen. It means the changed page
reporting threshold could affect the page allocator's behaviour, which could
introduce more fragmentations. However, it's really depending on how the
memory is allocated.

Thanks,
Gavin
David Hildenbrand June 16, 2021, 7:59 a.m. UTC | #6
On 16.06.21 03:53, Gavin Shan wrote:
> On 6/14/21 9:03 PM, David Hildenbrand wrote:
>> On 11.06.21 09:44, Gavin Shan wrote:
>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>> is never triggered with the following configurations and settings on
>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>
>>>>>      CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>      CONFIG_HUGETLB_PAGE:                  Y
>>>>>      CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>      pageblock_order:                      13
>>>>>      CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>      MAX_ORDER:                            14
>>>>>
>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>
>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>      -accel kvm -machine virt,gic-version=host                        \
>>>>>      -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>      -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>      -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>      -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>        :                                                              \
>>>>>      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>
>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>> drops after 'memhog' exits.
>>>>
>>>> IIRC, we use pageblock_order to
>>>>
>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>
>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>
>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>
>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>
>>>>
>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>
>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>
>>>
>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>
>>
>> [adding his gmail address which should be the right one]
>>
>>> David, thanks for your time to review and sorry for the delay and late response.
>>> I spent some time to get myself familiar with the code, but there are still some
>>> questions to me, explained as below.
>>>
>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>> incur more overhead if the threshold is decreased as you said in (a).
>>
>> Right. Alex did quite some performance/overhead evaluation when introducing this feature. Changing the reporting granularity on most setups (esp., x86-64) is not desired IMHO.
>>
> 
> Thanks for adding Alex's correct mail address, David.
> 
>>>
>>> This patch tries to decrease the free page reporting threshold. The @pageblock_order
>>> isn't touched. I don't understand how the code changes affecting THP splitting
>>> and the creation of page blocks mentioned in (b) and (c). David, could you please
>>> provide more details?
>>
>> Think of it like this: while reporting to the hypervisor, we temporarily turn free/"movable" pieces part of a pageblock "unmovable" -- see __isolate_free_page()->del_page_from_free_list(). While reporting them to the hypervisor, these pages are not available and not even marked as PageBuddy() anymore.
>>
>> There are at least two scenarios where this could affect creation of free pageblocks I can see:
>>
>> a. Compaction. While compacting, we might identify completely movable/free pageblocks, however, actual compaction on that pageblock can fail because some part is temporarily unmovable.
>>
>> b. Free/alloc sequences. Assume a pageblocks is mostly free, except two pages (x and y). Assume the following sequence:
>>
>> 1. free(x)
>> 2. free(y)
>> 3. alloc
>>
>> Before your change, after 1. and 2. we'll have a free pageblock. 3 won't allocate from that pageblock.
>>
>> With your change, free page reporting might run after 1. After 2, we'll not have a free pageblock (until free page reporting finished), and 3. might just reallocate what we freed in 2 and prevent having a free pageblock.
>>
>>
>> No idea how relevant both points are in practice, however, the fundamental difference to current handling is that we would turn parts of pageblocks temporarily unmovable, instead of complete pageblocks.
>>
> 
> Thank you for the details. Without my changes and the page reporting threshold
> is @pageblock_order, the whole page block can become 'movable' from 'unmovable'.
> I don't think it's what we want, but I need Alex's confirm.

__isolate_free_page() will set the pageblock MIGRATE_MOVABLE in that 
case. It's only temporarily unmovable, while we're hinting.

Note that MOVABLE vs. UNMOVABLE is just grouping for free pages, and 
even setting it to the wrong migratetype isn't "wrong" as in 
"correctness". It doesn't make a difference if there are no free pages 
because the whole block is isolated.
David Hildenbrand June 16, 2021, 8:03 a.m. UTC | #7
On 16.06.21 11:10, Gavin Shan wrote:
> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>> is never triggered with the following configurations and settings on
>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>
>>>>>>       CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>       CONFIG_HUGETLB_PAGE:                  Y
>>>>>>       CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>       pageblock_order:                      13
>>>>>>       CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>       MAX_ORDER:                            14
>>>>>>
>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>
>>>>>>       /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>       -accel kvm -machine virt,gic-version=host                        \
>>>>>>       -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>       -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>       -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>       -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>       -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>         :                                                              \
>>>>>>       -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>
>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>> drops after 'memhog' exits.
>>>>>
>>>>> IIRC, we use pageblock_order to
>>>>>
>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>
>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>
>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>
>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>
>>>>>
>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>
>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>
>>>>
>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>
>>>
>>> [adding his gmail address which should be the right one]
>>>
>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>> I spent some time to get myself familiar with the code, but there are still some
>>>> questions to me, explained as below.
>>>>
>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>
>>> Right. Alex did quite some performance/overhead evaluation when
>>> introducing this feature. Changing the reporting granularity on most
>>> setups (esp., x86-64) is not desired IMHO.
>>
>> Yes, generally reporting pages comes at a fairly high cost so it is
>> important to find the right trade-off between the size of the page and
>> the size of the batch of pages being reported. If the size of the
>> pages is reduced it maybe important to increase the batch size in
>> order to avoid paying too much in the way of overhead.
>>
>> The other main reason for holding to pageblock_order on x86 is to
>> avoid THP splitting. Anything smaller than pageblock_order will
>> trigger THP splitting which will significantly hurt the performance of
>> the VM in general as it forces it down to order 0 pages.
>>
> 
> Alex, Thanks for your reply and sorry for taking your time to this
> discussion.
> 
> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
> used in page_reporting_cycle() when you're talking about "batch"?
> 
> I don't understand how the THP splitting is triggered. As I understood,
> the free page reporting works like this: the free pages are reported
> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on the
> free pages. For THP related VMA, its memory is guranteed to be physically
> 2MB contiguous. The 2MB memory is free'd at once as I understand. I
> don't think the 2MB memory can be free'd partially apart from the THP
> splitting caused by some reasons. One of the reason is memory reclaim
> and swapping. However, I think the THP splitting is caused by memory
> reclaim is irrelevant to what you were saying.


Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you 
madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page 
back to the buddy. That virtual memory region is no longer backed by a 
physically contiguous 2 MB page and there is no huge mapping in the page 
tables. Instead, there is now an "ordinary" 4k mapping with a hole. Bad 
for performance.
Gavin Shan June 16, 2021, 9:10 a.m. UTC | #8
On 6/15/21 12:26 PM, Alexander Duyck wrote:
> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>> On 11.06.21 09:44, Gavin Shan wrote:
>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>> is never triggered with the following configurations and settings on
>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>
>>>>>      CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>      CONFIG_HUGETLB_PAGE:                  Y
>>>>>      CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>      pageblock_order:                      13
>>>>>      CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>      MAX_ORDER:                            14
>>>>>
>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>
>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>      -accel kvm -machine virt,gic-version=host                        \
>>>>>      -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>      -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>      -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>      -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>        :                                                              \
>>>>>      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>
>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>> drops after 'memhog' exits.
>>>>
>>>> IIRC, we use pageblock_order to
>>>>
>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>
>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>
>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>
>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>
>>>>
>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>
>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>
>>>
>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>
>>
>> [adding his gmail address which should be the right one]
>>
>>> David, thanks for your time to review and sorry for the delay and late response.
>>> I spent some time to get myself familiar with the code, but there are still some
>>> questions to me, explained as below.
>>>
>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>> incur more overhead if the threshold is decreased as you said in (a).
>>
>> Right. Alex did quite some performance/overhead evaluation when
>> introducing this feature. Changing the reporting granularity on most
>> setups (esp., x86-64) is not desired IMHO.
> 
> Yes, generally reporting pages comes at a fairly high cost so it is
> important to find the right trade-off between the size of the page and
> the size of the batch of pages being reported. If the size of the
> pages is reduced it maybe important to increase the batch size in
> order to avoid paying too much in the way of overhead.
> 
> The other main reason for holding to pageblock_order on x86 is to
> avoid THP splitting. Anything smaller than pageblock_order will
> trigger THP splitting which will significantly hurt the performance of
> the VM in general as it forces it down to order 0 pages.
> 

Alex, Thanks for your reply and sorry for taking your time to this
discussion.

Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
used in page_reporting_cycle() when you're talking about "batch"?

I don't understand how the THP splitting is triggered. As I understood,
the free page reporting works like this: the free pages are reported
back to QEMU through virtio-balloon, madvise(DONTNEED) is called on the
free pages. For THP related VMA, its memory is guranteed to be physically
2MB contiguous. The 2MB memory is free'd at once as I understand. I
don't think the 2MB memory can be free'd partially apart from the THP
splitting caused by some reasons. One of the reason is memory reclaim
and swapping. However, I think the THP splitting is caused by memory
reclaim is irrelevant to what you were saying.

Thanks,
Gavin
David Hildenbrand June 16, 2021, 11:15 a.m. UTC | #9
On 16.06.21 14:59, Gavin Shan wrote:
> On 6/16/21 5:59 PM, David Hildenbrand wrote:
>> On 16.06.21 03:53, Gavin Shan wrote:
>>> On 6/14/21 9:03 PM, David Hildenbrand wrote:
>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>>> is never triggered with the following configurations and settings on
>>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>>
>>>>>>>       CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>>       CONFIG_HUGETLB_PAGE:                  Y
>>>>>>>       CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>>       pageblock_order:                      13
>>>>>>>       CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>>       MAX_ORDER:                            14
>>>>>>>
>>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>>
>>>>>>>       /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>       -accel kvm -machine virt,gic-version=host                        \
>>>>>>>       -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>>       -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>>       -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>>       -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>>       -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>>         :                                                              \
>>>>>>>       -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>
>>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>>> drops after 'memhog' exits.
>>>>>>
>>>>>> IIRC, we use pageblock_order to
>>>>>>
>>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>>
>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>>
>>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>>
>>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>>
>>>>>>
>>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>>
>>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>>
>>>>>
>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>
>>>>
>>>> [adding his gmail address which should be the right one]
>>>>
>>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>>> I spent some time to get myself familiar with the code, but there are still some
>>>>> questions to me, explained as below.
>>>>>
>>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>>
>>>> Right. Alex did quite some performance/overhead evaluation when introducing this feature. Changing the reporting granularity on most setups (esp., x86-64) is not desired IMHO.
>>>>
>>>
>>> Thanks for adding Alex's correct mail address, David.
>>>
>>>>>
>>>>> This patch tries to decrease the free page reporting threshold. The @pageblock_order
>>>>> isn't touched. I don't understand how the code changes affecting THP splitting
>>>>> and the creation of page blocks mentioned in (b) and (c). David, could you please
>>>>> provide more details?
>>>>
>>>> Think of it like this: while reporting to the hypervisor, we temporarily turn free/"movable" pieces part of a pageblock "unmovable" -- see __isolate_free_page()->del_page_from_free_list(). While reporting them to the hypervisor, these pages are not available and not even marked as PageBuddy() anymore.
>>>>
>>>> There are at least two scenarios where this could affect creation of free pageblocks I can see:
>>>>
>>>> a. Compaction. While compacting, we might identify completely movable/free pageblocks, however, actual compaction on that pageblock can fail because some part is temporarily unmovable.
>>>>
>>>> b. Free/alloc sequences. Assume a pageblocks is mostly free, except two pages (x and y). Assume the following sequence:
>>>>
>>>> 1. free(x)
>>>> 2. free(y)
>>>> 3. alloc
>>>>
>>>> Before your change, after 1. and 2. we'll have a free pageblock. 3 won't allocate from that pageblock.
>>>>
>>>> With your change, free page reporting might run after 1. After 2, we'll not have a free pageblock (until free page reporting finished), and 3. might just reallocate what we freed in 2 and prevent having a free pageblock.
>>>>
>>>>
>>>> No idea how relevant both points are in practice, however, the fundamental difference to current handling is that we would turn parts of pageblocks temporarily unmovable, instead of complete pageblocks.
>>>>
>>>
>>> Thank you for the details. Without my changes and the page reporting threshold
>>> is @pageblock_order, the whole page block can become 'movable' from 'unmovable'.
>>> I don't think it's what we want, but I need Alex's confirm.
>>
>> __isolate_free_page() will set the pageblock MIGRATE_MOVABLE in that case. It's only temporarily unmovable, while we're hinting.
>>
>> Note that MOVABLE vs. UNMOVABLE is just grouping for free pages, and even setting it to the wrong migratetype isn't "wrong" as in "correctness". It doesn't make a difference if there are no free pages because the whole block is isolated.
>>
> 
> Yes, It doesn't matter since these pages have been isolated. The migration type is changed to MIGRATE_MOVABLE
> in __isolated_free_page(). My questions are actually:
> 
> (1) Is it possible the migration type is changed from MIGRATE_UNMOVABLE to MIGRATE_MOVABLE
>       in __isolated_free_page()?

Yes, if the isolated page covers at least half the pageblock. So either 
if we isolate the complete pageblock (as it's free, there is nothing 
unmovable) or half the pageblock. The latter seems to be some heuristic 
that says if it's half-free, make it MIGRATE_MOVABLE -- maybe because 
that increases the chances that we might get a completely movable 
pageblock later (would have too look into the details).

> (2) After the free page reporting is completed, the migrate type is restored to MIGRATE_UNMOVABLE?

No, don't think so. And it also doesn't make too much sense if we 
decided when isolating that we're better off using MIGRATE_MOVABLE. 
After all, we're just putting back a free page we previously isolated 
from the free lists.
David Hildenbrand June 16, 2021, 11:20 a.m. UTC | #10
On 16.06.21 15:16, Gavin Shan wrote:
> On 6/16/21 6:03 PM, David Hildenbrand wrote:
>> On 16.06.21 11:10, Gavin Shan wrote:
>>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>>>> is never triggered with the following configurations and settings on
>>>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>>>
>>>>>>>>        CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>>>        CONFIG_HUGETLB_PAGE:                  Y
>>>>>>>>        CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>>>        pageblock_order:                      13
>>>>>>>>        CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>>>        MAX_ORDER:                            14
>>>>>>>>
>>>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>>>
>>>>>>>>        /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>>        -accel kvm -machine virt,gic-version=host                        \
>>>>>>>>        -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>>>        -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>>>        -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>>>        -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>>>        -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>>>          :                                                              \
>>>>>>>>        -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>>
>>>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>>>> drops after 'memhog' exits.
>>>>>>>
>>>>>>> IIRC, we use pageblock_order to
>>>>>>>
>>>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>>>
>>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>>>
>>>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>>>
>>>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>>>
>>>>>>>
>>>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>>>
>>>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>>>
>>>>>>
>>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>>
>>>>>
>>>>> [adding his gmail address which should be the right one]
>>>>>
>>>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>>>> I spent some time to get myself familiar with the code, but there are still some
>>>>>> questions to me, explained as below.
>>>>>>
>>>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>>>
>>>>> Right. Alex did quite some performance/overhead evaluation when
>>>>> introducing this feature. Changing the reporting granularity on most
>>>>> setups (esp., x86-64) is not desired IMHO.
>>>>
>>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>>> important to find the right trade-off between the size of the page and
>>>> the size of the batch of pages being reported. If the size of the
>>>> pages is reduced it maybe important to increase the batch size in
>>>> order to avoid paying too much in the way of overhead.
>>>>
>>>> The other main reason for holding to pageblock_order on x86 is to
>>>> avoid THP splitting. Anything smaller than pageblock_order will
>>>> trigger THP splitting which will significantly hurt the performance of
>>>> the VM in general as it forces it down to order 0 pages.
>>>>
>>>
>>> Alex, Thanks for your reply and sorry for taking your time to this
>>> discussion.
>>>
>>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>>> used in page_reporting_cycle() when you're talking about "batch"?
>>>
>>> I don't understand how the THP splitting is triggered. As I understood,
>>> the free page reporting works like this: the free pages are reported
>>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on the
>>> free pages. For THP related VMA, its memory is guranteed to be physically
>>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I
>>> don't think the 2MB memory can be free'd partially apart from the THP
>>> splitting caused by some reasons. One of the reason is memory reclaim
>>> and swapping. However, I think the THP splitting is caused by memory
>>> reclaim is irrelevant to what you were saying.
>>
>>
>> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page back to the buddy. That virtual memory region is no longer backed by a physically contiguous 2 MB page and there is no huge mapping in the page tables. Instead, there is now an "ordinary" 4k mapping with a hole. Bad for performance.
>>
> 
> Ok, now I see and thanks for your explanation. In this regard, we need gurantee
> the page reporting threshold is larger or equal to THP size. The THP size is 2MB
> or 512MB if base page size is 4KB or 64KB.
> 
> It makes the issue hard to be fixed as we have 512MB THP size with 64KB base page
> size on arm64. The following configurations are used in this case.
> 
>      CONFIG_FORCE_MAX_ZONEORDER          14
>      MAX_ORDER                           14
>      pageblock_order                     13
> 
> The free page reporting won't be started until the page freeing comes up with 512MB
> free area. On system, which has limited memory (e.g. 4GB), 512MB free area is hard
> to have due to memory fragmentation.
> 

FWIW, in an ideal world the hypervisor would tell us (guest) which 
granularity it prefers. Could be that the hypervisor is using a 
different page size / thp size ... but that's a different story :)

 From a guest POV, it usually makes sense to report only whole 
pageblocks. But as identified, this is an issue with abnormally large 
pageblocks.

512MB pageblocks / THP is simply far from ideal for VMs, kindly phrased. :)
David Hildenbrand June 16, 2021, 12:07 p.m. UTC | #11
> Indeed. 512MB pageblocks are rare, especially on systems which have been
> up and running for long time.
> 
> The free page reporting starts from guest. Taking an extreme case: guest has
> 512MB memory and it's backed by one THP on host. The free page reporting won't
> work at all.
> 
> Besides, it seems free page reporting isn't guranteed to work all the time.
> For example, on system where we have 4KB base page size. Freeing individual
> 4KB pages can't come up with a free 2MB pageblock due to fragmentation.
> In this case, the free'd page won't be reported immediately, but might be
> reported after swapping or compaction due to memory pressure. The free page
> isn't reported immediately at least.

Exactly, it's a pure optimization that won't work, especially when guest 
memory is heavily fragmented. There has to be a balance between 
reclaiming free memory in the hypervisor, degrading VM performance, and 
overhead of the feature.

Further, there are no guarantees when a VM will reuse the memory again. 
In the worst case, all VMs that reported free pages reuse memory at the 
same time. In that case, one definitely needs sufficient backend memory 
in the hypervisor (-> swap) to not run out of memory, and performance 
will be degraded.

As MST once phrased it, if the feature has a higher overhead than 
swapping in the hypervisor, it's of little use.

> 
> David, how about taking your suggestion to have different threshold size only
> for arm64 (64KB base page size). The threshold will be smaller than pageblock_order
> for sure. There are two ways to do so and please let me know which is the preferred
> way to go if you (and Alex) agree to do it.
> 
> (a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose the
>       value. The threshold falls back to pageblock_order if isn't configurated.
> (b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can decide
>       its value. If it's not provided by arch, it falls back to pageblock_order.
> 

I wonder if we could further define it as a (module/cmdline) parameter 
and make it configurable when booting. The default could then be set 
based on CONFIG_PAGE_REPORTING_ORDER. CONFIG_PAGE_REPORTING_ORDER would 
default to pageblock_order (if easily possible) and could be 
special-cases to arm64 with 64k.

> By the way, I recently had some performance testing on different page sizes.
> We get much more performance gain from 64KB (vs 4KB) page size in guest than
> 512MB (vs 2MB) THP on host. It means the performance won't be affected too
> much even the 512MB THP is splitted on arm64 host.

Yes, if one is even able to get 512MB THP populated in the hypervisor -- 
because once again, 512MB THP are just a bad fit for many workloads.
Gavin Shan June 16, 2021, 12:59 p.m. UTC | #12
On 6/16/21 5:59 PM, David Hildenbrand wrote:
> On 16.06.21 03:53, Gavin Shan wrote:
>> On 6/14/21 9:03 PM, David Hildenbrand wrote:
>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>> is never triggered with the following configurations and settings on
>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>
>>>>>>      CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>      CONFIG_HUGETLB_PAGE:                  Y
>>>>>>      CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>      pageblock_order:                      13
>>>>>>      CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>      MAX_ORDER:                            14
>>>>>>
>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>
>>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>      -accel kvm -machine virt,gic-version=host                        \
>>>>>>      -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>      -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>      -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>      -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>        :                                                              \
>>>>>>      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>
>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>> drops after 'memhog' exits.
>>>>>
>>>>> IIRC, we use pageblock_order to
>>>>>
>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>
>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>
>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>
>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>
>>>>>
>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>
>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>
>>>>
>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>
>>>
>>> [adding his gmail address which should be the right one]
>>>
>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>> I spent some time to get myself familiar with the code, but there are still some
>>>> questions to me, explained as below.
>>>>
>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>
>>> Right. Alex did quite some performance/overhead evaluation when introducing this feature. Changing the reporting granularity on most setups (esp., x86-64) is not desired IMHO.
>>>
>>
>> Thanks for adding Alex's correct mail address, David.
>>
>>>>
>>>> This patch tries to decrease the free page reporting threshold. The @pageblock_order
>>>> isn't touched. I don't understand how the code changes affecting THP splitting
>>>> and the creation of page blocks mentioned in (b) and (c). David, could you please
>>>> provide more details?
>>>
>>> Think of it like this: while reporting to the hypervisor, we temporarily turn free/"movable" pieces part of a pageblock "unmovable" -- see __isolate_free_page()->del_page_from_free_list(). While reporting them to the hypervisor, these pages are not available and not even marked as PageBuddy() anymore.
>>>
>>> There are at least two scenarios where this could affect creation of free pageblocks I can see:
>>>
>>> a. Compaction. While compacting, we might identify completely movable/free pageblocks, however, actual compaction on that pageblock can fail because some part is temporarily unmovable.
>>>
>>> b. Free/alloc sequences. Assume a pageblocks is mostly free, except two pages (x and y). Assume the following sequence:
>>>
>>> 1. free(x)
>>> 2. free(y)
>>> 3. alloc
>>>
>>> Before your change, after 1. and 2. we'll have a free pageblock. 3 won't allocate from that pageblock.
>>>
>>> With your change, free page reporting might run after 1. After 2, we'll not have a free pageblock (until free page reporting finished), and 3. might just reallocate what we freed in 2 and prevent having a free pageblock.
>>>
>>>
>>> No idea how relevant both points are in practice, however, the fundamental difference to current handling is that we would turn parts of pageblocks temporarily unmovable, instead of complete pageblocks.
>>>
>>
>> Thank you for the details. Without my changes and the page reporting threshold
>> is @pageblock_order, the whole page block can become 'movable' from 'unmovable'.
>> I don't think it's what we want, but I need Alex's confirm.
> 
> __isolate_free_page() will set the pageblock MIGRATE_MOVABLE in that case. It's only temporarily unmovable, while we're hinting.
> 
> Note that MOVABLE vs. UNMOVABLE is just grouping for free pages, and even setting it to the wrong migratetype isn't "wrong" as in "correctness". It doesn't make a difference if there are no free pages because the whole block is isolated.
> 

Yes, It doesn't matter since these pages have been isolated. The migration type is changed to MIGRATE_MOVABLE
in __isolated_free_page(). My questions are actually:

(1) Is it possible the migration type is changed from MIGRATE_UNMOVABLE to MIGRATE_MOVABLE
     in __isolated_free_page()?
(2) After the free page reporting is completed, the migrate type is restored to MIGRATE_UNMOVABLE?

Thanks,
Gavin
Gavin Shan June 16, 2021, 1:16 p.m. UTC | #13
On 6/16/21 6:03 PM, David Hildenbrand wrote:
> On 16.06.21 11:10, Gavin Shan wrote:
>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>>> is never triggered with the following configurations and settings on
>>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>>
>>>>>>>       CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>>       CONFIG_HUGETLB_PAGE:                  Y
>>>>>>>       CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>>       pageblock_order:                      13
>>>>>>>       CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>>       MAX_ORDER:                            14
>>>>>>>
>>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>>
>>>>>>>       /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>       -accel kvm -machine virt,gic-version=host                        \
>>>>>>>       -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>>       -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>>       -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>>       -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>>       -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>>         :                                                              \
>>>>>>>       -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>
>>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>>> drops after 'memhog' exits.
>>>>>>
>>>>>> IIRC, we use pageblock_order to
>>>>>>
>>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>>
>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>>
>>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>>
>>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>>
>>>>>>
>>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>>
>>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>>
>>>>>
>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>
>>>>
>>>> [adding his gmail address which should be the right one]
>>>>
>>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>>> I spent some time to get myself familiar with the code, but there are still some
>>>>> questions to me, explained as below.
>>>>>
>>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>>
>>>> Right. Alex did quite some performance/overhead evaluation when
>>>> introducing this feature. Changing the reporting granularity on most
>>>> setups (esp., x86-64) is not desired IMHO.
>>>
>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>> important to find the right trade-off between the size of the page and
>>> the size of the batch of pages being reported. If the size of the
>>> pages is reduced it maybe important to increase the batch size in
>>> order to avoid paying too much in the way of overhead.
>>>
>>> The other main reason for holding to pageblock_order on x86 is to
>>> avoid THP splitting. Anything smaller than pageblock_order will
>>> trigger THP splitting which will significantly hurt the performance of
>>> the VM in general as it forces it down to order 0 pages.
>>>
>>
>> Alex, Thanks for your reply and sorry for taking your time to this
>> discussion.
>>
>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>> used in page_reporting_cycle() when you're talking about "batch"?
>>
>> I don't understand how the THP splitting is triggered. As I understood,
>> the free page reporting works like this: the free pages are reported
>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on the
>> free pages. For THP related VMA, its memory is guranteed to be physically
>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I
>> don't think the 2MB memory can be free'd partially apart from the THP
>> splitting caused by some reasons. One of the reason is memory reclaim
>> and swapping. However, I think the THP splitting is caused by memory
>> reclaim is irrelevant to what you were saying.
> 
> 
> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page back to the buddy. That virtual memory region is no longer backed by a physically contiguous 2 MB page and there is no huge mapping in the page tables. Instead, there is now an "ordinary" 4k mapping with a hole. Bad for performance.
> 

Ok, now I see and thanks for your explanation. In this regard, we need gurantee
the page reporting threshold is larger or equal to THP size. The THP size is 2MB
or 512MB if base page size is 4KB or 64KB.

It makes the issue hard to be fixed as we have 512MB THP size with 64KB base page
size on arm64. The following configurations are used in this case.

    CONFIG_FORCE_MAX_ZONEORDER          14
    MAX_ORDER                           14
    pageblock_order                     13

The free page reporting won't be started until the page freeing comes up with 512MB
free area. On system, which has limited memory (e.g. 4GB), 512MB free area is hard
to have due to memory fragmentation.

Thanks,
Gavin
Gavin Shan June 16, 2021, 1:58 p.m. UTC | #14
On 6/16/21 9:20 PM, David Hildenbrand wrote:
> On 16.06.21 15:16, Gavin Shan wrote:
>> On 6/16/21 6:03 PM, David Hildenbrand wrote:
>>> On 16.06.21 11:10, Gavin Shan wrote:
>>>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>>>>> is never triggered with the following configurations and settings on
>>>>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>>>>
>>>>>>>>>        CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>>>>        CONFIG_HUGETLB_PAGE:                  Y
>>>>>>>>>        CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>>>>        pageblock_order:                      13
>>>>>>>>>        CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>>>>        MAX_ORDER:                            14
>>>>>>>>>
>>>>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>>>>
>>>>>>>>>        /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>>>        -accel kvm -machine virt,gic-version=host                        \
>>>>>>>>>        -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>>>>        -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>>>>        -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>>>>        -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>>>>        -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>>>>          :                                                              \
>>>>>>>>>        -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>>>
>>>>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>>>>> drops after 'memhog' exits.
>>>>>>>>
>>>>>>>> IIRC, we use pageblock_order to
>>>>>>>>
>>>>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>>>>
>>>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>>>>
>>>>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>>>>
>>>>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>>>>
>>>>>>>>
>>>>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>>>>
>>>>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>>>>
>>>>>>>
>>>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>>>
>>>>>>
>>>>>> [adding his gmail address which should be the right one]
>>>>>>
>>>>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>>>>> I spent some time to get myself familiar with the code, but there are still some
>>>>>>> questions to me, explained as below.
>>>>>>>
>>>>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>>>>
>>>>>> Right. Alex did quite some performance/overhead evaluation when
>>>>>> introducing this feature. Changing the reporting granularity on most
>>>>>> setups (esp., x86-64) is not desired IMHO.
>>>>>
>>>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>>>> important to find the right trade-off between the size of the page and
>>>>> the size of the batch of pages being reported. If the size of the
>>>>> pages is reduced it maybe important to increase the batch size in
>>>>> order to avoid paying too much in the way of overhead.
>>>>>
>>>>> The other main reason for holding to pageblock_order on x86 is to
>>>>> avoid THP splitting. Anything smaller than pageblock_order will
>>>>> trigger THP splitting which will significantly hurt the performance of
>>>>> the VM in general as it forces it down to order 0 pages.
>>>>>
>>>>
>>>> Alex, Thanks for your reply and sorry for taking your time to this
>>>> discussion.
>>>>
>>>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>>>> used in page_reporting_cycle() when you're talking about "batch"?
>>>>
>>>> I don't understand how the THP splitting is triggered. As I understood,
>>>> the free page reporting works like this: the free pages are reported
>>>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on the
>>>> free pages. For THP related VMA, its memory is guranteed to be physically
>>>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I
>>>> don't think the 2MB memory can be free'd partially apart from the THP
>>>> splitting caused by some reasons. One of the reason is memory reclaim
>>>> and swapping. However, I think the THP splitting is caused by memory
>>>> reclaim is irrelevant to what you were saying.
>>>
>>>
>>> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page back to the buddy. That virtual memory region is no longer backed by a physically contiguous 2 MB page and there is no huge mapping in the page tables. Instead, there is now an "ordinary" 4k mapping with a hole. Bad for performance.
>>>
>>
>> Ok, now I see and thanks for your explanation. In this regard, we need gurantee
>> the page reporting threshold is larger or equal to THP size. The THP size is 2MB
>> or 512MB if base page size is 4KB or 64KB.
>>
>> It makes the issue hard to be fixed as we have 512MB THP size with 64KB base page
>> size on arm64. The following configurations are used in this case.
>>
>>      CONFIG_FORCE_MAX_ZONEORDER          14
>>      MAX_ORDER                           14
>>      pageblock_order                     13
>>
>> The free page reporting won't be started until the page freeing comes up with 512MB
>> free area. On system, which has limited memory (e.g. 4GB), 512MB free area is hard
>> to have due to memory fragmentation.
>>
> 
> FWIW, in an ideal world the hypervisor would tell us (guest) which granularity it prefers. Could be that the hypervisor is using a different page size / thp size ... but that's a different story :)
> 
>  From a guest POV, it usually makes sense to report only whole pageblocks. But as identified, this is an issue with abnormally large pageblocks.
> 
> 512MB pageblocks / THP is simply far from ideal for VMs, kindly phrased. :)
> 

Indeed. 512MB pageblocks are rare, especially on systems which have been
up and running for long time.

The free page reporting starts from guest. Taking an extreme case: guest has
512MB memory and it's backed by one THP on host. The free page reporting won't
work at all.

Besides, it seems free page reporting isn't guranteed to work all the time.
For example, on system where we have 4KB base page size. Freeing individual
4KB pages can't come up with a free 2MB pageblock due to fragmentation.
In this case, the free'd page won't be reported immediately, but might be
reported after swapping or compaction due to memory pressure. The free page
isn't reported immediately at least.

David, how about taking your suggestion to have different threshold size only
for arm64 (64KB base page size). The threshold will be smaller than pageblock_order
for sure. There are two ways to do so and please let me know which is the preferred
way to go if you (and Alex) agree to do it.

(a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose the
     value. The threshold falls back to pageblock_order if isn't configurated.
(b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can decide
     its value. If it's not provided by arch, it falls back to pageblock_order.

By the way, I recently had some performance testing on different page sizes.
We get much more performance gain from 64KB (vs 4KB) page size in guest than
512MB (vs 2MB) THP on host. It means the performance won't be affected too
much even the 512MB THP is splitted on arm64 host.

Thanks,
Gavin
Alexander Duyck June 16, 2021, 2:15 p.m. UTC | #15
On Wed, Jun 16, 2021 at 12:10 AM Gavin Shan <gshan@redhat.com> wrote:
>
> On 6/15/21 12:26 PM, Alexander Duyck wrote:
> > On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
> >> On 11.06.21 09:44, Gavin Shan wrote:
> >>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
> >>>> On 01.06.21 05:33, Gavin Shan wrote:
> >>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
> >>>>> minimal order (threshold) to trigger page reporting. The page reporting
> >>>>> is never triggered with the following configurations and settings on
> >>>>> aarch64. In the particular scenario, the page reporting won't be triggered
> >>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
> >>>>> page freeing. The condition is very hard, or even impossible to be met.
> >>>>>
> >>>>>      CONFIG_ARM64_PAGE_SHIFT:              16
> >>>>>      CONFIG_HUGETLB_PAGE:                  Y
> >>>>>      CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
> >>>>>      pageblock_order:                      13
> >>>>>      CONFIG_FORCE_MAX_ZONEORDER:           14
> >>>>>      MAX_ORDER:                            14
> >>>>>
> >>>>> The issue can be reproduced in VM, running kernel with above configurations
> >>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
> >>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
> >>>>>
> >>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> >>>>>      -accel kvm -machine virt,gic-version=host                        \
> >>>>>      -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
> >>>>>      -object memory-backend-ram,id=mem0,size=2048M                    \
> >>>>>      -object memory-backend-ram,id=mem1,size=2048M                    \
> >>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
> >>>>>      -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
> >>>>>        :                                                              \
> >>>>>      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> >>>>>
> >>>>> This tries to fix the issue by adjusting the threshold to the smaller value
> >>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
> >>>>> drops after 'memhog' exits.
> >>>>
> >>>> IIRC, we use pageblock_order to
> >>>>
> >>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
> >>>>
> >>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
> >>>>
> >>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
> >>>>
> >>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
> >>>>
> >>>>
> >>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
> >>>>
> >>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
> >>>>
> >>>
> >>> [Remove Alexander from the cc list as his mail isn't reachable]
> >>>
> >>
> >> [adding his gmail address which should be the right one]
> >>
> >>> David, thanks for your time to review and sorry for the delay and late response.
> >>> I spent some time to get myself familiar with the code, but there are still some
> >>> questions to me, explained as below.
> >>>
> >>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
> >>> incur more overhead if the threshold is decreased as you said in (a).
> >>
> >> Right. Alex did quite some performance/overhead evaluation when
> >> introducing this feature. Changing the reporting granularity on most
> >> setups (esp., x86-64) is not desired IMHO.
> >
> > Yes, generally reporting pages comes at a fairly high cost so it is
> > important to find the right trade-off between the size of the page and
> > the size of the batch of pages being reported. If the size of the
> > pages is reduced it maybe important to increase the batch size in
> > order to avoid paying too much in the way of overhead.
> >
> > The other main reason for holding to pageblock_order on x86 is to
> > avoid THP splitting. Anything smaller than pageblock_order will
> > trigger THP splitting which will significantly hurt the performance of
> > the VM in general as it forces it down to order 0 pages.
> >
>
> Alex, Thanks for your reply and sorry for taking your time to this
> discussion.
>
> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
> used in page_reporting_cycle() when you're talking about "batch"?

Yes, when I refer to batch it is how many pages we are processing in a
single call. That is limited by PAGE_REPORTING_CAPACITY.
Gavin Shan June 21, 2021, 5:16 a.m. UTC | #16
On 6/16/21 10:07 PM, David Hildenbrand wrote:
>> Indeed. 512MB pageblocks are rare, especially on systems which have been
>> up and running for long time.
>>
>> The free page reporting starts from guest. Taking an extreme case: guest has
>> 512MB memory and it's backed by one THP on host. The free page reporting won't
>> work at all.
>>
>> Besides, it seems free page reporting isn't guranteed to work all the time.
>> For example, on system where we have 4KB base page size. Freeing individual
>> 4KB pages can't come up with a free 2MB pageblock due to fragmentation.
>> In this case, the free'd page won't be reported immediately, but might be
>> reported after swapping or compaction due to memory pressure. The free page
>> isn't reported immediately at least.
> 
> Exactly, it's a pure optimization that won't work, especially when guest memory is heavily fragmented. There has to be a balance between reclaiming free memory in the hypervisor, degrading VM performance, and overhead of the feature.
> 
> Further, there are no guarantees when a VM will reuse the memory again. In the worst case, all VMs that reported free pages reuse memory at the same time. In that case, one definitely needs sufficient backend memory in the hypervisor (-> swap) to not run out of memory, and performance will be degraded.
> 
> As MST once phrased it, if the feature has a higher overhead than swapping in the hypervisor, it's of little use.
> 

Thanks for the explanation and sorry again for late response, David. I took
last week as holiday and didn't work too much.

However, it's nice to have unused pages returned back to the host. These pages
can be used by other VMs or applications running on the host.

>>
>> David, how about taking your suggestion to have different threshold size only
>> for arm64 (64KB base page size). The threshold will be smaller than pageblock_order
>> for sure. There are two ways to do so and please let me know which is the preferred
>> way to go if you (and Alex) agree to do it.
>>
>> (a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose the
>>       value. The threshold falls back to pageblock_order if isn't configurated.
>> (b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can decide
>>       its value. If it's not provided by arch, it falls back to pageblock_order.
>>
> 
> I wonder if we could further define it as a (module/cmdline) parameter and make it configurable when booting. The default could then be set based on CONFIG_PAGE_REPORTING_ORDER. CONFIG_PAGE_REPORTING_ORDER would default to pageblock_order (if easily possible) and could be special-cases to arm64 with 64k.
> 

The formal patches are posted for review. I used macro PAGE_REPORTING_ORDER
instead of CONFIG_PAGE_REPORTING_ORDER. The page reporting order (threshold)
is also exported as a module parameter, as you suggested.

>> By the way, I recently had some performance testing on different page sizes.
>> We get much more performance gain from 64KB (vs 4KB) page size in guest than
>> 512MB (vs 2MB) THP on host. It means the performance won't be affected too
>> much even the 512MB THP is splitted on arm64 host.
> 
> Yes, if one is even able to get 512MB THP populated in the hypervisor -- because once again, 512MB THP are just a bad fit for many workloads.
> 

Yeah, indeed :)

Thanks,
Gavin
Gavin Shan June 21, 2021, 7:03 a.m. UTC | #17
On 6/17/21 12:15 AM, Alexander Duyck wrote:
> On Wed, Jun 16, 2021 at 12:10 AM Gavin Shan <gshan@redhat.com> wrote:
>>
>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>>>>> minimal order (threshold) to trigger page reporting. The page reporting
>>>>>>> is never triggered with the following configurations and settings on
>>>>>>> aarch64. In the particular scenario, the page reporting won't be triggered
>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>>>>> page freeing. The condition is very hard, or even impossible to be met.
>>>>>>>
>>>>>>>       CONFIG_ARM64_PAGE_SHIFT:              16
>>>>>>>       CONFIG_HUGETLB_PAGE:                  Y
>>>>>>>       CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
>>>>>>>       pageblock_order:                      13
>>>>>>>       CONFIG_FORCE_MAX_ZONEORDER:           14
>>>>>>>       MAX_ORDER:                            14
>>>>>>>
>>>>>>> The issue can be reproduced in VM, running kernel with above configurations
>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>>>>
>>>>>>>       /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>       -accel kvm -machine virt,gic-version=host                        \
>>>>>>>       -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
>>>>>>>       -object memory-backend-ram,id=mem0,size=2048M                    \
>>>>>>>       -object memory-backend-ram,id=mem1,size=2048M                    \
>>>>>>>       -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
>>>>>>>       -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
>>>>>>>         :                                                              \
>>>>>>>       -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>
>>>>>>> This tries to fix the issue by adjusting the threshold to the smaller value
>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
>>>>>>> drops after 'memhog' exits.
>>>>>>
>>>>>> IIRC, we use pageblock_order to
>>>>>>
>>>>>> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
>>>>>>
>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
>>>>>>
>>>>>> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
>>>>>>
>>>>>> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
>>>>>>
>>>>>>
>>>>>> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
>>>>>>
>>>>>> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
>>>>>>
>>>>>
>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>
>>>>
>>>> [adding his gmail address which should be the right one]
>>>>
>>>>> David, thanks for your time to review and sorry for the delay and late response.
>>>>> I spent some time to get myself familiar with the code, but there are still some
>>>>> questions to me, explained as below.
>>>>>
>>>>> Yes, @pageblock_order is currently taken as page reporting threshold. It will
>>>>> incur more overhead if the threshold is decreased as you said in (a).
>>>>
>>>> Right. Alex did quite some performance/overhead evaluation when
>>>> introducing this feature. Changing the reporting granularity on most
>>>> setups (esp., x86-64) is not desired IMHO.
>>>
>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>> important to find the right trade-off between the size of the page and
>>> the size of the batch of pages being reported. If the size of the
>>> pages is reduced it maybe important to increase the batch size in
>>> order to avoid paying too much in the way of overhead.
>>>
>>> The other main reason for holding to pageblock_order on x86 is to
>>> avoid THP splitting. Anything smaller than pageblock_order will
>>> trigger THP splitting which will significantly hurt the performance of
>>> the VM in general as it forces it down to order 0 pages.
>>>
>>
>> Alex, Thanks for your reply and sorry for taking your time to this
>> discussion.
>>
>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>> used in page_reporting_cycle() when you're talking about "batch"?
> 
> Yes, when I refer to batch it is how many pages we are processing in a
> single call. That is limited by PAGE_REPORTING_CAPACITY.
> 

Thanks for your confirmation. I will do some investigation on how to improve
it. Lets resolve the issue we had at first: The page reporting doesn't work
on arm64 with 64KB base size.

Thanks,
Gavin
Gavin Shan June 21, 2021, 7:52 a.m. UTC | #18
On 6/17/21 12:15 AM, Alexander Duyck wrote:
> On Wed, Jun 16, 2021 at 12:10 AM Gavin Shan <gshan@redhat.com> wrote:
>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>> On 01.06.21 05:33, Gavin Shan wrote:

[...]

>>>
>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>> important to find the right trade-off between the size of the page and
>>> the size of the batch of pages being reported. If the size of the
>>> pages is reduced it maybe important to increase the batch size in
>>> order to avoid paying too much in the way of overhead.
>>>
>>> The other main reason for holding to pageblock_order on x86 is to
>>> avoid THP splitting. Anything smaller than pageblock_order will
>>> trigger THP splitting which will significantly hurt the performance of
>>> the VM in general as it forces it down to order 0 pages.
>>>
>>
>> Alex, Thanks for your reply and sorry for taking your time to this
>> discussion.
>>
>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>> used in page_reporting_cycle() when you're talking about "batch"?
> 
> Yes, when I refer to batch it is how many pages we are processing in a
> single call. That is limited by PAGE_REPORTING_CAPACITY.
> 

Alex, It seems the batch mechanism is to avoid heavy contention on
zone's lock if I'm correct? The current design is to report all pages
in the corresponding free list within 17 calls to page_reporting_cycle().
Could you please explain why 17 was chosen? :)

    budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);

It's related to the magic number ("16"). With the threshold is decreased,
for example from 512MB to 2MB on arm64 with 64KB base page size, more
page reporting activities will be introduced. From this regard, it's
reasonable to increase the magic number as well, so that more calls
to page_reporting_cycle() to avoid the contention to zone's lock.

If you agree, I will come up with something, similar to what we do for
the threshold. However, I'm not sure if 64 is reasonable cycles to have
for this particular case.

    in arch/arm64/include/asm/page.h
       #ifdef CONFIG_ARM64_64K_PAGES
       #define PAGE_REPORTING_ORDER    5
       #define PAGE_REPORTING_CYCLES   64
       #endif
    in mm/page_reporting.h
       #ifndef PAGE_REPORTING_CYCLES
       #define PAGE_REPORTING_CYCLES   16
       #endif
    in mm/page_reporting.c::page_reporting_cycle()
       budget = DIV_ROUND_UP(area->nr_free,
                             PAGE_REPORTING_CAPACITY * PAGE_REPORTING_CYCLES);

Thanks,
Gavin
Alexander Duyck June 21, 2021, 1:43 p.m. UTC | #19
On Sun, Jun 20, 2021 at 10:51 PM Gavin Shan <gshan@redhat.com> wrote:
>
> On 6/17/21 12:15 AM, Alexander Duyck wrote:
> > On Wed, Jun 16, 2021 at 12:10 AM Gavin Shan <gshan@redhat.com> wrote:
> >> On 6/15/21 12:26 PM, Alexander Duyck wrote:
> >>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
> >>>> On 11.06.21 09:44, Gavin Shan wrote:
> >>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
> >>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>
> [...]
>
> >>>
> >>> Yes, generally reporting pages comes at a fairly high cost so it is
> >>> important to find the right trade-off between the size of the page and
> >>> the size of the batch of pages being reported. If the size of the
> >>> pages is reduced it maybe important to increase the batch size in
> >>> order to avoid paying too much in the way of overhead.
> >>>
> >>> The other main reason for holding to pageblock_order on x86 is to
> >>> avoid THP splitting. Anything smaller than pageblock_order will
> >>> trigger THP splitting which will significantly hurt the performance of
> >>> the VM in general as it forces it down to order 0 pages.
> >>>
> >>
> >> Alex, Thanks for your reply and sorry for taking your time to this
> >> discussion.
> >>
> >> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
> >> used in page_reporting_cycle() when you're talking about "batch"?
> >
> > Yes, when I refer to batch it is how many pages we are processing in a
> > single call. That is limited by PAGE_REPORTING_CAPACITY.
> >
>
> Alex, It seems the batch mechanism is to avoid heavy contention on
> zone's lock if I'm correct? The current design is to report all pages
> in the corresponding free list within 17 calls to page_reporting_cycle().
> Could you please explain why 17 was chosen? :)
>
>     budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);

It isn't that 17 was chosen. The idea was to only process 1/16th of
the free list at a time. The general idea is that by doing that and
limiting the page reporting to an interval of once every 2 seconds we
should have the entire guest reported out after about 30 seconds
assuming it is idle. If it isn't idle then the overhead for reporting
only 1/16th of the guest memory should be fairly low.

> It's related to the magic number ("16"). With the threshold is decreased,
> for example from 512MB to 2MB on arm64 with 64KB base page size, more
> page reporting activities will be introduced. From this regard, it's
> reasonable to increase the magic number as well, so that more calls
> to page_reporting_cycle() to avoid the contention to zone's lock.
>
> If you agree, I will come up with something, similar to what we do for
> the threshold. However, I'm not sure if 64 is reasonable cycles to have
> for this particular case.
>
>     in arch/arm64/include/asm/page.h
>        #ifdef CONFIG_ARM64_64K_PAGES
>        #define PAGE_REPORTING_ORDER    5
>        #define PAGE_REPORTING_CYCLES   64
>        #endif

You mentioned going from 512MB to 2MB pages. What is the MAX_ORDER for
the arm architecture you are working with? One concern I have is that
order 5 pages may not be high enough order to keep the page reporting
from interfering with the guest memory allocations since you are
having to cover so many free areas.

Ideally with page reporting we were aiming for MAX_ORDER and MAX_ORDER
- 1 as being the main targets for page reporting. The advantage there
is that on x86 that also allowed us to avoid splitting THP pages. The
other advantage is that when combined with the 16 and the fact that we
were rounding up the budget it should come out to about one minute to
fully flush out all the memory on an idle guest.

If anything we would want to take the (MAX_ORDER -
PAGE_REPORTING_ORDER)/2 and use that as a multiple for the 16 value as
that would give us the upper limit on how long it should take to
report all of the pages in a given block. It gets us to the same
value, but does a better job of explaining why.

>     in mm/page_reporting.h
>        #ifndef PAGE_REPORTING_CYCLES
>        #define PAGE_REPORTING_CYCLES   16
>        #endif
>     in mm/page_reporting.c::page_reporting_cycle()
>        budget = DIV_ROUND_UP(area->nr_free,
>                              PAGE_REPORTING_CAPACITY * PAGE_REPORTING_CYCLES);
>
> Thanks,
> Gavin

The 16 isn't about cycles, it is about how fast we want to leak the
memory out of the guest. You don't want this to go too fast otherwise
you are going to be fighting with anything that is trying to allocate
memory. In theory you should only be reporting pages in the top tiers
of the memory hierarchy so that it is not very active.

One way to think about page reporting is as a leaky bucket approach.
After a minute or so you want the bucket to drain assuming the memory
allocations/frees for a guest have become inactive. However it the VM
is active you want it to do very little in way of page reporting so
that you are not having to fault back in a ton of memory.
diff mbox series

Patch

diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index 2c385dd4ddbd..5dae3d171004 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -10,9 +10,10 @@ 
 #include <linux/pgtable.h>
 #include <linux/scatterlist.h>
 
-#define PAGE_REPORTING_MIN_ORDER	pageblock_order
-
 #ifdef CONFIG_PAGE_REPORTING
+#define PAGE_REPORTING_MIN_ORDER	\
+	min_t(unsigned int, pageblock_order, (MAX_ORDER / 2))
+
 DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
 void __page_reporting_notify(void);