Message ID | 20250211003028.213461-1-npache@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | khugepaged: mTHP support | expand |
On 11/02/25 6:00 am, Nico Pache wrote: > The following series provides khugepaged and madvise collapse with the > capability to collapse regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > using a bitmap. After the PMD scan is done, we do binary recursion on the > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > on max_ptes_none is removed during the scan, to make sure we account for > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > order to determine how full a THP must be to be eligible. If a mTHP collapse > is attempted, but contains swapped out, or shared pages, we dont perform the > collapse. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > constantly promote mTHPs to the next available size. > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > Patch 2: Refactor/rename hpage_collapse > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-9: The mTHP patches > > --------- > Testing > --------- > - Built for x86_64, aarch64, ppc64le, and s390x > - selftests mm > - I created a test script that I used to push khugepaged to its limits while > monitoring a number of stats and tracepoints. The code is available > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > The summary from my testings was that there was no significant regression > noticed through this test. In some cases my changes had better collapse > latencies, and was able to scan more pages in the same amount of time/work, > but for the most part the results were consistant. > - redis testing. I tested these changes along with my defer changes > (see followup post for more details). > - some basic testing on 64k page size. > - lots of general use. These changes have been running in my VM for some time. > > Changes since V1 [2]: > - Minor bug fixes discovered during review and testing > - removed dynamic allocations for bitmaps, and made them stack based > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > - Updated trace events to include collapsing order info. > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > - No longer require a chunk to be fully utilized before setting the bit. Use > the same max_ptes_none scaling principle to achieve this. > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > some of the "creep" that was discovered in v1. > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > Nico Pache (9): > introduce khugepaged_collapse_single_pmd to unify khugepaged and > madvise_collapse > khugepaged: rename hpage_collapse_* to khugepaged_* > khugepaged: generalize hugepage_vma_revalidate for mTHP support > khugepaged: generalize alloc_charge_folio for mTHP support > khugepaged: generalize __collapse_huge_page_* for mTHP support > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > khugepaged: add mTHP support > khugepaged: improve tracepoints for mTHP orders > khugepaged: skip collapsing mTHP to smaller orders > > include/linux/khugepaged.h | 4 + > include/trace/events/huge_memory.h | 34 ++- > mm/khugepaged.c | 422 +++++++++++++++++++---------- > 3 files changed, 306 insertions(+), 154 deletions(-) > Does this patchset suffer from the problem described here: https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: > > > > On 11/02/25 6:00 am, Nico Pache wrote: > > The following series provides khugepaged and madvise collapse with the > > capability to collapse regions to mTHPs. > > > > To achieve this we generalize the khugepaged functions to no longer depend > > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > > using a bitmap. After the PMD scan is done, we do binary recursion on the > > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > > on max_ptes_none is removed during the scan, to make sure we account for > > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > > order to determine how full a THP must be to be eligible. If a mTHP collapse > > is attempted, but contains swapped out, or shared pages, we dont perform the > > collapse. > > > > With the default max_ptes_none=511, the code should keep its most of its > > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > > constantly promote mTHPs to the next available size. > > > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > > Patch 2: Refactor/rename hpage_collapse > > Patch 3-5: Generalize khugepaged functions for arbitrary orders > > Patch 6-9: The mTHP patches > > > > --------- > > Testing > > --------- > > - Built for x86_64, aarch64, ppc64le, and s390x > > - selftests mm > > - I created a test script that I used to push khugepaged to its limits while > > monitoring a number of stats and tracepoints. The code is available > > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > > The summary from my testings was that there was no significant regression > > noticed through this test. In some cases my changes had better collapse > > latencies, and was able to scan more pages in the same amount of time/work, > > but for the most part the results were consistant. > > - redis testing. I tested these changes along with my defer changes > > (see followup post for more details). > > - some basic testing on 64k page size. > > - lots of general use. These changes have been running in my VM for some time. > > > > Changes since V1 [2]: > > - Minor bug fixes discovered during review and testing > > - removed dynamic allocations for bitmaps, and made them stack based > > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > > - Updated trace events to include collapsing order info. > > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > > - No longer require a chunk to be fully utilized before setting the bit. Use > > the same max_ptes_none scaling principle to achieve this. > > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > > some of the "creep" that was discovered in v1. > > > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > > > Nico Pache (9): > > introduce khugepaged_collapse_single_pmd to unify khugepaged and > > madvise_collapse > > khugepaged: rename hpage_collapse_* to khugepaged_* > > khugepaged: generalize hugepage_vma_revalidate for mTHP support > > khugepaged: generalize alloc_charge_folio for mTHP support > > khugepaged: generalize __collapse_huge_page_* for mTHP support > > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > > khugepaged: add mTHP support > > khugepaged: improve tracepoints for mTHP orders > > khugepaged: skip collapsing mTHP to smaller orders > > > > include/linux/khugepaged.h | 4 + > > include/trace/events/huge_memory.h | 34 ++- > > mm/khugepaged.c | 422 +++++++++++++++++++---------- > > 3 files changed, 306 insertions(+), 154 deletions(-) > > > > Does this patchset suffer from the problem described here: > https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ Hi Dev, Sorry I meant to get back to you about that. I understand your concern, but like I've mentioned before, the scan with the read lock was done so we dont have to do the more expensive locking, and could still gain insight into the state. You are right that this info could become stale if the state changes dramatically, but the collapse_isolate function will verify it and not collapse. From my testing I found this to rarely happen. Also, khugepaged, my changes, and your changes are all a victim of this. Once we drop the read lock (to either allocate the folio, or right before acquiring the write_lock), the state can change. In your case, yes, you are gathering more up to date information, but is it really that important/worth it to retake locks and rescan for each instance if we are about to reverify with the write lock taken? So in my eyes, this is not a "problem" Cheers, -- Nico >
On 12/02/25 10:19 pm, Nico Pache wrote: > On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: >> >> >> >> On 11/02/25 6:00 am, Nico Pache wrote: >>> The following series provides khugepaged and madvise collapse with the >>> capability to collapse regions to mTHPs. >>> >>> To achieve this we generalize the khugepaged functions to no longer depend >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>> on max_ptes_none is removed during the scan, to make sure we account for >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>> is attempted, but contains swapped out, or shared pages, we dont perform the >>> collapse. >>> >>> With the default max_ptes_none=511, the code should keep its most of its >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >>> constantly promote mTHPs to the next available size. >>> >>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>> Patch 2: Refactor/rename hpage_collapse >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>> Patch 6-9: The mTHP patches >>> >>> --------- >>> Testing >>> --------- >>> - Built for x86_64, aarch64, ppc64le, and s390x >>> - selftests mm >>> - I created a test script that I used to push khugepaged to its limits while >>> monitoring a number of stats and tracepoints. The code is available >>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>> The summary from my testings was that there was no significant regression >>> noticed through this test. In some cases my changes had better collapse >>> latencies, and was able to scan more pages in the same amount of time/work, >>> but for the most part the results were consistant. >>> - redis testing. I tested these changes along with my defer changes >>> (see followup post for more details). >>> - some basic testing on 64k page size. >>> - lots of general use. These changes have been running in my VM for some time. >>> >>> Changes since V1 [2]: >>> - Minor bug fixes discovered during review and testing >>> - removed dynamic allocations for bitmaps, and made them stack based >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>> - Updated trace events to include collapsing order info. >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>> - No longer require a chunk to be fully utilized before setting the bit. Use >>> the same max_ptes_none scaling principle to achieve this. >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>> some of the "creep" that was discovered in v1. >>> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>> >>> Nico Pache (9): >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>> madvise_collapse >>> khugepaged: rename hpage_collapse_* to khugepaged_* >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>> khugepaged: generalize alloc_charge_folio for mTHP support >>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>> khugepaged: add mTHP support >>> khugepaged: improve tracepoints for mTHP orders >>> khugepaged: skip collapsing mTHP to smaller orders >>> >>> include/linux/khugepaged.h | 4 + >>> include/trace/events/huge_memory.h | 34 ++- >>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>> 3 files changed, 306 insertions(+), 154 deletions(-) >>> >> >> Does this patchset suffer from the problem described here: >> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ > Hi Dev, > > Sorry I meant to get back to you about that. > > I understand your concern, but like I've mentioned before, the scan > with the read lock was done so we dont have to do the more expensive > locking, and could still gain insight into the state. You are right > that this info could become stale if the state changes dramatically, > but the collapse_isolate function will verify it and not collapse. If the state changes dramatically, the _isolate function will verify it, and fallback. And this fallback happens after following this costly path: retrieve a large folio from the buddy allocator -> swapin pages from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush on all CPUs -> fallback in _isolate(). If you do fail in _isolate(), doesn't it make sense to get the updated state for the next fallback order immediately, because we have prior information that we failed because of PTE state? What your algorithm will do is *still* follow the costly path described above, and again fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like mine would. The verification of the PTE state by the _isolate() function is the "no turning back" point of the algorithm. The verification by hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth it, before we do costly operations" point of the algorithm. > From my testing I found this to rarely happen. Unfortunately, I am not very familiar with performance testing/load testing, I am fairly new to kernel programming, so I am getting there. But it really depends on the type of test you are running, what actually runs on memory-intensive systems, etc etc. In fact, on loaded systems I would expect the PTE state to dramatically change. But still, no opinion here. > > Also, khugepaged, my changes, and your changes are all a victim of > this. Once we drop the read lock (to either allocate the folio, or > right before acquiring the write_lock), the state can change. In your > case, yes, you are gathering more up to date information, but is it > really that important/worth it to retake locks and rescan for each > instance if we are about to reverify with the write lock taken? You said "reverify": You are removing the verification, so this step won't be reverification, it will be verification. We do not want to verify *after* we have already done 95% of latency-heavy stuff, only to know that we are going to fail. Algorithms in the kernel, in general, are of the following form: 1) Verify if a condition is true, resulting in taking a control path -> 2) do a lot of stuff -> "no turning back" step, wherein before committing (by taking locks, say), reverify if this is the control path we should be in. You are eliminating step 1). Therefore, I will have to say that I disagree with your approach. On top of this, in the subjective analysis in [1], point number 7 (along with point number 1) remains. And, point number 4 remains. [1] https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ > > So in my eyes, this is not a "problem" Looks like the kernel scheduled us for a high-priority debate, I hope there's no deadlock :) > > Cheers, > -- Nico > > >> >
On 13/02/25 1:56 pm, Dev Jain wrote: > > > On 12/02/25 10:19 pm, Nico Pache wrote: >> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: >>> >>> >>> >>> On 11/02/25 6:00 am, Nico Pache wrote: >>>> The following series provides khugepaged and madvise collapse with the >>>> capability to collapse regions to mTHPs. >>>> >>>> To achieve this we generalize the khugepaged functions to no longer >>>> depend >>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of >>>> pages >>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>> using a bitmap. After the PMD scan is done, we do binary recursion >>>> on the >>>> bitmap to find the optimal mTHP sizes for the PMD range. The >>>> restriction >>>> on max_ptes_none is removed during the scan, to make sure we account >>>> for >>>> the whole PMD range. max_ptes_none will be scaled by the attempted >>>> collapse >>>> order to determine how full a THP must be to be eligible. If a mTHP >>>> collapse >>>> is attempted, but contains swapped out, or shared pages, we dont >>>> perform the >>>> collapse. >>>> >>>> With the default max_ptes_none=511, the code should keep its most of >>>> its >>>> original behavior. To exercise mTHP collapse we need to set >>>> max_ptes_none<=255. >>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse >>>> "creep" and >>>> constantly promote mTHPs to the next available size. >>>> >>>> Patch 1: Some refactoring to combine madvise_collapse and >>>> khugepaged >>>> Patch 2: Refactor/rename hpage_collapse >>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>> Patch 6-9: The mTHP patches >>>> >>>> --------- >>>> Testing >>>> --------- >>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>> - selftests mm >>>> - I created a test script that I used to push khugepaged to its >>>> limits while >>>> monitoring a number of stats and tracepoints. The code is >>>> available >>>> here[1] (Run in legacy mode for these changes and set mthp >>>> sizes to inherit) >>>> The summary from my testings was that there was no significant >>>> regression >>>> noticed through this test. In some cases my changes had better >>>> collapse >>>> latencies, and was able to scan more pages in the same amount >>>> of time/work, >>>> but for the most part the results were consistant. >>>> - redis testing. I tested these changes along with my defer changes >>>> (see followup post for more details). >>>> - some basic testing on 64k page size. >>>> - lots of general use. These changes have been running in my VM for >>>> some time. >>>> >>>> Changes since V1 [2]: >>>> - Minor bug fixes discovered during review and testing >>>> - removed dynamic allocations for bitmaps, and made them stack based >>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>> - Updated trace events to include collapsing order info. >>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>>> - No longer require a chunk to be fully utilized before setting the >>>> bit. Use >>>> the same max_ptes_none scaling principle to achieve this. >>>> - Skip mTHP collapse that requires swapin or shared handling. This >>>> helps prevent >>>> some of the "creep" that was discovered in v1. >>>> >>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- >>>> npache@redhat.com/ >>>> >>>> Nico Pache (9): >>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>>> madvise_collapse >>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>> khugepaged: add mTHP support >>>> khugepaged: improve tracepoints for mTHP orders >>>> khugepaged: skip collapsing mTHP to smaller orders >>>> >>>> include/linux/khugepaged.h | 4 + >>>> include/trace/events/huge_memory.h | 34 ++- >>>> mm/khugepaged.c | 422 ++++++++++++++++++ >>>> +---------- >>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>> >>> >>> Does this patchset suffer from the problem described here: >>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- >>> c2d48d4963b6@arm.com/ >> Hi Dev, >> >> Sorry I meant to get back to you about that. >> >> I understand your concern, but like I've mentioned before, the scan >> with the read lock was done so we dont have to do the more expensive >> locking, and could still gain insight into the state. You are right >> that this info could become stale if the state changes dramatically, >> but the collapse_isolate function will verify it and not collapse. > > If the state changes dramatically, the _isolate function will verify it, > and fallback. And this fallback happens after following this costly > path: retrieve a large folio from the buddy allocator -> swapin pages > from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush > on all CPUs -> fallback in _isolate(). > If you do fail in _isolate(), doesn't it make sense to get the updated > state for the next fallback order immediately, because we have prior > information that we failed because of PTE state? What your algorithm > will do is *still* follow the costly path described above, and again > fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like > mine would. > > The verification of the PTE state by the _isolate() function is the "no > turning back" point of the algorithm. The verification by > hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth > it, before we do costly operations" point of the algorithm. > >> From my testing I found this to rarely happen. > > Unfortunately, I am not very familiar with performance testing/load > testing, I am fairly new to kernel programming, so I am getting there. > But it really depends on the type of test you are running, what actually > runs on memory-intensive systems, etc etc. In fact, on loaded systems I > would expect the PTE state to dramatically change. But still, no opinion > here. > >> >> Also, khugepaged, my changes, and your changes are all a victim of >> this. Once we drop the read lock (to either allocate the folio, or >> right before acquiring the write_lock), the state can change. In your >> case, yes, you are gathering more up to date information, but is it >> really that important/worth it to retake locks and rescan for each >> instance if we are about to reverify with the write lock taken? > > You said "reverify": You are removing the verification, so this step > won't be reverification, it will be verification. We do not want to > verify *after* we have already done 95% of latency-heavy stuff, only to > know that we are going to fail. > > Algorithms in the kernel, in general, are of the following form: 1) > Verify if a condition is true, resulting in taking a control path -> 2) > do a lot of stuff -> "no turning back" step, wherein before committing > (by taking locks, say), reverify if this is the control path we should > be in. You are eliminating step 1). > > Therefore, I will have to say that I disagree with your approach. > > On top of this, in the subjective analysis in [1], point number 7 (along > with point number 1) remains. And, point number 4 remains. > > [1] https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b- > aba4b1a441b4@arm.com/ > >> >> So in my eyes, this is not a "problem" > > Looks like the kernel scheduled us for a high-priority debate, I hope > there's no deadlock :) > >> >> Cheers, >> -- Nico >> >> In any case, As Andrew notes, the first step is to justify that this functionality is beneficial to get in. The second step would be to compare the implementations. It would be great if someone from the community can jump in to test these things out : ) but in the meantime I will begin working on the first step. >>> >> > >
On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: > > > > On 12/02/25 10:19 pm, Nico Pache wrote: > > On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: > >> > >> > >> > >> On 11/02/25 6:00 am, Nico Pache wrote: > >>> The following series provides khugepaged and madvise collapse with the > >>> capability to collapse regions to mTHPs. > >>> > >>> To achieve this we generalize the khugepaged functions to no longer depend > >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > >>> using a bitmap. After the PMD scan is done, we do binary recursion on the > >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction > >>> on max_ptes_none is removed during the scan, to make sure we account for > >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse > >>> order to determine how full a THP must be to be eligible. If a mTHP collapse > >>> is attempted, but contains swapped out, or shared pages, we dont perform the > >>> collapse. > >>> > >>> With the default max_ptes_none=511, the code should keep its most of its > >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > >>> constantly promote mTHPs to the next available size. > >>> > >>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged > >>> Patch 2: Refactor/rename hpage_collapse > >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >>> Patch 6-9: The mTHP patches > >>> > >>> --------- > >>> Testing > >>> --------- > >>> - Built for x86_64, aarch64, ppc64le, and s390x > >>> - selftests mm > >>> - I created a test script that I used to push khugepaged to its limits while > >>> monitoring a number of stats and tracepoints. The code is available > >>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > >>> The summary from my testings was that there was no significant regression > >>> noticed through this test. In some cases my changes had better collapse > >>> latencies, and was able to scan more pages in the same amount of time/work, > >>> but for the most part the results were consistant. > >>> - redis testing. I tested these changes along with my defer changes > >>> (see followup post for more details). > >>> - some basic testing on 64k page size. > >>> - lots of general use. These changes have been running in my VM for some time. > >>> > >>> Changes since V1 [2]: > >>> - Minor bug fixes discovered during review and testing > >>> - removed dynamic allocations for bitmaps, and made them stack based > >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >>> - Updated trace events to include collapsing order info. > >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > >>> - No longer require a chunk to be fully utilized before setting the bit. Use > >>> the same max_ptes_none scaling principle to achieve this. > >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > >>> some of the "creep" that was discovered in v1. > >>> > >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > >>> > >>> Nico Pache (9): > >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and > >>> madvise_collapse > >>> khugepaged: rename hpage_collapse_* to khugepaged_* > >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support > >>> khugepaged: generalize alloc_charge_folio for mTHP support > >>> khugepaged: generalize __collapse_huge_page_* for mTHP support > >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >>> khugepaged: add mTHP support > >>> khugepaged: improve tracepoints for mTHP orders > >>> khugepaged: skip collapsing mTHP to smaller orders > >>> > >>> include/linux/khugepaged.h | 4 + > >>> include/trace/events/huge_memory.h | 34 ++- > >>> mm/khugepaged.c | 422 +++++++++++++++++++---------- > >>> 3 files changed, 306 insertions(+), 154 deletions(-) > >>> > >> > >> Does this patchset suffer from the problem described here: > >> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ > > Hi Dev, > > > > Sorry I meant to get back to you about that. > > > > I understand your concern, but like I've mentioned before, the scan > > with the read lock was done so we dont have to do the more expensive > > locking, and could still gain insight into the state. You are right > > that this info could become stale if the state changes dramatically, > > but the collapse_isolate function will verify it and not collapse. > > If the state changes dramatically, the _isolate function will verify it, > and fallback. And this fallback happens after following this costly > path: retrieve a large folio from the buddy allocator -> swapin pages > from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush > on all CPUs -> fallback in _isolate(). > If you do fail in _isolate(), doesn't it make sense to get the updated > state for the next fallback order immediately, because we have prior > information that we failed because of PTE state? What your algorithm > will do is *still* follow the costly path described above, and again > fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like > mine would. You do raise a valid point here, I can optimize my solution by detecting certain collapse failure types and jump to the next scan. I'll add that to my solution, thanks! As for the disagreement around the bitmap, we'll leave that up to the community to decide since we have differing opinions/solutions. > > The verification of the PTE state by the _isolate() function is the "no > turning back" point of the algorithm. The verification by > hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth > it, before we do costly operations" point of the algorithm. > > > From my testing I found this to rarely happen. > > Unfortunately, I am not very familiar with performance testing/load > testing, I am fairly new to kernel programming, so I am getting there. > But it really depends on the type of test you are running, what actually > runs on memory-intensive systems, etc etc. In fact, on loaded systems I > would expect the PTE state to dramatically change. But still, no opinion > here. Yeah there are probably some cases where it happens more often. Probably in cases of short lived allocations, but khugepaged doesn't run that frequently so those won't be that big of an issue. Our performance team is currently testing my implementation so I should have more real workload test results soon. The redis testing had some gains and didn't show any signs of obvious regressions. As for the testing, check out https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads this does the tracing for my testing script. It can help you get started. There are 3 different traces being applied there: the bpftrace for collapse latencies, the perf record for the flamegraph (not actually that useful, but may be useful to visualize any weird/long paths that you may not have noticed), and the trace-cmd which records the tracepoint of the scan and the collapse functions then processes the data using the awk script-- the output being the scan rate, the pages collapsed, and their result status (grouped by order). You can also look into https://github.com/gormanm/mmtests for testing/comparing kernels. I was running the config-memdb-redis-benchmark-medium workload. > > > > > Also, khugepaged, my changes, and your changes are all a victim of > > this. Once we drop the read lock (to either allocate the folio, or > > right before acquiring the write_lock), the state can change. In your > > case, yes, you are gathering more up to date information, but is it > > really that important/worth it to retake locks and rescan for each > > instance if we are about to reverify with the write lock taken? > > You said "reverify": You are removing the verification, so this step > won't be reverification, it will be verification. We do not want to > verify *after* we have already done 95% of latency-heavy stuff, only to > know that we are going to fail. > > Algorithms in the kernel, in general, are of the following form: 1) > Verify if a condition is true, resulting in taking a control path -> 2) > do a lot of stuff -> "no turning back" step, wherein before committing > (by taking locks, say), reverify if this is the control path we should > be in. You are eliminating step 1). > > Therefore, I will have to say that I disagree with your approach. > > On top of this, in the subjective analysis in [1], point number 7 (along > with point number 1) remains. And, point number 4 remains. for 1) your worst case of 1024 is not the worst case. There are 8 possible orders in your implementation, if all are enabled, that is 4096 iterations in the worst case. This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case. > > [1] > https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ > > > > > So in my eyes, this is not a "problem" > > Looks like the kernel scheduled us for a high-priority debate, I hope > there's no deadlock :) > > > > > Cheers, > > -- Nico > > > > > >> > > >
On 14/02/25 1:09 am, Nico Pache wrote: > On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: >> >> >> >> On 12/02/25 10:19 pm, Nico Pache wrote: >>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: >>>> >>>> >>>> >>>> On 11/02/25 6:00 am, Nico Pache wrote: >>>>> The following series provides khugepaged and madvise collapse with the >>>>> capability to collapse regions to mTHPs. >>>>> >>>>> To achieve this we generalize the khugepaged functions to no longer depend >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>>>> on max_ptes_none is removed during the scan, to make sure we account for >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the >>>>> collapse. >>>>> >>>>> With the default max_ptes_none=511, the code should keep its most of its >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >>>>> constantly promote mTHPs to the next available size. >>>>> >>>>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>>>> Patch 2: Refactor/rename hpage_collapse >>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>>> Patch 6-9: The mTHP patches >>>>> >>>>> --------- >>>>> Testing >>>>> --------- >>>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>>> - selftests mm >>>>> - I created a test script that I used to push khugepaged to its limits while >>>>> monitoring a number of stats and tracepoints. The code is available >>>>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>>>> The summary from my testings was that there was no significant regression >>>>> noticed through this test. In some cases my changes had better collapse >>>>> latencies, and was able to scan more pages in the same amount of time/work, >>>>> but for the most part the results were consistant. >>>>> - redis testing. I tested these changes along with my defer changes >>>>> (see followup post for more details). >>>>> - some basic testing on 64k page size. >>>>> - lots of general use. These changes have been running in my VM for some time. >>>>> >>>>> Changes since V1 [2]: >>>>> - Minor bug fixes discovered during review and testing >>>>> - removed dynamic allocations for bitmaps, and made them stack based >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>>> - Updated trace events to include collapsing order info. >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use >>>>> the same max_ptes_none scaling principle to achieve this. >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>>>> some of the "creep" that was discovered in v1. >>>>> >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>>>> >>>>> Nico Pache (9): >>>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>>>> madvise_collapse >>>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>>> khugepaged: add mTHP support >>>>> khugepaged: improve tracepoints for mTHP orders >>>>> khugepaged: skip collapsing mTHP to smaller orders >>>>> >>>>> include/linux/khugepaged.h | 4 + >>>>> include/trace/events/huge_memory.h | 34 ++- >>>>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>>> >>>> >>>> Does this patchset suffer from the problem described here: >>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ >>> Hi Dev, >>> >>> Sorry I meant to get back to you about that. >>> >>> I understand your concern, but like I've mentioned before, the scan >>> with the read lock was done so we dont have to do the more expensive >>> locking, and could still gain insight into the state. You are right >>> that this info could become stale if the state changes dramatically, >>> but the collapse_isolate function will verify it and not collapse. >> >> If the state changes dramatically, the _isolate function will verify it, >> and fallback. And this fallback happens after following this costly >> path: retrieve a large folio from the buddy allocator -> swapin pages >> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush >> on all CPUs -> fallback in _isolate(). >> If you do fail in _isolate(), doesn't it make sense to get the updated >> state for the next fallback order immediately, because we have prior >> information that we failed because of PTE state? What your algorithm >> will do is *still* follow the costly path described above, and again >> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like >> mine would. > > You do raise a valid point here, I can optimize my solution by > detecting certain collapse failure types and jump to the next scan. > I'll add that to my solution, thanks! > > As for the disagreement around the bitmap, we'll leave that up to the > community to decide since we have differing opinions/solutions. > >> >> The verification of the PTE state by the _isolate() function is the "no >> turning back" point of the algorithm. The verification by >> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth >> it, before we do costly operations" point of the algorithm. >> >>> From my testing I found this to rarely happen. >> >> Unfortunately, I am not very familiar with performance testing/load >> testing, I am fairly new to kernel programming, so I am getting there. >> But it really depends on the type of test you are running, what actually >> runs on memory-intensive systems, etc etc. In fact, on loaded systems I >> would expect the PTE state to dramatically change. But still, no opinion >> here. > > Yeah there are probably some cases where it happens more often. > Probably in cases of short lived allocations, but khugepaged doesn't > run that frequently so those won't be that big of an issue. > > Our performance team is currently testing my implementation so I > should have more real workload test results soon. The redis testing > had some gains and didn't show any signs of obvious regressions. > > As for the testing, check out > https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads > this does the tracing for my testing script. It can help you get > started. There are 3 different traces being applied there: the > bpftrace for collapse latencies, the perf record for the flamegraph > (not actually that useful, but may be useful to visualize any > weird/long paths that you may not have noticed), and the trace-cmd > which records the tracepoint of the scan and the collapse functions > then processes the data using the awk script-- the output being the > scan rate, the pages collapsed, and their result status (grouped by > order). > > You can also look into https://github.com/gormanm/mmtests for > testing/comparing kernels. I was running the > config-memdb-redis-benchmark-medium workload. Thanks. I'll take a look. > >> >>> >>> Also, khugepaged, my changes, and your changes are all a victim of >>> this. Once we drop the read lock (to either allocate the folio, or >>> right before acquiring the write_lock), the state can change. In your >>> case, yes, you are gathering more up to date information, but is it >>> really that important/worth it to retake locks and rescan for each >>> instance if we are about to reverify with the write lock taken? >> >> You said "reverify": You are removing the verification, so this step >> won't be reverification, it will be verification. We do not want to >> verify *after* we have already done 95% of latency-heavy stuff, only to >> know that we are going to fail. >> >> Algorithms in the kernel, in general, are of the following form: 1) >> Verify if a condition is true, resulting in taking a control path -> 2) >> do a lot of stuff -> "no turning back" step, wherein before committing >> (by taking locks, say), reverify if this is the control path we should >> be in. You are eliminating step 1). >> >> Therefore, I will have to say that I disagree with your approach. >> >> On top of this, in the subjective analysis in [1], point number 7 (along >> with point number 1) remains. And, point number 4 remains. > > for 1) your worst case of 1024 is not the worst case. There are 8 > possible orders in your implementation, if all are enabled, that is > 4096 iterations in the worst case. Yes, that is exactly what I wrote in 1). I am still not convinced that the overhead you produce + 512 iterations is going to beat 4096 iterations. Anyways, that is hand-waving and we should test this. > This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case. Sorry, I am missing something here; how does the number of iterations change with page size? Am I not scanning the PTE table, which is invariant to the page size? >> >> [1] >> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ >> >>> >>> So in my eyes, this is not a "problem" >> >> Looks like the kernel scheduled us for a high-priority debate, I hope >> there's no deadlock :) >> >>> >>> Cheers, >>> -- Nico >>> >>> >>>> >>> >> >
On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote: > > > > On 14/02/25 1:09 am, Nico Pache wrote: > > On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: > >> > >> > >> > >> On 12/02/25 10:19 pm, Nico Pache wrote: > >>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: > >>>> > >>>> > >>>> > >>>> On 11/02/25 6:00 am, Nico Pache wrote: > >>>>> The following series provides khugepaged and madvise collapse with the > >>>>> capability to collapse regions to mTHPs. > >>>>> > >>>>> To achieve this we generalize the khugepaged functions to no longer depend > >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the > >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction > >>>>> on max_ptes_none is removed during the scan, to make sure we account for > >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse > >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse > >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the > >>>>> collapse. > >>>>> > >>>>> With the default max_ptes_none=511, the code should keep its most of its > >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > >>>>> constantly promote mTHPs to the next available size. > >>>>> > >>>>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged > >>>>> Patch 2: Refactor/rename hpage_collapse > >>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >>>>> Patch 6-9: The mTHP patches > >>>>> > >>>>> --------- > >>>>> Testing > >>>>> --------- > >>>>> - Built for x86_64, aarch64, ppc64le, and s390x > >>>>> - selftests mm > >>>>> - I created a test script that I used to push khugepaged to its limits while > >>>>> monitoring a number of stats and tracepoints. The code is available > >>>>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > >>>>> The summary from my testings was that there was no significant regression > >>>>> noticed through this test. In some cases my changes had better collapse > >>>>> latencies, and was able to scan more pages in the same amount of time/work, > >>>>> but for the most part the results were consistant. > >>>>> - redis testing. I tested these changes along with my defer changes > >>>>> (see followup post for more details). > >>>>> - some basic testing on 64k page size. > >>>>> - lots of general use. These changes have been running in my VM for some time. > >>>>> > >>>>> Changes since V1 [2]: > >>>>> - Minor bug fixes discovered during review and testing > >>>>> - removed dynamic allocations for bitmaps, and made them stack based > >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >>>>> - Updated trace events to include collapsing order info. > >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use > >>>>> the same max_ptes_none scaling principle to achieve this. > >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > >>>>> some of the "creep" that was discovered in v1. > >>>>> > >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > >>>>> > >>>>> Nico Pache (9): > >>>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and > >>>>> madvise_collapse > >>>>> khugepaged: rename hpage_collapse_* to khugepaged_* > >>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support > >>>>> khugepaged: generalize alloc_charge_folio for mTHP support > >>>>> khugepaged: generalize __collapse_huge_page_* for mTHP support > >>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >>>>> khugepaged: add mTHP support > >>>>> khugepaged: improve tracepoints for mTHP orders > >>>>> khugepaged: skip collapsing mTHP to smaller orders > >>>>> > >>>>> include/linux/khugepaged.h | 4 + > >>>>> include/trace/events/huge_memory.h | 34 ++- > >>>>> mm/khugepaged.c | 422 +++++++++++++++++++---------- > >>>>> 3 files changed, 306 insertions(+), 154 deletions(-) > >>>>> > >>>> > >>>> Does this patchset suffer from the problem described here: > >>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ > >>> Hi Dev, > >>> > >>> Sorry I meant to get back to you about that. > >>> > >>> I understand your concern, but like I've mentioned before, the scan > >>> with the read lock was done so we dont have to do the more expensive > >>> locking, and could still gain insight into the state. You are right > >>> that this info could become stale if the state changes dramatically, > >>> but the collapse_isolate function will verify it and not collapse. > >> > >> If the state changes dramatically, the _isolate function will verify it, > >> and fallback. And this fallback happens after following this costly > >> path: retrieve a large folio from the buddy allocator -> swapin pages > >> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush > >> on all CPUs -> fallback in _isolate(). > >> If you do fail in _isolate(), doesn't it make sense to get the updated > >> state for the next fallback order immediately, because we have prior > >> information that we failed because of PTE state? What your algorithm > >> will do is *still* follow the costly path described above, and again > >> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like > >> mine would. > > > > You do raise a valid point here, I can optimize my solution by > > detecting certain collapse failure types and jump to the next scan. > > I'll add that to my solution, thanks! > > > > As for the disagreement around the bitmap, we'll leave that up to the > > community to decide since we have differing opinions/solutions. > > > >> > >> The verification of the PTE state by the _isolate() function is the "no > >> turning back" point of the algorithm. The verification by > >> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth > >> it, before we do costly operations" point of the algorithm. > >> > >>> From my testing I found this to rarely happen. > >> > >> Unfortunately, I am not very familiar with performance testing/load > >> testing, I am fairly new to kernel programming, so I am getting there. > >> But it really depends on the type of test you are running, what actually > >> runs on memory-intensive systems, etc etc. In fact, on loaded systems I > >> would expect the PTE state to dramatically change. But still, no opinion > >> here. > > > > Yeah there are probably some cases where it happens more often. > > Probably in cases of short lived allocations, but khugepaged doesn't > > run that frequently so those won't be that big of an issue. > > > > Our performance team is currently testing my implementation so I > > should have more real workload test results soon. The redis testing > > had some gains and didn't show any signs of obvious regressions. > > > > As for the testing, check out > > https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads > > this does the tracing for my testing script. It can help you get > > started. There are 3 different traces being applied there: the > > bpftrace for collapse latencies, the perf record for the flamegraph > > (not actually that useful, but may be useful to visualize any > > weird/long paths that you may not have noticed), and the trace-cmd > > which records the tracepoint of the scan and the collapse functions > > then processes the data using the awk script-- the output being the > > scan rate, the pages collapsed, and their result status (grouped by > > order). > > > > You can also look into https://github.com/gormanm/mmtests for > > testing/comparing kernels. I was running the > > config-memdb-redis-benchmark-medium workload. > > Thanks. I'll take a look. > > > > >> > >>> > >>> Also, khugepaged, my changes, and your changes are all a victim of > >>> this. Once we drop the read lock (to either allocate the folio, or > >>> right before acquiring the write_lock), the state can change. In your > >>> case, yes, you are gathering more up to date information, but is it > >>> really that important/worth it to retake locks and rescan for each > >>> instance if we are about to reverify with the write lock taken? > >> > >> You said "reverify": You are removing the verification, so this step > >> won't be reverification, it will be verification. We do not want to > >> verify *after* we have already done 95% of latency-heavy stuff, only to > >> know that we are going to fail. > >> > >> Algorithms in the kernel, in general, are of the following form: 1) > >> Verify if a condition is true, resulting in taking a control path -> 2) > >> do a lot of stuff -> "no turning back" step, wherein before committing > >> (by taking locks, say), reverify if this is the control path we should > >> be in. You are eliminating step 1). > >> > >> Therefore, I will have to say that I disagree with your approach. > >> > >> On top of this, in the subjective analysis in [1], point number 7 (along > >> with point number 1) remains. And, point number 4 remains. > > > > for 1) your worst case of 1024 is not the worst case. There are 8 > > possible orders in your implementation, if all are enabled, that is > > 4096 iterations in the worst case. > > Yes, that is exactly what I wrote in 1). I am still not convinced that > the overhead you produce + 512 iterations is going to beat 4096 > iterations. Anyways, that is hand-waving and we should test this. > > > This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case. > > Sorry, I am missing something here; how does the number of iterations > change with page size? Am I not scanning the PTE table, which is > invariant to the page size? I got the calculation wrong the first time and it's actually worst. Lets hope I got this right this time on ARM64 64k kernel: PMD size = 512M PTE= 64k PTEs per PMD = 8192 log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the first and second order are skipped) Assuming I understand your algorithm correctly, in the worst case you are scanning the whole PMD for each order. So you scan 8192 PTEs 11 times. 8192 * 11 = 90112. Please let me know if I'm missing something here. > > >> > >> [1] > >> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ > >> > >>> > >>> So in my eyes, this is not a "problem" > >> > >> Looks like the kernel scheduled us for a high-priority debate, I hope > >> there's no deadlock :) > >> > >>> > >>> Cheers, > >>> -- Nico > >>> > >>> > >>>> > >>> > >> > > >
On 15/02/25 6:22 am, Nico Pache wrote: > On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote: >> >> >> >> On 14/02/25 1:09 am, Nico Pache wrote: >>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: >>>> >>>> >>>> >>>> On 12/02/25 10:19 pm, Nico Pache wrote: >>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 11/02/25 6:00 am, Nico Pache wrote: >>>>>>> The following series provides khugepaged and madvise collapse with the >>>>>>> capability to collapse regions to mTHPs. >>>>>>> >>>>>>> To achieve this we generalize the khugepaged functions to no longer depend >>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>>>>>> on max_ptes_none is removed during the scan, to make sure we account for >>>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the >>>>>>> collapse. >>>>>>> >>>>>>> With the default max_ptes_none=511, the code should keep its most of its >>>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >>>>>>> constantly promote mTHPs to the next available size. >>>>>>> >>>>>>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>>>>>> Patch 2: Refactor/rename hpage_collapse >>>>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>>>>> Patch 6-9: The mTHP patches >>>>>>> >>>>>>> --------- >>>>>>> Testing >>>>>>> --------- >>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>>>>> - selftests mm >>>>>>> - I created a test script that I used to push khugepaged to its limits while >>>>>>> monitoring a number of stats and tracepoints. The code is available >>>>>>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>>>>>> The summary from my testings was that there was no significant regression >>>>>>> noticed through this test. In some cases my changes had better collapse >>>>>>> latencies, and was able to scan more pages in the same amount of time/work, >>>>>>> but for the most part the results were consistant. >>>>>>> - redis testing. I tested these changes along with my defer changes >>>>>>> (see followup post for more details). >>>>>>> - some basic testing on 64k page size. >>>>>>> - lots of general use. These changes have been running in my VM for some time. >>>>>>> >>>>>>> Changes since V1 [2]: >>>>>>> - Minor bug fixes discovered during review and testing >>>>>>> - removed dynamic allocations for bitmaps, and made them stack based >>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>>>>> - Updated trace events to include collapsing order info. >>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use >>>>>>> the same max_ptes_none scaling principle to achieve this. >>>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>>>>>> some of the "creep" that was discovered in v1. >>>>>>> >>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>>>>>> >>>>>>> Nico Pache (9): >>>>>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>>>>>> madvise_collapse >>>>>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>>>>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>>>>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>>>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>>>>> khugepaged: add mTHP support >>>>>>> khugepaged: improve tracepoints for mTHP orders >>>>>>> khugepaged: skip collapsing mTHP to smaller orders >>>>>>> >>>>>>> include/linux/khugepaged.h | 4 + >>>>>>> include/trace/events/huge_memory.h | 34 ++- >>>>>>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>>>>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>>>>> >>>>>> >>>>>> Does this patchset suffer from the problem described here: >>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/ >>>>> Hi Dev, >>>>> >>>>> Sorry I meant to get back to you about that. >>>>> >>>>> I understand your concern, but like I've mentioned before, the scan >>>>> with the read lock was done so we dont have to do the more expensive >>>>> locking, and could still gain insight into the state. You are right >>>>> that this info could become stale if the state changes dramatically, >>>>> but the collapse_isolate function will verify it and not collapse. >>>> >>>> If the state changes dramatically, the _isolate function will verify it, >>>> and fallback. And this fallback happens after following this costly >>>> path: retrieve a large folio from the buddy allocator -> swapin pages >>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush >>>> on all CPUs -> fallback in _isolate(). >>>> If you do fail in _isolate(), doesn't it make sense to get the updated >>>> state for the next fallback order immediately, because we have prior >>>> information that we failed because of PTE state? What your algorithm >>>> will do is *still* follow the costly path described above, and again >>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like >>>> mine would. >>> >>> You do raise a valid point here, I can optimize my solution by >>> detecting certain collapse failure types and jump to the next scan. >>> I'll add that to my solution, thanks! >>> >>> As for the disagreement around the bitmap, we'll leave that up to the >>> community to decide since we have differing opinions/solutions. >>> >>>> >>>> The verification of the PTE state by the _isolate() function is the "no >>>> turning back" point of the algorithm. The verification by >>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth >>>> it, before we do costly operations" point of the algorithm. >>>> >>>>> From my testing I found this to rarely happen. >>>> >>>> Unfortunately, I am not very familiar with performance testing/load >>>> testing, I am fairly new to kernel programming, so I am getting there. >>>> But it really depends on the type of test you are running, what actually >>>> runs on memory-intensive systems, etc etc. In fact, on loaded systems I >>>> would expect the PTE state to dramatically change. But still, no opinion >>>> here. >>> >>> Yeah there are probably some cases where it happens more often. >>> Probably in cases of short lived allocations, but khugepaged doesn't >>> run that frequently so those won't be that big of an issue. >>> >>> Our performance team is currently testing my implementation so I >>> should have more real workload test results soon. The redis testing >>> had some gains and didn't show any signs of obvious regressions. >>> >>> As for the testing, check out >>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads >>> this does the tracing for my testing script. It can help you get >>> started. There are 3 different traces being applied there: the >>> bpftrace for collapse latencies, the perf record for the flamegraph >>> (not actually that useful, but may be useful to visualize any >>> weird/long paths that you may not have noticed), and the trace-cmd >>> which records the tracepoint of the scan and the collapse functions >>> then processes the data using the awk script-- the output being the >>> scan rate, the pages collapsed, and their result status (grouped by >>> order). >>> >>> You can also look into https://github.com/gormanm/mmtests for >>> testing/comparing kernels. I was running the >>> config-memdb-redis-benchmark-medium workload. >> >> Thanks. I'll take a look. >> >>> >>>> >>>>> >>>>> Also, khugepaged, my changes, and your changes are all a victim of >>>>> this. Once we drop the read lock (to either allocate the folio, or >>>>> right before acquiring the write_lock), the state can change. In your >>>>> case, yes, you are gathering more up to date information, but is it >>>>> really that important/worth it to retake locks and rescan for each >>>>> instance if we are about to reverify with the write lock taken? >>>> >>>> You said "reverify": You are removing the verification, so this step >>>> won't be reverification, it will be verification. We do not want to >>>> verify *after* we have already done 95% of latency-heavy stuff, only to >>>> know that we are going to fail. >>>> >>>> Algorithms in the kernel, in general, are of the following form: 1) >>>> Verify if a condition is true, resulting in taking a control path -> 2) >>>> do a lot of stuff -> "no turning back" step, wherein before committing >>>> (by taking locks, say), reverify if this is the control path we should >>>> be in. You are eliminating step 1). >>>> >>>> Therefore, I will have to say that I disagree with your approach. >>>> >>>> On top of this, in the subjective analysis in [1], point number 7 (along >>>> with point number 1) remains. And, point number 4 remains. >>> >>> for 1) your worst case of 1024 is not the worst case. There are 8 >>> possible orders in your implementation, if all are enabled, that is >>> 4096 iterations in the worst case. >> >> Yes, that is exactly what I wrote in 1). I am still not convinced that >> the overhead you produce + 512 iterations is going to beat 4096 >> iterations. Anyways, that is hand-waving and we should test this. >> >>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case. >> >> Sorry, I am missing something here; how does the number of iterations >> change with page size? Am I not scanning the PTE table, which is >> invariant to the page size? > > I got the calculation wrong the first time and it's actually worst. > Lets hope I got this right this time > on ARM64 64k kernel: > PMD size = 512M > PTE= 64k > PTEs per PMD = 8192 *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 << 13. > log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the > first and second order are skipped) > > Assuming I understand your algorithm correctly, in the worst case you > are scanning the whole PMD for each order. > > So you scan 8192 PTEs 11 times. 8192 * 11 = 90112. Yup. Now it seems that the bitmap overhead may just be worth it; for the worst case the bitmap will give us an 11x saving...for the average case, it will give us 2x, but still, 8192 is a large number. I'll think of ways to test this out. Btw, I was made aware that an LWN article just got posted on our work! https://lwn.net/Articles/1009039/ > > Please let me know if I'm missing something here. >> >>>> >>>> [1] >>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ >>>> >>>>> >>>>> So in my eyes, this is not a "problem" >>>> >>>> Looks like the kernel scheduled us for a high-priority debate, I hope >>>> there's no deadlock :) >>>> >>>>> >>>>> Cheers, >>>>> -- Nico >>>>> >>>>> >>>>>> >>>>> >>>> >>> >> >
On 11/02/25 6:00 am, Nico Pache wrote: > The following series provides khugepaged and madvise collapse with the > capability to collapse regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > using a bitmap. After the PMD scan is done, we do binary recursion on the > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > on max_ptes_none is removed during the scan, to make sure we account for > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > order to determine how full a THP must be to be eligible. If a mTHP collapse > is attempted, but contains swapped out, or shared pages, we dont perform the > collapse. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > constantly promote mTHPs to the next available size. How does creep stop when max_ptes_none <= 255? > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > Patch 2: Refactor/rename hpage_collapse > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-9: The mTHP patches > > --------- > Testing > --------- > - Built for x86_64, aarch64, ppc64le, and s390x > - selftests mm > - I created a test script that I used to push khugepaged to its limits while > monitoring a number of stats and tracepoints. The code is available > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > The summary from my testings was that there was no significant regression > noticed through this test. In some cases my changes had better collapse > latencies, and was able to scan more pages in the same amount of time/work, > but for the most part the results were consistant. > - redis testing. I tested these changes along with my defer changes > (see followup post for more details). > - some basic testing on 64k page size. > - lots of general use. These changes have been running in my VM for some time. > > Changes since V1 [2]: > - Minor bug fixes discovered during review and testing > - removed dynamic allocations for bitmaps, and made them stack based > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > - Updated trace events to include collapsing order info. > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > - No longer require a chunk to be fully utilized before setting the bit. Use > the same max_ptes_none scaling principle to achieve this. > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > some of the "creep" that was discovered in v1. > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > Nico Pache (9): > introduce khugepaged_collapse_single_pmd to unify khugepaged and > madvise_collapse > khugepaged: rename hpage_collapse_* to khugepaged_* > khugepaged: generalize hugepage_vma_revalidate for mTHP support > khugepaged: generalize alloc_charge_folio for mTHP support > khugepaged: generalize __collapse_huge_page_* for mTHP support > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > khugepaged: add mTHP support > khugepaged: improve tracepoints for mTHP orders > khugepaged: skip collapsing mTHP to smaller orders > > include/linux/khugepaged.h | 4 + > include/trace/events/huge_memory.h | 34 ++- > mm/khugepaged.c | 422 +++++++++++++++++++---------- > 3 files changed, 306 insertions(+), 154 deletions(-) >
On 15/02/25 12:08 pm, Dev Jain wrote: > > > On 15/02/25 6:22 am, Nico Pache wrote: >> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote: >>> >>> >>> >>> On 14/02/25 1:09 am, Nico Pache wrote: >>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: >>>>> >>>>> >>>>> >>>>> On 12/02/25 10:19 pm, Nico Pache wrote: >>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote: >>>>>>>> The following series provides khugepaged and madvise collapse >>>>>>>> with the >>>>>>>> capability to collapse regions to mTHPs. >>>>>>>> >>>>>>>> To achieve this we generalize the khugepaged functions to no >>>>>>>> longer depend >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks >>>>>>>> of pages >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>>>>>> using a bitmap. After the PMD scan is done, we do binary >>>>>>>> recursion on the >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The >>>>>>>> restriction >>>>>>>> on max_ptes_none is removed during the scan, to make sure we >>>>>>>> account for >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the >>>>>>>> attempted collapse >>>>>>>> order to determine how full a THP must be to be eligible. If a >>>>>>>> mTHP collapse >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont >>>>>>>> perform the >>>>>>>> collapse. >>>>>>>> >>>>>>>> With the default max_ptes_none=511, the code should keep its >>>>>>>> most of its >>>>>>>> original behavior. To exercise mTHP collapse we need to set >>>>>>>> max_ptes_none<=255. >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse >>>>>>>> "creep" and >>>>>>>> constantly promote mTHPs to the next available size. >>>>>>>> >>>>>>>> Patch 1: Some refactoring to combine madvise_collapse and >>>>>>>> khugepaged >>>>>>>> Patch 2: Refactor/rename hpage_collapse >>>>>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>>>>>> Patch 6-9: The mTHP patches >>>>>>>> >>>>>>>> --------- >>>>>>>> Testing >>>>>>>> --------- >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>>>>>> - selftests mm >>>>>>>> - I created a test script that I used to push khugepaged to its >>>>>>>> limits while >>>>>>>> monitoring a number of stats and tracepoints. The code is >>>>>>>> available >>>>>>>> here[1] (Run in legacy mode for these changes and set >>>>>>>> mthp sizes to inherit) >>>>>>>> The summary from my testings was that there was no >>>>>>>> significant regression >>>>>>>> noticed through this test. In some cases my changes had >>>>>>>> better collapse >>>>>>>> latencies, and was able to scan more pages in the same >>>>>>>> amount of time/work, >>>>>>>> but for the most part the results were consistant. >>>>>>>> - redis testing. I tested these changes along with my defer changes >>>>>>>> (see followup post for more details). >>>>>>>> - some basic testing on 64k page size. >>>>>>>> - lots of general use. These changes have been running in my VM >>>>>>>> for some time. >>>>>>>> >>>>>>>> Changes since V1 [2]: >>>>>>>> - Minor bug fixes discovered during review and testing >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack >>>>>>>> based >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>>>>>> - Updated trace events to include collapsing order info. >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 >>>>>>>> scale. >>>>>>>> - No longer require a chunk to be fully utilized before setting >>>>>>>> the bit. Use >>>>>>>> the same max_ptes_none scaling principle to achieve this. >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. >>>>>>>> This helps prevent >>>>>>>> some of the "creep" that was discovered in v1. >>>>>>>> >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- >>>>>>>> npache@redhat.com/ >>>>>>>> >>>>>>>> Nico Pache (9): >>>>>>>> introduce khugepaged_collapse_single_pmd to unify >>>>>>>> khugepaged and >>>>>>>> madvise_collapse >>>>>>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>>>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP >>>>>>>> support >>>>>>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>>>>>> khugepaged: generalize __collapse_huge_page_* for mTHP >>>>>>>> support >>>>>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>>>>>> khugepaged: add mTHP support >>>>>>>> khugepaged: improve tracepoints for mTHP orders >>>>>>>> khugepaged: skip collapsing mTHP to smaller orders >>>>>>>> >>>>>>>> include/linux/khugepaged.h | 4 + >>>>>>>> include/trace/events/huge_memory.h | 34 ++- >>>>>>>> mm/khugepaged.c | 422 ++++++++++++++++++ >>>>>>>> +---------- >>>>>>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>>>>>> >>>>>>> >>>>>>> Does this patchset suffer from the problem described here: >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- >>>>>>> c2d48d4963b6@arm.com/ >>>>>> Hi Dev, >>>>>> >>>>>> Sorry I meant to get back to you about that. >>>>>> >>>>>> I understand your concern, but like I've mentioned before, the scan >>>>>> with the read lock was done so we dont have to do the more expensive >>>>>> locking, and could still gain insight into the state. You are right >>>>>> that this info could become stale if the state changes dramatically, >>>>>> but the collapse_isolate function will verify it and not collapse. >>>>> >>>>> If the state changes dramatically, the _isolate function will >>>>> verify it, >>>>> and fallback. And this fallback happens after following this costly >>>>> path: retrieve a large folio from the buddy allocator -> swapin pages >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB >>>>> flush >>>>> on all CPUs -> fallback in _isolate(). >>>>> If you do fail in _isolate(), doesn't it make sense to get the updated >>>>> state for the next fallback order immediately, because we have prior >>>>> information that we failed because of PTE state? What your algorithm >>>>> will do is *still* follow the costly path described above, and again >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() >>>>> like >>>>> mine would. >>>> >>>> You do raise a valid point here, I can optimize my solution by >>>> detecting certain collapse failure types and jump to the next scan. >>>> I'll add that to my solution, thanks! >>>> >>>> As for the disagreement around the bitmap, we'll leave that up to the >>>> community to decide since we have differing opinions/solutions. >>>> >>>>> >>>>> The verification of the PTE state by the _isolate() function is the >>>>> "no >>>>> turning back" point of the algorithm. The verification by >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even >>>>> worth >>>>> it, before we do costly operations" point of the algorithm. >>>>> >>>>>> From my testing I found this to rarely happen. >>>>> >>>>> Unfortunately, I am not very familiar with performance testing/load >>>>> testing, I am fairly new to kernel programming, so I am getting there. >>>>> But it really depends on the type of test you are running, what >>>>> actually >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded >>>>> systems I >>>>> would expect the PTE state to dramatically change. But still, no >>>>> opinion >>>>> here. >>>> >>>> Yeah there are probably some cases where it happens more often. >>>> Probably in cases of short lived allocations, but khugepaged doesn't >>>> run that frequently so those won't be that big of an issue. >>>> >>>> Our performance team is currently testing my implementation so I >>>> should have more real workload test results soon. The redis testing >>>> had some gains and didn't show any signs of obvious regressions. >>>> >>>> As for the testing, check out >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record- >>>> khuge-performance.sh?ref_type=heads >>>> this does the tracing for my testing script. It can help you get >>>> started. There are 3 different traces being applied there: the >>>> bpftrace for collapse latencies, the perf record for the flamegraph >>>> (not actually that useful, but may be useful to visualize any >>>> weird/long paths that you may not have noticed), and the trace-cmd >>>> which records the tracepoint of the scan and the collapse functions >>>> then processes the data using the awk script-- the output being the >>>> scan rate, the pages collapsed, and their result status (grouped by >>>> order). >>>> >>>> You can also look into https://github.com/gormanm/mmtests for >>>> testing/comparing kernels. I was running the >>>> config-memdb-redis-benchmark-medium workload. >>> >>> Thanks. I'll take a look. >>> >>>> >>>>> >>>>>> >>>>>> Also, khugepaged, my changes, and your changes are all a victim of >>>>>> this. Once we drop the read lock (to either allocate the folio, or >>>>>> right before acquiring the write_lock), the state can change. In your >>>>>> case, yes, you are gathering more up to date information, but is it >>>>>> really that important/worth it to retake locks and rescan for each >>>>>> instance if we are about to reverify with the write lock taken? >>>>> >>>>> You said "reverify": You are removing the verification, so this step >>>>> won't be reverification, it will be verification. We do not want to >>>>> verify *after* we have already done 95% of latency-heavy stuff, >>>>> only to >>>>> know that we are going to fail. >>>>> >>>>> Algorithms in the kernel, in general, are of the following form: 1) >>>>> Verify if a condition is true, resulting in taking a control path - >>>>> > 2) >>>>> do a lot of stuff -> "no turning back" step, wherein before committing >>>>> (by taking locks, say), reverify if this is the control path we should >>>>> be in. You are eliminating step 1). >>>>> >>>>> Therefore, I will have to say that I disagree with your approach. >>>>> >>>>> On top of this, in the subjective analysis in [1], point number 7 >>>>> (along >>>>> with point number 1) remains. And, point number 4 remains. >>>> >>>> for 1) your worst case of 1024 is not the worst case. There are 8 >>>> possible orders in your implementation, if all are enabled, that is >>>> 4096 iterations in the worst case. >>> >>> Yes, that is exactly what I wrote in 1). I am still not convinced that >>> the overhead you produce + 512 iterations is going to beat 4096 >>> iterations. Anyways, that is hand-waving and we should test this. >>> >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 >>>> in my case. >>> >>> Sorry, I am missing something here; how does the number of iterations >>> change with page size? Am I not scanning the PTE table, which is >>> invariant to the page size? >> >> I got the calculation wrong the first time and it's actually worst. >> Lets hope I got this right this time >> on ARM64 64k kernel: >> PMD size = 512M >> PTE= 64k >> PTEs per PMD = 8192 > > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't > depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 > << 13. > >> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the >> first and second order are skipped) >> >> Assuming I understand your algorithm correctly, in the worst case you >> are scanning the whole PMD for each order. >> >> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112. > > Yup. Now it seems that the bitmap overhead may just be worth it; for the > worst case the bitmap will give us an 11x saving...for the average case, > it will give us 2x, but still, 8192 is a large number. I'll think of Clearing on this: the saving is w.r.t the initial scan. That is, if time taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and y is the bitmap overhead, then time taken by DJ is 2x + collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in _isolate(). Anyhow, we differ in opinion as to where the max_ptes_* check should be placed; I recalled the following: https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/ https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/ One thing you can do to relieve one of my criticisms (not completely) is apply the following patch (this can be done in both methods): diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b589f889bb5a..dc5cb602eaad 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, } vmf.orig_pte = ptep_get_lockless(pte); - if (!is_swap_pte(vmf.orig_pte)) + if (!is_swap_pte(vmf.orig_pte)) { continue; + } else { + if (order != HPAGE_PMD_ORDER) { + result = SCAN_EXCEED_SWAP_PTE; + goto out; + } + } vmf.pte = pte; vmf.ptl = ptl;
On Sun, Feb 16, 2025 at 11:39 PM Dev Jain <dev.jain@arm.com> wrote: > > > > On 11/02/25 6:00 am, Nico Pache wrote: > > The following series provides khugepaged and madvise collapse with the > > capability to collapse regions to mTHPs. > > > > To achieve this we generalize the khugepaged functions to no longer depend > > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > > using a bitmap. After the PMD scan is done, we do binary recursion on the > > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > > on max_ptes_none is removed during the scan, to make sure we account for > > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > > order to determine how full a THP must be to be eligible. If a mTHP collapse > > is attempted, but contains swapped out, or shared pages, we dont perform the > > collapse. > > > > With the default max_ptes_none=511, the code should keep its most of its > > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > > constantly promote mTHPs to the next available size. > > How does creep stop when max_ptes_none <= 255? think of a 512kB mTHP region that is half utilized (but not collapsed), and max_ptes_none = 255. threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) >> (HPAGE_PMD_ORDER - state.order); threshold_bits = 512 - 1 - 255 >> 9 - 4 = 256 >> 5 = 8 So >8 bits must be set, where the bitmap length for this section is a max of 16 bits. This means more than half of the mTHP has to be utilized before collapse. If we collapse and introduce half of an empty mTHP then the mTHP will be available next round for another promotion. When max_ptes_none is 256, the threshold_bits = 7 and collapsing will make enough non-none pages to collapse again. > > > > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > > Patch 2: Refactor/rename hpage_collapse > > Patch 3-5: Generalize khugepaged functions for arbitrary orders > > Patch 6-9: The mTHP patches > > > > --------- > > Testing > > --------- > > - Built for x86_64, aarch64, ppc64le, and s390x > > - selftests mm > > - I created a test script that I used to push khugepaged to its limits while > > monitoring a number of stats and tracepoints. The code is available > > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > > The summary from my testings was that there was no significant regression > > noticed through this test. In some cases my changes had better collapse > > latencies, and was able to scan more pages in the same amount of time/work, > > but for the most part the results were consistant. > > - redis testing. I tested these changes along with my defer changes > > (see followup post for more details). > > - some basic testing on 64k page size. > > - lots of general use. These changes have been running in my VM for some time. > > > > Changes since V1 [2]: > > - Minor bug fixes discovered during review and testing > > - removed dynamic allocations for bitmaps, and made them stack based > > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > > - Updated trace events to include collapsing order info. > > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > > - No longer require a chunk to be fully utilized before setting the bit. Use > > the same max_ptes_none scaling principle to achieve this. > > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > > some of the "creep" that was discovered in v1. > > > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > > > Nico Pache (9): > > introduce khugepaged_collapse_single_pmd to unify khugepaged and > > madvise_collapse > > khugepaged: rename hpage_collapse_* to khugepaged_* > > khugepaged: generalize hugepage_vma_revalidate for mTHP support > > khugepaged: generalize alloc_charge_folio for mTHP support > > khugepaged: generalize __collapse_huge_page_* for mTHP support > > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > > khugepaged: add mTHP support > > khugepaged: improve tracepoints for mTHP orders > > khugepaged: skip collapsing mTHP to smaller orders > > > > include/linux/khugepaged.h | 4 + > > include/trace/events/huge_memory.h | 34 ++- > > mm/khugepaged.c | 422 +++++++++++++++++++---------- > > 3 files changed, 306 insertions(+), 154 deletions(-) > > >
On Mon, Feb 17, 2025 at 1:06 AM Dev Jain <dev.jain@arm.com> wrote: > > > > On 15/02/25 12:08 pm, Dev Jain wrote: > > > > > > On 15/02/25 6:22 am, Nico Pache wrote: > >> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote: > >>> > >>> > >>> > >>> On 14/02/25 1:09 am, Nico Pache wrote: > >>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 12/02/25 10:19 pm, Nico Pache wrote: > >>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote: > >>>>>>>> The following series provides khugepaged and madvise collapse > >>>>>>>> with the > >>>>>>>> capability to collapse regions to mTHPs. > >>>>>>>> > >>>>>>>> To achieve this we generalize the khugepaged functions to no > >>>>>>>> longer depend > >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks > >>>>>>>> of pages > >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > >>>>>>>> using a bitmap. After the PMD scan is done, we do binary > >>>>>>>> recursion on the > >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The > >>>>>>>> restriction > >>>>>>>> on max_ptes_none is removed during the scan, to make sure we > >>>>>>>> account for > >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the > >>>>>>>> attempted collapse > >>>>>>>> order to determine how full a THP must be to be eligible. If a > >>>>>>>> mTHP collapse > >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont > >>>>>>>> perform the > >>>>>>>> collapse. > >>>>>>>> > >>>>>>>> With the default max_ptes_none=511, the code should keep its > >>>>>>>> most of its > >>>>>>>> original behavior. To exercise mTHP collapse we need to set > >>>>>>>> max_ptes_none<=255. > >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse > >>>>>>>> "creep" and > >>>>>>>> constantly promote mTHPs to the next available size. > >>>>>>>> > >>>>>>>> Patch 1: Some refactoring to combine madvise_collapse and > >>>>>>>> khugepaged > >>>>>>>> Patch 2: Refactor/rename hpage_collapse > >>>>>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >>>>>>>> Patch 6-9: The mTHP patches > >>>>>>>> > >>>>>>>> --------- > >>>>>>>> Testing > >>>>>>>> --------- > >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x > >>>>>>>> - selftests mm > >>>>>>>> - I created a test script that I used to push khugepaged to its > >>>>>>>> limits while > >>>>>>>> monitoring a number of stats and tracepoints. The code is > >>>>>>>> available > >>>>>>>> here[1] (Run in legacy mode for these changes and set > >>>>>>>> mthp sizes to inherit) > >>>>>>>> The summary from my testings was that there was no > >>>>>>>> significant regression > >>>>>>>> noticed through this test. In some cases my changes had > >>>>>>>> better collapse > >>>>>>>> latencies, and was able to scan more pages in the same > >>>>>>>> amount of time/work, > >>>>>>>> but for the most part the results were consistant. > >>>>>>>> - redis testing. I tested these changes along with my defer changes > >>>>>>>> (see followup post for more details). > >>>>>>>> - some basic testing on 64k page size. > >>>>>>>> - lots of general use. These changes have been running in my VM > >>>>>>>> for some time. > >>>>>>>> > >>>>>>>> Changes since V1 [2]: > >>>>>>>> - Minor bug fixes discovered during review and testing > >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack > >>>>>>>> based > >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >>>>>>>> - Updated trace events to include collapsing order info. > >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 > >>>>>>>> scale. > >>>>>>>> - No longer require a chunk to be fully utilized before setting > >>>>>>>> the bit. Use > >>>>>>>> the same max_ptes_none scaling principle to achieve this. > >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. > >>>>>>>> This helps prevent > >>>>>>>> some of the "creep" that was discovered in v1. > >>>>>>>> > >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- > >>>>>>>> npache@redhat.com/ > >>>>>>>> > >>>>>>>> Nico Pache (9): > >>>>>>>> introduce khugepaged_collapse_single_pmd to unify > >>>>>>>> khugepaged and > >>>>>>>> madvise_collapse > >>>>>>>> khugepaged: rename hpage_collapse_* to khugepaged_* > >>>>>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP > >>>>>>>> support > >>>>>>>> khugepaged: generalize alloc_charge_folio for mTHP support > >>>>>>>> khugepaged: generalize __collapse_huge_page_* for mTHP > >>>>>>>> support > >>>>>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >>>>>>>> khugepaged: add mTHP support > >>>>>>>> khugepaged: improve tracepoints for mTHP orders > >>>>>>>> khugepaged: skip collapsing mTHP to smaller orders > >>>>>>>> > >>>>>>>> include/linux/khugepaged.h | 4 + > >>>>>>>> include/trace/events/huge_memory.h | 34 ++- > >>>>>>>> mm/khugepaged.c | 422 ++++++++++++++++++ > >>>>>>>> +---------- > >>>>>>>> 3 files changed, 306 insertions(+), 154 deletions(-) > >>>>>>>> > >>>>>>> > >>>>>>> Does this patchset suffer from the problem described here: > >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- > >>>>>>> c2d48d4963b6@arm.com/ > >>>>>> Hi Dev, > >>>>>> > >>>>>> Sorry I meant to get back to you about that. > >>>>>> > >>>>>> I understand your concern, but like I've mentioned before, the scan > >>>>>> with the read lock was done so we dont have to do the more expensive > >>>>>> locking, and could still gain insight into the state. You are right > >>>>>> that this info could become stale if the state changes dramatically, > >>>>>> but the collapse_isolate function will verify it and not collapse. > >>>>> > >>>>> If the state changes dramatically, the _isolate function will > >>>>> verify it, > >>>>> and fallback. And this fallback happens after following this costly > >>>>> path: retrieve a large folio from the buddy allocator -> swapin pages > >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB > >>>>> flush > >>>>> on all CPUs -> fallback in _isolate(). > >>>>> If you do fail in _isolate(), doesn't it make sense to get the updated > >>>>> state for the next fallback order immediately, because we have prior > >>>>> information that we failed because of PTE state? What your algorithm > >>>>> will do is *still* follow the costly path described above, and again > >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() > >>>>> like > >>>>> mine would. > >>>> > >>>> You do raise a valid point here, I can optimize my solution by > >>>> detecting certain collapse failure types and jump to the next scan. > >>>> I'll add that to my solution, thanks! > >>>> > >>>> As for the disagreement around the bitmap, we'll leave that up to the > >>>> community to decide since we have differing opinions/solutions. > >>>> > >>>>> > >>>>> The verification of the PTE state by the _isolate() function is the > >>>>> "no > >>>>> turning back" point of the algorithm. The verification by > >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even > >>>>> worth > >>>>> it, before we do costly operations" point of the algorithm. > >>>>> > >>>>>> From my testing I found this to rarely happen. > >>>>> > >>>>> Unfortunately, I am not very familiar with performance testing/load > >>>>> testing, I am fairly new to kernel programming, so I am getting there. > >>>>> But it really depends on the type of test you are running, what > >>>>> actually > >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded > >>>>> systems I > >>>>> would expect the PTE state to dramatically change. But still, no > >>>>> opinion > >>>>> here. > >>>> > >>>> Yeah there are probably some cases where it happens more often. > >>>> Probably in cases of short lived allocations, but khugepaged doesn't > >>>> run that frequently so those won't be that big of an issue. > >>>> > >>>> Our performance team is currently testing my implementation so I > >>>> should have more real workload test results soon. The redis testing > >>>> had some gains and didn't show any signs of obvious regressions. > >>>> > >>>> As for the testing, check out > >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record- > >>>> khuge-performance.sh?ref_type=heads > >>>> this does the tracing for my testing script. It can help you get > >>>> started. There are 3 different traces being applied there: the > >>>> bpftrace for collapse latencies, the perf record for the flamegraph > >>>> (not actually that useful, but may be useful to visualize any > >>>> weird/long paths that you may not have noticed), and the trace-cmd > >>>> which records the tracepoint of the scan and the collapse functions > >>>> then processes the data using the awk script-- the output being the > >>>> scan rate, the pages collapsed, and their result status (grouped by > >>>> order). > >>>> > >>>> You can also look into https://github.com/gormanm/mmtests for > >>>> testing/comparing kernels. I was running the > >>>> config-memdb-redis-benchmark-medium workload. > >>> > >>> Thanks. I'll take a look. > >>> > >>>> > >>>>> > >>>>>> > >>>>>> Also, khugepaged, my changes, and your changes are all a victim of > >>>>>> this. Once we drop the read lock (to either allocate the folio, or > >>>>>> right before acquiring the write_lock), the state can change. In your > >>>>>> case, yes, you are gathering more up to date information, but is it > >>>>>> really that important/worth it to retake locks and rescan for each > >>>>>> instance if we are about to reverify with the write lock taken? > >>>>> > >>>>> You said "reverify": You are removing the verification, so this step > >>>>> won't be reverification, it will be verification. We do not want to > >>>>> verify *after* we have already done 95% of latency-heavy stuff, > >>>>> only to > >>>>> know that we are going to fail. > >>>>> > >>>>> Algorithms in the kernel, in general, are of the following form: 1) > >>>>> Verify if a condition is true, resulting in taking a control path - > >>>>> > 2) > >>>>> do a lot of stuff -> "no turning back" step, wherein before committing > >>>>> (by taking locks, say), reverify if this is the control path we should > >>>>> be in. You are eliminating step 1). > >>>>> > >>>>> Therefore, I will have to say that I disagree with your approach. > >>>>> > >>>>> On top of this, in the subjective analysis in [1], point number 7 > >>>>> (along > >>>>> with point number 1) remains. And, point number 4 remains. > >>>> > >>>> for 1) your worst case of 1024 is not the worst case. There are 8 > >>>> possible orders in your implementation, if all are enabled, that is > >>>> 4096 iterations in the worst case. > >>> > >>> Yes, that is exactly what I wrote in 1). I am still not convinced that > >>> the overhead you produce + 512 iterations is going to beat 4096 > >>> iterations. Anyways, that is hand-waving and we should test this. > >>> > >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 > >>>> in my case. > >>> > >>> Sorry, I am missing something here; how does the number of iterations > >>> change with page size? Am I not scanning the PTE table, which is > >>> invariant to the page size? > >> > >> I got the calculation wrong the first time and it's actually worst. > >> Lets hope I got this right this time > >> on ARM64 64k kernel: > >> PMD size = 512M > >> PTE= 64k > >> PTEs per PMD = 8192 > > > > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't > > depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = > > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 > > << 13. > > > >> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the > >> first and second order are skipped) > >> > >> Assuming I understand your algorithm correctly, in the worst case you > >> are scanning the whole PMD for each order. > >> > >> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112. > > > > Yup. Now it seems that the bitmap overhead may just be worth it; for the > > worst case the bitmap will give us an 11x saving...for the average case, > > it will give us 2x, but still, 8192 is a large number. I'll think of > > Clearing on this: the saving is w.r.t the initial scan. That is, if time > taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and > y is the bitmap overhead, then time taken by DJ is 2x + > collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in > _isolate(). Anyhow, we differ in opinion as to where the max_ptes_* > check should be placed; I recalled the following: > > https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/ > https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/ > > One thing you can do to relieve one of my criticisms (not completely) is > apply the following patch (this can be done in both methods): > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b589f889bb5a..dc5cb602eaad 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct > mm_struct *mm, > } > > vmf.orig_pte = ptep_get_lockless(pte); > - if (!is_swap_pte(vmf.orig_pte)) > + if (!is_swap_pte(vmf.orig_pte)) { > continue; > + } else { > + if (order != HPAGE_PMD_ORDER) { > + result = SCAN_EXCEED_SWAP_PTE; > + goto out; > + } > + } > > vmf.pte = pte; > vmf.ptl = ptl; > -- > > But this really is the same thing being done in the links above :) Yes my code already does this, @@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, if (!is_swap_pte(vmf.orig_pte)) continue; + if (order != HPAGE_PMD_ORDER) { + result = SCAN_EXCEED_SWAP_PTE; + goto out; + } + No need for the if/else nesting. > > > > ways to test this out. > > > > Btw, I was made aware that an LWN article just got posted on our work! > > https://lwn.net/Articles/1009039/ > > > >> > >> Please let me know if I'm missing something here. > >>> > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b- > >>>>> aba4b1a441b4@arm.com/ > >>>>> > >>>>>> > >>>>>> So in my eyes, this is not a "problem" > >>>>> > >>>>> Looks like the kernel scheduled us for a high-priority debate, I hope > >>>>> there's no deadlock :) > >>>>> > >>>>>> > >>>>>> Cheers, > >>>>>> -- Nico > >>>>>> > >>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > >
On 11/02/2025 00:30, Nico Pache wrote: > The following series provides khugepaged and madvise collapse with the > capability to collapse regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > using a bitmap. After the PMD scan is done, we do binary recursion on the > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > on max_ptes_none is removed during the scan, to make sure we account for > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > order to determine how full a THP must be to be eligible. If a mTHP collapse > is attempted, but contains swapped out, or shared pages, we dont perform the > collapse. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? This is making my head hurt, but I *think* I agree with you that if max_ptes_none is less than half of the number of ptes in a pmd, then creep doesn't happen. To make sure I've understood; - to collapse to 16K, you would need >=3 out of 4 PTEs to be present - to collapse to 32K, you would need >=5 out of 8 PTEs to be present - to collapse to 64K, you would need >=9 out of 16 PTEs to be present - ... So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now have 4 PTEs in a 32K area which is insufficient to collapse to 32K. Sounds good to me! > constantly promote mTHPs to the next available size. > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > Patch 2: Refactor/rename hpage_collapse > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-9: The mTHP patches > > --------- > Testing > --------- > - Built for x86_64, aarch64, ppc64le, and s390x > - selftests mm > - I created a test script that I used to push khugepaged to its limits while > monitoring a number of stats and tracepoints. The code is available > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > The summary from my testings was that there was no significant regression > noticed through this test. In some cases my changes had better collapse > latencies, and was able to scan more pages in the same amount of time/work, > but for the most part the results were consistant. > - redis testing. I tested these changes along with my defer changes > (see followup post for more details). > - some basic testing on 64k page size. > - lots of general use. These changes have been running in my VM for some time. > > Changes since V1 [2]: > - Minor bug fixes discovered during review and testing > - removed dynamic allocations for bitmaps, and made them stack based > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > - Updated trace events to include collapsing order info. > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > - No longer require a chunk to be fully utilized before setting the bit. Use > the same max_ptes_none scaling principle to achieve this. > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > some of the "creep" that was discovered in v1. > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > Nico Pache (9): > introduce khugepaged_collapse_single_pmd to unify khugepaged and > madvise_collapse > khugepaged: rename hpage_collapse_* to khugepaged_* > khugepaged: generalize hugepage_vma_revalidate for mTHP support > khugepaged: generalize alloc_charge_folio for mTHP support > khugepaged: generalize __collapse_huge_page_* for mTHP support > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > khugepaged: add mTHP support > khugepaged: improve tracepoints for mTHP orders > khugepaged: skip collapsing mTHP to smaller orders > > include/linux/khugepaged.h | 4 + > include/trace/events/huge_memory.h | 34 ++- > mm/khugepaged.c | 422 +++++++++++++++++++---------- > 3 files changed, 306 insertions(+), 154 deletions(-) >
On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 11/02/2025 00:30, Nico Pache wrote: > > The following series provides khugepaged and madvise collapse with the > > capability to collapse regions to mTHPs. > > > > To achieve this we generalize the khugepaged functions to no longer depend > > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > > using a bitmap. After the PMD scan is done, we do binary recursion on the > > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > > on max_ptes_none is removed during the scan, to make sure we account for > > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > > order to determine how full a THP must be to be eligible. If a mTHP collapse > > is attempted, but contains swapped out, or shared pages, we dont perform the > > collapse. > > > > With the default max_ptes_none=511, the code should keep its most of its > > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > > nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? > This is making my head hurt, but I *think* I agree with you that if > max_ptes_none is less than half of the number of ptes in a pmd, then creep > doesn't happen. Haha yea the compressed bitmap does not make the math super easy to follow, but i'm glad we arrived at the same conclusion :) > > To make sure I've understood; > > - to collapse to 16K, you would need >=3 out of 4 PTEs to be present > - to collapse to 32K, you would need >=5 out of 8 PTEs to be present > - to collapse to 64K, you would need >=9 out of 16 PTEs to be present > - ... > > So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now > have 4 PTEs in a 32K area which is insufficient to collapse to 32K. > > Sounds good to me! Great! Another easy way to think about it is, with max_ptes_none = HPAGE_PMD_NR/2, a collapse will double the size, and we only need half for it to collapse again. Each size is 2x the last, so if we hit one collapse, it will be eligible again next round. > > > constantly promote mTHPs to the next available size. > > > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > > Patch 2: Refactor/rename hpage_collapse > > Patch 3-5: Generalize khugepaged functions for arbitrary orders > > Patch 6-9: The mTHP patches > > > > --------- > > Testing > > --------- > > - Built for x86_64, aarch64, ppc64le, and s390x > > - selftests mm > > - I created a test script that I used to push khugepaged to its limits while > > monitoring a number of stats and tracepoints. The code is available > > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > > The summary from my testings was that there was no significant regression > > noticed through this test. In some cases my changes had better collapse > > latencies, and was able to scan more pages in the same amount of time/work, > > but for the most part the results were consistant. > > - redis testing. I tested these changes along with my defer changes > > (see followup post for more details). > > - some basic testing on 64k page size. > > - lots of general use. These changes have been running in my VM for some time. > > > > Changes since V1 [2]: > > - Minor bug fixes discovered during review and testing > > - removed dynamic allocations for bitmaps, and made them stack based > > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > > - Updated trace events to include collapsing order info. > > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > > - No longer require a chunk to be fully utilized before setting the bit. Use > > the same max_ptes_none scaling principle to achieve this. > > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > > some of the "creep" that was discovered in v1. > > > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > > > Nico Pache (9): > > introduce khugepaged_collapse_single_pmd to unify khugepaged and > > madvise_collapse > > khugepaged: rename hpage_collapse_* to khugepaged_* > > khugepaged: generalize hugepage_vma_revalidate for mTHP support > > khugepaged: generalize alloc_charge_folio for mTHP support > > khugepaged: generalize __collapse_huge_page_* for mTHP support > > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > > khugepaged: add mTHP support > > khugepaged: improve tracepoints for mTHP orders > > khugepaged: skip collapsing mTHP to smaller orders > > > > include/linux/khugepaged.h | 4 + > > include/trace/events/huge_memory.h | 34 ++- > > mm/khugepaged.c | 422 +++++++++++++++++++---------- > > 3 files changed, 306 insertions(+), 154 deletions(-) > > >
On 19/02/25 4:00 am, Nico Pache wrote: > On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 11/02/2025 00:30, Nico Pache wrote: >>> The following series provides khugepaged and madvise collapse with the >>> capability to collapse regions to mTHPs. >>> >>> To achieve this we generalize the khugepaged functions to no longer depend >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>> on max_ptes_none is removed during the scan, to make sure we account for >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>> is attempted, but contains swapped out, or shared pages, we dont perform the >>> collapse. >>> >>> With the default max_ptes_none=511, the code should keep its most of its >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >> >> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? >> This is making my head hurt, but I *think* I agree with you that if >> max_ptes_none is less than half of the number of ptes in a pmd, then creep >> doesn't happen. > Haha yea the compressed bitmap does not make the math super easy to > follow, but i'm glad we arrived at the same conclusion :) >> >> To make sure I've understood; >> >> - to collapse to 16K, you would need >=3 out of 4 PTEs to be present >> - to collapse to 32K, you would need >=5 out of 8 PTEs to be present >> - to collapse to 64K, you would need >=9 out of 16 PTEs to be present >> - ... >> >> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now >> have 4 PTEs in a 32K area which is insufficient to collapse to 32K. >> >> Sounds good to me! > Great! Another easy way to think about it is, with max_ptes_none = > HPAGE_PMD_NR/2, a collapse will double the size, and we only need half > for it to collapse again. Each size is 2x the last, so if we hit one > collapse, it will be eligible again next round. Please someone correct me if I am wrong. Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 => #chunks = 64 / 8 = 8. Let the PTE state within the chunks be as follows: Chunk 0: < 5 filled Chunk 1: 5 filled Chunk 2: 5 filled Chunk 3: 5 filled Chunk 4: 5 filled Chunk 5: < 5 filled Chunk 6: < 5 filled Chunk 7: < 5 filled Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 < HPAGE_PMD_NR/2). => To collapse we need at least 60% of the PTEs filled. Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. Then, if the number of chunks set is greater than 60%, then we will collapse. Chunk 0 will be marked zero because less than 5 PTEs are filled => percentage filled <= 50% Right now the state is 0111 1000 where the indices are the chunk numbers. Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse won't happen. For the first 4 chunks, the percent filled is 75%. So the state becomes 1111 1000 after 128K collapse, and now 256K collapse will happen. Either I got this correct, or I do not understand the utility of maintaining chunks :) What you are doing is what I am doing except that my chunk size = 1. >> >>> constantly promote mTHPs to the next available size. >>> >>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>> Patch 2: Refactor/rename hpage_collapse >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>> Patch 6-9: The mTHP patches >>> >>> --------- >>> Testing >>> --------- >>> - Built for x86_64, aarch64, ppc64le, and s390x >>> - selftests mm >>> - I created a test script that I used to push khugepaged to its limits while >>> monitoring a number of stats and tracepoints. The code is available >>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>> The summary from my testings was that there was no significant regression >>> noticed through this test. In some cases my changes had better collapse >>> latencies, and was able to scan more pages in the same amount of time/work, >>> but for the most part the results were consistant. >>> - redis testing. I tested these changes along with my defer changes >>> (see followup post for more details). >>> - some basic testing on 64k page size. >>> - lots of general use. These changes have been running in my VM for some time. >>> >>> Changes since V1 [2]: >>> - Minor bug fixes discovered during review and testing >>> - removed dynamic allocations for bitmaps, and made them stack based >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>> - Updated trace events to include collapsing order info. >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>> - No longer require a chunk to be fully utilized before setting the bit. Use >>> the same max_ptes_none scaling principle to achieve this. >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>> some of the "creep" that was discovered in v1. >>> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>> >>> Nico Pache (9): >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>> madvise_collapse >>> khugepaged: rename hpage_collapse_* to khugepaged_* >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>> khugepaged: generalize alloc_charge_folio for mTHP support >>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>> khugepaged: add mTHP support >>> khugepaged: improve tracepoints for mTHP orders >>> khugepaged: skip collapsing mTHP to smaller orders >>> >>> include/linux/khugepaged.h | 4 + >>> include/trace/events/huge_memory.h | 34 ++- >>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>> 3 files changed, 306 insertions(+), 154 deletions(-) >>> >> > >
On 11/02/2025 00:30, Nico Pache wrote: > The following series provides khugepaged and madvise collapse with the > capability to collapse regions to mTHPs. > > To achieve this we generalize the khugepaged functions to no longer depend > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > (defined by MTHP_MIN_ORDER) that are utilized. Having been through the code I don't think you can handle the case of a partially covered PMD? That was never important before because by definition we only cared about PMD-sized chunks, but it would be good to be able to collapse to appropriately sized mTHP, memory within a VMA that doesn't cover an entire PMD. Do you have plans to add that? Is the approach you've taken amenable to extension in this way? > This info is tracked > using a bitmap. After the PMD scan is done, we do binary recursion on the > bitmap to find the optimal mTHP sizes for the PMD range. The restriction > on max_ptes_none is removed during the scan, to make sure we account for > the whole PMD range. max_ptes_none will be scaled by the attempted collapse > order to determine how full a THP must be to be eligible. If a mTHP collapse > is attempted, but contains swapped out, or shared pages, we dont perform the > collapse. > > With the default max_ptes_none=511, the code should keep its most of its > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > constantly promote mTHPs to the next available size. > > Patch 1: Some refactoring to combine madvise_collapse and khugepaged > Patch 2: Refactor/rename hpage_collapse > Patch 3-5: Generalize khugepaged functions for arbitrary orders > Patch 6-9: The mTHP patches > > --------- > Testing > --------- > - Built for x86_64, aarch64, ppc64le, and s390x > - selftests mm > - I created a test script that I used to push khugepaged to its limits while > monitoring a number of stats and tracepoints. The code is available > here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > The summary from my testings was that there was no significant regression > noticed through this test. In some cases my changes had better collapse > latencies, and was able to scan more pages in the same amount of time/work, > but for the most part the results were consistant. > - redis testing. I tested these changes along with my defer changes > (see followup post for more details). > - some basic testing on 64k page size. > - lots of general use. These changes have been running in my VM for some time. > > Changes since V1 [2]: > - Minor bug fixes discovered during review and testing > - removed dynamic allocations for bitmaps, and made them stack based > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > - Updated trace events to include collapsing order info. > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > - No longer require a chunk to be fully utilized before setting the bit. Use > the same max_ptes_none scaling principle to achieve this. > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > some of the "creep" that was discovered in v1. > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > > Nico Pache (9): > introduce khugepaged_collapse_single_pmd to unify khugepaged and > madvise_collapse > khugepaged: rename hpage_collapse_* to khugepaged_* > khugepaged: generalize hugepage_vma_revalidate for mTHP support > khugepaged: generalize alloc_charge_folio for mTHP support > khugepaged: generalize __collapse_huge_page_* for mTHP support > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > khugepaged: add mTHP support > khugepaged: improve tracepoints for mTHP orders > khugepaged: skip collapsing mTHP to smaller orders > > include/linux/khugepaged.h | 4 + > include/trace/events/huge_memory.h | 34 ++- > mm/khugepaged.c | 422 +++++++++++++++++++---------- > 3 files changed, 306 insertions(+), 154 deletions(-) >
On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote: > > > > On 19/02/25 4:00 am, Nico Pache wrote: > > On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 11/02/2025 00:30, Nico Pache wrote: > >>> The following series provides khugepaged and madvise collapse with the > >>> capability to collapse regions to mTHPs. > >>> > >>> To achieve this we generalize the khugepaged functions to no longer depend > >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages > >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > >>> using a bitmap. After the PMD scan is done, we do binary recursion on the > >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction > >>> on max_ptes_none is removed during the scan, to make sure we account for > >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse > >>> order to determine how full a THP must be to be eligible. If a mTHP collapse > >>> is attempted, but contains swapped out, or shared pages, we dont perform the > >>> collapse. > >>> > >>> With the default max_ptes_none=511, the code should keep its most of its > >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. > >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and > >> > >> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? > >> This is making my head hurt, but I *think* I agree with you that if > >> max_ptes_none is less than half of the number of ptes in a pmd, then creep > >> doesn't happen. > > Haha yea the compressed bitmap does not make the math super easy to > > follow, but i'm glad we arrived at the same conclusion :) > >> > >> To make sure I've understood; > >> > >> - to collapse to 16K, you would need >=3 out of 4 PTEs to be present > >> - to collapse to 32K, you would need >=5 out of 8 PTEs to be present > >> - to collapse to 64K, you would need >=9 out of 16 PTEs to be present > >> - ... > >> > >> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now > >> have 4 PTEs in a 32K area which is insufficient to collapse to 32K. > >> > >> Sounds good to me! > > Great! Another easy way to think about it is, with max_ptes_none = > > HPAGE_PMD_NR/2, a collapse will double the size, and we only need half > > for it to collapse again. Each size is 2x the last, so if we hit one > > collapse, it will be eligible again next round. > > Please someone correct me if I am wrong. > max_ptes_none = 204 scaled_none = (204 >> 9 - 3) = ~3.1 so 4 pages need to be available in each chunk for the bit to be set, not 5. at 204 the bitmap check is 512 - 1 - 204 = 307 (PMD) 307 >> 3 = 38 (1024k) 307 >> 4 = 19 (512k) 307 >> 5 = 9 (256k) 307 >> 6 = 4 > Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 > => #chunks = 64 / 8 = 8. > > Let the PTE state within the chunks be as follows: > > Chunk 0: < 5 filled Chunk 1: 5 filled Chunk 2: 5 filled Chunk 3: 5 > filled > > Chunk 4: 5 filled Chunk 5: < 5 filled Chunk 6: < 5 filled Chunk 7: > < 5 filled > > Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 > < HPAGE_PMD_NR/2). > => To collapse we need at least 60% of the PTEs filled. > > Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. > Then, if the number of chunks set is greater than 60%, then we will > collapse. > > Chunk 0 will be marked zero because less than 5 PTEs are filled => > percentage filled <= 50% > > Right now the state is > 0111 1000 > where the indices are the chunk numbers. > Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse > won't happen. > > For the first 4 chunks, the percent filled is 75%. So the state becomes > 1111 1000 > after 128K collapse, and now 256K collapse will happen. > > Either I got this correct, or I do not understand the utility of > maintaining chunks :) What you are doing is what I am doing except that > my chunk size = 1. Ignoring all the math, and just going off the 0111 1000 We do "creep", but its not the same type of "creep" we've been describing. The collapse in the first half will allow the collapse in order++ to collapse but it stops there and doesnt keep getting promoted to a PMD size. That is unless the adjacent 256k also has some bits set, then it can collapse to 512k. So I guess we still can creep, but its way less aggressive and only when there is actual memory being utilized in the adjacent chunk, so it's not like we are creating a huge waste. > > >> > >>> constantly promote mTHPs to the next available size. > >>> > >>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged > >>> Patch 2: Refactor/rename hpage_collapse > >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >>> Patch 6-9: The mTHP patches > >>> > >>> --------- > >>> Testing > >>> --------- > >>> - Built for x86_64, aarch64, ppc64le, and s390x > >>> - selftests mm > >>> - I created a test script that I used to push khugepaged to its limits while > >>> monitoring a number of stats and tracepoints. The code is available > >>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) > >>> The summary from my testings was that there was no significant regression > >>> noticed through this test. In some cases my changes had better collapse > >>> latencies, and was able to scan more pages in the same amount of time/work, > >>> but for the most part the results were consistant. > >>> - redis testing. I tested these changes along with my defer changes > >>> (see followup post for more details). > >>> - some basic testing on 64k page size. > >>> - lots of general use. These changes have been running in my VM for some time. > >>> > >>> Changes since V1 [2]: > >>> - Minor bug fixes discovered during review and testing > >>> - removed dynamic allocations for bitmaps, and made them stack based > >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >>> - Updated trace events to include collapsing order info. > >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > >>> - No longer require a chunk to be fully utilized before setting the bit. Use > >>> the same max_ptes_none scaling principle to achieve this. > >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent > >>> some of the "creep" that was discovered in v1. > >>> > >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ > >>> > >>> Nico Pache (9): > >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and > >>> madvise_collapse > >>> khugepaged: rename hpage_collapse_* to khugepaged_* > >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support > >>> khugepaged: generalize alloc_charge_folio for mTHP support > >>> khugepaged: generalize __collapse_huge_page_* for mTHP support > >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >>> khugepaged: add mTHP support > >>> khugepaged: improve tracepoints for mTHP orders > >>> khugepaged: skip collapsing mTHP to smaller orders > >>> > >>> include/linux/khugepaged.h | 4 + > >>> include/trace/events/huge_memory.h | 34 ++- > >>> mm/khugepaged.c | 422 +++++++++++++++++++---------- > >>> 3 files changed, 306 insertions(+), 154 deletions(-) > >>> > >> > > > > >
On 21/02/25 12:42 am, Nico Pache wrote: > On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote: >> >> >> >> On 19/02/25 4:00 am, Nico Pache wrote: >>> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 11/02/2025 00:30, Nico Pache wrote: >>>>> The following series provides khugepaged and madvise collapse with the >>>>> capability to collapse regions to mTHPs. >>>>> >>>>> To achieve this we generalize the khugepaged functions to no longer depend >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>>>> on max_ptes_none is removed during the scan, to make sure we account for >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the >>>>> collapse. >>>>> >>>>> With the default max_ptes_none=511, the code should keep its most of its >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >>>> >>>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? >>>> This is making my head hurt, but I *think* I agree with you that if >>>> max_ptes_none is less than half of the number of ptes in a pmd, then creep >>>> doesn't happen. >>> Haha yea the compressed bitmap does not make the math super easy to >>> follow, but i'm glad we arrived at the same conclusion :) >>>> >>>> To make sure I've understood; >>>> >>>> - to collapse to 16K, you would need >=3 out of 4 PTEs to be present >>>> - to collapse to 32K, you would need >=5 out of 8 PTEs to be present >>>> - to collapse to 64K, you would need >=9 out of 16 PTEs to be present >>>> - ... >>>> >>>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now >>>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K. >>>> >>>> Sounds good to me! >>> Great! Another easy way to think about it is, with max_ptes_none = >>> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half >>> for it to collapse again. Each size is 2x the last, so if we hit one >>> collapse, it will be eligible again next round. >> >> Please someone correct me if I am wrong. >> > > max_ptes_none = 204 > scaled_none = (204 >> 9 - 3) = ~3.1 > so 4 pages need to be available in each chunk for the bit to be set, not 5. > > at 204 the bitmap check is > 512 - 1 - 204 = 307 > (PMD) 307 >> 3 = 38 > (1024k) 307 >> 4 = 19 > (512k) 307 >> 5 = 9 > (256k) 307 >> 6 = 4 > >> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 >> => #chunks = 64 / 8 = 8. >> >> Let the PTE state within the chunks be as follows: >> >> Chunk 0: < 5 filled Chunk 1: 5 filled Chunk 2: 5 filled Chunk 3: 5 >> filled >> >> Chunk 4: 5 filled Chunk 5: < 5 filled Chunk 6: < 5 filled Chunk 7: >> < 5 filled >> >> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 >> < HPAGE_PMD_NR/2). >> => To collapse we need at least 60% of the PTEs filled. >> >> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. >> Then, if the number of chunks set is greater than 60%, then we will >> collapse. >> >> Chunk 0 will be marked zero because less than 5 PTEs are filled => >> percentage filled <= 50% >> >> Right now the state is >> 0111 1000 >> where the indices are the chunk numbers. >> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse >> won't happen. >> >> For the first 4 chunks, the percent filled is 75%. So the state becomes >> 1111 1000 >> after 128K collapse, and now 256K collapse will happen. >> >> Either I got this correct, or I do not understand the utility of >> maintaining chunks :) What you are doing is what I am doing except that >> my chunk size = 1. > > Ignoring all the math, and just going off the 0111 1000 > We do "creep", but its not the same type of "creep" we've been > describing. The collapse in the first half will allow the collapse in > order++ to collapse but it stops there and doesnt keep getting > promoted to a PMD size. That is unless the adjacent 256k also has some > bits set, then it can collapse to 512k. So I guess we still can creep, > but its way less aggressive and only when there is actual memory being > utilized in the adjacent chunk, so it's not like we are creating a > huge waste. I get you. You will creep when the adjacent chunk has at least 1 bit set. I don't really have a strong opinion on this one. > > >> >>>> >>>>> constantly promote mTHPs to the next available size. >>>>> >>>>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>>>> Patch 2: Refactor/rename hpage_collapse >>>>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>>>> Patch 6-9: The mTHP patches >>>>> >>>>> --------- >>>>> Testing >>>>> --------- >>>>> - Built for x86_64, aarch64, ppc64le, and s390x >>>>> - selftests mm >>>>> - I created a test script that I used to push khugepaged to its limits while >>>>> monitoring a number of stats and tracepoints. The code is available >>>>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>>>> The summary from my testings was that there was no significant regression >>>>> noticed through this test. In some cases my changes had better collapse >>>>> latencies, and was able to scan more pages in the same amount of time/work, >>>>> but for the most part the results were consistant. >>>>> - redis testing. I tested these changes along with my defer changes >>>>> (see followup post for more details). >>>>> - some basic testing on 64k page size. >>>>> - lots of general use. These changes have been running in my VM for some time. >>>>> >>>>> Changes since V1 [2]: >>>>> - Minor bug fixes discovered during review and testing >>>>> - removed dynamic allocations for bitmaps, and made them stack based >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>>>> - Updated trace events to include collapsing order info. >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use >>>>> the same max_ptes_none scaling principle to achieve this. >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>>>> some of the "creep" that was discovered in v1. >>>>> >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>>>> >>>>> Nico Pache (9): >>>>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>>>> madvise_collapse >>>>> khugepaged: rename hpage_collapse_* to khugepaged_* >>>>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>>>> khugepaged: generalize alloc_charge_folio for mTHP support >>>>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>>>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>>>> khugepaged: add mTHP support >>>>> khugepaged: improve tracepoints for mTHP orders >>>>> khugepaged: skip collapsing mTHP to smaller orders >>>>> >>>>> include/linux/khugepaged.h | 4 + >>>>> include/trace/events/huge_memory.h | 34 ++- >>>>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>>>> 3 files changed, 306 insertions(+), 154 deletions(-) >>>>> >>>> >>> >>> >> >