mbox series

[v3,0/6] add mTHP support for anonymous shmem

Message ID cover.1717033868.git.baolin.wang@linux.alibaba.com (mailing list archive)
Headers show
Series add mTHP support for anonymous shmem | expand

Message

Baolin Wang May 30, 2024, 2:04 a.m. UTC
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

Changes from v2:
 - Rebased to mm/mm-unstable.
 - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
 - Drop the patch that re-arranges the position of highest_order() and
   next_order(), per Ryan.
 - Modify the finish_fault() to fix VA alignment issue, per Ryan and
   David.
 - Fix some building issues, reported by Lance and kernel test robot.
 - Update some commit message.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (6):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  29 ++
 include/linux/huge_mm.h                    |  23 ++
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  58 +++-
 mm/shmem.c                                 | 340 ++++++++++++++++++---
 5 files changed, 406 insertions(+), 61 deletions(-)

Comments

David Hildenbrand May 31, 2024, 9:35 a.m. UTC | #1
On 30.05.24 04:04, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have all the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option. By default all sizes will be set to "never"
> except PMD size, which is set to "inherit". This ensures backward compatibility
> with the anonymous shmem enabled of the top level, meanwhile also allows
> independent control of anonymous shmem enabled for each mTHP.
> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
>  From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.

Let me summarize the takeaway from the bi-weekly MM meeting as I 
understood it, that includes Hugh's feedback on per-block tracking vs. mTHP:

(1) Per-block tracking

Per-block tracking is currently considered unwarranted complexity in 
shmem.c. We should try to get it done without that. For any test cases 
that fail, we should consider if they are actually valid for shmem.

To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
is not possible at fallcoate() time, detecting zeropages later and
retrying to split+free might be an option, without per-block tracking.

(2) mTHP controls

As a default, we should not be using large folios / mTHP for any shmem, 
just like we did with THP via shmem_enabled. This is what this series 
currently does, and is aprt of the whole mTHP user-space interface design.

Further, the mTHP controls should control all of shmem, not only 
"anonymous shmem".

Also, we should properly fallback within the configured sizes, and not 
jump "over" configured sizes. Unless there is a good reason.

(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed, 
using smaller THP sizes as fallback might prohibit collapsing a 
PMD-sized THP later. But really, khugepaged needs to be fixed to handle 
that.

(4) force/disable

These settings are rather testing artifacts from the old ages. We should 
not add them to the per-size toggles. We might "inherit" it from the 
global one, though.

"within_size" might have value, and especially for consistency, we 
should have them per size.



So, this series only tackles anonymous shmem, which is a good starting 
point. Ideally, we'd get support for other shmem (especially during 
fault time) soon afterwards, because we won't be adding separate toggles 
for that from the interface POV, and having inconsistent behavior 
between kernel versions would be a bit unfortunate.


@Baolin, this series likely does not consider (4) yet. And I suggest we 
have to take a lot of the "anonymous thp" terminology out of this 
series, especially when it comes to documentation.

@Daniel, Pankaj, what are your plans regarding that? It would be great 
if we could get an understanding on the next steps on !anon shmem.
Baolin Wang May 31, 2024, 10:13 a.m. UTC | #2
On 2024/5/31 17:35, David Hildenbrand wrote:
> On 30.05.24 04:04, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) 
>> allocation
>> through commit 19eaf44954df, that can allow THP to be configured 
>> through the
>> sysfs interface located at 
>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule 
>> configured
>> through the sysfs interface, and can only use the PMD-mapped THP, that 
>> is not
>> reasonable. Many implement anonymous page sharing through 
>> mmap(MAP_SHARED |
>> MAP_ANONYMOUS), especially in database usage scenarios, therefore, 
>> users expect
>> to apply an unified mTHP strategy for anonymous pages, also including the
>> anonymous shared pages, in order to enjoy the benefits of mTHP. For 
>> example,
>> lower latency than PMD-mapped THP, smaller memory bloat than 
>> PMD-mapped THP,
>> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have all the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option. By default all sizes will be set to "never"
>> except PMD size, which is set to "inherit". This ensures backward 
>> compatibility
>> with the anonymous shmem enabled of the top level, meanwhile also allows
>> independent control of anonymous shmem enabled for each mTHP.
>>
>> Use the page fault latency tool to measure the performance of 1G 
>> anonymous shmem
>> with 32 threads on my machine environment with: ARM64 Architecture, 32 
>> cores,
>> 125G memory:
>> base: mm-unstable
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.04s        3.10s         83516.416                  2669684.890
>>
>> mm-unstable + patchset, anon shmem mTHP disabled
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.02s        3.14s         82936.359                  2630746.027
>>
>> mm-unstable + patchset, anon shmem 64K mTHP enabled
>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>> 0.08s        0.31s         678630.231                 17082522.495
>>
>>  From the data above, it is observed that the patchset has a minimal 
>> impact when
>> mTHP is not enabled (some fluctuations observed during testing). When 
>> enabling 64K
>> mTHP, there is a significant improvement of the page fault latency.
> 
> Let me summarize the takeaway from the bi-weekly MM meeting as I 
> understood it, that includes Hugh's feedback on per-block tracking vs. 

Thanks David for the summarization.

> mTHP:
> 
> (1) Per-block tracking
> 
> Per-block tracking is currently considered unwarranted complexity in 
> shmem.c. We should try to get it done without that. For any test cases 
> that fail, we should consider if they are actually valid for shmem.
> 
> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> is not possible at fallcoate() time, detecting zeropages later and
> retrying to split+free might be an option, without per-block tracking.
> 
> (2) mTHP controls
> 
> As a default, we should not be using large folios / mTHP for any shmem, 
> just like we did with THP via shmem_enabled. This is what this series 
> currently does, and is aprt of the whole mTHP user-space interface design.
> 
> Further, the mTHP controls should control all of shmem, not only 
> "anonymous shmem".

Yes, that's what I thought and in my TODO list.

> 
> Also, we should properly fallback within the configured sizes, and not 
> jump "over" configured sizes. Unless there is a good reason.
> 
> (3) khugepaged
> 
> khugepaged needs to handle larger folios properly as well. Until fixed, 
> using smaller THP sizes as fallback might prohibit collapsing a 
> PMD-sized THP later. But really, khugepaged needs to be fixed to handle 
> that. >
> (4) force/disable
> 
> These settings are rather testing artifacts from the old ages. We should 
> not add them to the per-size toggles. We might "inherit" it from the 
> global one, though.

Sorry, I missed this. So I thould remove the 'force' and 'deny' option 
for each mTHP, right?

> 
> "within_size" might have value, and especially for consistency, we 
> should have them per size.
> 
> 
> 
> So, this series only tackles anonymous shmem, which is a good starting 
> point. Ideally, we'd get support for other shmem (especially during 
> fault time) soon afterwards, because we won't be adding separate toggles 
> for that from the interface POV, and having inconsistent behavior 
> between kernel versions would be a bit unfortunate.
> 
> 
> @Baolin, this series likely does not consider (4) yet. And I suggest we 
> have to take a lot of the "anonymous thp" terminology out of this 
> series, especially when it comes to documentation.

Sure. I will remove the "anonymous thp" terminology from the 
documentation, but want to still keep it in the commit message, cause I 
want to start from the anonymous shmem.

> 
> @Daniel, Pankaj, what are your plans regarding that? It would be great 
> if we could get an understanding on the next steps on !anon shmem.
>
David Hildenbrand May 31, 2024, 11:13 a.m. UTC | #3
>>
>> As a default, we should not be using large folios / mTHP for any shmem,
>> just like we did with THP via shmem_enabled. This is what this series
>> currently does, and is aprt of the whole mTHP user-space interface design.
>>
>> Further, the mTHP controls should control all of shmem, not only
>> "anonymous shmem".
> 
> Yes, that's what I thought and in my TODO list.

Good, it would be helpful to coordinate with Daniel and Pankaj.

> 
>>
>> Also, we should properly fallback within the configured sizes, and not
>> jump "over" configured sizes. Unless there is a good reason.
>>
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a
>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>> that. >
>> (4) force/disable
>>
>> These settings are rather testing artifacts from the old ages. We should
>> not add them to the per-size toggles. We might "inherit" it from the
>> global one, though.
> 
> Sorry, I missed this. So I thould remove the 'force' and 'deny' option
> for each mTHP, right?

Yes, that's my understanding. But we have to keep them on the top level 
for any possible user out there.

> 
>>
>> "within_size" might have value, and especially for consistency, we
>> should have them per size.
>>
>>
>>
>> So, this series only tackles anonymous shmem, which is a good starting
>> point. Ideally, we'd get support for other shmem (especially during
>> fault time) soon afterwards, because we won't be adding separate toggles
>> for that from the interface POV, and having inconsistent behavior
>> between kernel versions would be a bit unfortunate.
>>
>>
>> @Baolin, this series likely does not consider (4) yet. And I suggest we
>> have to take a lot of the "anonymous thp" terminology out of this
>> series, especially when it comes to documentation.
> 
> Sure. I will remove the "anonymous thp" terminology from the
> documentation, but want to still keep it in the commit message, cause I
> want to start from the anonymous shmem.

For commit message and friends makes sense. The story should be 
"controls all of shmem/tmpfs, but support will be added iteratively. The 
first step is anonymous shmem."
Daniel Gomez May 31, 2024, 1:19 p.m. UTC | #4
On Fri, May 31, 2024 at 11:35:30AM +0200, David Hildenbrand wrote:
> On 30.05.24 04:04, Baolin Wang wrote:
> > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > through commit 19eaf44954df, that can allow THP to be configured through the
> > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > 
> > However, the anonymous shmem will ignore the anonymous mTHP rule configured
> > through the sysfs interface, and can only use the PMD-mapped THP, that is not
> > reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> > MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> > to apply an unified mTHP strategy for anonymous pages, also including the
> > anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> > lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> > contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > 
> > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > which can have all the same values as the top-level
> > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > additional "inherit" option. By default all sizes will be set to "never"
> > except PMD size, which is set to "inherit". This ensures backward compatibility
> > with the anonymous shmem enabled of the top level, meanwhile also allows
> > independent control of anonymous shmem enabled for each mTHP.
> > 
> > Use the page fault latency tool to measure the performance of 1G anonymous shmem
> > with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> > 125G memory:
> > base: mm-unstable
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.04s        3.10s         83516.416                  2669684.890
> > 
> > mm-unstable + patchset, anon shmem mTHP disabled
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.02s        3.14s         82936.359                  2630746.027
> > 
> > mm-unstable + patchset, anon shmem 64K mTHP enabled
> > user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> > 0.08s        0.31s         678630.231                 17082522.495
> > 
> >  From the data above, it is observed that the patchset has a minimal impact when
> > mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> > mTHP, there is a significant improvement of the page fault latency.
> 
> Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> it, that includes Hugh's feedback on per-block tracking vs. mTHP:

Thanks David for the summary. Please, find below some follow up questions.

I want understand if zeropage scanning overhead is preferred over per-block
tracking complexity or if we still need to verify this.

> 
> (1) Per-block tracking
> 
> Per-block tracking is currently considered unwarranted complexity in
> shmem.c. We should try to get it done without that. For any test cases that
> fail, we should consider if they are actually valid for shmem.

I agree it was unwarranted complexity but only if this is just to fix lseek() as
we can simply make the test pass by checking if holes are reported as data. That
would be the minimum required for lseek() to be compliant with the syscall.

How can we use per-block tracking for reclaiming memory and what changes would
be needed? Or is per-block really a non-viable option?

Clearly, if per-block is viable option, shmem_fault() bug would required to be
fixed first. Any ideas on how to make it reproducible?

The alternatives discussed where sub-page refcounting and zeropage scanning.
The first one is not possible (IIUC) because of a refactor years back that
simplified the code and also requires extra complexity. The second option would
require additional overhead as we would involve scanning.

> 
> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> is not possible at fallcoate() time, detecting zeropages later and
> retrying to split+free might be an option, without per-block tracking.

> 
> (2) mTHP controls
> 
> As a default, we should not be using large folios / mTHP for any shmem, just
> like we did with THP via shmem_enabled. This is what this series currently
> does, and is aprt of the whole mTHP user-space interface design.

That was clear for me too. But what is the reason we want to boot in 'safe
mode'? What are the implications of not respecing that?

> 
> Further, the mTHP controls should control all of shmem, not only "anonymous
> shmem".

As I understood from the call, mTHP with sysctl knobs is preferred over
optimistic falloc/write allocation? But is still unclear to me why the former
is preferred.

Is large folios a non-viable option?

> 
> Also, we should properly fallback within the configured sizes, and not jump
> "over" configured sizes. Unless there is a good reason.
> 
> (3) khugepaged
> 
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> THP later. But really, khugepaged needs to be fixed to handle that.
> 
> (4) force/disable
> 
> These settings are rather testing artifacts from the old ages. We should not
> add them to the per-size toggles. We might "inherit" it from the global one,
> though.
> 
> "within_size" might have value, and especially for consistency, we should
> have them per size.
> 
> 
> 
> So, this series only tackles anonymous shmem, which is a good starting
> point. Ideally, we'd get support for other shmem (especially during fault
> time) soon afterwards, because we won't be adding separate toggles for that
> from the interface POV, and having inconsistent behavior between kernel
> versions would be a bit unfortunate.
> 
> 
> @Baolin, this series likely does not consider (4) yet. And I suggest we have
> to take a lot of the "anonymous thp" terminology out of this series,
> especially when it comes to documentation.
> 
> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> could get an understanding on the next steps on !anon shmem.

I realize I've raised so many questions, but it's essential for us to grasp the
mm concerns and usage scenarios. This understanding will provide clarity on the
direction regarding folios for !anon shmem.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

Daniel
David Hildenbrand May 31, 2024, 2:43 p.m. UTC | #5
Hi Daniel,

>> Let me summarize the takeaway from the bi-weekly MM meeting as I understood
>> it, that includes Hugh's feedback on per-block tracking vs. mTHP:
> 
> Thanks David for the summary. Please, find below some follow up questions.
> 
> I want understand if zeropage scanning overhead is preferred over per-block
> tracking complexity or if we still need to verify this.
> 
>>
>> (1) Per-block tracking
>>
>> Per-block tracking is currently considered unwarranted complexity in
>> shmem.c. We should try to get it done without that. For any test cases that
>> fail, we should consider if they are actually valid for shmem.
> 
> I agree it was unwarranted complexity but only if this is just to fix lseek() as
> we can simply make the test pass by checking if holes are reported as data. That
> would be the minimum required for lseek() to be compliant with the syscall.
> 
> How can we use per-block tracking for reclaiming memory and what changes would
> be needed? Or is per-block really a non-viable option?

The interesting thing is: with mTHP toggles it is opt-in -- like for 
PMD-sized THP with shmem_enabled -- and we don't have to be that 
concerned about this problem right now.

> 
> Clearly, if per-block is viable option, shmem_fault() bug would required to be
> fixed first. Any ideas on how to make it reproducible?
> 
> The alternatives discussed where sub-page refcounting and zeropage scanning.

Yeah, I don't think sub-page refcounting is a feasible (and certainly 
not desired ;) ) option in the folio world.

> The first one is not possible (IIUC) because of a refactor years back that
> simplified the code and also requires extra complexity. The second option would
> require additional overhead as we would involve scanning.

We'll likely need something similar (scanning, tracking?) for anonymous 
memory as well. There was a proposal for a THP shrinker some time ago, 
that would solve part of the problem.

For example, for shmem you could simply flag folios that failed 
splitting during fallocate() as reclaim candidates and try to reclaim 
memory later. So you don't have to scan arbitrary folios (which might 
also be desired, though).

> 
>>
>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>> is not possible at fallcoate() time, detecting zeropages later and
>> retrying to split+free might be an option, without per-block tracking.
> 
>>
>> (2) mTHP controls
>>
>> As a default, we should not be using large folios / mTHP for any shmem, just
>> like we did with THP via shmem_enabled. This is what this series currently
>> does, and is aprt of the whole mTHP user-space interface design.
> 
> That was clear for me too. But what is the reason we want to boot in 'safe
> mode'? What are the implications of not respecing that?

[...]

> 
> As I understood from the call, mTHP with sysctl knobs is preferred over
> optimistic falloc/write allocation? But is still unclear to me why the former
> is preferred.

I think Hugh's point was that this should be an opt-in feature, just 
like PMD-sized THP started out, and still is, an opt-in feature.

Problematic interaction with khugepaged (that will be fixed) is one 
thing, interaction with memory reclaim (without any kind of memory 
reclaim mechanisms in place) might be another one. Controlling and 
tuning for specific folio sizes might be another one Hugh raised. 
[summarizing what I recall from the discussion, there might be more].

> 
> Is large folios a non-viable option?

I think you mean "allocating large folios without user space control".

Because mTHP gives user space full control, to the point where you can 
enable all sizes and obtain the same result.

> 
>>
>> Also, we should properly fallback within the configured sizes, and not jump
>> "over" configured sizes. Unless there is a good reason.
>>
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
>> THP later. But really, khugepaged needs to be fixed to handle that.
>>
>> (4) force/disable
>>
>> These settings are rather testing artifacts from the old ages. We should not
>> add them to the per-size toggles. We might "inherit" it from the global one,
>> though.
>>
>> "within_size" might have value, and especially for consistency, we should
>> have them per size.
>>
>>
>>
>> So, this series only tackles anonymous shmem, which is a good starting
>> point. Ideally, we'd get support for other shmem (especially during fault
>> time) soon afterwards, because we won't be adding separate toggles for that
>> from the interface POV, and having inconsistent behavior between kernel
>> versions would be a bit unfortunate.
>>
>>
>> @Baolin, this series likely does not consider (4) yet. And I suggest we have
>> to take a lot of the "anonymous thp" terminology out of this series,
>> especially when it comes to documentation.
>>
>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
>> could get an understanding on the next steps on !anon shmem.
> 
> I realize I've raised so many questions, but it's essential for us to grasp the
> mm concerns and usage scenarios. This understanding will provide clarity on the
> direction regarding folios for !anon shmem.

If I understood correctly, Hugh had strong feelings against not 
respecting mTHP toggles for shmem. Without per-block tracking, I agree 
with that.
wang wei June 1, 2024, 3:54 a.m. UTC | #6
At 2024-05-31 18:13:03, "Baolin Wang" <baolin.wang@linux.alibaba.com> wrote:
>
>
>On 2024/5/31 17:35, David Hildenbrand wrote:
>> On 30.05.24 04:04, Baolin Wang wrote:
>>> Anonymous pages have already been supported for multi-size (mTHP) 
>>> allocation
>>> through commit 19eaf44954df, that can allow THP to be configured 
>>> through the
>>> sysfs interface located at 
>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>
>>> However, the anonymous shmem will ignore the anonymous mTHP rule 
>>> configured
>>> through the sysfs interface, and can only use the PMD-mapped THP, that 
>>> is not
>>> reasonable. Many implement anonymous page sharing through 
>>> mmap(MAP_SHARED |
>>> MAP_ANONYMOUS), especially in database usage scenarios, therefore, 
>>> users expect
>>> to apply an unified mTHP strategy for anonymous pages, also including the
>>> anonymous shared pages, in order to enjoy the benefits of mTHP. For 
>>> example,
>>> lower latency than PMD-mapped THP, smaller memory bloat than 
>>> PMD-mapped THP,
>>> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>>
>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>> which can have all the same values as the top-level
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>> additional "inherit" option. By default all sizes will be set to "never"
>>> except PMD size, which is set to "inherit". This ensures backward 
>>> compatibility
>>> with the anonymous shmem enabled of the top level, meanwhile also allows
>>> independent control of anonymous shmem enabled for each mTHP.
>>>
>>> Use the page fault latency tool to measure the performance of 1G 
>>> anonymous shmem
>>> with 32 threads on my machine environment with: ARM64 Architecture, 32 
>>> cores,
>>> 125G memory:
>>> base: mm-unstable
>>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>>> 0.04s        3.10s         83516.416                  2669684.890
>>>
>>> mm-unstable + patchset, anon shmem mTHP disabled
>>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>>> 0.02s        3.14s         82936.359                  2630746.027
>>>
>>> mm-unstable + patchset, anon shmem 64K mTHP enabled
>>> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
>>> 0.08s        0.31s         678630.231                 17082522.495
>>>
>>>  From the data above, it is observed that the patchset has a minimal 
>>> impact when
>>> mTHP is not enabled (some fluctuations observed during testing). When 
>>> enabling 64K
>>> mTHP, there is a significant improvement of the page fault latency.
>> 
>> Let me summarize the takeaway from the bi-weekly MM meeting as I 
>> understood it, that includes Hugh's feedback on per-block tracking vs. 
>
>Thanks David for the summarization.
>
>> mTHP:
>> 
>> (1) Per-block tracking
>> 
>> Per-block tracking is currently considered unwarranted complexity in 
>> shmem.c. We should try to get it done without that. For any test cases 
>> that fail, we should consider if they are actually valid for shmem.
>> 
>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>> is not possible at fallcoate() time, detecting zeropages later and
>> retrying to split+free might be an option, without per-block tracking.
>> 
>> (2) mTHP controls
>> 
>> As a default, we should not be using large folios / mTHP for any shmem, 
>> just like we did with THP via shmem_enabled. This is what this series 
>> currently does, and is aprt of the whole mTHP user-space interface design.
>> 
>> Further, the mTHP controls should control all of shmem, not only 
>> "anonymous shmem".
>
>Yes, that's what I thought and in my TODO list.
>
>> 
>> Also, we should properly fallback within the configured sizes, and not 
>> jump "over" configured sizes. Unless there is a good reason.
>> 
>> (3) khugepaged
>> 
>> khugepaged needs to handle larger folios properly as well. Until fixed, 
>> using smaller THP sizes as fallback might prohibit collapsing a 
>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle 
>> that. >
>> (4) force/disable
>> 
>> These settings are rather testing artifacts from the old ages. We should 
>> not add them to the per-size toggles. We might "inherit" it from the 
>> global one, though.
>
>Sorry, I missed this. So I thould remove the 'force' and 'deny' option 
>for each mTHP, right?
>

I prefer to this. Perhaps the functionality of "force/deny" is different from 
that of "always/never" when tmpfs is supported. The user needs to 
understand the usage of "force" and "deny" again.
Baolin Wang June 2, 2024, 4:15 a.m. UTC | #7
On 2024/5/31 19:13, David Hildenbrand wrote:
>>>
>>> As a default, we should not be using large folios / mTHP for any shmem,
>>> just like we did with THP via shmem_enabled. This is what this series
>>> currently does, and is aprt of the whole mTHP user-space interface 
>>> design.
>>>
>>> Further, the mTHP controls should control all of shmem, not only
>>> "anonymous shmem".
>>
>> Yes, that's what I thought and in my TODO list.
> 
> Good, it would be helpful to coordinate with Daniel and Pankaj.
> 
>>
>>>
>>> Also, we should properly fallback within the configured sizes, and not
>>> jump "over" configured sizes. Unless there is a good reason.
>>>
>>> (3) khugepaged
>>>
>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>> using smaller THP sizes as fallback might prohibit collapsing a
>>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>>> that. >
>>> (4) force/disable
>>>
>>> These settings are rather testing artifacts from the old ages. We should
>>> not add them to the per-size toggles. We might "inherit" it from the
>>> global one, though.
>>
>> Sorry, I missed this. So I thould remove the 'force' and 'deny' option
>> for each mTHP, right?
> 
> Yes, that's my understanding. But we have to keep them on the top level 
> for any possible user out there.

OK.

>>>
>>> "within_size" might have value, and especially for consistency, we
>>> should have them per size.
>>>
>>>
>>>
>>> So, this series only tackles anonymous shmem, which is a good starting
>>> point. Ideally, we'd get support for other shmem (especially during
>>> fault time) soon afterwards, because we won't be adding separate toggles
>>> for that from the interface POV, and having inconsistent behavior
>>> between kernel versions would be a bit unfortunate.
>>>
>>>
>>> @Baolin, this series likely does not consider (4) yet. And I suggest we
>>> have to take a lot of the "anonymous thp" terminology out of this
>>> series, especially when it comes to documentation.
>>
>> Sure. I will remove the "anonymous thp" terminology from the
>> documentation, but want to still keep it in the commit message, cause I
>> want to start from the anonymous shmem.
> 
> For commit message and friends makes sense. The story should be 
> "controls all of shmem/tmpfs, but support will be added iteratively. The 
> first step is anonymous shmem."

Sure. Thanks.
Daniel Gomez June 4, 2024, 8:18 a.m. UTC | #8
On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > > 
> > > As a default, we should not be using large folios / mTHP for any shmem,
> > > just like we did with THP via shmem_enabled. This is what this series
> > > currently does, and is aprt of the whole mTHP user-space interface design.
> > > 
> > > Further, the mTHP controls should control all of shmem, not only
> > > "anonymous shmem".
> > 
> > Yes, that's what I thought and in my TODO list.
> 
> Good, it would be helpful to coordinate with Daniel and Pankaj.

I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
v3 patches. You may find a version in my integration branch here [2]. I can
attach them here if it's preferred.

[1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
[2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp

The point here is to combine the large folios strategy I proposed with mTHP
user controls. Would it make sense to limit the orders to the mapping order
calculated based on the size and index?

@@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,

                order = highest_order(suitable_orders);
                while (suitable_orders) {
+                       if (order > mapping_order) {
+                               order = next_order(&suitable_orders, order);
+                               continue;
+                       }
                        pages = 1UL << order;
                        index = round_down(index, pages);
                        folio = shmem_alloc_folio(gfp, order, info, index);

Note: The branch still need to be adapted to include !anon mm.

> 
> > 
> > > 
> > > Also, we should properly fallback within the configured sizes, and not
> > > jump "over" configured sizes. Unless there is a good reason.
> > > 
> > > (3) khugepaged
> > > 
> > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > using smaller THP sizes as fallback might prohibit collapsing a
> > > PMD-sized THP later. But really, khugepaged needs to be fixed to handle
> > > that. >
> > > (4) force/disable
> > > 
> > > These settings are rather testing artifacts from the old ages. We should
> > > not add them to the per-size toggles. We might "inherit" it from the
> > > global one, though.
> > 
> > Sorry, I missed this. So I thould remove the 'force' and 'deny' option
> > for each mTHP, right?
> 
> Yes, that's my understanding. But we have to keep them on the top level for
> any possible user out there.
> 
> > 
> > > 
> > > "within_size" might have value, and especially for consistency, we
> > > should have them per size.
> > > 
> > > 
> > > 
> > > So, this series only tackles anonymous shmem, which is a good starting
> > > point. Ideally, we'd get support for other shmem (especially during
> > > fault time) soon afterwards, because we won't be adding separate toggles
> > > for that from the interface POV, and having inconsistent behavior
> > > between kernel versions would be a bit unfortunate.
> > > 
> > > 
> > > @Baolin, this series likely does not consider (4) yet. And I suggest we
> > > have to take a lot of the "anonymous thp" terminology out of this
> > > series, especially when it comes to documentation.
> > 
> > Sure. I will remove the "anonymous thp" terminology from the
> > documentation, but want to still keep it in the commit message, cause I
> > want to start from the anonymous shmem.
> 
> For commit message and friends makes sense. The story should be "controls
> all of shmem/tmpfs, but support will be added iteratively. The first step is
> anonymous shmem."
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Daniel Gomez June 4, 2024, 9:29 a.m. UTC | #9
On Fri, May 31, 2024 at 04:43:32PM +0200, David Hildenbrand wrote:
> Hi Daniel,
> 
> > > Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> > > it, that includes Hugh's feedback on per-block tracking vs. mTHP:
> > 
> > Thanks David for the summary. Please, find below some follow up questions.
> > 
> > I want understand if zeropage scanning overhead is preferred over per-block
> > tracking complexity or if we still need to verify this.
> > 
> > > 
> > > (1) Per-block tracking
> > > 
> > > Per-block tracking is currently considered unwarranted complexity in
> > > shmem.c. We should try to get it done without that. For any test cases that
> > > fail, we should consider if they are actually valid for shmem.
> > 
> > I agree it was unwarranted complexity but only if this is just to fix lseek() as
> > we can simply make the test pass by checking if holes are reported as data. That
> > would be the minimum required for lseek() to be compliant with the syscall.
> > 
> > How can we use per-block tracking for reclaiming memory and what changes would
> > be needed? Or is per-block really a non-viable option?
> 
> The interesting thing is: with mTHP toggles it is opt-in -- like for
> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
> about this problem right now.

Without respecting the size when allocating large folios, mTHP toggles would
over allocate. My proposal added earlier to this thread is to combine the 2
to avoid that case. Otherwise, shouldn't we try to find a solution for the
reclaiming path?

> 
> > 
> > Clearly, if per-block is viable option, shmem_fault() bug would required to be
> > fixed first. Any ideas on how to make it reproducible?
> > 
> > The alternatives discussed where sub-page refcounting and zeropage scanning.
> 
> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
> desired ;) ) option in the folio world.
> 
> > The first one is not possible (IIUC) because of a refactor years back that
> > simplified the code and also requires extra complexity. The second option would
> > require additional overhead as we would involve scanning.
> 
> We'll likely need something similar (scanning, tracking?) for anonymous
> memory as well. There was a proposal for a THP shrinker some time ago, that
> would solve part of the problem.

It's good to know we have the same problem in different places. I'm more
inclined to keep the information rather than adding an extra overhead. Unless
the complexity is really overwhelming. Considering the concerns here, not sure
how much should we try merging with iomap as per Ritesh's suggestion.

The THP shrinker, could you please confirm if it is this following thread?

https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/

> 
> For example, for shmem you could simply flag folios that failed splitting
> during fallocate() as reclaim candidates and try to reclaim memory later. So
> you don't have to scan arbitrary folios (which might also be desired,
> though).

Thanks for the suggestion. I'll look into this.

> 
> > 
> > > 
> > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> > > is not possible at fallcoate() time, detecting zeropages later and
> > > retrying to split+free might be an option, without per-block tracking.
> > 
> > > 
> > > (2) mTHP controls
> > > 
> > > As a default, we should not be using large folios / mTHP for any shmem, just
> > > like we did with THP via shmem_enabled. This is what this series currently
> > > does, and is aprt of the whole mTHP user-space interface design.
> > 
> > That was clear for me too. But what is the reason we want to boot in 'safe
> > mode'? What are the implications of not respecing that?
> 
> [...]
> 
> > 
> > As I understood from the call, mTHP with sysctl knobs is preferred over
> > optimistic falloc/write allocation? But is still unclear to me why the former
> > is preferred.
> 
> I think Hugh's point was that this should be an opt-in feature, just like
> PMD-sized THP started out, and still is, an opt-in feature.

I'd be keen to understand the use case for this. Even the current THP controls
we have in tmpfs. I guess these are just scenarios with no swap involved?
Are these use cases the same for both tmpfs and shmem anon mm?

> 
> Problematic interaction with khugepaged (that will be fixed) is one thing,
> interaction with memory reclaim (without any kind of memory reclaim
> mechanisms in place) might be another one. Controlling and tuning for
> specific folio sizes might be another one Hugh raised. [summarizing what I
> recall from the discussion, there might be more].
> 
> > 
> > Is large folios a non-viable option?
> 
> I think you mean "allocating large folios without user space control".

Yes.

> 
> Because mTHP gives user space full control, to the point where you can
> enable all sizes and obtain the same result.

Agree.

> 
> > 
> > > 
> > > Also, we should properly fallback within the configured sizes, and not jump
> > > "over" configured sizes. Unless there is a good reason.
> > > 
> > > (3) khugepaged
> > > 
> > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> > > THP later. But really, khugepaged needs to be fixed to handle that.
> > > 
> > > (4) force/disable
> > > 
> > > These settings are rather testing artifacts from the old ages. We should not
> > > add them to the per-size toggles. We might "inherit" it from the global one,
> > > though.
> > > 
> > > "within_size" might have value, and especially for consistency, we should
> > > have them per size.
> > > 
> > > 
> > > 
> > > So, this series only tackles anonymous shmem, which is a good starting
> > > point. Ideally, we'd get support for other shmem (especially during fault
> > > time) soon afterwards, because we won't be adding separate toggles for that
> > > from the interface POV, and having inconsistent behavior between kernel
> > > versions would be a bit unfortunate.
> > > 
> > > 
> > > @Baolin, this series likely does not consider (4) yet. And I suggest we have
> > > to take a lot of the "anonymous thp" terminology out of this series,
> > > especially when it comes to documentation.
> > > 
> > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> > > could get an understanding on the next steps on !anon shmem.
> > 
> > I realize I've raised so many questions, but it's essential for us to grasp the
> > mm concerns and usage scenarios. This understanding will provide clarity on the
> > direction regarding folios for !anon shmem.
> 
> If I understood correctly, Hugh had strong feelings against not respecting
> mTHP toggles for shmem. Without per-block tracking, I agree with that.

My understanding was the same. But I have this follow-up question: should
we respect mTHP toggles without considering mapping constraints (size and
index)? Or perhaps we should use within_size where we can fit this intermediate
approach, as long as mTHP granularity is respected?

Daniel

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Baolin Wang June 4, 2024, 9:45 a.m. UTC | #10
On 2024/6/4 16:18, Daniel Gomez wrote:
> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>
>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>> just like we did with THP via shmem_enabled. This is what this series
>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>
>>>> Further, the mTHP controls should control all of shmem, not only
>>>> "anonymous shmem".
>>>
>>> Yes, that's what I thought and in my TODO list.
>>
>> Good, it would be helpful to coordinate with Daniel and Pankaj.
> 
> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> v3 patches. You may find a version in my integration branch here [2]. I can
> attach them here if it's preferred.
> 
> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> [2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp
> 
> The point here is to combine the large folios strategy I proposed with mTHP
> user controls. Would it make sense to limit the orders to the mapping order
> calculated based on the size and index?

IMO, for !anon shmem, this change makes sense to me. We should respect 
the size and mTHP should act as a order filter.

For anon shmem, we should ignore the length, which you always set it to 
PAGE_SIZE in patch [1].

[1] 
https://gitlab.com/dkruces/linux-next/-/commit/edf02311fd6d86b355d3aeb74e67c8da6de3c569

> @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> 
>                  order = highest_order(suitable_orders);
>                  while (suitable_orders) {
> +                       if (order > mapping_order) {
> +                               order = next_order(&suitable_orders, order);
> +                               continue;
> +                       }
>                          pages = 1UL << order;
>                          index = round_down(index, pages);
>                          folio = shmem_alloc_folio(gfp, order, info, index);
> 
> Note: The branch still need to be adapted to include !anon mm.
David Hildenbrand June 4, 2024, 9:59 a.m. UTC | #11
[...]

>>> How can we use per-block tracking for reclaiming memory and what changes would
>>> be needed? Or is per-block really a non-viable option?
>>
>> The interesting thing is: with mTHP toggles it is opt-in -- like for
>> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
>> about this problem right now.
> 
> Without respecting the size when allocating large folios, mTHP toggles would
> over allocate. My proposal added earlier to this thread is to combine the 2
> to avoid that case. Otherwise, shouldn't we try to find a solution for the
> reclaiming path?

I think at some point we'll really have to do a better job at reclaiming 
(either memory-overallocation, PUNCHHOLE that couldn't split, but maybe 
also pages that are now all-zero again and could be reclaimed again).

> 
>>
>>>
>>> Clearly, if per-block is viable option, shmem_fault() bug would required to be
>>> fixed first. Any ideas on how to make it reproducible?
>>>
>>> The alternatives discussed where sub-page refcounting and zeropage scanning.
>>
>> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
>> desired ;) ) option in the folio world.
>>
>>> The first one is not possible (IIUC) because of a refactor years back that
>>> simplified the code and also requires extra complexity. The second option would
>>> require additional overhead as we would involve scanning.
>>
>> We'll likely need something similar (scanning, tracking?) for anonymous
>> memory as well. There was a proposal for a THP shrinker some time ago, that
>> would solve part of the problem.
> 
> It's good to know we have the same problem in different places. I'm more
> inclined to keep the information rather than adding an extra overhead. Unless
> the complexity is really overwhelming. Considering the concerns here, not sure
> how much should we try merging with iomap as per Ritesh's suggestion.

As raised in the meeting, I do see value in maintaining the information; 
but I also see why Hugh and Kirill think this might be unwarranted 
complexity in shmem.c. Meaning: we might be able to achieve something 
similar without it, and we don't have to solve the problem right now to 
benefit from large folios.

> 
> The THP shrinker, could you please confirm if it is this following thread?
> 
> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/

Yes, although there was no follow up so far. Possibly, because in the 
current khugepaged approach, there will be a constant back-and-forth 
between khugepaged collapsing memory (and wasting memory in the default 
setting), and the THP shrinker reclaiming memory; doesn't sound quite 
effective :) That needs more thought IMHO.

[...]

>>>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
>>>> is not possible at fallcoate() time, detecting zeropages later and
>>>> retrying to split+free might be an option, without per-block tracking.
>>>
>>>>
>>>> (2) mTHP controls
>>>>
>>>> As a default, we should not be using large folios / mTHP for any shmem, just
>>>> like we did with THP via shmem_enabled. This is what this series currently
>>>> does, and is aprt of the whole mTHP user-space interface design.
>>>
>>> That was clear for me too. But what is the reason we want to boot in 'safe
>>> mode'? What are the implications of not respecing that?
>>
>> [...]
>>
>>>
>>> As I understood from the call, mTHP with sysctl knobs is preferred over
>>> optimistic falloc/write allocation? But is still unclear to me why the former
>>> is preferred.
>>
>> I think Hugh's point was that this should be an opt-in feature, just like
>> PMD-sized THP started out, and still is, an opt-in feature.
> 
> I'd be keen to understand the use case for this. Even the current THP controls
> we have in tmpfs. I guess these are just scenarios with no swap involved?
> Are these use cases the same for both tmpfs and shmem anon mm?

Systems without swap are one case I think. The other reason for a toggle 
in the past was to be able to disable it to troubleshoot issues (Hugh 
mentioned in the meeting about unlocking new code paths in shmem.c only 
with a toggle).

Staring at my Fedora system:

$ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
always within_size advise [never] deny force

Maybe because it uses tmpfs to mount /tmp (interesting article on 
lwn.net about that) and people are concerned about the side-effects 
(that can currently waste memory, or result in more reclaim work being 
required when exceeding file sizes).

For VMs, I know that shmem+THP with memory ballooning is problematic and 
not really recommended.

[...]

>>>
>>>>
>>>> Also, we should properly fallback within the configured sizes, and not jump
>>>> "over" configured sizes. Unless there is a good reason.
>>>>
>>>> (3) khugepaged
>>>>
>>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
>>>> THP later. But really, khugepaged needs to be fixed to handle that.
>>>>
>>>> (4) force/disable
>>>>
>>>> These settings are rather testing artifacts from the old ages. We should not
>>>> add them to the per-size toggles. We might "inherit" it from the global one,
>>>> though.
>>>>
>>>> "within_size" might have value, and especially for consistency, we should
>>>> have them per size.
>>>>
>>>>
>>>>
>>>> So, this series only tackles anonymous shmem, which is a good starting
>>>> point. Ideally, we'd get support for other shmem (especially during fault
>>>> time) soon afterwards, because we won't be adding separate toggles for that
>>>> from the interface POV, and having inconsistent behavior between kernel
>>>> versions would be a bit unfortunate.
>>>>
>>>>
>>>> @Baolin, this series likely does not consider (4) yet. And I suggest we have
>>>> to take a lot of the "anonymous thp" terminology out of this series,
>>>> especially when it comes to documentation.
>>>>
>>>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we
>>>> could get an understanding on the next steps on !anon shmem.
>>>
>>> I realize I've raised so many questions, but it's essential for us to grasp the
>>> mm concerns and usage scenarios. This understanding will provide clarity on the
>>> direction regarding folios for !anon shmem.
>>
>> If I understood correctly, Hugh had strong feelings against not respecting
>> mTHP toggles for shmem. Without per-block tracking, I agree with that.
> 
> My understanding was the same. But I have this follow-up question: should
> we respect mTHP toggles without considering mapping constraints (size and
> index)? Or perhaps we should use within_size where we can fit this intermediate
> approach, as long as mTHP granularity is respected?

Likely we should do exactly the same as we would do with the existing 
PMD-sized THPs.

I recall in the meeting that we discussed that always vs. within_size 
seems to have some value, and that we should respect that setting like 
we did for PMD-sized THP.

Which other mapping constraints could we have?
Daniel Gomez June 4, 2024, 12:05 p.m. UTC | #12
On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
> 
> 
> On 2024/6/4 16:18, Daniel Gomez wrote:
> > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > > > > 
> > > > > As a default, we should not be using large folios / mTHP for any shmem,
> > > > > just like we did with THP via shmem_enabled. This is what this series
> > > > > currently does, and is aprt of the whole mTHP user-space interface design.
> > > > > 
> > > > > Further, the mTHP controls should control all of shmem, not only
> > > > > "anonymous shmem".
> > > > 
> > > > Yes, that's what I thought and in my TODO list.
> > > 
> > > Good, it would be helpful to coordinate with Daniel and Pankaj.
> > 
> > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> > v3 patches. You may find a version in my integration branch here [2]. I can
> > attach them here if it's preferred.
> > 
> > [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
> > 
> > The point here is to combine the large folios strategy I proposed with mTHP
> > user controls. Would it make sense to limit the orders to the mapping order
> > calculated based on the size and index?
> 
> IMO, for !anon shmem, this change makes sense to me. We should respect the
> size and mTHP should act as a order filter.

What about respecing the size when within_size flag is enabled? Then, 'always'
would allocate mTHP enabled folios, regardless of the size. And 'never'
would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
mentioned in the discussion.

> 
> For anon shmem, we should ignore the length, which you always set it to
> PAGE_SIZE in patch [1].
> 
> [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569

Since we are ignoring the length, we should ignore any value being passed.

> 
> > @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> > 
> >                  order = highest_order(suitable_orders);
> >                  while (suitable_orders) {
> > +                       if (order > mapping_order) {
> > +                               order = next_order(&suitable_orders, order);
> > +                               continue;
> > +                       }
> >                          pages = 1UL << order;
> >                          index = round_down(index, pages);
> >                          folio = shmem_alloc_folio(gfp, order, info, index);
> > 
> > Note: The branch still need to be adapted to include !anon mm.
Daniel Gomez June 4, 2024, 12:30 p.m. UTC | #13
On Tue, Jun 04, 2024 at 11:59:51AM +0200, David Hildenbrand wrote:
> [...]
> 
> > > > How can we use per-block tracking for reclaiming memory and what changes would
> > > > be needed? Or is per-block really a non-viable option?
> > > 
> > > The interesting thing is: with mTHP toggles it is opt-in -- like for
> > > PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
> > > about this problem right now.
> > 
> > Without respecting the size when allocating large folios, mTHP toggles would
> > over allocate. My proposal added earlier to this thread is to combine the 2
> > to avoid that case. Otherwise, shouldn't we try to find a solution for the
> > reclaiming path?
> 
> I think at some point we'll really have to do a better job at reclaiming
> (either memory-overallocation, PUNCHHOLE that couldn't split, but maybe also
> pages that are now all-zero again and could be reclaimed again).
> 
> > 
> > > 
> > > > 
> > > > Clearly, if per-block is viable option, shmem_fault() bug would required to be
> > > > fixed first. Any ideas on how to make it reproducible?
> > > > 
> > > > The alternatives discussed where sub-page refcounting and zeropage scanning.
> > > 
> > > Yeah, I don't think sub-page refcounting is a feasible (and certainly not
> > > desired ;) ) option in the folio world.
> > > 
> > > > The first one is not possible (IIUC) because of a refactor years back that
> > > > simplified the code and also requires extra complexity. The second option would
> > > > require additional overhead as we would involve scanning.
> > > 
> > > We'll likely need something similar (scanning, tracking?) for anonymous
> > > memory as well. There was a proposal for a THP shrinker some time ago, that
> > > would solve part of the problem.
> > 
> > It's good to know we have the same problem in different places. I'm more
> > inclined to keep the information rather than adding an extra overhead. Unless
> > the complexity is really overwhelming. Considering the concerns here, not sure
> > how much should we try merging with iomap as per Ritesh's suggestion.
> 
> As raised in the meeting, I do see value in maintaining the information; but
> I also see why Hugh and Kirill think this might be unwarranted complexity in
> shmem.c. Meaning: we might be able to achieve something similar without it,
> and we don't have to solve the problem right now to benefit from large
> folios.
> 
> > 
> > The THP shrinker, could you please confirm if it is this following thread?
> > 
> > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/
> 
> Yes, although there was no follow up so far. Possibly, because in the
> current khugepaged approach, there will be a constant back-and-forth between
> khugepaged collapsing memory (and wasting memory in the default setting),
> and the THP shrinker reclaiming memory; doesn't sound quite effective :)
> That needs more thought IMHO.
> 
> [...]
> 
> > > > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> > > > > is not possible at fallcoate() time, detecting zeropages later and
> > > > > retrying to split+free might be an option, without per-block tracking.
> > > > 
> > > > > 
> > > > > (2) mTHP controls
> > > > > 
> > > > > As a default, we should not be using large folios / mTHP for any shmem, just
> > > > > like we did with THP via shmem_enabled. This is what this series currently
> > > > > does, and is aprt of the whole mTHP user-space interface design.
> > > > 
> > > > That was clear for me too. But what is the reason we want to boot in 'safe
> > > > mode'? What are the implications of not respecing that?
> > > 
> > > [...]
> > > 
> > > > 
> > > > As I understood from the call, mTHP with sysctl knobs is preferred over
> > > > optimistic falloc/write allocation? But is still unclear to me why the former
> > > > is preferred.
> > > 
> > > I think Hugh's point was that this should be an opt-in feature, just like
> > > PMD-sized THP started out, and still is, an opt-in feature.
> > 
> > I'd be keen to understand the use case for this. Even the current THP controls
> > we have in tmpfs. I guess these are just scenarios with no swap involved?
> > Are these use cases the same for both tmpfs and shmem anon mm?
> 
> Systems without swap are one case I think. The other reason for a toggle in
> the past was to be able to disable it to troubleshoot issues (Hugh mentioned
> in the meeting about unlocking new code paths in shmem.c only with a
> toggle).
> 
> Staring at my Fedora system:
> 
> $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
> always within_size advise [never] deny force
> 
> Maybe because it uses tmpfs to mount /tmp (interesting article on lwn.net
> about that) and people are concerned about the side-effects (that can
> currently waste memory, or result in more reclaim work being required when
> exceeding file sizes).
> 
> For VMs, I know that shmem+THP with memory ballooning is problematic and not
> really recommended.

Agree. We cannot PUNCH_HOLES when using THP unless the punch is PMD-size.

> 
> [...]
> 
> > > > 
> > > > > 
> > > > > Also, we should properly fallback within the configured sizes, and not jump
> > > > > "over" configured sizes. Unless there is a good reason.
> > > > > 
> > > > > (3) khugepaged
> > > > > 
> > > > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> > > > > THP later. But really, khugepaged needs to be fixed to handle that.
> > > > > 
> > > > > (4) force/disable
> > > > > 
> > > > > These settings are rather testing artifacts from the old ages. We should not
> > > > > add them to the per-size toggles. We might "inherit" it from the global one,
> > > > > though.
> > > > > 
> > > > > "within_size" might have value, and especially for consistency, we should
> > > > > have them per size.
> > > > > 
> > > > > 
> > > > > 
> > > > > So, this series only tackles anonymous shmem, which is a good starting
> > > > > point. Ideally, we'd get support for other shmem (especially during fault
> > > > > time) soon afterwards, because we won't be adding separate toggles for that
> > > > > from the interface POV, and having inconsistent behavior between kernel
> > > > > versions would be a bit unfortunate.
> > > > > 
> > > > > 
> > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have
> > > > > to take a lot of the "anonymous thp" terminology out of this series,
> > > > > especially when it comes to documentation.
> > > > > 
> > > > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> > > > > could get an understanding on the next steps on !anon shmem.
> > > > 
> > > > I realize I've raised so many questions, but it's essential for us to grasp the
> > > > mm concerns and usage scenarios. This understanding will provide clarity on the
> > > > direction regarding folios for !anon shmem.
> > > 
> > > If I understood correctly, Hugh had strong feelings against not respecting
> > > mTHP toggles for shmem. Without per-block tracking, I agree with that.
> > 
> > My understanding was the same. But I have this follow-up question: should
> > we respect mTHP toggles without considering mapping constraints (size and
> > index)? Or perhaps we should use within_size where we can fit this intermediate
> > approach, as long as mTHP granularity is respected?
> 
> Likely we should do exactly the same as we would do with the existing
> PMD-sized THPs.
> 
> I recall in the meeting that we discussed that always vs. within_size seems
> to have some value, and that we should respect that setting like we did for
> PMD-sized THP.
> 
> Which other mapping constraints could we have?

Patch 12 in my RFC for LSF [1] adds this shmem_mapping_size_order() (updated
to support get_order()) to get the max order 'based on the file size which the
mapping currently allows at the given index'. Same check is done here [2] in
filemap.c.

[1] https://lore.kernel.org/all/20240515055719.32577-13-da.gomez@samsung.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h=v6.10-rc2#n1940

@@ -1726,6 +1726,37 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
        return folio;
 }

+/**
+ * shmem_mapping_size_order - Get maximum folio order for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @size: The suggested size of the folio to create.
+ *
+ * This returns a high order for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Like __filemap_get_folio order calculation.
+ *
+ * Return: The order.
+ */
+static inline unsigned int
+shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
+                        size_t size)
+{
+       unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE));
+
+       if (!mapping_large_folio_support(mapping))
+               return 0;
+
+       /* If we're not aligned, allocate a smaller folio */
+       if (index & ((1UL << order) - 1))
+               order = __ffs(index);
+
+       return min_t(size_t, order, MAX_PAGECACHE_ORDER);
+}
+

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Baolin Wang June 6, 2024, 3:31 a.m. UTC | #14
On 2024/6/4 20:05, Daniel Gomez wrote:
> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>
>>
>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>
>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>> "anonymous shmem".
>>>>>
>>>>> Yes, that's what I thought and in my TODO list.
>>>>
>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>
>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>> attach them here if it's preferred.
>>>
>>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>
>>> The point here is to combine the large folios strategy I proposed with mTHP
>>> user controls. Would it make sense to limit the orders to the mapping order
>>> calculated based on the size and index?
>>
>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>> size and mTHP should act as a order filter.
> 
> What about respecing the size when within_size flag is enabled? Then, 'always'
> would allocate mTHP enabled folios, regardless of the size. And 'never'
> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
> mentioned in the discussion.

Looks reasonable to me. What do you think, David?

And what about 'advise' option? Silimar to 'within_size'?

>> For anon shmem, we should ignore the length, which you always set it to
>> PAGE_SIZE in patch [1].
>>
>> [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569
> 
> Since we are ignoring the length, we should ignore any value being passed.
> 
>>
>>> @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>>>
>>>                   order = highest_order(suitable_orders);
>>>                   while (suitable_orders) {
>>> +                       if (order > mapping_order) {
>>> +                               order = next_order(&suitable_orders, order);
>>> +                               continue;
>>> +                       }
>>>                           pages = 1UL << order;
>>>                           index = round_down(index, pages);
>>>                           folio = shmem_alloc_folio(gfp, order, info, index);
>>>
>>> Note: The branch still need to be adapted to include !anon mm.
David Hildenbrand June 6, 2024, 8:38 a.m. UTC | #15
On 06.06.24 05:31, Baolin Wang wrote:
> 
> 
> On 2024/6/4 20:05, Daniel Gomez wrote:
>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>
>>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>>
>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>> "anonymous shmem".
>>>>>>
>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>
>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>
>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>>> attach them here if it's preferred.
>>>>
>>>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
>>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>
>>>> The point here is to combine the large folios strategy I proposed with mTHP
>>>> user controls. Would it make sense to limit the orders to the mapping order
>>>> calculated based on the size and index?
>>>
>>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>>> size and mTHP should act as a order filter.
>>
>> What about respecing the size when within_size flag is enabled? Then, 'always'
>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
>> mentioned in the discussion.
> 
> Looks reasonable to me. What do you think, David?
> 

That mimics existing PMD-THP behavior, right?

With "within_size", we must not exceed the size, with "always", we may 
exceed the size.

> And what about 'advise' option? Silimar to 'within_size'?

Good question. What's the behavior of PMD-THP? I would assume it behaves 
like "within_size", because in the common case we mmap (+advise) only 
within the size of the file, not exceeding it.

(the always option, as I learned during the meeting, is primarily 
helpful when writing to tmpfs files in an append-fashion. With 
mmap()+madvise() that doesn't quite happen)
Baolin Wang June 6, 2024, 9:31 a.m. UTC | #16
On 2024/6/6 16:38, David Hildenbrand wrote:
> On 06.06.24 05:31, Baolin Wang wrote:
>>
>>
>> On 2024/6/4 20:05, Daniel Gomez wrote:
>>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> As a default, we should not be using large folios / mTHP for any 
>>>>>>>> shmem,
>>>>>>>> just like we did with THP via shmem_enabled. This is what this 
>>>>>>>> series
>>>>>>>> currently does, and is aprt of the whole mTHP user-space 
>>>>>>>> interface design.
>>>>>>>>
>>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>>> "anonymous shmem".
>>>>>>>
>>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>>
>>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>>
>>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on 
>>>>> top of Baolin's
>>>>> v3 patches. You may find a version in my integration branch here 
>>>>> [2]. I can
>>>>> attach them here if it's preferred.
>>>>>
>>>>> [1] 
>>>>> https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
>>>>> [2] 
>>>>> https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>>
>>>>> The point here is to combine the large folios strategy I proposed 
>>>>> with mTHP
>>>>> user controls. Would it make sense to limit the orders to the 
>>>>> mapping order
>>>>> calculated based on the size and index?
>>>>
>>>> IMO, for !anon shmem, this change makes sense to me. We should 
>>>> respect the
>>>> size and mTHP should act as a order filter.
>>>
>>> What about respecing the size when within_size flag is enabled? Then, 
>>> 'always'
>>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>>> would ignore mTHP and size. So, 'never' can be used for this 'safe' 
>>> boot case
>>> mentioned in the discussion.
>>
>> Looks reasonable to me. What do you think, David?
>>
> 
> That mimics existing PMD-THP behavior, right?
> 
> With "within_size", we must not exceed the size, with "always", we may 
> exceed the size.

Yes, I think so.

>> And what about 'advise' option? Silimar to 'within_size'?
> 
> Good question. What's the behavior of PMD-THP? I would assume it behaves 
> like "within_size", because in the common case we mmap (+advise) only 
> within the size of the file, not exceeding it.

Yes, that is also my understanding.

> (the always option, as I learned during the meeting, is primarily 
> helpful when writing to tmpfs files in an append-fashion. With 
> mmap()+madvise() that doesn't quite happen)
>
Daniel Gomez June 7, 2024, 9:05 a.m. UTC | #17
On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote:
> On 06.06.24 05:31, Baolin Wang wrote:
> > 
> > 
> > On 2024/6/4 20:05, Daniel Gomez wrote:
> > > On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
> > > > 
> > > > 
> > > > On 2024/6/4 16:18, Daniel Gomez wrote:
> > > > > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
> > > > > > > > 
> > > > > > > > As a default, we should not be using large folios / mTHP for any shmem,
> > > > > > > > just like we did with THP via shmem_enabled. This is what this series
> > > > > > > > currently does, and is aprt of the whole mTHP user-space interface design.
> > > > > > > > 
> > > > > > > > Further, the mTHP controls should control all of shmem, not only
> > > > > > > > "anonymous shmem".
> > > > > > > 
> > > > > > > Yes, that's what I thought and in my TODO list.
> > > > > > 
> > > > > > Good, it would be helpful to coordinate with Daniel and Pankaj.
> > > > > 
> > > > > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
> > > > > v3 patches. You may find a version in my integration branch here [2]. I can
> > > > > attach them here if it's preferred.
> > > > > 
> > > > > [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
> > > > > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
> > > > > 
> > > > > The point here is to combine the large folios strategy I proposed with mTHP
> > > > > user controls. Would it make sense to limit the orders to the mapping order
> > > > > calculated based on the size and index?
> > > > 
> > > > IMO, for !anon shmem, this change makes sense to me. We should respect the
> > > > size and mTHP should act as a order filter.
> > > 
> > > What about respecing the size when within_size flag is enabled? Then, 'always'
> > > would allocate mTHP enabled folios, regardless of the size. And 'never'
> > > would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
> > > mentioned in the discussion.
> > 
> > Looks reasonable to me. What do you think, David?
> > 
> 
> That mimics existing PMD-THP behavior, right?
> 
> With "within_size", we must not exceed the size, with "always", we may
> exceed the size.

But right now we only check the inode size. With large folio support in
write_iter() we can have access to the length as well. I think this would solve
(paratially?) the cases where we don't have access to the file size and if we
perform writes in bigger chunks.

E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file

For 'within_size', the example above would allocate 512 pages instead of one
huge page. After patches [1] [2] we can get the size of the write to allocate
whatever mTHP/THP makes more sense for the length being passed.

[1] https://lore.kernel.org/all/20240527163616.1135968-2-hch@lst.de/
[2] https://lore.kernel.org/all/20240515055719.32577-12-da.gomez@samsung.com/

Here a quick hack for THP:

@@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
        case SHMEM_HUGE_WITHIN_SIZE:
                index = round_up(index + 1, HPAGE_PMD_NR);
                i_size = round_up(i_size_read(inode), PAGE_SIZE);
-               if (i_size >> PAGE_SHIFT >= index)
+               if ((i_size >> PAGE_SHIFT >= index) ||
+                   (len >> PAGE_SHIFT >= index))
                        return true;
                fallthrough;


> 
> > And what about 'advise' option? Silimar to 'within_size'?
> 
> Good question. What's the behavior of PMD-THP? I would assume it behaves
> like "within_size", because in the common case we mmap (+advise) only within
> the size of the file, not exceeding it.

It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size).

> 
> (the always option, as I learned during the meeting, is primarily helpful
> when writing to tmpfs files in an append-fashion. With mmap()+madvise() that
> doesn't quite happen)

Let's benefit of larger writes as well if the chunk matches any of the mTHP/
THP sizes.

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand June 7, 2024, 10:39 a.m. UTC | #18
On 07.06.24 11:05, Daniel Gomez wrote:
> On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote:
>> On 06.06.24 05:31, Baolin Wang wrote:
>>>
>>>
>>> On 2024/6/4 20:05, Daniel Gomez wrote:
>>>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/6/4 16:18, Daniel Gomez wrote:
>>>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> As a default, we should not be using large folios / mTHP for any shmem,
>>>>>>>>> just like we did with THP via shmem_enabled. This is what this series
>>>>>>>>> currently does, and is aprt of the whole mTHP user-space interface design.
>>>>>>>>>
>>>>>>>>> Further, the mTHP controls should control all of shmem, not only
>>>>>>>>> "anonymous shmem".
>>>>>>>>
>>>>>>>> Yes, that's what I thought and in my TODO list.
>>>>>>>
>>>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj.
>>>>>>
>>>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
>>>>>> v3 patches. You may find a version in my integration branch here [2]. I can
>>>>>> attach them here if it's preferred.
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
>>>>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp
>>>>>>
>>>>>> The point here is to combine the large folios strategy I proposed with mTHP
>>>>>> user controls. Would it make sense to limit the orders to the mapping order
>>>>>> calculated based on the size and index?
>>>>>
>>>>> IMO, for !anon shmem, this change makes sense to me. We should respect the
>>>>> size and mTHP should act as a order filter.
>>>>
>>>> What about respecing the size when within_size flag is enabled? Then, 'always'
>>>> would allocate mTHP enabled folios, regardless of the size. And 'never'
>>>> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
>>>> mentioned in the discussion.
>>>
>>> Looks reasonable to me. What do you think, David?
>>>
>>
>> That mimics existing PMD-THP behavior, right?
>>
>> With "within_size", we must not exceed the size, with "always", we may
>> exceed the size.
> 
> But right now we only check the inode size. With large folio support in
> write_iter() we can have access to the length as well. I think this would solve
> (paratially?) the cases where we don't have access to the file size and if we
> perform writes in bigger chunks.
> 
> E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file
> 
> For 'within_size', the example above would allocate 512 pages instead of one
> huge page. After patches [1] [2] we can get the size of the write to allocate
> whatever mTHP/THP makes more sense for the length being passed.
> 

Yes, although this sounds like an optimization/change we should be doing 
separately I guess.

> [1] https://lore.kernel.org/all/20240527163616.1135968-2-hch@lst.de/
> [2] https://lore.kernel.org/all/20240515055719.32577-12-da.gomez@samsung.com/
> 
> Here a quick hack for THP:
> 
> @@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
>          case SHMEM_HUGE_WITHIN_SIZE:
>                  index = round_up(index + 1, HPAGE_PMD_NR);
>                  i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >> PAGE_SHIFT >= index)
> +               if ((i_size >> PAGE_SHIFT >= index) ||
> +                   (len >> PAGE_SHIFT >= index))
>                          return true;
>                  fallthrough;
> 
> 
>>
>>> And what about 'advise' option? Silimar to 'within_size'?
>>
>> Good question. What's the behavior of PMD-THP? I would assume it behaves
>> like "within_size", because in the common case we mmap (+advise) only within
>> the size of the file, not exceeding it.
> 
> It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size).
> 

Interesting, so we should do the same. Thanks!