Message ID | cover.1717033868.git.baolin.wang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | add mTHP support for anonymous shmem | expand |
On 30.05.24 04:04, Baolin Wang wrote: > Anonymous pages have already been supported for multi-size (mTHP) allocation > through commit 19eaf44954df, that can allow THP to be configured through the > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. > > However, the anonymous shmem will ignore the anonymous mTHP rule configured > through the sysfs interface, and can only use the PMD-mapped THP, that is not > reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED | > MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect > to apply an unified mTHP strategy for anonymous pages, also including the > anonymous shared pages, in order to enjoy the benefits of mTHP. For example, > lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, > contiguous PTEs on ARM architecture to reduce TLB miss etc. > > The primary strategy is similar to supporting anonymous mTHP. Introduce > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', > which can have all the same values as the top-level > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new > additional "inherit" option. By default all sizes will be set to "never" > except PMD size, which is set to "inherit". This ensures backward compatibility > with the anonymous shmem enabled of the top level, meanwhile also allows > independent control of anonymous shmem enabled for each mTHP. > > Use the page fault latency tool to measure the performance of 1G anonymous shmem > with 32 threads on my machine environment with: ARM64 Architecture, 32 cores, > 125G memory: > base: mm-unstable > user-time sys_time faults_per_sec_per_cpu faults_per_sec > 0.04s 3.10s 83516.416 2669684.890 > > mm-unstable + patchset, anon shmem mTHP disabled > user-time sys_time faults_per_sec_per_cpu faults_per_sec > 0.02s 3.14s 82936.359 2630746.027 > > mm-unstable + patchset, anon shmem 64K mTHP enabled > user-time sys_time faults_per_sec_per_cpu faults_per_sec > 0.08s 0.31s 678630.231 17082522.495 > > From the data above, it is observed that the patchset has a minimal impact when > mTHP is not enabled (some fluctuations observed during testing). When enabling 64K > mTHP, there is a significant improvement of the page fault latency. Let me summarize the takeaway from the bi-weekly MM meeting as I understood it, that includes Hugh's feedback on per-block tracking vs. mTHP: (1) Per-block tracking Per-block tracking is currently considered unwarranted complexity in shmem.c. We should try to get it done without that. For any test cases that fail, we should consider if they are actually valid for shmem. To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing is not possible at fallcoate() time, detecting zeropages later and retrying to split+free might be an option, without per-block tracking. (2) mTHP controls As a default, we should not be using large folios / mTHP for any shmem, just like we did with THP via shmem_enabled. This is what this series currently does, and is aprt of the whole mTHP user-space interface design. Further, the mTHP controls should control all of shmem, not only "anonymous shmem". Also, we should properly fallback within the configured sizes, and not jump "over" configured sizes. Unless there is a good reason. (3) khugepaged khugepaged needs to handle larger folios properly as well. Until fixed, using smaller THP sizes as fallback might prohibit collapsing a PMD-sized THP later. But really, khugepaged needs to be fixed to handle that. (4) force/disable These settings are rather testing artifacts from the old ages. We should not add them to the per-size toggles. We might "inherit" it from the global one, though. "within_size" might have value, and especially for consistency, we should have them per size. So, this series only tackles anonymous shmem, which is a good starting point. Ideally, we'd get support for other shmem (especially during fault time) soon afterwards, because we won't be adding separate toggles for that from the interface POV, and having inconsistent behavior between kernel versions would be a bit unfortunate. @Baolin, this series likely does not consider (4) yet. And I suggest we have to take a lot of the "anonymous thp" terminology out of this series, especially when it comes to documentation. @Daniel, Pankaj, what are your plans regarding that? It would be great if we could get an understanding on the next steps on !anon shmem.
On 2024/5/31 17:35, David Hildenbrand wrote: > On 30.05.24 04:04, Baolin Wang wrote: >> Anonymous pages have already been supported for multi-size (mTHP) >> allocation >> through commit 19eaf44954df, that can allow THP to be configured >> through the >> sysfs interface located at >> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. >> >> However, the anonymous shmem will ignore the anonymous mTHP rule >> configured >> through the sysfs interface, and can only use the PMD-mapped THP, that >> is not >> reasonable. Many implement anonymous page sharing through >> mmap(MAP_SHARED | >> MAP_ANONYMOUS), especially in database usage scenarios, therefore, >> users expect >> to apply an unified mTHP strategy for anonymous pages, also including the >> anonymous shared pages, in order to enjoy the benefits of mTHP. For >> example, >> lower latency than PMD-mapped THP, smaller memory bloat than >> PMD-mapped THP, >> contiguous PTEs on ARM architecture to reduce TLB miss etc. >> >> The primary strategy is similar to supporting anonymous mTHP. Introduce >> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', >> which can have all the same values as the top-level >> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new >> additional "inherit" option. By default all sizes will be set to "never" >> except PMD size, which is set to "inherit". This ensures backward >> compatibility >> with the anonymous shmem enabled of the top level, meanwhile also allows >> independent control of anonymous shmem enabled for each mTHP. >> >> Use the page fault latency tool to measure the performance of 1G >> anonymous shmem >> with 32 threads on my machine environment with: ARM64 Architecture, 32 >> cores, >> 125G memory: >> base: mm-unstable >> user-time sys_time faults_per_sec_per_cpu faults_per_sec >> 0.04s 3.10s 83516.416 2669684.890 >> >> mm-unstable + patchset, anon shmem mTHP disabled >> user-time sys_time faults_per_sec_per_cpu faults_per_sec >> 0.02s 3.14s 82936.359 2630746.027 >> >> mm-unstable + patchset, anon shmem 64K mTHP enabled >> user-time sys_time faults_per_sec_per_cpu faults_per_sec >> 0.08s 0.31s 678630.231 17082522.495 >> >> From the data above, it is observed that the patchset has a minimal >> impact when >> mTHP is not enabled (some fluctuations observed during testing). When >> enabling 64K >> mTHP, there is a significant improvement of the page fault latency. > > Let me summarize the takeaway from the bi-weekly MM meeting as I > understood it, that includes Hugh's feedback on per-block tracking vs. Thanks David for the summarization. > mTHP: > > (1) Per-block tracking > > Per-block tracking is currently considered unwarranted complexity in > shmem.c. We should try to get it done without that. For any test cases > that fail, we should consider if they are actually valid for shmem. > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing > is not possible at fallcoate() time, detecting zeropages later and > retrying to split+free might be an option, without per-block tracking. > > (2) mTHP controls > > As a default, we should not be using large folios / mTHP for any shmem, > just like we did with THP via shmem_enabled. This is what this series > currently does, and is aprt of the whole mTHP user-space interface design. > > Further, the mTHP controls should control all of shmem, not only > "anonymous shmem". Yes, that's what I thought and in my TODO list. > > Also, we should properly fallback within the configured sizes, and not > jump "over" configured sizes. Unless there is a good reason. > > (3) khugepaged > > khugepaged needs to handle larger folios properly as well. Until fixed, > using smaller THP sizes as fallback might prohibit collapsing a > PMD-sized THP later. But really, khugepaged needs to be fixed to handle > that. > > (4) force/disable > > These settings are rather testing artifacts from the old ages. We should > not add them to the per-size toggles. We might "inherit" it from the > global one, though. Sorry, I missed this. So I thould remove the 'force' and 'deny' option for each mTHP, right? > > "within_size" might have value, and especially for consistency, we > should have them per size. > > > > So, this series only tackles anonymous shmem, which is a good starting > point. Ideally, we'd get support for other shmem (especially during > fault time) soon afterwards, because we won't be adding separate toggles > for that from the interface POV, and having inconsistent behavior > between kernel versions would be a bit unfortunate. > > > @Baolin, this series likely does not consider (4) yet. And I suggest we > have to take a lot of the "anonymous thp" terminology out of this > series, especially when it comes to documentation. Sure. I will remove the "anonymous thp" terminology from the documentation, but want to still keep it in the commit message, cause I want to start from the anonymous shmem. > > @Daniel, Pankaj, what are your plans regarding that? It would be great > if we could get an understanding on the next steps on !anon shmem. >
>> >> As a default, we should not be using large folios / mTHP for any shmem, >> just like we did with THP via shmem_enabled. This is what this series >> currently does, and is aprt of the whole mTHP user-space interface design. >> >> Further, the mTHP controls should control all of shmem, not only >> "anonymous shmem". > > Yes, that's what I thought and in my TODO list. Good, it would be helpful to coordinate with Daniel and Pankaj. > >> >> Also, we should properly fallback within the configured sizes, and not >> jump "over" configured sizes. Unless there is a good reason. >> >> (3) khugepaged >> >> khugepaged needs to handle larger folios properly as well. Until fixed, >> using smaller THP sizes as fallback might prohibit collapsing a >> PMD-sized THP later. But really, khugepaged needs to be fixed to handle >> that. > >> (4) force/disable >> >> These settings are rather testing artifacts from the old ages. We should >> not add them to the per-size toggles. We might "inherit" it from the >> global one, though. > > Sorry, I missed this. So I thould remove the 'force' and 'deny' option > for each mTHP, right? Yes, that's my understanding. But we have to keep them on the top level for any possible user out there. > >> >> "within_size" might have value, and especially for consistency, we >> should have them per size. >> >> >> >> So, this series only tackles anonymous shmem, which is a good starting >> point. Ideally, we'd get support for other shmem (especially during >> fault time) soon afterwards, because we won't be adding separate toggles >> for that from the interface POV, and having inconsistent behavior >> between kernel versions would be a bit unfortunate. >> >> >> @Baolin, this series likely does not consider (4) yet. And I suggest we >> have to take a lot of the "anonymous thp" terminology out of this >> series, especially when it comes to documentation. > > Sure. I will remove the "anonymous thp" terminology from the > documentation, but want to still keep it in the commit message, cause I > want to start from the anonymous shmem. For commit message and friends makes sense. The story should be "controls all of shmem/tmpfs, but support will be added iteratively. The first step is anonymous shmem."
On Fri, May 31, 2024 at 11:35:30AM +0200, David Hildenbrand wrote: > On 30.05.24 04:04, Baolin Wang wrote: > > Anonymous pages have already been supported for multi-size (mTHP) allocation > > through commit 19eaf44954df, that can allow THP to be configured through the > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. > > > > However, the anonymous shmem will ignore the anonymous mTHP rule configured > > through the sysfs interface, and can only use the PMD-mapped THP, that is not > > reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED | > > MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect > > to apply an unified mTHP strategy for anonymous pages, also including the > > anonymous shared pages, in order to enjoy the benefits of mTHP. For example, > > lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, > > contiguous PTEs on ARM architecture to reduce TLB miss etc. > > > > The primary strategy is similar to supporting anonymous mTHP. Introduce > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', > > which can have all the same values as the top-level > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new > > additional "inherit" option. By default all sizes will be set to "never" > > except PMD size, which is set to "inherit". This ensures backward compatibility > > with the anonymous shmem enabled of the top level, meanwhile also allows > > independent control of anonymous shmem enabled for each mTHP. > > > > Use the page fault latency tool to measure the performance of 1G anonymous shmem > > with 32 threads on my machine environment with: ARM64 Architecture, 32 cores, > > 125G memory: > > base: mm-unstable > > user-time sys_time faults_per_sec_per_cpu faults_per_sec > > 0.04s 3.10s 83516.416 2669684.890 > > > > mm-unstable + patchset, anon shmem mTHP disabled > > user-time sys_time faults_per_sec_per_cpu faults_per_sec > > 0.02s 3.14s 82936.359 2630746.027 > > > > mm-unstable + patchset, anon shmem 64K mTHP enabled > > user-time sys_time faults_per_sec_per_cpu faults_per_sec > > 0.08s 0.31s 678630.231 17082522.495 > > > > From the data above, it is observed that the patchset has a minimal impact when > > mTHP is not enabled (some fluctuations observed during testing). When enabling 64K > > mTHP, there is a significant improvement of the page fault latency. > > Let me summarize the takeaway from the bi-weekly MM meeting as I understood > it, that includes Hugh's feedback on per-block tracking vs. mTHP: Thanks David for the summary. Please, find below some follow up questions. I want understand if zeropage scanning overhead is preferred over per-block tracking complexity or if we still need to verify this. > > (1) Per-block tracking > > Per-block tracking is currently considered unwarranted complexity in > shmem.c. We should try to get it done without that. For any test cases that > fail, we should consider if they are actually valid for shmem. I agree it was unwarranted complexity but only if this is just to fix lseek() as we can simply make the test pass by checking if holes are reported as data. That would be the minimum required for lseek() to be compliant with the syscall. How can we use per-block tracking for reclaiming memory and what changes would be needed? Or is per-block really a non-viable option? Clearly, if per-block is viable option, shmem_fault() bug would required to be fixed first. Any ideas on how to make it reproducible? The alternatives discussed where sub-page refcounting and zeropage scanning. The first one is not possible (IIUC) because of a refactor years back that simplified the code and also requires extra complexity. The second option would require additional overhead as we would involve scanning. > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing > is not possible at fallcoate() time, detecting zeropages later and > retrying to split+free might be an option, without per-block tracking. > > (2) mTHP controls > > As a default, we should not be using large folios / mTHP for any shmem, just > like we did with THP via shmem_enabled. This is what this series currently > does, and is aprt of the whole mTHP user-space interface design. That was clear for me too. But what is the reason we want to boot in 'safe mode'? What are the implications of not respecing that? > > Further, the mTHP controls should control all of shmem, not only "anonymous > shmem". As I understood from the call, mTHP with sysctl knobs is preferred over optimistic falloc/write allocation? But is still unclear to me why the former is preferred. Is large folios a non-viable option? > > Also, we should properly fallback within the configured sizes, and not jump > "over" configured sizes. Unless there is a good reason. > > (3) khugepaged > > khugepaged needs to handle larger folios properly as well. Until fixed, > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized > THP later. But really, khugepaged needs to be fixed to handle that. > > (4) force/disable > > These settings are rather testing artifacts from the old ages. We should not > add them to the per-size toggles. We might "inherit" it from the global one, > though. > > "within_size" might have value, and especially for consistency, we should > have them per size. > > > > So, this series only tackles anonymous shmem, which is a good starting > point. Ideally, we'd get support for other shmem (especially during fault > time) soon afterwards, because we won't be adding separate toggles for that > from the interface POV, and having inconsistent behavior between kernel > versions would be a bit unfortunate. > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have > to take a lot of the "anonymous thp" terminology out of this series, > especially when it comes to documentation. > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we > could get an understanding on the next steps on !anon shmem. I realize I've raised so many questions, but it's essential for us to grasp the mm concerns and usage scenarios. This understanding will provide clarity on the direction regarding folios for !anon shmem. > > -- > Cheers, > > David / dhildenb > Daniel
Hi Daniel, >> Let me summarize the takeaway from the bi-weekly MM meeting as I understood >> it, that includes Hugh's feedback on per-block tracking vs. mTHP: > > Thanks David for the summary. Please, find below some follow up questions. > > I want understand if zeropage scanning overhead is preferred over per-block > tracking complexity or if we still need to verify this. > >> >> (1) Per-block tracking >> >> Per-block tracking is currently considered unwarranted complexity in >> shmem.c. We should try to get it done without that. For any test cases that >> fail, we should consider if they are actually valid for shmem. > > I agree it was unwarranted complexity but only if this is just to fix lseek() as > we can simply make the test pass by checking if holes are reported as data. That > would be the minimum required for lseek() to be compliant with the syscall. > > How can we use per-block tracking for reclaiming memory and what changes would > be needed? Or is per-block really a non-viable option? The interesting thing is: with mTHP toggles it is opt-in -- like for PMD-sized THP with shmem_enabled -- and we don't have to be that concerned about this problem right now. > > Clearly, if per-block is viable option, shmem_fault() bug would required to be > fixed first. Any ideas on how to make it reproducible? > > The alternatives discussed where sub-page refcounting and zeropage scanning. Yeah, I don't think sub-page refcounting is a feasible (and certainly not desired ;) ) option in the folio world. > The first one is not possible (IIUC) because of a refactor years back that > simplified the code and also requires extra complexity. The second option would > require additional overhead as we would involve scanning. We'll likely need something similar (scanning, tracking?) for anonymous memory as well. There was a proposal for a THP shrinker some time ago, that would solve part of the problem. For example, for shmem you could simply flag folios that failed splitting during fallocate() as reclaim candidates and try to reclaim memory later. So you don't have to scan arbitrary folios (which might also be desired, though). > >> >> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing >> is not possible at fallcoate() time, detecting zeropages later and >> retrying to split+free might be an option, without per-block tracking. > >> >> (2) mTHP controls >> >> As a default, we should not be using large folios / mTHP for any shmem, just >> like we did with THP via shmem_enabled. This is what this series currently >> does, and is aprt of the whole mTHP user-space interface design. > > That was clear for me too. But what is the reason we want to boot in 'safe > mode'? What are the implications of not respecing that? [...] > > As I understood from the call, mTHP with sysctl knobs is preferred over > optimistic falloc/write allocation? But is still unclear to me why the former > is preferred. I think Hugh's point was that this should be an opt-in feature, just like PMD-sized THP started out, and still is, an opt-in feature. Problematic interaction with khugepaged (that will be fixed) is one thing, interaction with memory reclaim (without any kind of memory reclaim mechanisms in place) might be another one. Controlling and tuning for specific folio sizes might be another one Hugh raised. [summarizing what I recall from the discussion, there might be more]. > > Is large folios a non-viable option? I think you mean "allocating large folios without user space control". Because mTHP gives user space full control, to the point where you can enable all sizes and obtain the same result. > >> >> Also, we should properly fallback within the configured sizes, and not jump >> "over" configured sizes. Unless there is a good reason. >> >> (3) khugepaged >> >> khugepaged needs to handle larger folios properly as well. Until fixed, >> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized >> THP later. But really, khugepaged needs to be fixed to handle that. >> >> (4) force/disable >> >> These settings are rather testing artifacts from the old ages. We should not >> add them to the per-size toggles. We might "inherit" it from the global one, >> though. >> >> "within_size" might have value, and especially for consistency, we should >> have them per size. >> >> >> >> So, this series only tackles anonymous shmem, which is a good starting >> point. Ideally, we'd get support for other shmem (especially during fault >> time) soon afterwards, because we won't be adding separate toggles for that >> from the interface POV, and having inconsistent behavior between kernel >> versions would be a bit unfortunate. >> >> >> @Baolin, this series likely does not consider (4) yet. And I suggest we have >> to take a lot of the "anonymous thp" terminology out of this series, >> especially when it comes to documentation. >> >> @Daniel, Pankaj, what are your plans regarding that? It would be great if we >> could get an understanding on the next steps on !anon shmem. > > I realize I've raised so many questions, but it's essential for us to grasp the > mm concerns and usage scenarios. This understanding will provide clarity on the > direction regarding folios for !anon shmem. If I understood correctly, Hugh had strong feelings against not respecting mTHP toggles for shmem. Without per-block tracking, I agree with that.
At 2024-05-31 18:13:03, "Baolin Wang" <baolin.wang@linux.alibaba.com> wrote: > > >On 2024/5/31 17:35, David Hildenbrand wrote: >> On 30.05.24 04:04, Baolin Wang wrote: >>> Anonymous pages have already been supported for multi-size (mTHP) >>> allocation >>> through commit 19eaf44954df, that can allow THP to be configured >>> through the >>> sysfs interface located at >>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. >>> >>> However, the anonymous shmem will ignore the anonymous mTHP rule >>> configured >>> through the sysfs interface, and can only use the PMD-mapped THP, that >>> is not >>> reasonable. Many implement anonymous page sharing through >>> mmap(MAP_SHARED | >>> MAP_ANONYMOUS), especially in database usage scenarios, therefore, >>> users expect >>> to apply an unified mTHP strategy for anonymous pages, also including the >>> anonymous shared pages, in order to enjoy the benefits of mTHP. For >>> example, >>> lower latency than PMD-mapped THP, smaller memory bloat than >>> PMD-mapped THP, >>> contiguous PTEs on ARM architecture to reduce TLB miss etc. >>> >>> The primary strategy is similar to supporting anonymous mTHP. Introduce >>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', >>> which can have all the same values as the top-level >>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new >>> additional "inherit" option. By default all sizes will be set to "never" >>> except PMD size, which is set to "inherit". This ensures backward >>> compatibility >>> with the anonymous shmem enabled of the top level, meanwhile also allows >>> independent control of anonymous shmem enabled for each mTHP. >>> >>> Use the page fault latency tool to measure the performance of 1G >>> anonymous shmem >>> with 32 threads on my machine environment with: ARM64 Architecture, 32 >>> cores, >>> 125G memory: >>> base: mm-unstable >>> user-time sys_time faults_per_sec_per_cpu faults_per_sec >>> 0.04s 3.10s 83516.416 2669684.890 >>> >>> mm-unstable + patchset, anon shmem mTHP disabled >>> user-time sys_time faults_per_sec_per_cpu faults_per_sec >>> 0.02s 3.14s 82936.359 2630746.027 >>> >>> mm-unstable + patchset, anon shmem 64K mTHP enabled >>> user-time sys_time faults_per_sec_per_cpu faults_per_sec >>> 0.08s 0.31s 678630.231 17082522.495 >>> >>> From the data above, it is observed that the patchset has a minimal >>> impact when >>> mTHP is not enabled (some fluctuations observed during testing). When >>> enabling 64K >>> mTHP, there is a significant improvement of the page fault latency. >> >> Let me summarize the takeaway from the bi-weekly MM meeting as I >> understood it, that includes Hugh's feedback on per-block tracking vs. > >Thanks David for the summarization. > >> mTHP: >> >> (1) Per-block tracking >> >> Per-block tracking is currently considered unwarranted complexity in >> shmem.c. We should try to get it done without that. For any test cases >> that fail, we should consider if they are actually valid for shmem. >> >> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing >> is not possible at fallcoate() time, detecting zeropages later and >> retrying to split+free might be an option, without per-block tracking. >> >> (2) mTHP controls >> >> As a default, we should not be using large folios / mTHP for any shmem, >> just like we did with THP via shmem_enabled. This is what this series >> currently does, and is aprt of the whole mTHP user-space interface design. >> >> Further, the mTHP controls should control all of shmem, not only >> "anonymous shmem". > >Yes, that's what I thought and in my TODO list. > >> >> Also, we should properly fallback within the configured sizes, and not >> jump "over" configured sizes. Unless there is a good reason. >> >> (3) khugepaged >> >> khugepaged needs to handle larger folios properly as well. Until fixed, >> using smaller THP sizes as fallback might prohibit collapsing a >> PMD-sized THP later. But really, khugepaged needs to be fixed to handle >> that. > >> (4) force/disable >> >> These settings are rather testing artifacts from the old ages. We should >> not add them to the per-size toggles. We might "inherit" it from the >> global one, though. > >Sorry, I missed this. So I thould remove the 'force' and 'deny' option >for each mTHP, right? > I prefer to this. Perhaps the functionality of "force/deny" is different from that of "always/never" when tmpfs is supported. The user needs to understand the usage of "force" and "deny" again.
On 2024/5/31 19:13, David Hildenbrand wrote: >>> >>> As a default, we should not be using large folios / mTHP for any shmem, >>> just like we did with THP via shmem_enabled. This is what this series >>> currently does, and is aprt of the whole mTHP user-space interface >>> design. >>> >>> Further, the mTHP controls should control all of shmem, not only >>> "anonymous shmem". >> >> Yes, that's what I thought and in my TODO list. > > Good, it would be helpful to coordinate with Daniel and Pankaj. > >> >>> >>> Also, we should properly fallback within the configured sizes, and not >>> jump "over" configured sizes. Unless there is a good reason. >>> >>> (3) khugepaged >>> >>> khugepaged needs to handle larger folios properly as well. Until fixed, >>> using smaller THP sizes as fallback might prohibit collapsing a >>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle >>> that. > >>> (4) force/disable >>> >>> These settings are rather testing artifacts from the old ages. We should >>> not add them to the per-size toggles. We might "inherit" it from the >>> global one, though. >> >> Sorry, I missed this. So I thould remove the 'force' and 'deny' option >> for each mTHP, right? > > Yes, that's my understanding. But we have to keep them on the top level > for any possible user out there. OK. >>> >>> "within_size" might have value, and especially for consistency, we >>> should have them per size. >>> >>> >>> >>> So, this series only tackles anonymous shmem, which is a good starting >>> point. Ideally, we'd get support for other shmem (especially during >>> fault time) soon afterwards, because we won't be adding separate toggles >>> for that from the interface POV, and having inconsistent behavior >>> between kernel versions would be a bit unfortunate. >>> >>> >>> @Baolin, this series likely does not consider (4) yet. And I suggest we >>> have to take a lot of the "anonymous thp" terminology out of this >>> series, especially when it comes to documentation. >> >> Sure. I will remove the "anonymous thp" terminology from the >> documentation, but want to still keep it in the commit message, cause I >> want to start from the anonymous shmem. > > For commit message and friends makes sense. The story should be > "controls all of shmem/tmpfs, but support will be added iteratively. The > first step is anonymous shmem." Sure. Thanks.
On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: > > > > > > As a default, we should not be using large folios / mTHP for any shmem, > > > just like we did with THP via shmem_enabled. This is what this series > > > currently does, and is aprt of the whole mTHP user-space interface design. > > > > > > Further, the mTHP controls should control all of shmem, not only > > > "anonymous shmem". > > > > Yes, that's what I thought and in my TODO list. > > Good, it would be helpful to coordinate with Daniel and Pankaj. I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's v3 patches. You may find a version in my integration branch here [2]. I can attach them here if it's preferred. [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ [2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp The point here is to combine the large folios strategy I proposed with mTHP user controls. Would it make sense to limit the orders to the mapping order calculated based on the size and index? @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, order = highest_order(suitable_orders); while (suitable_orders) { + if (order > mapping_order) { + order = next_order(&suitable_orders, order); + continue; + } pages = 1UL << order; index = round_down(index, pages); folio = shmem_alloc_folio(gfp, order, info, index); Note: The branch still need to be adapted to include !anon mm. > > > > > > > > > Also, we should properly fallback within the configured sizes, and not > > > jump "over" configured sizes. Unless there is a good reason. > > > > > > (3) khugepaged > > > > > > khugepaged needs to handle larger folios properly as well. Until fixed, > > > using smaller THP sizes as fallback might prohibit collapsing a > > > PMD-sized THP later. But really, khugepaged needs to be fixed to handle > > > that. > > > > (4) force/disable > > > > > > These settings are rather testing artifacts from the old ages. We should > > > not add them to the per-size toggles. We might "inherit" it from the > > > global one, though. > > > > Sorry, I missed this. So I thould remove the 'force' and 'deny' option > > for each mTHP, right? > > Yes, that's my understanding. But we have to keep them on the top level for > any possible user out there. > > > > > > > > > "within_size" might have value, and especially for consistency, we > > > should have them per size. > > > > > > > > > > > > So, this series only tackles anonymous shmem, which is a good starting > > > point. Ideally, we'd get support for other shmem (especially during > > > fault time) soon afterwards, because we won't be adding separate toggles > > > for that from the interface POV, and having inconsistent behavior > > > between kernel versions would be a bit unfortunate. > > > > > > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we > > > have to take a lot of the "anonymous thp" terminology out of this > > > series, especially when it comes to documentation. > > > > Sure. I will remove the "anonymous thp" terminology from the > > documentation, but want to still keep it in the commit message, cause I > > want to start from the anonymous shmem. > > For commit message and friends makes sense. The story should be "controls > all of shmem/tmpfs, but support will be added iteratively. The first step is > anonymous shmem." > > -- > Cheers, > > David / dhildenb >
On Fri, May 31, 2024 at 04:43:32PM +0200, David Hildenbrand wrote: > Hi Daniel, > > > > Let me summarize the takeaway from the bi-weekly MM meeting as I understood > > > it, that includes Hugh's feedback on per-block tracking vs. mTHP: > > > > Thanks David for the summary. Please, find below some follow up questions. > > > > I want understand if zeropage scanning overhead is preferred over per-block > > tracking complexity or if we still need to verify this. > > > > > > > > (1) Per-block tracking > > > > > > Per-block tracking is currently considered unwarranted complexity in > > > shmem.c. We should try to get it done without that. For any test cases that > > > fail, we should consider if they are actually valid for shmem. > > > > I agree it was unwarranted complexity but only if this is just to fix lseek() as > > we can simply make the test pass by checking if holes are reported as data. That > > would be the minimum required for lseek() to be compliant with the syscall. > > > > How can we use per-block tracking for reclaiming memory and what changes would > > be needed? Or is per-block really a non-viable option? > > The interesting thing is: with mTHP toggles it is opt-in -- like for > PMD-sized THP with shmem_enabled -- and we don't have to be that concerned > about this problem right now. Without respecting the size when allocating large folios, mTHP toggles would over allocate. My proposal added earlier to this thread is to combine the 2 to avoid that case. Otherwise, shouldn't we try to find a solution for the reclaiming path? > > > > > Clearly, if per-block is viable option, shmem_fault() bug would required to be > > fixed first. Any ideas on how to make it reproducible? > > > > The alternatives discussed where sub-page refcounting and zeropage scanning. > > Yeah, I don't think sub-page refcounting is a feasible (and certainly not > desired ;) ) option in the folio world. > > > The first one is not possible (IIUC) because of a refactor years back that > > simplified the code and also requires extra complexity. The second option would > > require additional overhead as we would involve scanning. > > We'll likely need something similar (scanning, tracking?) for anonymous > memory as well. There was a proposal for a THP shrinker some time ago, that > would solve part of the problem. It's good to know we have the same problem in different places. I'm more inclined to keep the information rather than adding an extra overhead. Unless the complexity is really overwhelming. Considering the concerns here, not sure how much should we try merging with iomap as per Ritesh's suggestion. The THP shrinker, could you please confirm if it is this following thread? https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/ > > For example, for shmem you could simply flag folios that failed splitting > during fallocate() as reclaim candidates and try to reclaim memory later. So > you don't have to scan arbitrary folios (which might also be desired, > though). Thanks for the suggestion. I'll look into this. > > > > > > > > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing > > > is not possible at fallcoate() time, detecting zeropages later and > > > retrying to split+free might be an option, without per-block tracking. > > > > > > > > (2) mTHP controls > > > > > > As a default, we should not be using large folios / mTHP for any shmem, just > > > like we did with THP via shmem_enabled. This is what this series currently > > > does, and is aprt of the whole mTHP user-space interface design. > > > > That was clear for me too. But what is the reason we want to boot in 'safe > > mode'? What are the implications of not respecing that? > > [...] > > > > > As I understood from the call, mTHP with sysctl knobs is preferred over > > optimistic falloc/write allocation? But is still unclear to me why the former > > is preferred. > > I think Hugh's point was that this should be an opt-in feature, just like > PMD-sized THP started out, and still is, an opt-in feature. I'd be keen to understand the use case for this. Even the current THP controls we have in tmpfs. I guess these are just scenarios with no swap involved? Are these use cases the same for both tmpfs and shmem anon mm? > > Problematic interaction with khugepaged (that will be fixed) is one thing, > interaction with memory reclaim (without any kind of memory reclaim > mechanisms in place) might be another one. Controlling and tuning for > specific folio sizes might be another one Hugh raised. [summarizing what I > recall from the discussion, there might be more]. > > > > > Is large folios a non-viable option? > > I think you mean "allocating large folios without user space control". Yes. > > Because mTHP gives user space full control, to the point where you can > enable all sizes and obtain the same result. Agree. > > > > > > > > > Also, we should properly fallback within the configured sizes, and not jump > > > "over" configured sizes. Unless there is a good reason. > > > > > > (3) khugepaged > > > > > > khugepaged needs to handle larger folios properly as well. Until fixed, > > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized > > > THP later. But really, khugepaged needs to be fixed to handle that. > > > > > > (4) force/disable > > > > > > These settings are rather testing artifacts from the old ages. We should not > > > add them to the per-size toggles. We might "inherit" it from the global one, > > > though. > > > > > > "within_size" might have value, and especially for consistency, we should > > > have them per size. > > > > > > > > > > > > So, this series only tackles anonymous shmem, which is a good starting > > > point. Ideally, we'd get support for other shmem (especially during fault > > > time) soon afterwards, because we won't be adding separate toggles for that > > > from the interface POV, and having inconsistent behavior between kernel > > > versions would be a bit unfortunate. > > > > > > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have > > > to take a lot of the "anonymous thp" terminology out of this series, > > > especially when it comes to documentation. > > > > > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we > > > could get an understanding on the next steps on !anon shmem. > > > > I realize I've raised so many questions, but it's essential for us to grasp the > > mm concerns and usage scenarios. This understanding will provide clarity on the > > direction regarding folios for !anon shmem. > > If I understood correctly, Hugh had strong feelings against not respecting > mTHP toggles for shmem. Without per-block tracking, I agree with that. My understanding was the same. But I have this follow-up question: should we respect mTHP toggles without considering mapping constraints (size and index)? Or perhaps we should use within_size where we can fit this intermediate approach, as long as mTHP granularity is respected? Daniel > > -- > Cheers, > > David / dhildenb >
On 2024/6/4 16:18, Daniel Gomez wrote: > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: >>>> >>>> As a default, we should not be using large folios / mTHP for any shmem, >>>> just like we did with THP via shmem_enabled. This is what this series >>>> currently does, and is aprt of the whole mTHP user-space interface design. >>>> >>>> Further, the mTHP controls should control all of shmem, not only >>>> "anonymous shmem". >>> >>> Yes, that's what I thought and in my TODO list. >> >> Good, it would be helpful to coordinate with Daniel and Pankaj. > > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's > v3 patches. You may find a version in my integration branch here [2]. I can > attach them here if it's preferred. > > [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ > [2] https://gitlab.com/dkruces/linux-next/-/commits/next-20240604-shmem-mthp > > The point here is to combine the large folios strategy I proposed with mTHP > user controls. Would it make sense to limit the orders to the mapping order > calculated based on the size and index? IMO, for !anon shmem, this change makes sense to me. We should respect the size and mTHP should act as a order filter. For anon shmem, we should ignore the length, which you always set it to PAGE_SIZE in patch [1]. [1] https://gitlab.com/dkruces/linux-next/-/commit/edf02311fd6d86b355d3aeb74e67c8da6de3c569 > @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, > > order = highest_order(suitable_orders); > while (suitable_orders) { > + if (order > mapping_order) { > + order = next_order(&suitable_orders, order); > + continue; > + } > pages = 1UL << order; > index = round_down(index, pages); > folio = shmem_alloc_folio(gfp, order, info, index); > > Note: The branch still need to be adapted to include !anon mm.
[...] >>> How can we use per-block tracking for reclaiming memory and what changes would >>> be needed? Or is per-block really a non-viable option? >> >> The interesting thing is: with mTHP toggles it is opt-in -- like for >> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned >> about this problem right now. > > Without respecting the size when allocating large folios, mTHP toggles would > over allocate. My proposal added earlier to this thread is to combine the 2 > to avoid that case. Otherwise, shouldn't we try to find a solution for the > reclaiming path? I think at some point we'll really have to do a better job at reclaiming (either memory-overallocation, PUNCHHOLE that couldn't split, but maybe also pages that are now all-zero again and could be reclaimed again). > >> >>> >>> Clearly, if per-block is viable option, shmem_fault() bug would required to be >>> fixed first. Any ideas on how to make it reproducible? >>> >>> The alternatives discussed where sub-page refcounting and zeropage scanning. >> >> Yeah, I don't think sub-page refcounting is a feasible (and certainly not >> desired ;) ) option in the folio world. >> >>> The first one is not possible (IIUC) because of a refactor years back that >>> simplified the code and also requires extra complexity. The second option would >>> require additional overhead as we would involve scanning. >> >> We'll likely need something similar (scanning, tracking?) for anonymous >> memory as well. There was a proposal for a THP shrinker some time ago, that >> would solve part of the problem. > > It's good to know we have the same problem in different places. I'm more > inclined to keep the information rather than adding an extra overhead. Unless > the complexity is really overwhelming. Considering the concerns here, not sure > how much should we try merging with iomap as per Ritesh's suggestion. As raised in the meeting, I do see value in maintaining the information; but I also see why Hugh and Kirill think this might be unwarranted complexity in shmem.c. Meaning: we might be able to achieve something similar without it, and we don't have to solve the problem right now to benefit from large folios. > > The THP shrinker, could you please confirm if it is this following thread? > > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/ Yes, although there was no follow up so far. Possibly, because in the current khugepaged approach, there will be a constant back-and-forth between khugepaged collapsing memory (and wasting memory in the default setting), and the THP shrinker reclaiming memory; doesn't sound quite effective :) That needs more thought IMHO. [...] >>>> To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing >>>> is not possible at fallcoate() time, detecting zeropages later and >>>> retrying to split+free might be an option, without per-block tracking. >>> >>>> >>>> (2) mTHP controls >>>> >>>> As a default, we should not be using large folios / mTHP for any shmem, just >>>> like we did with THP via shmem_enabled. This is what this series currently >>>> does, and is aprt of the whole mTHP user-space interface design. >>> >>> That was clear for me too. But what is the reason we want to boot in 'safe >>> mode'? What are the implications of not respecing that? >> >> [...] >> >>> >>> As I understood from the call, mTHP with sysctl knobs is preferred over >>> optimistic falloc/write allocation? But is still unclear to me why the former >>> is preferred. >> >> I think Hugh's point was that this should be an opt-in feature, just like >> PMD-sized THP started out, and still is, an opt-in feature. > > I'd be keen to understand the use case for this. Even the current THP controls > we have in tmpfs. I guess these are just scenarios with no swap involved? > Are these use cases the same for both tmpfs and shmem anon mm? Systems without swap are one case I think. The other reason for a toggle in the past was to be able to disable it to troubleshoot issues (Hugh mentioned in the meeting about unlocking new code paths in shmem.c only with a toggle). Staring at my Fedora system: $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled always within_size advise [never] deny force Maybe because it uses tmpfs to mount /tmp (interesting article on lwn.net about that) and people are concerned about the side-effects (that can currently waste memory, or result in more reclaim work being required when exceeding file sizes). For VMs, I know that shmem+THP with memory ballooning is problematic and not really recommended. [...] >>> >>>> >>>> Also, we should properly fallback within the configured sizes, and not jump >>>> "over" configured sizes. Unless there is a good reason. >>>> >>>> (3) khugepaged >>>> >>>> khugepaged needs to handle larger folios properly as well. Until fixed, >>>> using smaller THP sizes as fallback might prohibit collapsing a PMD-sized >>>> THP later. But really, khugepaged needs to be fixed to handle that. >>>> >>>> (4) force/disable >>>> >>>> These settings are rather testing artifacts from the old ages. We should not >>>> add them to the per-size toggles. We might "inherit" it from the global one, >>>> though. >>>> >>>> "within_size" might have value, and especially for consistency, we should >>>> have them per size. >>>> >>>> >>>> >>>> So, this series only tackles anonymous shmem, which is a good starting >>>> point. Ideally, we'd get support for other shmem (especially during fault >>>> time) soon afterwards, because we won't be adding separate toggles for that >>>> from the interface POV, and having inconsistent behavior between kernel >>>> versions would be a bit unfortunate. >>>> >>>> >>>> @Baolin, this series likely does not consider (4) yet. And I suggest we have >>>> to take a lot of the "anonymous thp" terminology out of this series, >>>> especially when it comes to documentation. >>>> >>>> @Daniel, Pankaj, what are your plans regarding that? It would be great if we >>>> could get an understanding on the next steps on !anon shmem. >>> >>> I realize I've raised so many questions, but it's essential for us to grasp the >>> mm concerns and usage scenarios. This understanding will provide clarity on the >>> direction regarding folios for !anon shmem. >> >> If I understood correctly, Hugh had strong feelings against not respecting >> mTHP toggles for shmem. Without per-block tracking, I agree with that. > > My understanding was the same. But I have this follow-up question: should > we respect mTHP toggles without considering mapping constraints (size and > index)? Or perhaps we should use within_size where we can fit this intermediate > approach, as long as mTHP granularity is respected? Likely we should do exactly the same as we would do with the existing PMD-sized THPs. I recall in the meeting that we discussed that always vs. within_size seems to have some value, and that we should respect that setting like we did for PMD-sized THP. Which other mapping constraints could we have?
On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: > > > On 2024/6/4 16:18, Daniel Gomez wrote: > > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: > > > > > > > > > > As a default, we should not be using large folios / mTHP for any shmem, > > > > > just like we did with THP via shmem_enabled. This is what this series > > > > > currently does, and is aprt of the whole mTHP user-space interface design. > > > > > > > > > > Further, the mTHP controls should control all of shmem, not only > > > > > "anonymous shmem". > > > > > > > > Yes, that's what I thought and in my TODO list. > > > > > > Good, it would be helpful to coordinate with Daniel and Pankaj. > > > > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's > > v3 patches. You may find a version in my integration branch here [2]. I can > > attach them here if it's preferred. > > > > [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ > > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp > > > > The point here is to combine the large folios strategy I proposed with mTHP > > user controls. Would it make sense to limit the orders to the mapping order > > calculated based on the size and index? > > IMO, for !anon shmem, this change makes sense to me. We should respect the > size and mTHP should act as a order filter. What about respecing the size when within_size flag is enabled? Then, 'always' would allocate mTHP enabled folios, regardless of the size. And 'never' would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case mentioned in the discussion. > > For anon shmem, we should ignore the length, which you always set it to > PAGE_SIZE in patch [1]. > > [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569 Since we are ignoring the length, we should ignore any value being passed. > > > @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, > > > > order = highest_order(suitable_orders); > > while (suitable_orders) { > > + if (order > mapping_order) { > > + order = next_order(&suitable_orders, order); > > + continue; > > + } > > pages = 1UL << order; > > index = round_down(index, pages); > > folio = shmem_alloc_folio(gfp, order, info, index); > > > > Note: The branch still need to be adapted to include !anon mm.
On Tue, Jun 04, 2024 at 11:59:51AM +0200, David Hildenbrand wrote: > [...] > > > > > How can we use per-block tracking for reclaiming memory and what changes would > > > > be needed? Or is per-block really a non-viable option? > > > > > > The interesting thing is: with mTHP toggles it is opt-in -- like for > > > PMD-sized THP with shmem_enabled -- and we don't have to be that concerned > > > about this problem right now. > > > > Without respecting the size when allocating large folios, mTHP toggles would > > over allocate. My proposal added earlier to this thread is to combine the 2 > > to avoid that case. Otherwise, shouldn't we try to find a solution for the > > reclaiming path? > > I think at some point we'll really have to do a better job at reclaiming > (either memory-overallocation, PUNCHHOLE that couldn't split, but maybe also > pages that are now all-zero again and could be reclaimed again). > > > > > > > > > > > > > > Clearly, if per-block is viable option, shmem_fault() bug would required to be > > > > fixed first. Any ideas on how to make it reproducible? > > > > > > > > The alternatives discussed where sub-page refcounting and zeropage scanning. > > > > > > Yeah, I don't think sub-page refcounting is a feasible (and certainly not > > > desired ;) ) option in the folio world. > > > > > > > The first one is not possible (IIUC) because of a refactor years back that > > > > simplified the code and also requires extra complexity. The second option would > > > > require additional overhead as we would involve scanning. > > > > > > We'll likely need something similar (scanning, tracking?) for anonymous > > > memory as well. There was a proposal for a THP shrinker some time ago, that > > > would solve part of the problem. > > > > It's good to know we have the same problem in different places. I'm more > > inclined to keep the information rather than adding an extra overhead. Unless > > the complexity is really overwhelming. Considering the concerns here, not sure > > how much should we try merging with iomap as per Ritesh's suggestion. > > As raised in the meeting, I do see value in maintaining the information; but > I also see why Hugh and Kirill think this might be unwarranted complexity in > shmem.c. Meaning: we might be able to achieve something similar without it, > and we don't have to solve the problem right now to benefit from large > folios. > > > > > The THP shrinker, could you please confirm if it is this following thread? > > > > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/ > > Yes, although there was no follow up so far. Possibly, because in the > current khugepaged approach, there will be a constant back-and-forth between > khugepaged collapsing memory (and wasting memory in the default setting), > and the THP shrinker reclaiming memory; doesn't sound quite effective :) > That needs more thought IMHO. > > [...] > > > > > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing > > > > > is not possible at fallcoate() time, detecting zeropages later and > > > > > retrying to split+free might be an option, without per-block tracking. > > > > > > > > > > > > > > (2) mTHP controls > > > > > > > > > > As a default, we should not be using large folios / mTHP for any shmem, just > > > > > like we did with THP via shmem_enabled. This is what this series currently > > > > > does, and is aprt of the whole mTHP user-space interface design. > > > > > > > > That was clear for me too. But what is the reason we want to boot in 'safe > > > > mode'? What are the implications of not respecing that? > > > > > > [...] > > > > > > > > > > > As I understood from the call, mTHP with sysctl knobs is preferred over > > > > optimistic falloc/write allocation? But is still unclear to me why the former > > > > is preferred. > > > > > > I think Hugh's point was that this should be an opt-in feature, just like > > > PMD-sized THP started out, and still is, an opt-in feature. > > > > I'd be keen to understand the use case for this. Even the current THP controls > > we have in tmpfs. I guess these are just scenarios with no swap involved? > > Are these use cases the same for both tmpfs and shmem anon mm? > > Systems without swap are one case I think. The other reason for a toggle in > the past was to be able to disable it to troubleshoot issues (Hugh mentioned > in the meeting about unlocking new code paths in shmem.c only with a > toggle). > > Staring at my Fedora system: > > $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled > always within_size advise [never] deny force > > Maybe because it uses tmpfs to mount /tmp (interesting article on lwn.net > about that) and people are concerned about the side-effects (that can > currently waste memory, or result in more reclaim work being required when > exceeding file sizes). > > For VMs, I know that shmem+THP with memory ballooning is problematic and not > really recommended. Agree. We cannot PUNCH_HOLES when using THP unless the punch is PMD-size. > > [...] > > > > > > > > > > > > > > > Also, we should properly fallback within the configured sizes, and not jump > > > > > "over" configured sizes. Unless there is a good reason. > > > > > > > > > > (3) khugepaged > > > > > > > > > > khugepaged needs to handle larger folios properly as well. Until fixed, > > > > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized > > > > > THP later. But really, khugepaged needs to be fixed to handle that. > > > > > > > > > > (4) force/disable > > > > > > > > > > These settings are rather testing artifacts from the old ages. We should not > > > > > add them to the per-size toggles. We might "inherit" it from the global one, > > > > > though. > > > > > > > > > > "within_size" might have value, and especially for consistency, we should > > > > > have them per size. > > > > > > > > > > > > > > > > > > > > So, this series only tackles anonymous shmem, which is a good starting > > > > > point. Ideally, we'd get support for other shmem (especially during fault > > > > > time) soon afterwards, because we won't be adding separate toggles for that > > > > > from the interface POV, and having inconsistent behavior between kernel > > > > > versions would be a bit unfortunate. > > > > > > > > > > > > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have > > > > > to take a lot of the "anonymous thp" terminology out of this series, > > > > > especially when it comes to documentation. > > > > > > > > > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we > > > > > could get an understanding on the next steps on !anon shmem. > > > > > > > > I realize I've raised so many questions, but it's essential for us to grasp the > > > > mm concerns and usage scenarios. This understanding will provide clarity on the > > > > direction regarding folios for !anon shmem. > > > > > > If I understood correctly, Hugh had strong feelings against not respecting > > > mTHP toggles for shmem. Without per-block tracking, I agree with that. > > > > My understanding was the same. But I have this follow-up question: should > > we respect mTHP toggles without considering mapping constraints (size and > > index)? Or perhaps we should use within_size where we can fit this intermediate > > approach, as long as mTHP granularity is respected? > > Likely we should do exactly the same as we would do with the existing > PMD-sized THPs. > > I recall in the meeting that we discussed that always vs. within_size seems > to have some value, and that we should respect that setting like we did for > PMD-sized THP. > > Which other mapping constraints could we have? Patch 12 in my RFC for LSF [1] adds this shmem_mapping_size_order() (updated to support get_order()) to get the max order 'based on the file size which the mapping currently allows at the given index'. Same check is done here [2] in filemap.c. [1] https://lore.kernel.org/all/20240515055719.32577-13-da.gomez@samsung.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h=v6.10-rc2#n1940 @@ -1726,6 +1726,37 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order, return folio; } +/** + * shmem_mapping_size_order - Get maximum folio order for the given file size. + * @mapping: Target address_space. + * @index: The page index. + * @size: The suggested size of the folio to create. + * + * This returns a high order for folios (when supported) based on the file size + * which the mapping currently allows at the given index. The index is relevant + * due to alignment considerations the mapping might have. The returned order + * may be less than the size passed. + * + * Like __filemap_get_folio order calculation. + * + * Return: The order. + */ +static inline unsigned int +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, + size_t size) +{ + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE)); + + if (!mapping_large_folio_support(mapping)) + return 0; + + /* If we're not aligned, allocate a smaller folio */ + if (index & ((1UL << order) - 1)) + order = __ffs(index); + + return min_t(size_t, order, MAX_PAGECACHE_ORDER); +} + > > -- > Cheers, > > David / dhildenb >
On 2024/6/4 20:05, Daniel Gomez wrote: > On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: >> >> >> On 2024/6/4 16:18, Daniel Gomez wrote: >>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: >>>>>> >>>>>> As a default, we should not be using large folios / mTHP for any shmem, >>>>>> just like we did with THP via shmem_enabled. This is what this series >>>>>> currently does, and is aprt of the whole mTHP user-space interface design. >>>>>> >>>>>> Further, the mTHP controls should control all of shmem, not only >>>>>> "anonymous shmem". >>>>> >>>>> Yes, that's what I thought and in my TODO list. >>>> >>>> Good, it would be helpful to coordinate with Daniel and Pankaj. >>> >>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's >>> v3 patches. You may find a version in my integration branch here [2]. I can >>> attach them here if it's preferred. >>> >>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ >>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp >>> >>> The point here is to combine the large folios strategy I proposed with mTHP >>> user controls. Would it make sense to limit the orders to the mapping order >>> calculated based on the size and index? >> >> IMO, for !anon shmem, this change makes sense to me. We should respect the >> size and mTHP should act as a order filter. > > What about respecing the size when within_size flag is enabled? Then, 'always' > would allocate mTHP enabled folios, regardless of the size. And 'never' > would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case > mentioned in the discussion. Looks reasonable to me. What do you think, David? And what about 'advise' option? Silimar to 'within_size'? >> For anon shmem, we should ignore the length, which you always set it to >> PAGE_SIZE in patch [1]. >> >> [1] https://protect2.fireeye.com/v1/url?k=0d75a0c6-6cfeb5e6-0d742b89-74fe485fb347-904fa75c8efebdc2&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommit%2Fedf02311fd6d86b355d3aeb74e67c8da6de3c569 > > Since we are ignoring the length, we should ignore any value being passed. > >> >>> @@ -1765,6 +1798,10 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, >>> >>> order = highest_order(suitable_orders); >>> while (suitable_orders) { >>> + if (order > mapping_order) { >>> + order = next_order(&suitable_orders, order); >>> + continue; >>> + } >>> pages = 1UL << order; >>> index = round_down(index, pages); >>> folio = shmem_alloc_folio(gfp, order, info, index); >>> >>> Note: The branch still need to be adapted to include !anon mm.
On 06.06.24 05:31, Baolin Wang wrote: > > > On 2024/6/4 20:05, Daniel Gomez wrote: >> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: >>> >>> >>> On 2024/6/4 16:18, Daniel Gomez wrote: >>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: >>>>>>> >>>>>>> As a default, we should not be using large folios / mTHP for any shmem, >>>>>>> just like we did with THP via shmem_enabled. This is what this series >>>>>>> currently does, and is aprt of the whole mTHP user-space interface design. >>>>>>> >>>>>>> Further, the mTHP controls should control all of shmem, not only >>>>>>> "anonymous shmem". >>>>>> >>>>>> Yes, that's what I thought and in my TODO list. >>>>> >>>>> Good, it would be helpful to coordinate with Daniel and Pankaj. >>>> >>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's >>>> v3 patches. You may find a version in my integration branch here [2]. I can >>>> attach them here if it's preferred. >>>> >>>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ >>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp >>>> >>>> The point here is to combine the large folios strategy I proposed with mTHP >>>> user controls. Would it make sense to limit the orders to the mapping order >>>> calculated based on the size and index? >>> >>> IMO, for !anon shmem, this change makes sense to me. We should respect the >>> size and mTHP should act as a order filter. >> >> What about respecing the size when within_size flag is enabled? Then, 'always' >> would allocate mTHP enabled folios, regardless of the size. And 'never' >> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case >> mentioned in the discussion. > > Looks reasonable to me. What do you think, David? > That mimics existing PMD-THP behavior, right? With "within_size", we must not exceed the size, with "always", we may exceed the size. > And what about 'advise' option? Silimar to 'within_size'? Good question. What's the behavior of PMD-THP? I would assume it behaves like "within_size", because in the common case we mmap (+advise) only within the size of the file, not exceeding it. (the always option, as I learned during the meeting, is primarily helpful when writing to tmpfs files in an append-fashion. With mmap()+madvise() that doesn't quite happen)
On 2024/6/6 16:38, David Hildenbrand wrote: > On 06.06.24 05:31, Baolin Wang wrote: >> >> >> On 2024/6/4 20:05, Daniel Gomez wrote: >>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: >>>> >>>> >>>> On 2024/6/4 16:18, Daniel Gomez wrote: >>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: >>>>>>>> >>>>>>>> As a default, we should not be using large folios / mTHP for any >>>>>>>> shmem, >>>>>>>> just like we did with THP via shmem_enabled. This is what this >>>>>>>> series >>>>>>>> currently does, and is aprt of the whole mTHP user-space >>>>>>>> interface design. >>>>>>>> >>>>>>>> Further, the mTHP controls should control all of shmem, not only >>>>>>>> "anonymous shmem". >>>>>>> >>>>>>> Yes, that's what I thought and in my TODO list. >>>>>> >>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj. >>>>> >>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on >>>>> top of Baolin's >>>>> v3 patches. You may find a version in my integration branch here >>>>> [2]. I can >>>>> attach them here if it's preferred. >>>>> >>>>> [1] >>>>> https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ >>>>> [2] >>>>> https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp >>>>> >>>>> The point here is to combine the large folios strategy I proposed >>>>> with mTHP >>>>> user controls. Would it make sense to limit the orders to the >>>>> mapping order >>>>> calculated based on the size and index? >>>> >>>> IMO, for !anon shmem, this change makes sense to me. We should >>>> respect the >>>> size and mTHP should act as a order filter. >>> >>> What about respecing the size when within_size flag is enabled? Then, >>> 'always' >>> would allocate mTHP enabled folios, regardless of the size. And 'never' >>> would ignore mTHP and size. So, 'never' can be used for this 'safe' >>> boot case >>> mentioned in the discussion. >> >> Looks reasonable to me. What do you think, David? >> > > That mimics existing PMD-THP behavior, right? > > With "within_size", we must not exceed the size, with "always", we may > exceed the size. Yes, I think so. >> And what about 'advise' option? Silimar to 'within_size'? > > Good question. What's the behavior of PMD-THP? I would assume it behaves > like "within_size", because in the common case we mmap (+advise) only > within the size of the file, not exceeding it. Yes, that is also my understanding. > (the always option, as I learned during the meeting, is primarily > helpful when writing to tmpfs files in an append-fashion. With > mmap()+madvise() that doesn't quite happen) >
On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote: > On 06.06.24 05:31, Baolin Wang wrote: > > > > > > On 2024/6/4 20:05, Daniel Gomez wrote: > > > On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: > > > > > > > > > > > > On 2024/6/4 16:18, Daniel Gomez wrote: > > > > > On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > As a default, we should not be using large folios / mTHP for any shmem, > > > > > > > > just like we did with THP via shmem_enabled. This is what this series > > > > > > > > currently does, and is aprt of the whole mTHP user-space interface design. > > > > > > > > > > > > > > > > Further, the mTHP controls should control all of shmem, not only > > > > > > > > "anonymous shmem". > > > > > > > > > > > > > > Yes, that's what I thought and in my TODO list. > > > > > > > > > > > > Good, it would be helpful to coordinate with Daniel and Pankaj. > > > > > > > > > > I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's > > > > > v3 patches. You may find a version in my integration branch here [2]. I can > > > > > attach them here if it's preferred. > > > > > > > > > > [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ > > > > > [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp > > > > > > > > > > The point here is to combine the large folios strategy I proposed with mTHP > > > > > user controls. Would it make sense to limit the orders to the mapping order > > > > > calculated based on the size and index? > > > > > > > > IMO, for !anon shmem, this change makes sense to me. We should respect the > > > > size and mTHP should act as a order filter. > > > > > > What about respecing the size when within_size flag is enabled? Then, 'always' > > > would allocate mTHP enabled folios, regardless of the size. And 'never' > > > would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case > > > mentioned in the discussion. > > > > Looks reasonable to me. What do you think, David? > > > > That mimics existing PMD-THP behavior, right? > > With "within_size", we must not exceed the size, with "always", we may > exceed the size. But right now we only check the inode size. With large folio support in write_iter() we can have access to the length as well. I think this would solve (paratially?) the cases where we don't have access to the file size and if we perform writes in bigger chunks. E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file For 'within_size', the example above would allocate 512 pages instead of one huge page. After patches [1] [2] we can get the size of the write to allocate whatever mTHP/THP makes more sense for the length being passed. [1] https://lore.kernel.org/all/20240527163616.1135968-2-hch@lst.de/ [2] https://lore.kernel.org/all/20240515055719.32577-12-da.gomez@samsung.com/ Here a quick hack for THP: @@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, case SHMEM_HUGE_WITHIN_SIZE: index = round_up(index + 1, HPAGE_PMD_NR); i_size = round_up(i_size_read(inode), PAGE_SIZE); - if (i_size >> PAGE_SHIFT >= index) + if ((i_size >> PAGE_SHIFT >= index) || + (len >> PAGE_SHIFT >= index)) return true; fallthrough; > > > And what about 'advise' option? Silimar to 'within_size'? > > Good question. What's the behavior of PMD-THP? I would assume it behaves > like "within_size", because in the common case we mmap (+advise) only within > the size of the file, not exceeding it. It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size). > > (the always option, as I learned during the meeting, is primarily helpful > when writing to tmpfs files in an append-fashion. With mmap()+madvise() that > doesn't quite happen) Let's benefit of larger writes as well if the chunk matches any of the mTHP/ THP sizes. > > -- > Cheers, > > David / dhildenb >
On 07.06.24 11:05, Daniel Gomez wrote: > On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote: >> On 06.06.24 05:31, Baolin Wang wrote: >>> >>> >>> On 2024/6/4 20:05, Daniel Gomez wrote: >>>> On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2024/6/4 16:18, Daniel Gomez wrote: >>>>>> On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> As a default, we should not be using large folios / mTHP for any shmem, >>>>>>>>> just like we did with THP via shmem_enabled. This is what this series >>>>>>>>> currently does, and is aprt of the whole mTHP user-space interface design. >>>>>>>>> >>>>>>>>> Further, the mTHP controls should control all of shmem, not only >>>>>>>>> "anonymous shmem". >>>>>>>> >>>>>>>> Yes, that's what I thought and in my TODO list. >>>>>>> >>>>>>> Good, it would be helpful to coordinate with Daniel and Pankaj. >>>>>> >>>>>> I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's >>>>>> v3 patches. You may find a version in my integration branch here [2]. I can >>>>>> attach them here if it's preferred. >>>>>> >>>>>> [1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ >>>>>> [2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp >>>>>> >>>>>> The point here is to combine the large folios strategy I proposed with mTHP >>>>>> user controls. Would it make sense to limit the orders to the mapping order >>>>>> calculated based on the size and index? >>>>> >>>>> IMO, for !anon shmem, this change makes sense to me. We should respect the >>>>> size and mTHP should act as a order filter. >>>> >>>> What about respecing the size when within_size flag is enabled? Then, 'always' >>>> would allocate mTHP enabled folios, regardless of the size. And 'never' >>>> would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case >>>> mentioned in the discussion. >>> >>> Looks reasonable to me. What do you think, David? >>> >> >> That mimics existing PMD-THP behavior, right? >> >> With "within_size", we must not exceed the size, with "always", we may >> exceed the size. > > But right now we only check the inode size. With large folio support in > write_iter() we can have access to the length as well. I think this would solve > (paratially?) the cases where we don't have access to the file size and if we > perform writes in bigger chunks. > > E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file > > For 'within_size', the example above would allocate 512 pages instead of one > huge page. After patches [1] [2] we can get the size of the write to allocate > whatever mTHP/THP makes more sense for the length being passed. > Yes, although this sounds like an optimization/change we should be doing separately I guess. > [1] https://lore.kernel.org/all/20240527163616.1135968-2-hch@lst.de/ > [2] https://lore.kernel.org/all/20240515055719.32577-12-da.gomez@samsung.com/ > > Here a quick hack for THP: > > @@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, > case SHMEM_HUGE_WITHIN_SIZE: > index = round_up(index + 1, HPAGE_PMD_NR); > i_size = round_up(i_size_read(inode), PAGE_SIZE); > - if (i_size >> PAGE_SHIFT >= index) > + if ((i_size >> PAGE_SHIFT >= index) || > + (len >> PAGE_SHIFT >= index)) > return true; > fallthrough; > > >> >>> And what about 'advise' option? Silimar to 'within_size'? >> >> Good question. What's the behavior of PMD-THP? I would assume it behaves >> like "within_size", because in the common case we mmap (+advise) only within >> the size of the file, not exceeding it. > > It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size). > Interesting, so we should do the same. Thanks!