Message ID | 20250327160700.1147155-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm/filemap: Allow arch to request folio size for exec memory | expand |
On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: > So let's special-case the read(ahead) logic for executable mappings. The > trade-off is performance improvement (due to more efficient storage of > the translations in iTLB) vs potential read amplification (due to > reading too much data around the fault which won't be used), and the > latter is independent of base page size. I've chosen 64K folio size for > arm64 which benefits both the 4K and 16K base page size configs and > shouldn't lead to any read amplification in practice since the old > read-around path was (usually) reading blocks of 128K. I don't > anticipate any write amplification because text is always RO. Is there not also the potential for wasted memory due to ELF alignment? Kalesh talked about it in the MM BOF at the same time that Ted and I were discussing it in the FS BOF. Some coordination required (like maybe Kalesh could have mentioned it to me rathere than assuming I'd be there?) > +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) I don't think the "arch" really adds much value here. #define exec_folio_order() get_order(SZ_64K) > +#ifndef arch_exec_folio_order > +/* > + * Returns preferred minimum folio order for executable file-backed memory. Must > + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no > + * preference and mm will not special-case executable memory in the pagecache. > + */ > +static inline int arch_exec_folio_order(void) > +{ > + return -1; > +} This feels a bit fragile. I often expect to be able to store an order in an unsigned int. Why not return 0 instead?
+ Kalesh On 27/03/2025 12:44, Matthew Wilcox wrote: > On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >> So let's special-case the read(ahead) logic for executable mappings. The >> trade-off is performance improvement (due to more efficient storage of >> the translations in iTLB) vs potential read amplification (due to >> reading too much data around the fault which won't be used), and the >> latter is independent of base page size. I've chosen 64K folio size for >> arm64 which benefits both the 4K and 16K base page size configs and >> shouldn't lead to any read amplification in practice since the old >> read-around path was (usually) reading blocks of 128K. I don't >> anticipate any write amplification because text is always RO. > > Is there not also the potential for wasted memory due to ELF alignment? I think this is an orthogonal issue? My change isn't making that any worse. > Kalesh talked about it in the MM BOF at the same time that Ted and I > were discussing it in the FS BOF. Some coordination required (like > maybe Kalesh could have mentioned it to me rathere than assuming I'd be > there?) I was at Kalesh's talk. David H suggested that a potential solution might be for readahead to ask the fs where the next hole is and then truncate readahead to avoid reading the hole. Given it's padding, nothing should directly fault it in so it never ends up in the page cache. Not sure if you discussed anything like that if you were talking in parallel? Anyway, I'm not sure if you're suggesting these changes need to be considered as one somehow or if you're just mentioning it given it is loosely related? My view is that this change is an improvement indepently and could go in much sooner. > >> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) > > I don't think the "arch" really adds much value here. I was following the pattern used by arch_wants_old_prefaulted_pte(), arch_has_hw_pte_young(), etc. But I think you're right. I'll change as you suggest. > > #define exec_folio_order() get_order(SZ_64K) ooh... get_order()... nice. > >> +#ifndef arch_exec_folio_order >> +/* >> + * Returns preferred minimum folio order for executable file-backed memory. Must >> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no >> + * preference and mm will not special-case executable memory in the pagecache. >> + */ >> +static inline int arch_exec_folio_order(void) >> +{ >> + return -1; >> +} > > This feels a bit fragile. I often expect to be able to store an order > in an unsigned int. Why not return 0 instead? Well 0 is a valid order, no? I think we have had the "is order signed or unsigned" argument before. get_order() returns a signed int :) Personally I'd prefer to keep it signed and use a negative value as the sentinel. I don't think 0 is the right choice because it's a valid order. How about returning unsigned int and use UINT_MAX as the sentinel? #define EXEC_FOLIO_ORDER_NONE UINT_MAX static inline unsigned int arch_exec_folio_order(void) { return EXEC_FOLIO_ORDER_NONE; } Thanks, Ryan
On 27 Mar 2025, at 12:44, Matthew Wilcox wrote: > On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >> So let's special-case the read(ahead) logic for executable mappings. The >> trade-off is performance improvement (due to more efficient storage of >> the translations in iTLB) vs potential read amplification (due to >> reading too much data around the fault which won't be used), and the >> latter is independent of base page size. I've chosen 64K folio size for >> arm64 which benefits both the 4K and 16K base page size configs and >> shouldn't lead to any read amplification in practice since the old >> read-around path was (usually) reading blocks of 128K. I don't >> anticipate any write amplification because text is always RO. > > Is there not also the potential for wasted memory due to ELF alignment? > Kalesh talked about it in the MM BOF at the same time that Ted and I > were discussing it in the FS BOF. Some coordination required (like > maybe Kalesh could have mentioned it to me rathere than assuming I'd be > there?) > >> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) > > I don't think the "arch" really adds much value here. > > #define exec_folio_order() get_order(SZ_64K) How about AMD’s PTE coalescing, which does PTE compression at 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will not hurt AMD PTE coalescing. Starting with 64KB across all arch might be simpler to see the performance impact. Just a comment, no objection. :) Best Regards, Yan, Zi
On 27/03/2025 20:07, Zi Yan wrote: > On 27 Mar 2025, at 12:44, Matthew Wilcox wrote: > >> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>> So let's special-case the read(ahead) logic for executable mappings. The >>> trade-off is performance improvement (due to more efficient storage of >>> the translations in iTLB) vs potential read amplification (due to >>> reading too much data around the fault which won't be used), and the >>> latter is independent of base page size. I've chosen 64K folio size for >>> arm64 which benefits both the 4K and 16K base page size configs and >>> shouldn't lead to any read amplification in practice since the old >>> read-around path was (usually) reading blocks of 128K. I don't >>> anticipate any write amplification because text is always RO. >> >> Is there not also the potential for wasted memory due to ELF alignment? >> Kalesh talked about it in the MM BOF at the same time that Ted and I >> were discussing it in the FS BOF. Some coordination required (like >> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >> there?) >> >>> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) >> >> I don't think the "arch" really adds much value here. >> >> #define exec_folio_order() get_order(SZ_64K) > > How about AMD’s PTE coalescing, which does PTE compression at > 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will > not hurt AMD PTE coalescing. Starting with 64KB across all arch > might be simpler to see the performance impact. Just a comment, > no objection. :) exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred size. At the moment x86 is not opted in, but they could choose to opt in with 32K (or whatever else makese sense) if the HW supports coalescing. I'm not sure if you thought this was global and are arguing against that, or if you are arguing for it to be global because it will more easily show us performance regressions earlier if x86 is doing this too? > > Best Regards, > Yan, Zi
On 28 Mar 2025, at 9:09, Ryan Roberts wrote: > On 27/03/2025 20:07, Zi Yan wrote: >> On 27 Mar 2025, at 12:44, Matthew Wilcox wrote: >> >>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>>> So let's special-case the read(ahead) logic for executable mappings. The >>>> trade-off is performance improvement (due to more efficient storage of >>>> the translations in iTLB) vs potential read amplification (due to >>>> reading too much data around the fault which won't be used), and the >>>> latter is independent of base page size. I've chosen 64K folio size for >>>> arm64 which benefits both the 4K and 16K base page size configs and >>>> shouldn't lead to any read amplification in practice since the old >>>> read-around path was (usually) reading blocks of 128K. I don't >>>> anticipate any write amplification because text is always RO. >>> >>> Is there not also the potential for wasted memory due to ELF alignment? >>> Kalesh talked about it in the MM BOF at the same time that Ted and I >>> were discussing it in the FS BOF. Some coordination required (like >>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >>> there?) >>> >>>> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) >>> >>> I don't think the "arch" really adds much value here. >>> >>> #define exec_folio_order() get_order(SZ_64K) >> >> How about AMD’s PTE coalescing, which does PTE compression at >> 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will >> not hurt AMD PTE coalescing. Starting with 64KB across all arch >> might be simpler to see the performance impact. Just a comment, >> no objection. :) > > exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred > size. At the moment x86 is not opted in, but they could choose to opt in with > 32K (or whatever else makese sense) if the HW supports coalescing. Oh, I missed that part. I thought, since arch_ is not there, it was the same for all arch. > > I'm not sure if you thought this was global and are arguing against that, or if > you are arguing for it to be global because it will more easily show us > performance regressions earlier if x86 is doing this too? I thought it was global. It might be OK to set it global and let different arch to optimize it as it rolls out. Opt-in might be "never" until someone looks into it, but if it is global and it changes performance, people will notice and look into it. -- Best Regards, Yan, Zi
On 28/03/2025 09:32, Zi Yan wrote: > On 28 Mar 2025, at 9:09, Ryan Roberts wrote: > >> On 27/03/2025 20:07, Zi Yan wrote: >>> On 27 Mar 2025, at 12:44, Matthew Wilcox wrote: >>> >>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>>>> So let's special-case the read(ahead) logic for executable mappings. The >>>>> trade-off is performance improvement (due to more efficient storage of >>>>> the translations in iTLB) vs potential read amplification (due to >>>>> reading too much data around the fault which won't be used), and the >>>>> latter is independent of base page size. I've chosen 64K folio size for >>>>> arm64 which benefits both the 4K and 16K base page size configs and >>>>> shouldn't lead to any read amplification in practice since the old >>>>> read-around path was (usually) reading blocks of 128K. I don't >>>>> anticipate any write amplification because text is always RO. >>>> >>>> Is there not also the potential for wasted memory due to ELF alignment? >>>> Kalesh talked about it in the MM BOF at the same time that Ted and I >>>> were discussing it in the FS BOF. Some coordination required (like >>>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >>>> there?) >>>> >>>>> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) >>>> >>>> I don't think the "arch" really adds much value here. >>>> >>>> #define exec_folio_order() get_order(SZ_64K) >>> >>> How about AMD’s PTE coalescing, which does PTE compression at >>> 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will >>> not hurt AMD PTE coalescing. Starting with 64KB across all arch >>> might be simpler to see the performance impact. Just a comment, >>> no objection. :) >> >> exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred >> size. At the moment x86 is not opted in, but they could choose to opt in with >> 32K (or whatever else makese sense) if the HW supports coalescing. > > Oh, I missed that part. I thought, since arch_ is not there, it was the same > for all arch. > >> >> I'm not sure if you thought this was global and are arguing against that, or if >> you are arguing for it to be global because it will more easily show us >> performance regressions earlier if x86 is doing this too? > > I thought it was global. It might be OK to set it global and let different arch > to optimize it as it rolls out. Opt-in might be "never" until someone looks > into it, but if it is global and it changes performance, people will notice > and look into it. Ahh now that we are both clear, I'd prefer to stick with the policy as implemented; exec_folio_order() defaults to "use the existing readahead method" but can be overridden by arches (arm64) that want specific behaviour (64K folios). > > -- > Best Regards, > Yan, Zi
On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote: > + Kalesh > > On 27/03/2025 12:44, Matthew Wilcox wrote: > > On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: > >> So let's special-case the read(ahead) logic for executable mappings. The > >> trade-off is performance improvement (due to more efficient storage of > >> the translations in iTLB) vs potential read amplification (due to > >> reading too much data around the fault which won't be used), and the > >> latter is independent of base page size. I've chosen 64K folio size for > >> arm64 which benefits both the 4K and 16K base page size configs and > >> shouldn't lead to any read amplification in practice since the old > >> read-around path was (usually) reading blocks of 128K. I don't > >> anticipate any write amplification because text is always RO. > > > > Is there not also the potential for wasted memory due to ELF alignment? > > I think this is an orthogonal issue? My change isn't making that any worse. To a certain extent, it is. If readahead was doing order-2 allocations before and is now doing order-4, you're tying up 0-12 extra pages which happen to be filled with zeroes due to being used to cache the contents of a hole. > > Kalesh talked about it in the MM BOF at the same time that Ted and I > > were discussing it in the FS BOF. Some coordination required (like > > maybe Kalesh could have mentioned it to me rathere than assuming I'd be > > there?) > > I was at Kalesh's talk. David H suggested that a potential solution might be for > readahead to ask the fs where the next hole is and then truncate readahead to > avoid reading the hole. Given it's padding, nothing should directly fault it in > so it never ends up in the page cache. Not sure if you discussed anything like > that if you were talking in parallel? Ted said that he and Kalesh had talked about that solution. I have a more bold solution in mind which lifts the ext4 extent cache to the VFS inode so that the readahead code can interrogate it. > Anyway, I'm not sure if you're suggesting these changes need to be considered as > one somehow or if you're just mentioning it given it is loosely related? My view > is that this change is an improvement indepently and could go in much sooner. This is not a reason to delay this patch. It's just a downside which should be mentioned in the commit message. > >> +static inline int arch_exec_folio_order(void) > >> +{ > >> + return -1; > >> +} > > > > This feels a bit fragile. I often expect to be able to store an order > > in an unsigned int. Why not return 0 instead? > > Well 0 is a valid order, no? I think we have had the "is order signed or > unsigned" argument before. get_order() returns a signed int :) But why not always return a valid order? I don't think we need a sentinel. The default value can be 0 to do what we do today.
On 28/03/2025 15:14, Matthew Wilcox wrote: > On Thu, Mar 27, 2025 at 04:23:14PM -0400, Ryan Roberts wrote: >> + Kalesh >> >> On 27/03/2025 12:44, Matthew Wilcox wrote: >>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>>> So let's special-case the read(ahead) logic for executable mappings. The >>>> trade-off is performance improvement (due to more efficient storage of >>>> the translations in iTLB) vs potential read amplification (due to >>>> reading too much data around the fault which won't be used), and the >>>> latter is independent of base page size. I've chosen 64K folio size for >>>> arm64 which benefits both the 4K and 16K base page size configs and >>>> shouldn't lead to any read amplification in practice since the old >>>> read-around path was (usually) reading blocks of 128K. I don't >>>> anticipate any write amplification because text is always RO. >>> >>> Is there not also the potential for wasted memory due to ELF alignment? >> >> I think this is an orthogonal issue? My change isn't making that any worse. > > To a certain extent, it is. If readahead was doing order-2 allocations > before and is now doing order-4, you're tying up 0-12 extra pages which > happen to be filled with zeroes due to being used to cache the contents > of a hole. Well we would still have read them in before, nothing has changed there. But I guess your point is more about reclaim? Because those pages are now contained in a larger folio, if part of the folio is in use then all of it remains active. Whereas before, if the folio was fully contained in the pad area and never accessed, it would fall down the LRU quickly and get reclaimed. > >>> Kalesh talked about it in the MM BOF at the same time that Ted and I >>> were discussing it in the FS BOF. Some coordination required (like >>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >>> there?) >> >> I was at Kalesh's talk. David H suggested that a potential solution might be for >> readahead to ask the fs where the next hole is and then truncate readahead to >> avoid reading the hole. Given it's padding, nothing should directly fault it in >> so it never ends up in the page cache. Not sure if you discussed anything like >> that if you were talking in parallel? > > Ted said that he and Kalesh had talked about that solution. I have a > more bold solution in mind which lifts the ext4 extent cache to the > VFS inode so that the readahead code can interrogate it. > >> Anyway, I'm not sure if you're suggesting these changes need to be considered as >> one somehow or if you're just mentioning it given it is loosely related? My view >> is that this change is an improvement indepently and could go in much sooner. > > This is not a reason to delay this patch. It's just a downside which > should be mentioned in the commit message. Fair point; I'll add a paragraph about the potential reclaim issue. > >>>> +static inline int arch_exec_folio_order(void) >>>> +{ >>>> + return -1; >>>> +} >>> >>> This feels a bit fragile. I often expect to be able to store an order >>> in an unsigned int. Why not return 0 instead? >> >> Well 0 is a valid order, no? I think we have had the "is order signed or >> unsigned" argument before. get_order() returns a signed int :) > > But why not always return a valid order? I don't think we need a > sentinel. The default value can be 0 to do what we do today. > But a single order-0 folio is not what we do today. Note that my change as currently implemented requests to read a *single* folio of the specified order. And note that we only get the order we request to page_cache_ra_order() because the size is limited to a single folio. If the size were bigger, that function would actually expand the requested order by 2. (although the parameter is called "new_order", it's actually interpretted as "old_order"). The current behavior is effectively to read 128K in order-2 folios (with smaller folios for boundary alignment). So I see a few options: - Continue to allow non-opted in arches to use the existing behaviour; in this case we need a sentinel. This could be -1, UINT_MAX or 0. But in the latter case you are preventing an opted-in arch from specifying that they want order-0 - it's meaning is overridden. - Force all arches to use the new approach with a default folio order (and readahead size) of order-0. (The default can be overridden per-arch). Personally I'd be nervous about making this change. - Decouple the read size from the folio order size; continue to use the 128K read size and only allow opting-in to a specific folio order. The default order would be 2 (or 0). We would need to fix page_cache_async_ra() to call page_cache_ra_order() with "order + 2" (the new order) and fix page_cache_ra_order() to treat its order parameter as the *new* order. Perhaps we should do those fixes anyway (and then actually start with a folio order of 0 - which I think you said in the past was your original intention?). Thanks, Ryan
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 15211f74b035..5f75e2ddef02 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1514,6 +1514,20 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, */ #define arch_wants_old_prefaulted_pte cpu_has_hw_af +/* + * Request exec memory is read into pagecache in at least 64K folios. The + * trade-off here is performance improvement due to storing translations more + * efficiently in the iTLB vs the potential for read amplification due to + * reading data from disk that won't be used (although this is not a real + * concern as readahead is almost always 128K by default so we are actually + * potentially reducing the read bandwidth). The latter is independent of base + * page size, so we set a page-size independent block size of 64K. This size can + * be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry), + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in + * use. + */ +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) + static inline bool pud_sect_supported(void) { return PAGE_SIZE == SZ_4K; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 787c632ee2c9..944ff80e8f4f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -456,6 +456,18 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_exec_folio_order +/* + * Returns preferred minimum folio order for executable file-backed memory. Must + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no + * preference and mm will not special-case executable memory in the pagecache. + */ +static inline int arch_exec_folio_order(void) +{ + return -1; +} +#endif + #ifndef arch_check_zapped_pte static inline void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) diff --git a/mm/filemap.c b/mm/filemap.c index cc69f174f76b..22ff25a60598 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3223,6 +3223,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) } #endif + /* + * Allow arch to request a preferred minimum folio order for executable + * memory. This can often be beneficial to performance if (e.g.) arm64 + * can contpte-map the folio. Executable memory rarely benefits from + * readahead anyway, due to its random access nature. + */ + if (vm_flags & VM_EXEC) { + int order = arch_exec_folio_order(); + + if (order >= 0) { + fpin = maybe_unlock_mmap_for_io(vmf, fpin); + ra->size = 1UL << order; + ra->async_size = 0; + ractl._index &= ~((1UL << order) - 1); + page_cache_ra_order(&ractl, ra, order); + return fpin; + } + } + /* If we don't want any read-ahead, don't bother */ if (vm_flags & VM_RAND_READ) return fpin;
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read of an arch-specified size in a naturally aligned manner into a folio of the same size (assuming an fs with large folio support). On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-2 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-2. To make things worse, readahead reduces the folio order to 0 at the readahead window boundaries if required for alignment to those boundaries. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential read amplification (due to reading too much data around the fault which won't be used), and the latter is independent of base page size. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs and shouldn't lead to any read amplification in practice since the old read-around path was (usually) reading blocks of 128K. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Benchmarking ============ The below shows nginx and redis benchmarks on Ampere Altra arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: | File-backed folios | system boot | nginx | redis | | by size as percentage |-----------------|-----------------|-----------------| | of all mapped text mem | before | after | before | after | before | after | |========================|========|========|========|========|========|========| | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% | | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% | | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% | | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% | | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% | | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% | | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% | | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% | |------------------------|--------|--------|--------|--------|--------|--------| | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% | The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement: | Benchmark | Improvement | +===============================================+======================+ | pts/nginx (200 connections) | 8.96% | | pts/nginx (1000 connections) | 6.80% | +-----------------------------------------------+----------------------+ | pts/redis (LPOP, 50 connections) | 5.07% | | pts/redis (LPUSH, 50 connections) | 3.68% | Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- Hi All, This is follow up from LSF/MM where we discussed this and there was concensus to take this most simple approach. I know Dave Chinner had reservations when I originally posted it last year, but I think he was coming around in the discussion at [3]. This applies on top of yesterday's mm-unstable (87f556baedc9). Changes since v2 [2] ==================== - Rename arch_wants_exec_folio_order() to arch_exec_folio_order() (per Andrew) - Fixed some typos (per Andrew) Changes since v1 [1] ==================== - Remove "void" from arch_wants_exec_folio_order() macro args list [1] https://lore.kernel.org/linux-mm/20240111154106.3692206-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/ce3b5402-79b8-415b-9c51-f712bb2b953b@arm.com/ Thanks, Ryan arch/arm64/include/asm/pgtable.h | 14 ++++++++++++++ include/linux/pgtable.h | 12 ++++++++++++ mm/filemap.c | 19 +++++++++++++++++++ 3 files changed, 45 insertions(+) -- 2.43.0