diff mbox series

[v2] mm/filemap: Allow arch to request folio size for exec memory

Message ID 20240215154059.2863126-1-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series [v2] mm/filemap: Allow arch to request folio size for exec memory | expand

Commit Message

Ryan Roberts Feb. 15, 2024, 3:40 p.m. UTC
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read of an arch-specified size in a
naturally aligned manner.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" changes, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low liklihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).

The reason for the low liklihood is that the current readahead algorithm
starts with an order-2 folio and increases the folio order by 2 every
time the readahead mark is hit. But most executable memory is faulted in
fairly randomly and so the readahead mark is rarely hit and most
executable folios remain order-2. This is observed impirically and
confirmed from discussion with a gnu linker expert; in general, the
linker does nothing to group temporally accessed text together
spacially. Additionally, with the current read-around approach there are
no alignment guarrantees between the file and folio. This is
insufficient for arm64's contpte mapping requirement (order-4 for 4K
base pages).

So it seems reasonable to special-case the read(ahead) logic for
executable mappings. The trade-off is performance improvement (due to
more efficient storage of the translations in iTLB) vs potential read
amplification (due to reading too much data around the fault which won't
be used), and the latter is independent of base page size. I've chosen
64K folio size for arm64 which benefits both the 4K and 16K base page
size configs and shouldn't lead to any further read-amplification since
the old read-around path was (usually) reading blocks of 128K (with the
last 32K being async).

Performance Benchmarking
------------------------

The below shows kernel compilation and speedometer javascript benchmarks
on Ampere Altra arm64 system. (The contpte patch series is applied in
the baseline).

First, confirmation that this patch causes more memory to be contained
in 64K folios (this is for all file-backed memory so includes
non-executable too):

| File-backed folios      |   Speedometer   |  Kernel Compile |
| by size as percentage   |-----------------|-----------------|
| of all mapped file mem  | before |  after | before |  after |
|=========================|========|========|========|========|
|file-thp-aligned-16kB    |    45% |     9% |    46% |     7% |
|file-thp-aligned-32kB    |     2% |     0% |     3% |     1% |
|file-thp-aligned-64kB    |     3% |    63% |     5% |    80% |
|file-thp-aligned-128kB   |    11% |    11% |     0% |     0% |
|file-thp-unaligned-16kB  |     1% |     0% |     3% |     1% |
|file-thp-unaligned-128kB |     1% |     0% |     0% |     0% |
|file-thp-partial         |     0% |     0% |     0% |     0% |
|-------------------------|--------|--------|--------|--------|
|file-cont-aligned-64kB   |    16% |    75% |     5% |    80% |

The above shows that for both use cases, the amount of file memory
backed by 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of memory that is contpte-mapped
significantly increases (last line).

And this is reflected in performance improvement:

Kernel Compilation (smaller is faster):
| kernel   |   real-time |   kern-time |   user-time |   peak memory |
|----------|-------------|-------------|-------------|---------------|
| before   |        0.0% |        0.0% |        0.0% |          0.0% |
| after    |       -1.6% |       -2.1% |       -1.7% |          0.0% |

Speedometer (bigger is faster):
| kernel   |   runs_per_min |   peak memory |
|----------|----------------|---------------|
| before   |           0.0% |          0.0% |
| after    |           1.3% |          1.0% |

Both benchmarks show a ~1.5% improvement once the patch is applied.

Alternatives
------------

I considered (and rejected for now - but I anticipate this patch will
stimulate discussion around what the best approach is) alternative
approaches:

  - Expose a global user-controlled knob to set the preferred folio
    size; this would move policy to user space and allow (e.g.) setting
    it to PMD-size for even better iTLB utilizaiton. But this would add
    ABI, and I prefer to start with the simplest approach first. It also
    has the downside that a change wouldn't apply to memory already in
    the page cache that is in active use (e.g. libc) so we don't get the
    same level of utilization as for something that is fixed from boot.

  - Add a per-vma attribute to allow user space to specify preferred
    folio size for memory faulted from the range. (we've talked about
    such a control in the context of mTHP). The dynamic loader would
    then be responsible for adding the annotations. Again this feels
    like something that could be added later if value was demonstrated.

  - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size.
    This would still require dynamic linker involvement, but would
    additionally neccessitate a copy and all memory in the range would
    be synchronously faulted in, adding to application load time. It
    would work for filesystems that don't support large folios though.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

Hi All,

This applies on top of mm-unstable (157d6f218d89). But it requires the series at
[2] to realize the performance improvements (I'm hoping [2] will be added to
mm-unstable imminently).

I didn't get a huge amount of feedback from the RFC, except for Matthew
kind-of-implying that he agrees with the principle. Barry initially raised some
concerns about the potential for performance regression with filesystems that
don't support large folios, but I think he ended up coming around to Matthew's
view that this should be ok in principle.

So I'm trying my luck with a v2 with just one minor change that fixes a benign
bug. As per above I'm seeing performance improvement, so I'm keen to get this
in, but I'm willing to run other benchmarks if others have suggestions for what
to run.

Changes since v1 [1]
====================

 - Remove "void" from arch_wants_exec_folio_order() macro args list

[1] https://lore.kernel.org/linux-mm/20240111154106.3692206-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.com/

Thanks,
Ryan


 arch/arm64/include/asm/pgtable.h | 12 ++++++++++++
 include/linux/pgtable.h          | 12 ++++++++++++
 mm/filemap.c                     | 19 +++++++++++++++++++
 3 files changed, 43 insertions(+)

--
2.25.1

Comments

Andrew Morton Feb. 15, 2024, 10:48 p.m. UTC | #1
On Thu, 15 Feb 2024 15:40:59 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.

Some nits:

> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1115,6 +1115,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   */
>  #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
> 
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. The
> + * trade-off here is performance improvement due to storing translations more
> + * effciently in the iTLB vs the potential for read amplification due to reading

"efficiently"

> + * data from disk that won't be used. The latter is independent of base page
> + * size, so we set a page-size independent block size of 64K. This size can be
> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
> + * use.
> + */
> +#define arch_wants_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
> +

To my eye, "arch_wants_foo" and "arch_want_foo" are booleans.  Either
this arch wants a particular treatment or it does not want it.

I suggest a better name would be "arch_exec_folio_order".

>  static inline bool pud_sect_supported(void)
>  {
>  	return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index aab227e12493..6cdd145cbbb9 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -407,6 +407,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
> 
> +#ifndef arch_wants_exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
> + * preference and mm will not special-case executable memory in the pagecache.
> + */

I think this comment contains material which would be useful above the
other arch_wants_exec_folio_order() implementation - the "must be in
range" part.  So I suggest all this material be incorporated into a
single comment which describes arch_wants_exec_folio_order().  Then
this comment can be removed entirely.  Assume the reader knows to go
seek the other definition for the commentary.

> +static inline int arch_wants_exec_folio_order(void)
> +{
> +	return -1;
> +}
> +#endif
> +
>  #ifndef arch_check_zapped_pte
>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>  					 pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 142864338ca4..7954274de11c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3118,6 +3118,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  #endif
> 
> +	/*
> +	 * Allow arch to request a preferred minimum folio order for executable
> +	 * memory. This can often be beneficial to performance if (e.g.) arm64
> +	 * can contpte-map the folio. Executable memory rarely benefits from
> +	 * read-ahead anyway, due to its random access nature.

"readahead"

> +	 */
> +	if (vm_flags & VM_EXEC) {
> +		int order = arch_wants_exec_folio_order();
> +
> +		if (order >= 0) {
> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +			ra->size = 1UL << order;
> +			ra->async_size = 0;
> +			ractl._index &= ~((unsigned long)ra->size - 1);
> +			page_cache_ra_order(&ractl, ra, order);
> +			return fpin;
> +		}
> +	}
> +
>  	/* If we don't want any read-ahead, don't bother */
>  	if (vm_flags & VM_RAND_READ)
>  		return fpin;
Dave Chinner Feb. 16, 2024, 12:04 a.m. UTC | #2
On Thu, Feb 15, 2024 at 03:40:59PM +0000, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read of an arch-specified size in a
> naturally aligned manner.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" changes, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low liklihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low liklihood is that the current readahead algorithm
> starts with an order-2 folio and increases the folio order by 2 every
> time the readahead mark is hit. But most executable memory is faulted in
> fairly randomly and so the readahead mark is rarely hit and most
> executable folios remain order-2.

Yup, this is a bug in the readahead code, and really has nothing to
do with executable files, mmap or the architecture.  We don't want
some magic new VM_EXEC min folio size per architecture thingy to be
set - we just want readahead to do the right thing.

Indeed, we are already adding a mapping minimum folio order
directive to the address space to allow for filesystem block sizes
greater than PAGE_SIZE. That's the generic mechanism that this
functionality requires. See here:

https://lore.kernel.org/linux-xfs/20240213093713.1753368-5-kernel@pankajraghav.com/

(Probably worth reading some of the other readahead mods in that
series and the discussion because readahead needs to ensure that it
fill entire high order folios in a single IO to avoid partial folio
up-to-date states from partial reads.)

IOWs, it seems to me that we could use this proposed generic mapping
min order functionality when mmap() is run and VM_EXEC is set to set
the min order to, say, 64kB. Then the readahead code would simply do
the right thing, as would all other reads and writes to that
mapping.

We could trigger this in the ->mmap() method of the filesystem so
that filesysetms that can use large folios can turn it on, whilst
other filesystems remain blissfully unaware of the functionality.
Filesystems could also do smarter things here, too. eg. enable PMD
alignment for large mapped files....

-Dave.
Matthew Wilcox Feb. 16, 2024, 12:57 a.m. UTC | #3
On Fri, Feb 16, 2024 at 11:04:00AM +1100, Dave Chinner wrote:
> > The reason for the low liklihood is that the current readahead algorithm
> > starts with an order-2 folio and increases the folio order by 2 every
> > time the readahead mark is hit. But most executable memory is faulted in
> > fairly randomly and so the readahead mark is rarely hit and most
> > executable folios remain order-2.
> 
> Yup, this is a bug in the readahead code, and really has nothing to
> do with executable files, mmap or the architecture.  We don't want
> some magic new VM_EXEC min folio size per architecture thingy to be
> set - we just want readahead to do the right thing.
> 
> Indeed, we are already adding a mapping minimum folio order
> directive to the address space to allow for filesystem block sizes
> greater than PAGE_SIZE. That's the generic mechanism that this
> functionality requires. See here:
> 
> https://lore.kernel.org/linux-xfs/20240213093713.1753368-5-kernel@pankajraghav.com/
> 
> (Probably worth reading some of the other readahead mods in that
> series and the discussion because readahead needs to ensure that it
> fill entire high order folios in a single IO to avoid partial folio
> up-to-date states from partial reads.)
> 
> IOWs, it seems to me that we could use this proposed generic mapping
> min order functionality when mmap() is run and VM_EXEC is set to set
> the min order to, say, 64kB. Then the readahead code would simply do
> the right thing, as would all other reads and writes to that
> mapping.
> 
> We could trigger this in the ->mmap() method of the filesystem so
> that filesysetms that can use large folios can turn it on, whilst
> other filesystems remain blissfully unaware of the functionality.
> Filesystems could also do smarter things here, too. eg. enable PMD
> alignment for large mapped files....

We already enable PMD alignment for large mapped files (which caused
some shrieking from those who believe that ASLR continues to offer
worthwhile protection), commit efa7df3e3bb5.

My problem with your minimum order proposal is that it would be a
hard failure if we couldn't get an order-4 folio.  I think we'd do
better to set the ra_state parameters to 64KiB at mmap time (and I
think that's better done in the MM, not in each FS)
Ryan Roberts Feb. 16, 2024, 10:59 a.m. UTC | #4
On 15/02/2024 22:48, Andrew Morton wrote:
> On Thu, 15 Feb 2024 15:40:59 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read of an arch-specified size in a
>> naturally aligned manner.
> 
> Some nits:

Thanks for taking a look, Andrew!

> 
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1115,6 +1115,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>>   */
>>  #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
>>
>> +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. The
>> + * trade-off here is performance improvement due to storing translations more
>> + * effciently in the iTLB vs the potential for read amplification due to reading
> 
> "efficiently"

ACK; will fix if there is a v3

> 
>> + * data from disk that won't be used. The latter is independent of base page
>> + * size, so we set a page-size independent block size of 64K. This size can be
>> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
>> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
>> + * use.
>> + */
>> +#define arch_wants_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
>> +
> 
> To my eye, "arch_wants_foo" and "arch_want_foo" are booleans.  Either
> this arch wants a particular treatment or it does not want it.
> 
> I suggest a better name would be "arch_exec_folio_order".

ACK; will fix if there is a v3

> 
>>  static inline bool pud_sect_supported(void)
>>  {
>>  	return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index aab227e12493..6cdd145cbbb9 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -407,6 +407,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
>> + * preference and mm will not special-case executable memory in the pagecache.
>> + */
> 
> I think this comment contains material which would be useful above the
> other arch_wants_exec_folio_order() implementation - the "must be in
> range" part.  So I suggest all this material be incorporated into a
> single comment which describes arch_wants_exec_folio_order().  Then
> this comment can be removed entirely.  Assume the reader knows to go
> seek the other definition for the commentary.

Hmm... The approach I've been taking for other arch-overridable helpers is to
put the API spec against the default implementation (i.e. here) then put
comments about the specific implementation against the override. If anything I
would prefer to formalize this comment into proper doc header comment and leave
it here (see for example set_ptes(), and in recent patches now in mm-unstable;
get_and_clear_full_ptes(), wrprotect_ptes(), etc).

I'll move all of this to the arm64 code if you really think that's the right
approach, but that's not my personal preference.

Thanks,
Ryan

> 
>> +static inline int arch_wants_exec_folio_order(void)
>> +{
>> +	return -1;
>> +}
>> +#endif
>> +
>>  #ifndef arch_check_zapped_pte
>>  static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>>  					 pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 142864338ca4..7954274de11c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3118,6 +3118,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>  	}
>>  #endif
>>
>> +	/*
>> +	 * Allow arch to request a preferred minimum folio order for executable
>> +	 * memory. This can often be beneficial to performance if (e.g.) arm64
>> +	 * can contpte-map the folio. Executable memory rarely benefits from
>> +	 * read-ahead anyway, due to its random access nature.
> 
> "readahead"
> 
>> +	 */
>> +	if (vm_flags & VM_EXEC) {
>> +		int order = arch_wants_exec_folio_order();
>> +
>> +		if (order >= 0) {
>> +			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> +			ra->size = 1UL << order;
>> +			ra->async_size = 0;
>> +			ractl._index &= ~((unsigned long)ra->size - 1);
>> +			page_cache_ra_order(&ractl, ra, order);
>> +			return fpin;
>> +		}
>> +	}
>> +
>>  	/* If we don't want any read-ahead, don't bother */
>>  	if (vm_flags & VM_RAND_READ)
>>  		return fpin;
>
Ryan Roberts Feb. 16, 2024, 11:18 a.m. UTC | #5
Hi Dave,

Thanks for taking a look at this! Some comments below...

On 16/02/2024 00:04, Dave Chinner wrote:
> On Thu, Feb 15, 2024 at 03:40:59PM +0000, Ryan Roberts wrote:
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read of an arch-specified size in a
>> naturally aligned manner.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" changes, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low liklihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low liklihood is that the current readahead algorithm
>> starts with an order-2 folio and increases the folio order by 2 every
>> time the readahead mark is hit. But most executable memory is faulted in
>> fairly randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-2.
> 
> Yup, this is a bug in the readahead code, and really has nothing to
> do with executable files, mmap or the architecture.  We don't want
> some magic new VM_EXEC min folio size per architecture thingy to be
> set - we just want readahead to do the right thing.

It sounds like we agree that there is a bug but we don't agree on what the bug
is? My view is that executable segments are accessed in a ~random manner and
therefore readahead (as currently configured) is not very useful. But data may
well be accessed more sequentially and therefore readahead is useful. Given both
data and text can come from the same file, I don't think this can just be a
mapping setting? (my understanding is that there is one "mapping" for the whole
file?) So we need to look to VM_EXEC for that decision.

> 
> Indeed, we are already adding a mapping minimum folio order
> directive to the address space to allow for filesystem block sizes
> greater than PAGE_SIZE. That's the generic mechanism that this
> functionality requires. See here:
> 
> https://lore.kernel.org/linux-xfs/20240213093713.1753368-5-kernel@pankajraghav.com/

Great, I'm vaguely aware of this work, but haven't looked in detail. I'll go
read it. But from your brief description, IIUC, this applies to the whole file,
and is a constraint put in place by the filesystem? Applying to the whole file
may make sense - that means more opportunity for contpte mappings for data pages
too, although I guess this adds more scope for write amplificaiton because data
tends to be writable, and text isn't. But for my use case, its not a hard
constraint, its just a preference which can improve performance. And the
filesystem is the wrong place to make the decision; its the arch that knows
about the performacne opportunities with different block mapping sizes.

As a side note, concerns have been expressed about the possibility of physical
memory fragmentation becoming problematic, meaning we degrade back to small
folios over time with my mTHP work. The intuituon is that if the whole system is
using a few folio sizes in ~equal quantities then we might be ok, but I don't
have any data yet. Do you have any data on fragmentation? I guess this could be
more concerning for your use case?

> 
> (Probably worth reading some of the other readahead mods in that
> series and the discussion because readahead needs to ensure that it
> fill entire high order folios in a single IO to avoid partial folio
> up-to-date states from partial reads.)
> 
> IOWs, it seems to me that we could use this proposed generic mapping
> min order functionality when mmap() is run and VM_EXEC is set to set
> the min order to, say, 64kB. Then the readahead code would simply do
> the right thing, as would all other reads and writes to that
> mapping.

Ahh yes, hooking into your new logic to set a min order based on VM_EXEC sounds
perfect...

> 
> We could trigger this in the ->mmap() method of the filesystem so
> that filesysetms that can use large folios can turn it on, whilst
> other filesystems remain blissfully unaware of the functionality.
> Filesystems could also do smarter things here, too. eg. enable PMD
> alignment for large mapped files....

...but I don't think the filesystem is the right place. The size preference
should be driven by arch IMHO.

Thanks,
Ryan

> 
> -Dave.
Ryan Roberts July 4, 2024, 4:23 p.m. UTC | #6
Hi Dave,

I'm reviving this thread, hoping to make some progress on this. I'd appreciate
your thoughts...


On 16/02/2024 11:18, Ryan Roberts wrote:
> Hi Dave,
> 
> Thanks for taking a look at this! Some comments below...
> 
> On 16/02/2024 00:04, Dave Chinner wrote:
>> On Thu, Feb 15, 2024 at 03:40:59PM +0000, Ryan Roberts wrote:
>>> Change the readahead config so that if it is being requested for an
>>> executable mapping, do a synchronous read of an arch-specified size in a
>>> naturally aligned manner.
>>>
>>> On arm64 if memory is physically contiguous and naturally aligned to the
>>> "contpte" size, we can use contpte mappings, which improves utilization
>>> of the TLB. When paired with the "multi-size THP" changes, this works
>>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>>> executable mappings having a low liklihood of being in the required
>>> folio size and mapping alignment, even when the filesystem supports
>>> readahead into large folios (e.g. XFS).
>>>
>>> The reason for the low liklihood is that the current readahead algorithm
>>> starts with an order-2 folio and increases the folio order by 2 every
>>> time the readahead mark is hit. But most executable memory is faulted in
>>> fairly randomly and so the readahead mark is rarely hit and most
>>> executable folios remain order-2.
>>
>> Yup, this is a bug in the readahead code, and really has nothing to
>> do with executable files, mmap or the architecture.  We don't want
>> some magic new VM_EXEC min folio size per architecture thingy to be
>> set - we just want readahead to do the right thing.
> 
> It sounds like we agree that there is a bug but we don't agree on what the bug
> is? My view is that executable segments are accessed in a ~random manner and
> therefore readahead (as currently configured) is not very useful. But data may
> well be accessed more sequentially and therefore readahead is useful. Given both
> data and text can come from the same file, I don't think this can just be a
> mapping setting? (my understanding is that there is one "mapping" for the whole
> file?) So we need to look to VM_EXEC for that decision.

Additionally, what is "the right thing" in your view?

> 
>>
>> Indeed, we are already adding a mapping minimum folio order
>> directive to the address space to allow for filesystem block sizes
>> greater than PAGE_SIZE. That's the generic mechanism that this
>> functionality requires. See here:
>>
>> https://lore.kernel.org/linux-xfs/20240213093713.1753368-5-kernel@pankajraghav.com/
> 
> Great, I'm vaguely aware of this work, but haven't looked in detail. I'll go
> read it. But from your brief description, IIUC, this applies to the whole file,
> and is a constraint put in place by the filesystem? Applying to the whole file
> may make sense - that means more opportunity for contpte mappings for data pages
> too, although I guess this adds more scope for write amplificaiton because data
> tends to be writable, and text isn't. But for my use case, its not a hard
> constraint, its just a preference which can improve performance. And the
> filesystem is the wrong place to make the decision; its the arch that knows
> about the performacne opportunities with different block mapping sizes.

Having finally taken a proper look at this, I still have the same opinion. I
don't think this (hard) minimum folio order work is the right fit for what I'm
trying to achieve. I need a soft minimum that can still fall back to order-0 (or
the min mapping order), and ideally I want a different soft minimum to be
applied to different parts of the file (exec vs other).

I'm currently thinking about abandoning the arch hook and replacing with sysfs
ABI akin to the mTHP interface. The idea would be that for each size, you could
specify 'never', 'always', 'exec' or 'always+exec'. A maximum one size would be
allowed be marked as 'exec' at a time. The set of sizes marked 'always' would be
the ones considered in page_cache_ra_order(), with fallback to order-0 (or min
mapping order) still allowed. If a size is marked 'exec' then we would take
VM_EXEC path added by this patch and do sync read into folio of that size.

This obviously expands the scope somewhat, but I suspect having the ability to
control the folio orders that get allocated by the pagecache will also help
reduce large folio allocation failure due to fragmentation; if only a couple
folios sizes are in operation in a given system, you are more likely to be able
to reclaim the size that you need.

All just a thought experiment at the moment, and I'll obviously do some
prototyping and large folio allocation success rate measurements. I appreciate
that we don't want to add sysfs controls without good justification. But I
wonder if this could be a more pallatable solution to people, at least in principle?

Thanks,
Ryan

> 
> As a side note, concerns have been expressed about the possibility of physical
> memory fragmentation becoming problematic, meaning we degrade back to small
> folios over time with my mTHP work. The intuituon is that if the whole system is
> using a few folio sizes in ~equal quantities then we might be ok, but I don't
> have any data yet. Do you have any data on fragmentation? I guess this could be
> more concerning for your use case?
> 
>>
>> (Probably worth reading some of the other readahead mods in that
>> series and the discussion because readahead needs to ensure that it
>> fill entire high order folios in a single IO to avoid partial folio
>> up-to-date states from partial reads.)
>>
>> IOWs, it seems to me that we could use this proposed generic mapping
>> min order functionality when mmap() is run and VM_EXEC is set to set
>> the min order to, say, 64kB. Then the readahead code would simply do
>> the right thing, as would all other reads and writes to that
>> mapping.
> 
> Ahh yes, hooking into your new logic to set a min order based on VM_EXEC sounds
> perfect...
> 
>>
>> We could trigger this in the ->mmap() method of the filesystem so
>> that filesysetms that can use large folios can turn it on, whilst
>> other filesystems remain blissfully unaware of the functionality.
>> Filesystems could also do smarter things here, too. eg. enable PMD
>> alignment for large mapped files....
> 
> ...but I don't think the filesystem is the right place. The size preference
> should be driven by arch IMHO.
> 
> Thanks,
> Ryan
> 
>>
>> -Dave.
>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52d0b0a763f1..fa6365ce61a7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1115,6 +1115,18 @@  static inline void update_mmu_cache_range(struct vm_fault *vmf,
  */
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af

+/*
+ * Request exec memory is read into pagecache in at least 64K folios. The
+ * trade-off here is performance improvement due to storing translations more
+ * effciently in the iTLB vs the potential for read amplification due to reading
+ * data from disk that won't be used. The latter is independent of base page
+ * size, so we set a page-size independent block size of 64K. This size can be
+ * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry),
+ * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in
+ * use.
+ */
+#define arch_wants_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+
 static inline bool pud_sect_supported(void)
 {
 	return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..6cdd145cbbb9 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -407,6 +407,18 @@  static inline bool arch_has_hw_pte_young(void)
 }
 #endif

+#ifndef arch_wants_exec_folio_order
+/*
+ * Returns preferred minimum folio order for executable file-backed memory. Must
+ * be in range [0, PMD_ORDER]. Negative value implies that the HW has no
+ * preference and mm will not special-case executable memory in the pagecache.
+ */
+static inline int arch_wants_exec_folio_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef arch_check_zapped_pte
 static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
 					 pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 142864338ca4..7954274de11c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3118,6 +3118,25 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 #endif

+	/*
+	 * Allow arch to request a preferred minimum folio order for executable
+	 * memory. This can often be beneficial to performance if (e.g.) arm64
+	 * can contpte-map the folio. Executable memory rarely benefits from
+	 * read-ahead anyway, due to its random access nature.
+	 */
+	if (vm_flags & VM_EXEC) {
+		int order = arch_wants_exec_folio_order();
+
+		if (order >= 0) {
+			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+			ra->size = 1UL << order;
+			ra->async_size = 0;
+			ractl._index &= ~((unsigned long)ra->size - 1);
+			page_cache_ra_order(&ractl, ra, order);
+			return fpin;
+		}
+	}
+
 	/* If we don't want any read-ahead, don't bother */
 	if (vm_flags & VM_RAND_READ)
 		return fpin;