diff mbox series

[v5,4/6] mm: shmem: add mTHP support for anonymous shmem

Message ID 65796c1e72e51e15f3410195b5c2d5b6c160d411.1718090413.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series add mTHP support for anonymous shmem | expand

Commit Message

Baolin Wang June 11, 2024, 10:11 a.m. UTC
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous shmem, in order to enjoy
the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
to support all shmem/tmpfs scenarios in the future, especially for the
shmem mmap() case.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows independent
control of anonymous shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  10 +++
 mm/shmem.c              | 187 +++++++++++++++++++++++++++++++++-------
 2 files changed, 167 insertions(+), 30 deletions(-)

Comments

Ryan Roberts July 3, 2024, 5:25 p.m. UTC | #1
On 11/06/2024 11:11, Baolin Wang wrote:
> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
> can allow THP to be configured through the sysfs interface located at
> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Users expect to apply the mTHP rule for
> all anonymous pages, including the anonymous shmem, in order to enjoy
> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
> to support all shmem/tmpfs scenarios in the future, especially for the
> shmem mmap() case.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size,
> which is set to "inherit". This ensures backward compatibility with the
> anonymous shmem enabled of the top level, meanwhile also allows independent
> control of anonymous shmem enabled for each mTHP.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

[...]

Hi Baolin,

I'm currently trying to fix a bug where khugepaged is not started if only shmem
is enabled for THP. See discussion at [1]. It's been broken like this forever.

Assuming anon and shmem THP are initially both disabled:

# echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
<khugepaged is stopped here>

Then shemem THP is enabled:

# echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
<khugepaged is not started, this is a bug>

As part of investigating the fix, I stumbled upon this patch, which I remember
reviewing an early version of but I've been out for a while and missed the
latter versions. See below for comments and questions; the answers to which will
help me figure out how to fix the above...

[1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/


>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)
> +{
> +	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
> +	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
> +	unsigned long vm_flags = vma->vm_flags;
> +	/*
> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
> +	 * are enabled for this vma.
> +	 */
> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
> +	loff_t i_size;
> +	int order;
> +
> +	if ((vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return 0;
> +
> +	/* If the hardware/firmware marked hugepage support disabled. */
> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> +		return 0;
> +
> +	/*
> +	 * Following the 'deny' semantics of the top level, force the huge
> +	 * option off from all mounts.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_DENY)
> +		return 0;

I don't quite get this, I don't think its the desirable behaviour. This is
saying that if the top-level shemem_enabled control is set to 'deny', then all
mTHP sizes, regardless of their control's setting are disabled?

The anon controls don't work like that; you can set the top-level control to
anything you like, but its only consumed if the per-size controls are inheriting it.

So for the deny case, wouldn't it be better to allow that as an option on all
the per-size controls (and implicitly let it be inherrited from the top-level)?

> +
> +	/*
> +	 * Only allow inherit orders if the top-level value is 'force', which
> +	 * means non-PMD sized THP can not override 'huge' mount option now.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_FORCE)
> +		return READ_ONCE(huge_shmem_orders_inherit);

I vaguely recall that we originally discussed that trying to set 'force' on the
top level control while any per-size controls were set to 'inherit' would be an
error, and trying to set 'force' on any per-size control except the PMD-size
would be an error?

I don't really understand this logic. Shouldn't we just be looking at the
per-size control settings (or the top-level control as a proxy for every
per-size control that has 'inherit' set)?

Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
PMD-size control for decisions.

I'm also really struggling with the concept of shmem_is_huge() existing along
side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
shmem_allowable_huge_orders()?

> +
> +	/* Allow mTHP that will be fully within i_size. */
> +	order = highest_order(within_size_orders);
> +	while (within_size_orders) {
> +		index = round_up(index + 1, order);
> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
> +		if (i_size >> PAGE_SHIFT >= index) {
> +			mask |= within_size_orders;
> +			break;
> +		}
> +
> +		order = next_order(&within_size_orders, order);
> +	}
> +
> +	if (vm_flags & VM_HUGEPAGE)
> +		mask |= READ_ONCE(huge_shmem_orders_madvise);
> +
> +	if (global_huge)

Perhaps I've misunderstood global_huge, but I think its just the return value
from shmem_is_huge()? But you're also using shmem_huge directly in this
function. I find it all rather confusing.

Sorry if all this was discussed and agreed in my absence! But I think this needs
to be sorted in order to do the bug fix for khugepaged properly.

Thanks,
Ryan


> +		mask |= READ_ONCE(huge_shmem_orders_inherit);
> +
> +	return orders & mask;
> +}
> +
> +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					   struct address_space *mapping, pgoff_t index,
> +					   unsigned long orders)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long pages;
> +	int order;
> +
> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +	if (!orders)
> +		return 0;
> +
> +	/* Find the highest order that can add into the page cache */
> +	order = highest_order(orders);
> +	while (orders) {
> +		pages = 1UL << order;
> +		index = round_down(index, pages);
> +		if (!xa_find(&mapping->i_pages, &index,
> +			     index + pages - 1, XA_PRESENT))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	return orders;
> +}
> +#else
> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)
> +{
> +	return 0;
> +}
> +
> +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					   struct address_space *mapping, pgoff_t index,
> +					   unsigned long orders)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  		struct shmem_inode_info *info, pgoff_t index)
>  {
> @@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  	return folio;
>  }
>  
> -static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
> -		struct inode *inode, pgoff_t index,
> -		struct mm_struct *fault_mm, bool huge)
> +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> +		gfp_t gfp, struct inode *inode, pgoff_t index,
> +		struct mm_struct *fault_mm, unsigned long orders)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> -	struct folio *folio;
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> +	unsigned long suitable_orders = 0;
> +	struct folio *folio = NULL;
>  	long pages;
> -	int error;
> +	int error, order;
>  
>  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> -		huge = false;
> +		orders = 0;
>  
> -	if (huge) {
> -		pages = HPAGE_PMD_NR;
> -		index = round_down(index, HPAGE_PMD_NR);
> +	if (orders > 0) {
> +		if (vma && vma_is_anon_shmem(vma)) {
> +			suitable_orders = shmem_suitable_orders(inode, vmf,
> +							mapping, index, orders);
> +		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
> +			pages = HPAGE_PMD_NR;
> +			suitable_orders = BIT(HPAGE_PMD_ORDER);
> +			index = round_down(index, HPAGE_PMD_NR);
>  
> -		/*
> -		 * Check for conflict before waiting on a huge allocation.
> -		 * Conflict might be that a huge page has just been allocated
> -		 * and added to page cache by a racing thread, or that there
> -		 * is already at least one small page in the huge extent.
> -		 * Be careful to retry when appropriate, but not forever!
> -		 * Elsewhere -EEXIST would be the right code, but not here.
> -		 */
> -		if (xa_find(&mapping->i_pages, &index,
> -				index + HPAGE_PMD_NR - 1, XA_PRESENT))
> -			return ERR_PTR(-E2BIG);
> +			/*
> +			 * Check for conflict before waiting on a huge allocation.
> +			 * Conflict might be that a huge page has just been allocated
> +			 * and added to page cache by a racing thread, or that there
> +			 * is already at least one small page in the huge extent.
> +			 * Be careful to retry when appropriate, but not forever!
> +			 * Elsewhere -EEXIST would be the right code, but not here.
> +			 */
> +			if (xa_find(&mapping->i_pages, &index,
> +				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
> +				return ERR_PTR(-E2BIG);
> +		}
>  
> -		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
> -		if (!folio && pages == HPAGE_PMD_NR)
> -			count_vm_event(THP_FILE_FALLBACK);
> +		order = highest_order(suitable_orders);
> +		while (suitable_orders) {
> +			pages = 1UL << order;
> +			index = round_down(index, pages);
> +			folio = shmem_alloc_folio(gfp, order, info, index);
> +			if (folio)
> +				goto allocated;
> +
> +			if (pages == HPAGE_PMD_NR)
> +				count_vm_event(THP_FILE_FALLBACK);
> +			order = next_order(&suitable_orders, order);
> +		}
>  	} else {
>  		pages = 1;
>  		folio = shmem_alloc_folio(gfp, 0, info, index);
> @@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
>  	if (!folio)
>  		return ERR_PTR(-ENOMEM);
>  
> +allocated:
>  	__folio_set_locked(folio);
>  	__folio_set_swapbacked(folio);
>  
> @@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	struct mm_struct *fault_mm;
>  	struct folio *folio;
>  	int error;
> -	bool alloced;
> +	bool alloced, huge;
> +	unsigned long orders = 0;
>  
>  	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
>  		return -EINVAL;
> @@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  		return 0;
>  	}
>  
> -	if (shmem_is_huge(inode, index, false, fault_mm,
> -			  vma ? vma->vm_flags : 0)) {
> +	huge = shmem_is_huge(inode, index, false, fault_mm,
> +			     vma ? vma->vm_flags : 0);
> +	/* Find hugepage orders that are allowed for anonymous shmem. */
> +	if (vma && vma_is_anon_shmem(vma))
> +		orders = shmem_allowable_huge_orders(inode, vma, index, huge);
> +	else if (huge)
> +		orders = BIT(HPAGE_PMD_ORDER);
> +
> +	if (orders > 0) {
>  		gfp_t huge_gfp;
>  
>  		huge_gfp = vma_thp_gfp_mask(vma);
>  		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
> -		folio = shmem_alloc_and_add_folio(huge_gfp,
> -				inode, index, fault_mm, true);
> +		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
> +				inode, index, fault_mm, orders);
>  		if (!IS_ERR(folio)) {
>  			if (folio_test_pmd_mappable(folio))
>  				count_vm_event(THP_FILE_ALLOC);
> @@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  			goto repeat;
>  	}
>  
> -	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
> +	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
>  	if (IS_ERR(folio)) {
>  		error = PTR_ERR(folio);
>  		if (error == -EEXIST)
> @@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  
>  alloced:
>  	alloced = true;
> -	if (folio_test_pmd_mappable(folio) &&
> +	if (folio_test_large(folio) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  					folio_next_index(folio) - 1) {
>  		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
Baolin Wang July 4, 2024, 11:15 a.m. UTC | #2
On 2024/7/4 01:25, Ryan Roberts wrote:
> On 11/06/2024 11:11, Baolin Wang wrote:
>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>> can allow THP to be configured through the sysfs interface located at
>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>> all anonymous pages, including the anonymous shmem, in order to enjoy
>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>> to support all shmem/tmpfs scenarios in the future, especially for the
>> shmem mmap() case.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have almost the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option and dropping the testing options 'force' and
>> 'deny'. By default all sizes will be set to "never" except PMD size,
>> which is set to "inherit". This ensures backward compatibility with the
>> anonymous shmem enabled of the top level, meanwhile also allows independent
>> control of anonymous shmem enabled for each mTHP.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> [...]
> 
> Hi Baolin,
> 
> I'm currently trying to fix a bug where khugepaged is not started if only shmem
> is enabled for THP. See discussion at [1]. It's been broken like this forever.
> 
> Assuming anon and shmem THP are initially both disabled:
> 
> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> <khugepaged is stopped here>
> 
> Then shemem THP is enabled:
> 
> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> <khugepaged is not started, this is a bug>

Thanks Ryan. Yes, this is a real problem.

> As part of investigating the fix, I stumbled upon this patch, which I remember
> reviewing an early version of but I've been out for a while and missed the
> latter versions. See below for comments and questions; the answers to which will
> help me figure out how to fix the above...
> 
> [1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
> 
> 
>>   
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>> +				struct vm_area_struct *vma, pgoff_t index,
>> +				bool global_huge)
>> +{
>> +	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>> +	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
>> +	unsigned long vm_flags = vma->vm_flags;
>> +	/*
>> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>> +	 * are enabled for this vma.
>> +	 */
>> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>> +	loff_t i_size;
>> +	int order;
>> +
>> +	if ((vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return 0;
>> +
>> +	/* If the hardware/firmware marked hugepage support disabled. */
>> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>> +		return 0;
>> +
>> +	/*
>> +	 * Following the 'deny' semantics of the top level, force the huge
>> +	 * option off from all mounts.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_DENY)
>> +		return 0;
> 
> I don't quite get this, I don't think its the desirable behaviour. This is
> saying that if the top-level shemem_enabled control is set to 'deny', then all
> mTHP sizes, regardless of their control's setting are disabled?
> 
> The anon controls don't work like that; you can set the top-level control to
> anything you like, but its only consumed if the per-size controls are inheriting it.

IMO, 'deny' option is not similar like 'never' option.

> 
> So for the deny case, wouldn't it be better to allow that as an option on all
> the per-size controls (and implicitly let it be inherrited from the top-level)?

 From 'deny' option's semantics:
"disables huge on shm_mnt and all mounts, for emergency use;"

It is usually used for testing to shut down all huge pages from the old 
ages. Moreover, mTHP interfaces will be used as a huge order filter to 
support tmpfs, so I think it will make life easier to disable all huge 
orders for testing or emergency use, as well as to maintain the original 
semantics.

>> +
>> +	/*
>> +	 * Only allow inherit orders if the top-level value is 'force', which
>> +	 * means non-PMD sized THP can not override 'huge' mount option now.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_FORCE)
>> +		return READ_ONCE(huge_shmem_orders_inherit);
> 
> I vaguely recall that we originally discussed that trying to set 'force' on the
> top level control while any per-size controls were set to 'inherit' would be an
> error, and trying to set 'force' on any per-size control except the PMD-size
> would be an error?

Right.

> I don't really understand this logic. Shouldn't we just be looking at the
> per-size control settings (or the top-level control as a proxy for every
> per-size control that has 'inherit' set)?

‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
only allow pmd-mapped THP to be forced. We should not look at per-size 
control settings for tmpfs now (mTHP for tmpfs will be discussed in future).

> 
> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
> PMD-size control for decisions.
> 
> I'm also really struggling with the concept of shmem_is_huge() existing along
> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
> shmem_allowable_huge_orders()?

I understood. But now they serve different purposes: shmem_is_huge() 
will be used to check the huge orders for the top level, for *tmpfs* and 
anon shmem; whereas shmem_allowable_huge_orders() will only be used to 
check the per-size huge orders for anon shmem (excluding tmpfs now). 
However, as I plan to add mTHP support for tmpfs, I think we can perform 
some cleanups.

>> +	/* Allow mTHP that will be fully within i_size. */
>> +	order = highest_order(within_size_orders);
>> +	while (within_size_orders) {
>> +		index = round_up(index + 1, order);
>> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
>> +		if (i_size >> PAGE_SHIFT >= index) {
>> +			mask |= within_size_orders;
>> +			break;
>> +		}
>> +
>> +		order = next_order(&within_size_orders, order);
>> +	}
>> +
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		mask |= READ_ONCE(huge_shmem_orders_madvise);
>> +
>> +	if (global_huge)
> 
> Perhaps I've misunderstood global_huge, but I think its just the return value
> from shmem_is_huge()? But you're also using shmem_huge directly in this

Yes.

> function. I find it all rather confusing.

I think I have explained why need these logics as above. Since mTHP 
support for shmem has just started (tmpfs is still in progress). I will 
make it more clear in the following patches.

> Sorry if all this was discussed and agreed in my absence! But I think this needs
> to be sorted in order to do the bug fix for khugepaged properly.

No worries. Thanks for your input.
Ryan Roberts July 4, 2024, 1:58 p.m. UTC | #3
On 04/07/2024 12:15, Baolin Wang wrote:
> 
> 
> On 2024/7/4 01:25, Ryan Roberts wrote:
>> On 11/06/2024 11:11, Baolin Wang wrote:
>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>> can allow THP to be configured through the sysfs interface located at
>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>
>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>> configured through the sysfs interface, and can only use the PMD-mapped
>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>> shmem mmap() case.
>>>
>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>> which can have almost the same values as the top-level
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>> additional "inherit" option and dropping the testing options 'force' and
>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>> which is set to "inherit". This ensures backward compatibility with the
>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>> control of anonymous shmem enabled for each mTHP.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> [...]
>>
>> Hi Baolin,
>>
>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>
>> Assuming anon and shmem THP are initially both disabled:
>>
>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> <khugepaged is stopped here>
>>
>> Then shemem THP is enabled:
>>
>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>> <khugepaged is not started, this is a bug>
> 
> Thanks Ryan. Yes, this is a real problem.
> 
>> As part of investigating the fix, I stumbled upon this patch, which I remember
>> reviewing an early version of but I've been out for a while and missed the
>> latter versions. See below for comments and questions; the answers to which will
>> help me figure out how to fix the above...
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>
>>
>>>   +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>> +                struct vm_area_struct *vma, pgoff_t index,
>>> +                bool global_huge)
>>> +{
>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>> +    unsigned long within_size_orders =
>>> READ_ONCE(huge_shmem_orders_within_size);
>>> +    unsigned long vm_flags = vma->vm_flags;
>>> +    /*
>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>> +     * are enabled for this vma.
>>> +     */
>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>> +    loff_t i_size;
>>> +    int order;
>>> +
>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> +        return 0;
>>> +
>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * Following the 'deny' semantics of the top level, force the huge
>>> +     * option off from all mounts.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>> +        return 0;
>>
>> I don't quite get this, I don't think its the desirable behaviour. This is
>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>> mTHP sizes, regardless of their control's setting are disabled?
>>
>> The anon controls don't work like that; you can set the top-level control to
>> anything you like, but its only consumed if the per-size controls are
>> inheriting it.
> 
> IMO, 'deny' option is not similar like 'never' option.
> 
>>
>> So for the deny case, wouldn't it be better to allow that as an option on all
>> the per-size controls (and implicitly let it be inherrited from the top-level)?
> 
> From 'deny' option's semantics:
> "disables huge on shm_mnt and all mounts, for emergency use;"

Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
except the PMD-size control, 'deny' and 'never' would be the same practically
speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
disable that size for the tmpfs mounts.

I disagree with the current implementation where setting it up so that a
top-level 'deny' overrides whatever is in the per-size controls simply because
it's different to the model implemented for anon THP. That's my 2 cents. If
nobody else agrees then that ok - I'll fix the above bug according to the
current model.

> 
> It is usually used for testing to shut down all huge pages from the old ages.
> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
> so I think it will make life easier to disable all huge orders for testing or
> emergency use, as well as to maintain the original semantics.>
>>> +
>>> +    /*
>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>
>> I vaguely recall that we originally discussed that trying to set 'force' on the
>> top level control while any per-size controls were set to 'inherit' would be an
>> error, and trying to set 'force' on any per-size control except the PMD-size
>> would be an error?
> 
> Right.
> 
>> I don't really understand this logic. Shouldn't we just be looking at the
>> per-size control settings (or the top-level control as a proxy for every
>> per-size control that has 'inherit' set)?
> 
> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
> allow pmd-mapped THP to be forced. We should not look at per-size control
> settings for tmpfs now (mTHP for tmpfs will be discussed in future).

But why not just maintain per-size controls and refactor tmpfs to just look at
the PMD-size THP control for now. It can ignore the other sizes. That's much
simpler and cleaner IMHO.

> 
>>
>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>> PMD-size control for decisions.
>>
>> I'm also really struggling with the concept of shmem_is_huge() existing along
>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>> shmem_allowable_huge_orders()?
> 
> I understood. But now they serve different purposes: shmem_is_huge() will be
> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
> whereas shmem_allowable_huge_orders() will only be used to check the per-size
> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
> support for tmpfs, I think we can perform some cleanups.
> 
>>> +    /* Allow mTHP that will be fully within i_size. */
>>> +    order = highest_order(within_size_orders);
>>> +    while (within_size_orders) {
>>> +        index = round_up(index + 1, order);
>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>> +            mask |= within_size_orders;
>>> +            break;
>>> +        }
>>> +
>>> +        order = next_order(&within_size_orders, order);
>>> +    }
>>> +
>>> +    if (vm_flags & VM_HUGEPAGE)
>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>> +
>>> +    if (global_huge)
>>
>> Perhaps I've misunderstood global_huge, but I think its just the return value
>> from shmem_is_huge()? But you're also using shmem_huge directly in this
> 
> Yes.
> 
>> function. I find it all rather confusing.
> 
> I think I have explained why need these logics as above. Since mTHP support for
> shmem has just started (tmpfs is still in progress). I will make it more clear
> in the following patches.

OK as long as you have a plan for the clean up, that's good enough for me.

> 
>> Sorry if all this was discussed and agreed in my absence! But I think this needs
>> to be sorted in order to do the bug fix for khugepaged properly.
> 
> No worries. Thanks for your input.
David Hildenbrand July 4, 2024, 2:01 p.m. UTC | #4
On 04.07.24 15:58, Ryan Roberts wrote:
> On 04/07/2024 12:15, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>> can allow THP to be configured through the sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>> shmem mmap() case.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have almost the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option and dropping the testing options 'force' and
>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>> which is set to "inherit". This ensures backward compatibility with the
>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>> control of anonymous shmem enabled for each mTHP.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> [...]
>>>
>>> Hi Baolin,
>>>
>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>
>>> Assuming anon and shmem THP are initially both disabled:
>>>
>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> <khugepaged is stopped here>
>>>
>>> Then shemem THP is enabled:
>>>
>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> <khugepaged is not started, this is a bug>
>>
>> Thanks Ryan. Yes, this is a real problem.
>>
>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>> reviewing an early version of but I've been out for a while and missed the
>>> latter versions. See below for comments and questions; the answers to which will
>>> help me figure out how to fix the above...
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>
>>>
>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>> +                bool global_huge)
>>>> +{
>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>> +    unsigned long within_size_orders =
>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>> +    /*
>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>> +     * are enabled for this vma.
>>>> +     */
>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>> +    loff_t i_size;
>>>> +    int order;
>>>> +
>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>> +        return 0;
>>>> +
>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>> +        return 0;
>>>> +
>>>> +    /*
>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>> +     * option off from all mounts.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>> +        return 0;
>>>
>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>> mTHP sizes, regardless of their control's setting are disabled?
>>>
>>> The anon controls don't work like that; you can set the top-level control to
>>> anything you like, but its only consumed if the per-size controls are
>>> inheriting it.
>>
>> IMO, 'deny' option is not similar like 'never' option.
>>
>>>
>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>
>>  From 'deny' option's semantics:
>> "disables huge on shm_mnt and all mounts, for emergency use;"
> 
> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
> except the PMD-size control, 'deny' and 'never' would be the same practically
> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
> disable that size for the tmpfs mounts.
> 
> I disagree with the current implementation where setting it up so that a
> top-level 'deny' overrides whatever is in the per-size controls simply because
> it's different to the model implemented for anon THP. That's my 2 cents. If
> nobody else agrees then that ok - I'll fix the above bug according to the
> current model.

IIRC, Hugh said that deny+force are rather legacy artifacts that we 
don't want on sub-controls (and likely we cannot remove them). I agree 
they shouldn't overwrite other toggles. The models we used should better 
match.
Bang Li July 4, 2024, 2:46 p.m. UTC | #5
Hi Bao lin,

On 2024/7/4 19:15, Baolin Wang wrote:
>
>>> +
>>> +    /*
>>> +     * Only allow inherit orders if the top-level value is 'force', 
>>> which
>>> +     * means non-PMD sized THP can not override 'huge' mount option 
>>> now.
>>> +     */
>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>
>> I vaguely recall that we originally discussed that trying to set 
>> 'force' on the
>> top level control while any per-size controls were set to 'inherit' 
>> would be an
>> error, and trying to set 'force' on any per-size control except the 
>> PMD-size
>> would be an error?
>
> Right.
>
>> I don't really understand this logic. Shouldn't we just be looking at 
>> the
>> per-size control settings (or the top-level control as a proxy for every
>> per-size control that has 'inherit' set)?
>
> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
> only allow pmd-mapped THP to be forced. We should not look at per-size 
> control settings for tmpfs now (mTHP for tmpfs will be discussed in 
> future).
>
>>
>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just 
>> always use the
>> PMD-size control for decisions.
>>
>> I'm also really struggling with the concept of shmem_is_huge() 
>> existing along
>> side shmem_allowable_huge_orders(). Surely this needs to all be 
>> refactored into
>> shmem_allowable_huge_orders()?
>
> I understood. But now they serve different purposes: shmem_is_huge() 
> will be used to check the huge orders for the top level, for *tmpfs* 
> and anon shmem; whereas shmem_allowable_huge_orders() will only be 
> used to check the per-size huge orders for anon shmem (excluding tmpfs 
> now). However, as I plan to add mTHP support for tmpfs, I think we can 
> perform some cleanups. 

Please count me in, I'd be happy to contribute to the cleanup and 
enhancement
process if I can.

Thanks,
Bang
Bang Li July 4, 2024, 3:05 p.m. UTC | #6
Hey Ryan,

On 2024/7/4 21:58, Ryan Roberts wrote:
>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>> PMD-size control for decisions.
>>>
>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>> shmem_allowable_huge_orders()?
>> I understood. But now they serve different purposes: shmem_is_huge() will be
>> used to check the huge orders for the top level, for*tmpfs*  and anon shmem;
>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
>> support for tmpfs, I think we can perform some cleanups.
>>
>>>> +    /* Allow mTHP that will be fully within i_size. */
>>>> +    order = highest_order(within_size_orders);
>>>> +    while (within_size_orders) {
>>>> +        index = round_up(index + 1, order);
>>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>>> +            mask |= within_size_orders;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        order = next_order(&within_size_orders, order);
>>>> +    }
>>>> +
>>>> +    if (vm_flags & VM_HUGEPAGE)
>>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>> +
>>>> +    if (global_huge)
>>> Perhaps I've misunderstood global_huge, but I think its just the return value
>>> from shmem_is_huge()? But you're also using shmem_huge directly in this
>> Yes.
>>
>>> function. I find it all rather confusing.
>> I think I have explained why need these logics as above. Since mTHP support for
>> shmem has just started (tmpfs is still in progress). I will make it more clear
>> in the following patches.
> OK as long as you have a plan for the clean up, that's good enough for me.

Can I continue to push the following patch [1]? When other types of 
shmem mTHP
are supported, we will perform cleanups uniformly.

[1] 
https://lore.kernel.org/linux-mm/20240702023401.41553-1-libang.li@antgroup.com/

Thanks,
Bang
Ryan Roberts July 4, 2024, 3:54 p.m. UTC | #7
On 04/07/2024 16:05, Bang Li wrote:
> Hey Ryan,
> 
> On 2024/7/4 21:58, Ryan Roberts wrote:
>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>> PMD-size control for decisions.
>>>>
>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>> shmem_allowable_huge_orders()?
>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>> used to check the huge orders for the top level, for*tmpfs*  and anon shmem;
>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add mTHP
>>> support for tmpfs, I think we can perform some cleanups.
>>>
>>>>> +    /* Allow mTHP that will be fully within i_size. */
>>>>> +    order = highest_order(within_size_orders);
>>>>> +    while (within_size_orders) {
>>>>> +        index = round_up(index + 1, order);
>>>>> +        i_size = round_up(i_size_read(inode), PAGE_SIZE);
>>>>> +        if (i_size >> PAGE_SHIFT >= index) {
>>>>> +            mask |= within_size_orders;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        order = next_order(&within_size_orders, order);
>>>>> +    }
>>>>> +
>>>>> +    if (vm_flags & VM_HUGEPAGE)
>>>>> +        mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>>> +
>>>>> +    if (global_huge)
>>>> Perhaps I've misunderstood global_huge, but I think its just the return value
>>>> from shmem_is_huge()? But you're also using shmem_huge directly in this
>>> Yes.
>>>
>>>> function. I find it all rather confusing.
>>> I think I have explained why need these logics as above. Since mTHP support for
>>> shmem has just started (tmpfs is still in progress). I will make it more clear
>>> in the following patches.
>> OK as long as you have a plan for the clean up, that's good enough for me.
> 
> Can I continue to push the following patch [1]? When other types of shmem mTHP
> are supported, we will perform cleanups uniformly.

I guess that makes sense.

> 
> [1] https://lore.kernel.org/linux-mm/20240702023401.41553-1-libang.li@antgroup.com/
> 
> Thanks,
> Bang
Baolin Wang July 5, 2024, 2:56 a.m. UTC | #8
On 2024/7/4 21:58, Ryan Roberts wrote:
> On 04/07/2024 12:15, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>> can allow THP to be configured through the sysfs interface located at
>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>
>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>> shmem mmap() case.
>>>>
>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>> which can have almost the same values as the top-level
>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>> additional "inherit" option and dropping the testing options 'force' and
>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>> which is set to "inherit". This ensures backward compatibility with the
>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>> control of anonymous shmem enabled for each mTHP.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> [...]
>>>
>>> Hi Baolin,
>>>
>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>
>>> Assuming anon and shmem THP are initially both disabled:
>>>
>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> <khugepaged is stopped here>
>>>
>>> Then shemem THP is enabled:
>>>
>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>> <khugepaged is not started, this is a bug>
>>
>> Thanks Ryan. Yes, this is a real problem.
>>
>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>> reviewing an early version of but I've been out for a while and missed the
>>> latter versions. See below for comments and questions; the answers to which will
>>> help me figure out how to fix the above...
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>
>>>
>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>> +                bool global_huge)
>>>> +{
>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>> +    unsigned long within_size_orders =
>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>> +    /*
>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>> +     * are enabled for this vma.
>>>> +     */
>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>> +    loff_t i_size;
>>>> +    int order;
>>>> +
>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>> +        return 0;
>>>> +
>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>> +        return 0;
>>>> +
>>>> +    /*
>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>> +     * option off from all mounts.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>> +        return 0;
>>>
>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>> mTHP sizes, regardless of their control's setting are disabled?
>>>
>>> The anon controls don't work like that; you can set the top-level control to
>>> anything you like, but its only consumed if the per-size controls are
>>> inheriting it.
>>
>> IMO, 'deny' option is not similar like 'never' option.
>>
>>>
>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>
>>  From 'deny' option's semantics:
>> "disables huge on shm_mnt and all mounts, for emergency use;"
> 
> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
> except the PMD-size control, 'deny' and 'never' would be the same practically
> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
> disable that size for the tmpfs mounts.

Not really. We will not add 'deny' and 'force' testing option for each 
per-size mTHP control as suggested by Hugh in the previous MM meeting[1].

[1] 
https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u

> I disagree with the current implementation where setting it up so that a
> top-level 'deny' overrides whatever is in the per-size controls simply because
> it's different to the model implemented for anon THP. That's my 2 cents. If
> nobody else agrees then that ok - I'll fix the above bug according to the
> current model.

I remember we have customers who use the 'deny' option to forcibly turn 
off the huge page option for all mounts (including shm_mnt). But if 
shmem/tmpfs uses mTHP and users set 'deny', they cannot force all huge 
orders off and they must also do 'echo never > 
/sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled' to 
disable all huge orders option, which breaks the previous user habits, IMHO.

In additon, I do not think this creates a difference with the anon mTHP 
model, as anon mTHP does not have these shmem special 'deny' and 'force' 
options for testing purposes. In my view, the current 'deny' option 
fulfills the semantic definition of 'For use in emergencies, to force 
the *huge* option off from all mounts'.
>> It is usually used for testing to shut down all huge pages from the old ages.
>> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
>> so I think it will make life easier to disable all huge orders for testing or
>> emergency use, as well as to maintain the original semantics.>
>>>> +
>>>> +    /*
>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>
>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>> top level control while any per-size controls were set to 'inherit' would be an
>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>> would be an error?
>>
>> Right.
>>
>>> I don't really understand this logic. Shouldn't we just be looking at the
>>> per-size control settings (or the top-level control as a proxy for every
>>> per-size control that has 'inherit' set)?
>>
>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>> allow pmd-mapped THP to be forced. We should not look at per-size control
>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
> 
> But why not just maintain per-size controls and refactor tmpfs to just look at
> the PMD-size THP control for now. It can ignore the other sizes. That's much
> simpler and cleaner IMHO.

As I said above, 'force' and 'deny' option will not be added for 
per-size controls.
Baolin Wang July 5, 2024, 3:01 a.m. UTC | #9
On 2024/7/4 22:46, Bang Li wrote:
> Hi Bao lin,
> 
> On 2024/7/4 19:15, Baolin Wang wrote:
>>
>>>> +
>>>> +    /*
>>>> +     * Only allow inherit orders if the top-level value is 'force', 
>>>> which
>>>> +     * means non-PMD sized THP can not override 'huge' mount option 
>>>> now.
>>>> +     */
>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>
>>> I vaguely recall that we originally discussed that trying to set 
>>> 'force' on the
>>> top level control while any per-size controls were set to 'inherit' 
>>> would be an
>>> error, and trying to set 'force' on any per-size control except the 
>>> PMD-size
>>> would be an error?
>>
>> Right.
>>
>>> I don't really understand this logic. Shouldn't we just be looking at 
>>> the
>>> per-size control settings (or the top-level control as a proxy for every
>>> per-size control that has 'inherit' set)?
>>
>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we 
>> only allow pmd-mapped THP to be forced. We should not look at per-size 
>> control settings for tmpfs now (mTHP for tmpfs will be discussed in 
>> future).
>>
>>>
>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just 
>>> always use the
>>> PMD-size control for decisions.
>>>
>>> I'm also really struggling with the concept of shmem_is_huge() 
>>> existing along
>>> side shmem_allowable_huge_orders(). Surely this needs to all be 
>>> refactored into
>>> shmem_allowable_huge_orders()?
>>
>> I understood. But now they serve different purposes: shmem_is_huge() 
>> will be used to check the huge orders for the top level, for *tmpfs* 
>> and anon shmem; whereas shmem_allowable_huge_orders() will only be 
>> used to check the per-size huge orders for anon shmem (excluding tmpfs 
>> now). However, as I plan to add mTHP support for tmpfs, I think we can 
>> perform some cleanups. 
> 
> Please count me in, I'd be happy to contribute to the cleanup and 
> enhancement
> process if I can.

Good. If you have time, I think you can look at the shmem khugepaged 
issue from the previous discussion [1], which I don't have time to look 
at now.

"
(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed,
using smaller THP sizes as fallback might prohibit collapsing a
PMD-sized THP later. But really, khugepaged needs to be fixed to handle
that.
"

[1] 
https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u
Ryan Roberts July 5, 2024, 8:42 a.m. UTC | #10
On 05/07/2024 04:01, Baolin Wang wrote:
> 
> 
> On 2024/7/4 22:46, Bang Li wrote:
>> Hi Bao lin,
>>
>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>
>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>> would be an error?
>>>
>>> Right.
>>>
>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>> per-size control settings (or the top-level control as a proxy for every
>>>> per-size control that has 'inherit' set)?
>>>
>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>
>>>>
>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>> PMD-size control for decisions.
>>>>
>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>> shmem_allowable_huge_orders()?
>>>
>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>> mTHP support for tmpfs, I think we can perform some cleanups. 
>>
>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>> process if I can.
> 
> Good. If you have time, I think you can look at the shmem khugepaged issue from
> the previous discussion [1], which I don't have time to look at now.
> 
> "
> (3) khugepaged
> 
> khugepaged needs to handle larger folios properly as well. Until fixed,
> using smaller THP sizes as fallback might prohibit collapsing a
> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
> that.
> "

khugepaged can already collapse "folios of any order less then PMD-size" to
PMD-size, for anon memory. Infact I modified the selftest to be able to test
that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?
Although the test will definitely want to be extended to test it.

Thanks,
Ryan

> 
> [1]
> https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u
Ryan Roberts July 5, 2024, 8:55 a.m. UTC | #11
On 05/07/2024 03:56, Baolin Wang wrote:
> 
> 
> On 2024/7/4 21:58, Ryan Roberts wrote:
>> On 04/07/2024 12:15, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/4 01:25, Ryan Roberts wrote:
>>>> On 11/06/2024 11:11, Baolin Wang wrote:
>>>>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>>>>> can allow THP to be configured through the sysfs interface located at
>>>>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>>>>
>>>>> However, the anonymous shmem will ignore the anonymous mTHP rule
>>>>> configured through the sysfs interface, and can only use the PMD-mapped
>>>>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>>>>> all anonymous pages, including the anonymous shmem, in order to enjoy
>>>>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>>>>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>>>>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>>>>> to support all shmem/tmpfs scenarios in the future, especially for the
>>>>> shmem mmap() case.
>>>>>
>>>>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>>>>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>>>>> which can have almost the same values as the top-level
>>>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>>>>> additional "inherit" option and dropping the testing options 'force' and
>>>>> 'deny'. By default all sizes will be set to "never" except PMD size,
>>>>> which is set to "inherit". This ensures backward compatibility with the
>>>>> anonymous shmem enabled of the top level, meanwhile also allows independent
>>>>> control of anonymous shmem enabled for each mTHP.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>
>>>> [...]
>>>>
>>>> Hi Baolin,
>>>>
>>>> I'm currently trying to fix a bug where khugepaged is not started if only shmem
>>>> is enabled for THP. See discussion at [1]. It's been broken like this forever.
>>>>
>>>> Assuming anon and shmem THP are initially both disabled:
>>>>
>>>> # echo never > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>>> <khugepaged is stopped here>
>>>>
>>>> Then shemem THP is enabled:
>>>>
>>>> # echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>>>> <khugepaged is not started, this is a bug>
>>>
>>> Thanks Ryan. Yes, this is a real problem.
>>>
>>>> As part of investigating the fix, I stumbled upon this patch, which I remember
>>>> reviewing an early version of but I've been out for a while and missed the
>>>> latter versions. See below for comments and questions; the answers to which
>>>> will
>>>> help me figure out how to fix the above...
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.com/
>>>>
>>>>
>>>>>    +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +static unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>> +                struct vm_area_struct *vma, pgoff_t index,
>>>>> +                bool global_huge)
>>>>> +{
>>>>> +    unsigned long mask = READ_ONCE(huge_shmem_orders_always);
>>>>> +    unsigned long within_size_orders =
>>>>> READ_ONCE(huge_shmem_orders_within_size);
>>>>> +    unsigned long vm_flags = vma->vm_flags;
>>>>> +    /*
>>>>> +     * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>>>>> +     * are enabled for this vma.
>>>>> +     */
>>>>> +    unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>>>>> +    loff_t i_size;
>>>>> +    int order;
>>>>> +
>>>>> +    if ((vm_flags & VM_NOHUGEPAGE) ||
>>>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>>> +        return 0;
>>>>> +
>>>>> +    /* If the hardware/firmware marked hugepage support disabled. */
>>>>> +    if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>>> +        return 0;
>>>>> +
>>>>> +    /*
>>>>> +     * Following the 'deny' semantics of the top level, force the huge
>>>>> +     * option off from all mounts.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_DENY)
>>>>> +        return 0;
>>>>
>>>> I don't quite get this, I don't think its the desirable behaviour. This is
>>>> saying that if the top-level shemem_enabled control is set to 'deny', then all
>>>> mTHP sizes, regardless of their control's setting are disabled?
>>>>
>>>> The anon controls don't work like that; you can set the top-level control to
>>>> anything you like, but its only consumed if the per-size controls are
>>>> inheriting it.
>>>
>>> IMO, 'deny' option is not similar like 'never' option.
>>>
>>>>
>>>> So for the deny case, wouldn't it be better to allow that as an option on all
>>>> the per-size controls (and implicitly let it be inherrited from the top-level)?
>>>
>>>  From 'deny' option's semantics:
>>> "disables huge on shm_mnt and all mounts, for emergency use;"
>>
>> Right. Today, tmpfs only supports PMD-sized THP. So for all per-size controls
>> except the PMD-size control, 'deny' and 'never' would be the same practically
>> speaking. For the PMD-size control, 'deny' would disable THP for both anon shmem
>> and all tmpfs mounts, whereas 'never' would only disable THP for anon shmem.
>> When tmpfs gains mTHP support, 'deny' in the other per-size controls would also
>> disable that size for the tmpfs mounts.
> 
> Not really. We will not add 'deny' and 'force' testing option for each per-size
> mTHP control as suggested by Hugh in the previous MM meeting[1].
> 
> [1]
> https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/T/#u
> 
>> I disagree with the current implementation where setting it up so that a
>> top-level 'deny' overrides whatever is in the per-size controls simply because
>> it's different to the model implemented for anon THP. That's my 2 cents. If
>> nobody else agrees then that ok - I'll fix the above bug according to the
>> current model.
> 
> I remember we have customers who use the 'deny' option to forcibly turn off the
> huge page option for all mounts (including shm_mnt). But if shmem/tmpfs uses
> mTHP and users set 'deny', they cannot force all huge orders off and they must
> also do 'echo never >
> /sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled' to disable
> all huge orders option, which breaks the previous user habits, IMHO.

But if they have enabled mTHP for shmem in the first place, they must be mTHP
aware. So its not unreasonable to expect them to call:

# echo deny > /sys/kernel/mm/transparent_hugepage/hugepages-XXXkB/shmem_enabled

That's the expectation for anon mTHP anyway. If the user enables mTHP then does:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

That *does not* disable all mTHP sizes. It only disables the sizes that have
been set to 'inherit'. The model is that a size is only influenced by the
top-level control if it has explicitly requested 'inherit'. But here you have
this special (and weird in my opinion) case where some values of the top-level
control apply unconditionally to the per-size control and some don't.

> 
> In additon, I do not think this creates a difference with the anon mTHP model,
> as anon mTHP does not have these shmem special 'deny' and 'force' options for
> testing purposes. In my view, the current 'deny' option fulfills the semantic
> definition of 'For use in emergencies, to force the *huge* option off from all
> mounts'.

OK. My opinion is logged :) But I'm not hearing anyone joining in support of
that opinion. I have a better understanding of the intent of your model as a
result of this discussion so thanks for that. As long as the documentation is
clear on this behaviour, let's leave it as is.

>>> It is usually used for testing to shut down all huge pages from the old ages.
>>> Moreover, mTHP interfaces will be used as a huge order filter to support tmpfs,
>>> so I think it will make life easier to disable all huge orders for testing or
>>> emergency use, as well as to maintain the original semantics.>
>>>>> +
>>>>> +    /*
>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>> +     */
>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>
>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>> would be an error?
>>>
>>> Right.
>>>
>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>> per-size control settings (or the top-level control as a proxy for every
>>>> per-size control that has 'inherit' set)?
>>>
>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>
>> But why not just maintain per-size controls and refactor tmpfs to just look at
>> the PMD-size THP control for now. It can ignore the other sizes. That's much
>> simpler and cleaner IMHO.
> 
> As I said above, 'force' and 'deny' option will not be added for per-size controls.

Understood.

Thanks,
Ryan
Baolin Wang July 5, 2024, 8:57 a.m. UTC | #12
On 2024/7/5 16:42, Ryan Roberts wrote:
> On 05/07/2024 04:01, Baolin Wang wrote:
>>
>>
>> On 2024/7/4 22:46, Bang Li wrote:
>>> Hi Bao lin,
>>>
>>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>>
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>>> +     */
>>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>>
>>>>> I vaguely recall that we originally discussed that trying to set 'force' on the
>>>>> top level control while any per-size controls were set to 'inherit' would be an
>>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>>> would be an error?
>>>>
>>>> Right.
>>>>
>>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>>> per-size control settings (or the top-level control as a proxy for every
>>>>> per-size control that has 'inherit' set)?
>>>>
>>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>>
>>>>>
>>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always use the
>>>>> PMD-size control for decisions.
>>>>>
>>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored into
>>>>> shmem_allowable_huge_orders()?
>>>>
>>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>>> mTHP support for tmpfs, I think we can perform some cleanups.
>>>
>>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>>> process if I can.
>>
>> Good. If you have time, I think you can look at the shmem khugepaged issue from
>> the previous discussion [1], which I don't have time to look at now.
>>
>> "
>> (3) khugepaged
>>
>> khugepaged needs to handle larger folios properly as well. Until fixed,
>> using smaller THP sizes as fallback might prohibit collapsing a
>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>> that.
>> "
> 
> khugepaged can already collapse "folios of any order less then PMD-size" to
> PMD-size, for anon memory. Infact I modified the selftest to be able to test
> that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
> THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?

I did not test this, but from the comment in hpage_collapse_scan_file(), 
seems that compacting small mTHP into a single PMD-mapped THP is not 
supported yet.

/*
		 * TODO: khugepaged should compact smaller compound pages
		 * into a PMD sized page
		 */
		if (folio_test_large(folio)) {
			result = folio_order(folio) == HPAGE_PMD_ORDER &&
					folio->index == start
					/* Maybe PMD-mapped */
					? SCAN_PTE_MAPPED_HUGEPAGE
					: SCAN_PAGE_COMPOUND;
			/*
			 * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
			 * by the caller won't touch the page cache, and so
			 * it's safe to skip LRU and refcount checks before
			 * returning.
			 */
			break;
		}

> Although the test will definitely want to be extended to test it.

Right.
Ryan Roberts July 5, 2024, 9:05 a.m. UTC | #13
On 05/07/2024 09:57, Baolin Wang wrote:
> 
> 
> On 2024/7/5 16:42, Ryan Roberts wrote:
>> On 05/07/2024 04:01, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/4 22:46, Bang Li wrote:
>>>> Hi Bao lin,
>>>>
>>>> On 2024/7/4 19:15, Baolin Wang wrote:
>>>>>
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Only allow inherit orders if the top-level value is 'force', which
>>>>>>> +     * means non-PMD sized THP can not override 'huge' mount option now.
>>>>>>> +     */
>>>>>>> +    if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>>>> +        return READ_ONCE(huge_shmem_orders_inherit);
>>>>>>
>>>>>> I vaguely recall that we originally discussed that trying to set 'force'
>>>>>> on the
>>>>>> top level control while any per-size controls were set to 'inherit' would
>>>>>> be an
>>>>>> error, and trying to set 'force' on any per-size control except the PMD-size
>>>>>> would be an error?
>>>>>
>>>>> Right.
>>>>>
>>>>>> I don't really understand this logic. Shouldn't we just be looking at the
>>>>>> per-size control settings (or the top-level control as a proxy for every
>>>>>> per-size control that has 'inherit' set)?
>>>>>
>>>>> ‘force’ will apply the huge orders for anon shmem and tmpfs, so now we only
>>>>> allow pmd-mapped THP to be forced. We should not look at per-size control
>>>>> settings for tmpfs now (mTHP for tmpfs will be discussed in future).
>>>>>
>>>>>>
>>>>>> Then for tmpfs, which doesn't support non-PMD-sizes yet, we just always
>>>>>> use the
>>>>>> PMD-size control for decisions.
>>>>>>
>>>>>> I'm also really struggling with the concept of shmem_is_huge() existing along
>>>>>> side shmem_allowable_huge_orders(). Surely this needs to all be refactored
>>>>>> into
>>>>>> shmem_allowable_huge_orders()?
>>>>>
>>>>> I understood. But now they serve different purposes: shmem_is_huge() will be
>>>>> used to check the huge orders for the top level, for *tmpfs* and anon shmem;
>>>>> whereas shmem_allowable_huge_orders() will only be used to check the per-size
>>>>> huge orders for anon shmem (excluding tmpfs now). However, as I plan to add
>>>>> mTHP support for tmpfs, I think we can perform some cleanups.
>>>>
>>>> Please count me in, I'd be happy to contribute to the cleanup and enhancement
>>>> process if I can.
>>>
>>> Good. If you have time, I think you can look at the shmem khugepaged issue from
>>> the previous discussion [1], which I don't have time to look at now.
>>>
>>> "
>>> (3) khugepaged
>>>
>>> khugepaged needs to handle larger folios properly as well. Until fixed,
>>> using smaller THP sizes as fallback might prohibit collapsing a
>>> PMD-sized THP later. But really, khugepaged needs to be fixed to handle
>>> that.
>>> "
>>
>> khugepaged can already collapse "folios of any order less then PMD-size" to
>> PMD-size, for anon memory. Infact I modified the selftest to be able to test
>> that in commit 9f0704eae8a4 ("selftests/mm/khugepaged: enlighten for multi-size
>> THP"). I'd be surprised if khugepaged can't alreay handle the same for shmem?
> 
> I did not test this, but from the comment in hpage_collapse_scan_file(), seems
> that compacting small mTHP into a single PMD-mapped THP is not supported yet.
> 
> /*
>          * TODO: khugepaged should compact smaller compound pages
>          * into a PMD sized page
>          */
>         if (folio_test_large(folio)) {
>             result = folio_order(folio) == HPAGE_PMD_ORDER &&
>                     folio->index == start
>                     /* Maybe PMD-mapped */
>                     ? SCAN_PTE_MAPPED_HUGEPAGE
>                     : SCAN_PAGE_COMPOUND;
>             /*
>              * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
>              * by the caller won't touch the page cache, and so
>              * it's safe to skip LRU and refcount checks before
>              * returning.
>              */
>             break;
>         }

OK, so the functionality is missing just for shmem/file-backed collapse. Got it.
Sorry for the noise.

> 
>> Although the test will definitely want to be extended to test it.
> 
> Right.
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fac21548c5de..909cfc67521d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -575,6 +575,16 @@  static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline int highest_order(unsigned long orders)
+{
+	return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index bb19ea2770e3..e849c88452b2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,107 @@  static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 	return result;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	unsigned long mask = READ_ONCE(huge_shmem_orders_always);
+	unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
+	unsigned long vm_flags = vma->vm_flags;
+	/*
+	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+	 * are enabled for this vma.
+	 */
+	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+	loff_t i_size;
+	int order;
+
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return 0;
+
+	/* If the hardware/firmware marked hugepage support disabled. */
+	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+		return 0;
+
+	/*
+	 * Following the 'deny' semantics of the top level, force the huge
+	 * option off from all mounts.
+	 */
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return 0;
+
+	/*
+	 * Only allow inherit orders if the top-level value is 'force', which
+	 * means non-PMD sized THP can not override 'huge' mount option now.
+	 */
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return READ_ONCE(huge_shmem_orders_inherit);
+
+	/* Allow mTHP that will be fully within i_size. */
+	order = highest_order(within_size_orders);
+	while (within_size_orders) {
+		index = round_up(index + 1, order);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >> PAGE_SHIFT >= index) {
+			mask |= within_size_orders;
+			break;
+		}
+
+		order = next_order(&within_size_orders, order);
+	}
+
+	if (vm_flags & VM_HUGEPAGE)
+		mask |= READ_ONCE(huge_shmem_orders_madvise);
+
+	if (global_huge)
+		mask |= READ_ONCE(huge_shmem_orders_inherit);
+
+	return orders & mask;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					   struct address_space *mapping, pgoff_t index,
+					   unsigned long orders)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pages;
+	int order;
+
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
+
+	/* Find the highest order that can add into the page cache */
+	order = highest_order(orders);
+	while (orders) {
+		pages = 1UL << order;
+		index = round_down(index, pages);
+		if (!xa_find(&mapping->i_pages, &index,
+			     index + pages - 1, XA_PRESENT))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static unsigned long shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	return 0;
+}
+
+static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					   struct address_space *mapping, pgoff_t index,
+					   unsigned long orders)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 		struct shmem_inode_info *info, pgoff_t index)
 {
@@ -1625,38 +1726,55 @@  static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 	return folio;
 }
 
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
-		struct inode *inode, pgoff_t index,
-		struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+		gfp_t gfp, struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, unsigned long orders)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct folio *folio;
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	unsigned long suitable_orders = 0;
+	struct folio *folio = NULL;
 	long pages;
-	int error;
+	int error, order;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-		huge = false;
+		orders = 0;
 
-	if (huge) {
-		pages = HPAGE_PMD_NR;
-		index = round_down(index, HPAGE_PMD_NR);
+	if (orders > 0) {
+		if (vma && vma_is_anon_shmem(vma)) {
+			suitable_orders = shmem_suitable_orders(inode, vmf,
+							mapping, index, orders);
+		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
+			pages = HPAGE_PMD_NR;
+			suitable_orders = BIT(HPAGE_PMD_ORDER);
+			index = round_down(index, HPAGE_PMD_NR);
 
-		/*
-		 * Check for conflict before waiting on a huge allocation.
-		 * Conflict might be that a huge page has just been allocated
-		 * and added to page cache by a racing thread, or that there
-		 * is already at least one small page in the huge extent.
-		 * Be careful to retry when appropriate, but not forever!
-		 * Elsewhere -EEXIST would be the right code, but not here.
-		 */
-		if (xa_find(&mapping->i_pages, &index,
-				index + HPAGE_PMD_NR - 1, XA_PRESENT))
-			return ERR_PTR(-E2BIG);
+			/*
+			 * Check for conflict before waiting on a huge allocation.
+			 * Conflict might be that a huge page has just been allocated
+			 * and added to page cache by a racing thread, or that there
+			 * is already at least one small page in the huge extent.
+			 * Be careful to retry when appropriate, but not forever!
+			 * Elsewhere -EEXIST would be the right code, but not here.
+			 */
+			if (xa_find(&mapping->i_pages, &index,
+				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
+				return ERR_PTR(-E2BIG);
+		}
 
-		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
-		if (!folio && pages == HPAGE_PMD_NR)
-			count_vm_event(THP_FILE_FALLBACK);
+		order = highest_order(suitable_orders);
+		while (suitable_orders) {
+			pages = 1UL << order;
+			index = round_down(index, pages);
+			folio = shmem_alloc_folio(gfp, order, info, index);
+			if (folio)
+				goto allocated;
+
+			if (pages == HPAGE_PMD_NR)
+				count_vm_event(THP_FILE_FALLBACK);
+			order = next_order(&suitable_orders, order);
+		}
 	} else {
 		pages = 1;
 		folio = shmem_alloc_folio(gfp, 0, info, index);
@@ -1664,6 +1782,7 @@  static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
+allocated:
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
@@ -1958,7 +2077,8 @@  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm;
 	struct folio *folio;
 	int error;
-	bool alloced;
+	bool alloced, huge;
+	unsigned long orders = 0;
 
 	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
 		return -EINVAL;
@@ -2030,14 +2150,21 @@  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (shmem_is_huge(inode, index, false, fault_mm,
-			  vma ? vma->vm_flags : 0)) {
+	huge = shmem_is_huge(inode, index, false, fault_mm,
+			     vma ? vma->vm_flags : 0);
+	/* Find hugepage orders that are allowed for anonymous shmem. */
+	if (vma && vma_is_anon_shmem(vma))
+		orders = shmem_allowable_huge_orders(inode, vma, index, huge);
+	else if (huge)
+		orders = BIT(HPAGE_PMD_ORDER);
+
+	if (orders > 0) {
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-		folio = shmem_alloc_and_add_folio(huge_gfp,
-				inode, index, fault_mm, true);
+		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+				inode, index, fault_mm, orders);
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
@@ -2047,7 +2174,7 @@  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 			goto repeat;
 	}
 
-	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
 	if (IS_ERR(folio)) {
 		error = PTR_ERR(folio);
 		if (error == -EEXIST)
@@ -2058,7 +2185,7 @@  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 alloced:
 	alloced = true;
-	if (folio_test_pmd_mappable(folio) &&
+	if (folio_test_large(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);