diff mbox series

[5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem

Message ID 6b4afed1ef26dbd08ae9ec58449b329564dcef3e.1714978902.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series add mTHP support for anonymous shmem | expand

Commit Message

Baolin Wang May 6, 2024, 8:46 a.m. UTC
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never", "deny", "force". These values follow the same semantics as the top
level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
to 'always' to keep compatibility.

By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.

In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
 include/linux/huge_mm.h                    | 10 +++
 mm/huge_memory.c                           | 11 +--
 mm/shmem.c                                 | 96 ++++++++++++++++++++++
 4 files changed, 138 insertions(+), 8 deletions(-)

Comments

Ryan Roberts May 7, 2024, 10:52 a.m. UTC | #1
On 06/05/2024 09:46, Baolin Wang wrote:
> To support the use of mTHP with anonymous shmem, add a new sysfs interface
> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
> directory for each mTHP to control whether shmem is enabled for that mTHP,
> with a value similar to the top level 'shmem_enabled', which can be set to:
> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
> "never", "deny", "force". These values follow the same semantics as the top
> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
> to 'always' to keep compatibility.

We decided at [1] to not allow 'force' for non-PMD-sizes.

[1]
https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/

However, thinking about this a bit more, I wonder if the decision we made to
allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
for legacy back compat after all, I don't think there is any use case where
changing multiple mTHP size controls atomically is actually useful). Applying
that pattern here, it means the top level can always take "force" without any
weird error checking. And we would allow "force" on the PMD-sized control but
not on the others - again this is easy to error check.

Does this pattern make more sense? If so, is it too late to change
hugepages-xxkB/enabled interface?

> 
> By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
> sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
> 
> In addition, if top level value is 'force', then only PMD-sized hugepages
> have enabled="inherit", otherwise configuration will be failed and vice versa.
> That means now we will avoid using non-PMD sized THP to override the global
> huge allocation.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 29 +++++++
>  include/linux/huge_mm.h                    | 10 +++
>  mm/huge_memory.c                           | 11 +--
>  mm/shmem.c                                 | 96 ++++++++++++++++++++++
>  4 files changed, 138 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 076443cc10a6..a28496e15bdb 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -332,6 +332,35 @@ deny
>  force
>      Force the huge option on for all - very useful for testing;
>  
> +Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
> +to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
> +Its value for each mTHP is essentially consistent with the global setting, except
> +for the addition of 'inherit' to ensure compatibility with the global settings.
> +always
> +    Attempt to allocate <size> huge pages every time we need a new page;
> +
> +inherit
> +    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
> +    have enabled="inherit" and all other hugepage sizes have enabled="never";
> +
> +never
> +    Do not allocate <size> huge pages;
> +
> +within_size
> +    Only allocate <size> huge page if it will be fully within i_size.
> +    Also respect fadvise()/madvise() hints;
> +
> +advise
> +    Only allocate <size> huge pages if requested with fadvise()/madvise();
> +
> +deny
> +    Has the same semantics as 'never', now mTHP allocation policy is only
> +    used for anonymous shmem and no not override tmpfs.
> +
> +force
> +    Has the same semantics as 'always', now mTHP allocation policy is only
> +    used for anonymous shmem and no not override tmpfs.
> +
>  Need of application restart
>  ===========================
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e49b56c40a11..dbd6b3f56210 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,7 @@
>  #include <linux/mm_types.h>
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
> +#include <linux/kobject.h>
>  
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
>  				  struct kobj_attribute *attr, char *buf,
>  				  enum transparent_hugepage_flag flag);
>  extern struct kobj_attribute shmem_enabled_attr;
> +extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  
>  /*
>   * Mask of all large folio orders supported for anonymous THP; all orders up to
> @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>  }
>  
> +struct thpsize {
> +	struct kobject kobj;
> +	struct list_head node;
> +	int order;
> +};
> +
> +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> +
>  enum mthp_stat_item {
>  	MTHP_STAT_ANON_FAULT_ALLOC,
>  	MTHP_STAT_ANON_FAULT_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9efb6fefc391..d3080a8843f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
>  static DEFINE_SPINLOCK(huge_anon_orders_lock);
>  static LIST_HEAD(thpsize_list);
>  
> -struct thpsize {
> -	struct kobject kobj;
> -	struct list_head node;
> -	int order;
> -};
> -
> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> -
>  static ssize_t thpsize_enabled_show(struct kobject *kobj,
>  				    struct kobj_attribute *attr, char *buf)
>  {
> @@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
>  
>  static struct attribute *thpsize_attrs[] = {
>  	&thpsize_enabled_attr.attr,
> +#ifdef CONFIG_SHMEM
> +	&thpsize_shmem_enabled_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a383ea9a89a5..59cc26d44344 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -131,6 +131,14 @@ struct shmem_options {
>  #define SHMEM_SEEN_QUOTA 32
>  };
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long huge_anon_shmem_orders_always __read_mostly;
> +static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
> +static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
> +static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
> +static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
> +#endif
> +
>  #ifdef CONFIG_TMPFS
>  static unsigned long shmem_default_max_blocks(void)
>  {
> @@ -4687,6 +4695,12 @@ void __init shmem_init(void)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>  	else
>  		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> +
> +	/*
> +	 * Default to setting PMD-sized THP to inherit the global setting and
> +	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
> +	 */
> +	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>  #endif
>  	return;
>  
> @@ -4746,6 +4760,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
>  		return -EINVAL;
>  
> +	/* Do not override huge allocation policy with non-PMD sized mTHP */
> +	if (huge == SHMEM_HUGE_FORCE &&
> +	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +
>  	shmem_huge = huge;
>  	if (shmem_huge > SHMEM_HUGE_DENY)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> @@ -4753,6 +4772,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  }
>  
>  struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
> +
> +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
> +					  struct kobj_attribute *attr, char *buf)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	const char *output;
> +
> +	if (test_bit(order, &huge_anon_shmem_orders_always))
> +		output = "[always] inherit within_size advise never deny [force]";
> +	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
> +		output = "always [inherit] within_size advise never deny force";
> +	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
> +		output = "always inherit [within_size] advise never deny force";
> +	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
> +		output = "always inherit within_size [advise] never deny force";
> +	else
> +		output = "always inherit within_size advise [never] [deny] force";
> +
> +	return sysfs_emit(buf, "%s\n", output);
> +}
> +
> +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
> +					   struct kobj_attribute *attr,
> +					   const char *buf, size_t count)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	ssize_t ret = count;
> +
> +	if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_always);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "inherit")) {
> +		/* Do not override huge allocation policy with non-PMD sized mTHP */
> +		if (shmem_huge == SHMEM_HUGE_FORCE &&
> +		    order != HPAGE_PMD_ORDER)
> +			return -EINVAL;
> +
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_inherit);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "within_size")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		set_bit(order, &huge_anon_shmem_orders_within_size);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "madvise")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +struct kobj_attribute thpsize_shmem_enabled_attr =
> +	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>  
>  #else /* !CONFIG_SHMEM */
Baolin Wang May 8, 2024, 4:45 a.m. UTC | #2
On 2024/5/7 18:52, Ryan Roberts wrote:
> On 06/05/2024 09:46, Baolin Wang wrote:
>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>> with a value similar to the top level 'shmem_enabled', which can be set to:
>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>> "never", "deny", "force". These values follow the same semantics as the top
>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>> to 'always' to keep compatibility.
> 
> We decided at [1] to not allow 'force' for non-PMD-sizes.
> 
> [1]
> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
> 
> However, thinking about this a bit more, I wonder if the decision we made to
> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
> for legacy back compat after all, I don't think there is any use case where
> changing multiple mTHP size controls atomically is actually useful). Applying

Agree. This is also our usage of 'inherit'.

> that pattern here, it means the top level can always take "force" without any
> weird error checking. And we would allow "force" on the PMD-sized control but
> not on the others - again this is easy to error check.
> 
> Does this pattern make more sense? If so, is it too late to change
> hugepages-xxkB/enabled interface?

IMO, this sounds reasonable to me. Let's see what others think, David?
David Hildenbrand May 8, 2024, 7:08 a.m. UTC | #3
On 08.05.24 06:45, Baolin Wang wrote:
> 
> 
> On 2024/5/7 18:52, Ryan Roberts wrote:
>> On 06/05/2024 09:46, Baolin Wang wrote:
>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>> "never", "deny", "force". These values follow the same semantics as the top
>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>> to 'always' to keep compatibility.
>>
>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>
>> However, thinking about this a bit more, I wonder if the decision we made to
>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>> for legacy back compat after all, I don't think there is any use case where
>> changing multiple mTHP size controls atomically is actually useful). Applying
> 
> Agree. This is also our usage of 'inherit'.
> 
>> that pattern here, it means the top level can always take "force" without any
>> weird error checking. And we would allow "force" on the PMD-sized control but
>> not on the others - again this is easy to error check.
>>
>> Does this pattern make more sense? If so, is it too late to change
>> hugepages-xxkB/enabled interface?
> 
> IMO, this sounds reasonable to me. Let's see what others think, David?

Likely too late and we should try not to diverge too much from "enabled" 
for "shmem_enabled".

Having that said, to me it's much cleaner to just have "inherit" for all 
sizes and disallow invalid configurations as discussed.

Error checking cannot be too hard unless I am missing something important :)
David Hildenbrand May 8, 2024, 7:12 a.m. UTC | #4
On 08.05.24 09:08, David Hildenbrand wrote:
> On 08.05.24 06:45, Baolin Wang wrote:
>>
>>
>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>> to 'always' to keep compatibility.
>>>
>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>
>>> However, thinking about this a bit more, I wonder if the decision we made to
>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>> for legacy back compat after all, I don't think there is any use case where
>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>
>> Agree. This is also our usage of 'inherit'.

Missed that one: there might be use cases in the future once we would 
start defaulting to "inherit" for all knobs (a distro might default to 
that) and default-enable THP in the global knob. Then, it would be easy 
to disable any THP by disabling the global knob. (I think that's the 
future we're heading to, where we'd have an "auto" mode that can be set 
on the global toggle).

But I am just making up use cases ;) I think it will be valuable and 
just doing it consistently now might be cleaner.
Ryan Roberts May 8, 2024, 9:02 a.m. UTC | #5
On 08/05/2024 08:12, David Hildenbrand wrote:
> On 08.05.24 09:08, David Hildenbrand wrote:
>> On 08.05.24 06:45, Baolin Wang wrote:
>>>
>>>
>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>> "advise",
>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>> to 'always' to keep compatibility.
>>>>
>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>
>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>> for legacy back compat after all, I don't think there is any use case where
>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>
>>> Agree. This is also our usage of 'inherit'.
> 
> Missed that one: there might be use cases in the future once we would start
> defaulting to "inherit" for all knobs (a distro might default to that) and
> default-enable THP in the global knob. Then, it would be easy to disable any THP
> by disabling the global knob. (I think that's the future we're heading to, where
> we'd have an "auto" mode that can be set on the global toggle).
> 
> But I am just making up use cases ;) I think it will be valuable and just doing
> it consistently now might be cleaner.

I agree that consistency between enabled and shmem_enabled is top priority. And
yes, I had forgotten about the glorious "auto" future. So probably continuing
all sizes to select "inherit" is best.

But for shmem_enabled, that means we need the following error checking:

 - It is an error to set "force" for any size except PMD-size

 - It is an error to set "force" for the global control if any size except PMD-
   size is set to "inherit"

 - It is an error to set "inherit" for any size except PMD-size if the global
   control is set to "force".

Certainly not too difficult to code and prove to be correct, but not the nicest
UX from the user's point of view when they start seeing errors.

I think we previously said this would likely be temporary, and if/when tmpfs
gets mTHP support, we could simplify and allow all sizes to be set to "force".
But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
more suited to the approach the page cache takes to transparently ramp up the
folio size as it faults more in. (Just saying there is a chance that this error
checking becomes permanent).
Baolin Wang May 8, 2024, 9:56 a.m. UTC | #6
On 2024/5/8 17:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
> 
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
> 
> But for shmem_enabled, that means we need the following error checking:
> 
>   - It is an error to set "force" for any size except PMD-size
> 
>   - It is an error to set "force" for the global control if any size except PMD-
>     size is set to "inherit"
> 
>   - It is an error to set "inherit" for any size except PMD-size if the global
>     control is set to "force".
> 
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
> 
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).

The strategy for tmpfs supporting mTHP will require more discussions and 
evaluations in the future. However, regardless of the strategy (explicit 
mTHP control or page cache control), I think it would be possible to use 
'force' to override previous strategies for some testing purposes. This 
appears to be permissible according to the explanation in the current 
documentation: "force the huge option on for all - very useful for 
testing". So it seems not permanent?
Ryan Roberts May 8, 2024, 10:48 a.m. UTC | #7
On 08/05/2024 10:56, Baolin Wang wrote:
> 
> 
> On 2024/5/8 17:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>> one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>> just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>>   - It is an error to set "force" for any size except PMD-size
>>
>>   - It is an error to set "force" for the global control if any size except PMD-
>>     size is set to "inherit"
>>
>>   - It is an error to set "inherit" for any size except PMD-size if the global
>>     control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
> 
> The strategy for tmpfs supporting mTHP will require more discussions and
> evaluations in the future. However, regardless of the strategy (explicit mTHP
> control or page cache control), I think it would be possible to use 'force' to
> override previous strategies for some testing purposes. This appears to be
> permissible according to the explanation in the current documentation: "force
> the huge option on for all - very useful for testing". So it seems not permanent?

Yeah ok, makes sense to me.
David Hildenbrand May 8, 2024, 12:02 p.m. UTC | #8
On 08.05.24 11:02, Ryan Roberts wrote:
> On 08/05/2024 08:12, David Hildenbrand wrote:
>> On 08.05.24 09:08, David Hildenbrand wrote:
>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>> "advise",
>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>> to 'always' to keep compatibility.
>>>>>
>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>
>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>
>>>> Agree. This is also our usage of 'inherit'.
>>
>> Missed that one: there might be use cases in the future once we would start
>> defaulting to "inherit" for all knobs (a distro might default to that) and
>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>> by disabling the global knob. (I think that's the future we're heading to, where
>> we'd have an "auto" mode that can be set on the global toggle).
>>
>> But I am just making up use cases ;) I think it will be valuable and just doing
>> it consistently now might be cleaner.
> 
> I agree that consistency between enabled and shmem_enabled is top priority. And
> yes, I had forgotten about the glorious "auto" future. So probably continuing
> all sizes to select "inherit" is best.
> 
> But for shmem_enabled, that means we need the following error checking:
> 
>   - It is an error to set "force" for any size except PMD-size
> 
>   - It is an error to set "force" for the global control if any size except PMD-
>     size is set to "inherit"
> 
>   - It is an error to set "inherit" for any size except PMD-size if the global
>     control is set to "force".
> 
> Certainly not too difficult to code and prove to be correct, but not the nicest
> UX from the user's point of view when they start seeing errors.
> 
> I think we previously said this would likely be temporary, and if/when tmpfs
> gets mTHP support, we could simplify and allow all sizes to be set to "force".
> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
> more suited to the approach the page cache takes to transparently ramp up the
> folio size as it faults more in. (Just saying there is a chance that this error
> checking becomes permanent).

Note that with shmem you're inherently facing the same memory waste 
issues etc as you would with anonymous memory. (sometimes even worse, if 
you're running shmem that's configured to be unswappable!).
David Hildenbrand May 8, 2024, 12:10 p.m. UTC | #9
On 08.05.24 14:02, David Hildenbrand wrote:
> On 08.05.24 11:02, Ryan Roberts wrote:
>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>> "advise",
>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is equivalent
>>>>>>> to 'always' to keep compatibility.
>>>>>>
>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>
>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong one.
>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is just
>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>> changing multiple mTHP size controls atomically is actually useful). Applying
>>>>>
>>>>> Agree. This is also our usage of 'inherit'.
>>>
>>> Missed that one: there might be use cases in the future once we would start
>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>> default-enable THP in the global knob. Then, it would be easy to disable any THP
>>> by disabling the global knob. (I think that's the future we're heading to, where
>>> we'd have an "auto" mode that can be set on the global toggle).
>>>
>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>> it consistently now might be cleaner.
>>
>> I agree that consistency between enabled and shmem_enabled is top priority. And
>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>> all sizes to select "inherit" is best.
>>
>> But for shmem_enabled, that means we need the following error checking:
>>
>>    - It is an error to set "force" for any size except PMD-size
>>
>>    - It is an error to set "force" for the global control if any size except PMD-
>>      size is set to "inherit"
>>
>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>      control is set to "force".
>>
>> Certainly not too difficult to code and prove to be correct, but not the nicest
>> UX from the user's point of view when they start seeing errors.
>>
>> I think we previously said this would likely be temporary, and if/when tmpfs
>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>> more suited to the approach the page cache takes to transparently ramp up the
>> folio size as it faults more in. (Just saying there is a chance that this error
>> checking becomes permanent).
> 
> Note that with shmem you're inherently facing the same memory waste
> issues etc as you would with anonymous memory. (sometimes even worse, if
> you're running shmem that's configured to be unswappable!).

Also noting that memory waste is not really a problem when a write to a 
shmem file allocates a large folio that stays within boundaries of that 
write; issues only pop up if you end up over-allocating, especially, 
during page faults where you have not that much clue about what to do 
(single address, no real range provided).

There is the other issue that wasting large chunks of contiguous memory 
on stuff that barely benefits from it. With memory that maybe never gets 
evicted, there is no automatic "handing back" of that memory to the 
system to be used by something else. With ordinary files, that's a bit 
different. But I did not look closer into that issue yet, it's one of 
the reasons MADV_HUGEPAGE was added IIRC.
Ryan Roberts May 8, 2024, 12:43 p.m. UTC | #10
On 08/05/2024 13:10, David Hildenbrand wrote:
> On 08.05.24 14:02, David Hildenbrand wrote:
>> On 08.05.24 11:02, Ryan Roberts wrote:
>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>> "advise",
>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>> equivalent
>>>>>>>> to 'always' to keep compatibility.
>>>>>>>
>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>
>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>> one.
>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>> just
>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>> Applying
>>>>>>
>>>>>> Agree. This is also our usage of 'inherit'.
>>>>
>>>> Missed that one: there might be use cases in the future once we would start
>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>> THP
>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>> where
>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>
>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>> it consistently now might be cleaner.
>>>
>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>> all sizes to select "inherit" is best.
>>>
>>> But for shmem_enabled, that means we need the following error checking:
>>>
>>>    - It is an error to set "force" for any size except PMD-size
>>>
>>>    - It is an error to set "force" for the global control if any size except
>>> PMD-
>>>      size is set to "inherit"
>>>
>>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>>      control is set to "force".
>>>
>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>> UX from the user's point of view when they start seeing errors.
>>>
>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>> more suited to the approach the page cache takes to transparently ramp up the
>>> folio size as it faults more in. (Just saying there is a chance that this error
>>> checking becomes permanent).
>>
>> Note that with shmem you're inherently facing the same memory waste
>> issues etc as you would with anonymous memory. (sometimes even worse, if
>> you're running shmem that's configured to be unswappable!).
> 
> Also noting that memory waste is not really a problem when a write to a shmem
> file allocates a large folio that stays within boundaries of that write; issues
> only pop up if you end up over-allocating, especially, during page faults where
> you have not that much clue about what to do (single address, no real range
> provided).
> 
> There is the other issue that wasting large chunks of contiguous memory on stuff
> that barely benefits from it. With memory that maybe never gets evicted, there
> is no automatic "handing back" of that memory to the system to be used by
> something else. With ordinary files, that's a bit different. But I did not look
> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.

OK understood. Although, with tmpfs you're not going to mmap it then randomly
extend the file through page faults - mmap doesn't permit that, I don't think?
So presumably the user must explicitly set the size of the file first? Are you
suggesting there are a lot of use cases where a large tmpfs file is created,
mmaped then only accessed sparsely?
Ryan Roberts May 8, 2024, 12:44 p.m. UTC | #11
On 08/05/2024 13:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>>    - It is an error to set "force" for any size except PMD-size
>>>>
>>>>    - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>>      size is set to "inherit"
>>>>
>>>>    - It is an error to set "inherit" for any size except PMD-size if the global
>>>>      control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
> 
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?

I know that's often the case for anon memory, but not sure if you would expect
the same pattern with an explicit file?
David Hildenbrand May 8, 2024, 12:45 p.m. UTC | #12
On 08.05.24 14:43, Ryan Roberts wrote:
> On 08/05/2024 13:10, David Hildenbrand wrote:
>> On 08.05.24 14:02, David Hildenbrand wrote:
>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>>>>>>>>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be set to:
>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>> "advise",
>>>>>>>>> "never", "deny", "force". These values follow the same semantics as the top
>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>> equivalent
>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>
>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>
>>>>>>>> However, thinking about this a bit more, I wonder if the decision we made to
>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>> one.
>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>> just
>>>>>>>> for legacy back compat after all, I don't think there is any use case where
>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>> Applying
>>>>>>>
>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>
>>>>> Missed that one: there might be use cases in the future once we would start
>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>> THP
>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>> where
>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>
>>>>> But I am just making up use cases ;) I think it will be valuable and just doing
>>>>> it consistently now might be cleaner.
>>>>
>>>> I agree that consistency between enabled and shmem_enabled is top priority. And
>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>> all sizes to select "inherit" is best.
>>>>
>>>> But for shmem_enabled, that means we need the following error checking:
>>>>
>>>>     - It is an error to set "force" for any size except PMD-size
>>>>
>>>>     - It is an error to set "force" for the global control if any size except
>>>> PMD-
>>>>       size is set to "inherit"
>>>>
>>>>     - It is an error to set "inherit" for any size except PMD-size if the global
>>>>       control is set to "force".
>>>>
>>>> Certainly not too difficult to code and prove to be correct, but not the nicest
>>>> UX from the user's point of view when they start seeing errors.
>>>>
>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>> folio size as it faults more in. (Just saying there is a chance that this error
>>>> checking becomes permanent).
>>>
>>> Note that with shmem you're inherently facing the same memory waste
>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>> you're running shmem that's configured to be unswappable!).
>>
>> Also noting that memory waste is not really a problem when a write to a shmem
>> file allocates a large folio that stays within boundaries of that write; issues
>> only pop up if you end up over-allocating, especially, during page faults where
>> you have not that much clue about what to do (single address, no real range
>> provided).
>>
>> There is the other issue that wasting large chunks of contiguous memory on stuff
>> that barely benefits from it. With memory that maybe never gets evicted, there
>> is no automatic "handing back" of that memory to the system to be used by
>> something else. With ordinary files, that's a bit different. But I did not look
>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added IIRC.
> 
> OK understood. Although, with tmpfs you're not going to mmap it then randomly
> extend the file through page faults - mmap doesn't permit that, I don't think?
> So presumably the user must explicitly set the size of the file first? Are you
> suggesting there are a lot of use cases where a large tmpfs file is created,
> mmaped then only accessed sparsely?

I don't know about "a lot of use cases", but for VMs that's certainly 
how it's used.
Ryan Roberts May 8, 2024, 12:54 p.m. UTC | #13
On 08/05/2024 13:45, David Hildenbrand wrote:
> On 08.05.24 14:43, Ryan Roberts wrote:
>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>> interface
>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>> mTHP,
>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>> set to:
>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>> "advise",
>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>> the top
>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>> equivalent
>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>
>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>
>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>> made to
>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>> one.
>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>> just
>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>> where
>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>> Applying
>>>>>>>>
>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>
>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>> THP
>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>> where
>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>
>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>> doing
>>>>>> it consistently now might be cleaner.
>>>>>
>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>> And
>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>> all sizes to select "inherit" is best.
>>>>>
>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>
>>>>>     - It is an error to set "force" for any size except PMD-size
>>>>>
>>>>>     - It is an error to set "force" for the global control if any size except
>>>>> PMD-
>>>>>       size is set to "inherit"
>>>>>
>>>>>     - It is an error to set "inherit" for any size except PMD-size if the
>>>>> global
>>>>>       control is set to "force".
>>>>>
>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>> nicest
>>>>> UX from the user's point of view when they start seeing errors.
>>>>>
>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>> error
>>>>> checking becomes permanent).
>>>>
>>>> Note that with shmem you're inherently facing the same memory waste
>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>> you're running shmem that's configured to be unswappable!).
>>>
>>> Also noting that memory waste is not really a problem when a write to a shmem
>>> file allocates a large folio that stays within boundaries of that write; issues
>>> only pop up if you end up over-allocating, especially, during page faults where
>>> you have not that much clue about what to do (single address, no real range
>>> provided).
>>>
>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>> is no automatic "handing back" of that memory to the system to be used by
>>> something else. With ordinary files, that's a bit different. But I did not look
>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>> IIRC.
>>
>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>> extend the file through page faults - mmap doesn't permit that, I don't think?
>> So presumably the user must explicitly set the size of the file first? Are you
>> suggesting there are a lot of use cases where a large tmpfs file is created,
>> mmaped then only accessed sparsely?
> 
> I don't know about "a lot of use cases", but for VMs that's certainly how it's
> used.

Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
than private (or shared) anonymous memory for VMs?
David Hildenbrand May 8, 2024, 1:07 p.m. UTC | #14
On 08.05.24 14:54, Ryan Roberts wrote:
> On 08/05/2024 13:45, David Hildenbrand wrote:
>> On 08.05.24 14:43, Ryan Roberts wrote:
>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>> interface
>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>> mTHP,
>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>> set to:
>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>> "advise",
>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>> the top
>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>> equivalent
>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>
>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>>
>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>> made to
>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
>>>>>>>>>> one.
>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit (this is
>>>>>>>>>> just
>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>> where
>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>> Applying
>>>>>>>>>
>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>
>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable any
>>>>>>> THP
>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>> where
>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>
>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>> doing
>>>>>>> it consistently now might be cleaner.
>>>>>>
>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>> And
>>>>>> yes, I had forgotten about the glorious "auto" future. So probably continuing
>>>>>> all sizes to select "inherit" is best.
>>>>>>
>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>
>>>>>>      - It is an error to set "force" for any size except PMD-size
>>>>>>
>>>>>>      - It is an error to set "force" for the global control if any size except
>>>>>> PMD-
>>>>>>        size is set to "inherit"
>>>>>>
>>>>>>      - It is an error to set "inherit" for any size except PMD-size if the
>>>>>> global
>>>>>>        control is set to "force".
>>>>>>
>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>> nicest
>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>
>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to "force".
>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
>>>>>> more suited to the approach the page cache takes to transparently ramp up the
>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>> error
>>>>>> checking becomes permanent).
>>>>>
>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>> you're running shmem that's configured to be unswappable!).
>>>>
>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>> file allocates a large folio that stays within boundaries of that write; issues
>>>> only pop up if you end up over-allocating, especially, during page faults where
>>>> you have not that much clue about what to do (single address, no real range
>>>> provided).
>>>>
>>>> There is the other issue that wasting large chunks of contiguous memory on stuff
>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>> is no automatic "handing back" of that memory to the system to be used by
>>>> something else. With ordinary files, that's a bit different. But I did not look
>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>> IIRC.
>>>
>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>> So presumably the user must explicitly set the size of the file first? Are you
>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>> mmaped then only accessed sparsely?
>>
>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>> used.
> 

There are more details around that and the sparsity (memory ballooning, 
virtio-mem, free page reporting), but it might distract here :) I'll 
note that shmem+THP is known to be problematic with memory ballooning.

> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
> than private (or shared) anonymous memory for VMs?

The primary use case I know of is sharing VM memory with other processes 
(usually not child processes): DPDK/SPDK and other vhost-user variants 
(such as virtiofs) mmap() all guest memory to access it directly (some 
sort of multi-process hypervisors). They either use real-file-based 
shmem or memfd (essentially the same without a named file) for that.

Then, there is live-hypervisor upgrade, whereby you start a second 
hypervisor process that will take over. People use shmem for that, so 
you can minimize downtime by migrating guest memory simply by mmap'ing 
the shmem file into the new hypervisor.

Shared anonymous memory is basically never used (I only know of one 
corner case in QEMU).

I would assume that there are also DBs making use of rather sparse 
shmem? But no expert on that.
Ryan Roberts May 8, 2024, 1:44 p.m. UTC | #15
On 08/05/2024 14:07, David Hildenbrand wrote:
> On 08.05.24 14:54, Ryan Roberts wrote:
>> On 08/05/2024 13:45, David Hildenbrand wrote:
>>> On 08.05.24 14:43, Ryan Roberts wrote:
>>>> On 08/05/2024 13:10, David Hildenbrand wrote:
>>>>> On 08.05.24 14:02, David Hildenbrand wrote:
>>>>>> On 08.05.24 11:02, Ryan Roberts wrote:
>>>>>>> On 08/05/2024 08:12, David Hildenbrand wrote:
>>>>>>>> On 08.05.24 09:08, David Hildenbrand wrote:
>>>>>>>>> On 08.05.24 06:45, Baolin Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/5/7 18:52, Ryan Roberts wrote:
>>>>>>>>>>> On 06/05/2024 09:46, Baolin Wang wrote:
>>>>>>>>>>>> To support the use of mTHP with anonymous shmem, add a new sysfs
>>>>>>>>>>>> interface
>>>>>>>>>>>> 'shmem_enabled' in the
>>>>>>>>>>>> '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>>>>>>>>>>>> directory for each mTHP to control whether shmem is enabled for that
>>>>>>>>>>>> mTHP,
>>>>>>>>>>>> with a value similar to the top level 'shmem_enabled', which can be
>>>>>>>>>>>> set to:
>>>>>>>>>>>> "always", "inherit (to inherit the top level setting)", "within_size",
>>>>>>>>>>>> "advise",
>>>>>>>>>>>> "never", "deny", "force". These values follow the same semantics as
>>>>>>>>>>>> the top
>>>>>>>>>>>> level, except the 'deny' is equivalent to 'never', and 'force' is
>>>>>>>>>>>> equivalent
>>>>>>>>>>>> to 'always' to keep compatibility.
>>>>>>>>>>>
>>>>>>>>>>> We decided at [1] to not allow 'force' for non-PMD-sizes.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@redhat.com/
>>>>>>>>>>>
>>>>>>>>>>> However, thinking about this a bit more, I wonder if the decision we
>>>>>>>>>>> made to
>>>>>>>>>>> allow all hugepages-xxkB/enabled controls to take "inherit" was the
>>>>>>>>>>> wrong
>>>>>>>>>>> one.
>>>>>>>>>>> Perhaps we should have only allowed the PMD-sized enable=inherit
>>>>>>>>>>> (this is
>>>>>>>>>>> just
>>>>>>>>>>> for legacy back compat after all, I don't think there is any use case
>>>>>>>>>>> where
>>>>>>>>>>> changing multiple mTHP size controls atomically is actually useful).
>>>>>>>>>>> Applying
>>>>>>>>>>
>>>>>>>>>> Agree. This is also our usage of 'inherit'.
>>>>>>>>
>>>>>>>> Missed that one: there might be use cases in the future once we would start
>>>>>>>> defaulting to "inherit" for all knobs (a distro might default to that) and
>>>>>>>> default-enable THP in the global knob. Then, it would be easy to disable
>>>>>>>> any
>>>>>>>> THP
>>>>>>>> by disabling the global knob. (I think that's the future we're heading to,
>>>>>>>> where
>>>>>>>> we'd have an "auto" mode that can be set on the global toggle).
>>>>>>>>
>>>>>>>> But I am just making up use cases ;) I think it will be valuable and just
>>>>>>>> doing
>>>>>>>> it consistently now might be cleaner.
>>>>>>>
>>>>>>> I agree that consistency between enabled and shmem_enabled is top priority.
>>>>>>> And
>>>>>>> yes, I had forgotten about the glorious "auto" future. So probably
>>>>>>> continuing
>>>>>>> all sizes to select "inherit" is best.
>>>>>>>
>>>>>>> But for shmem_enabled, that means we need the following error checking:
>>>>>>>
>>>>>>>      - It is an error to set "force" for any size except PMD-size
>>>>>>>
>>>>>>>      - It is an error to set "force" for the global control if any size
>>>>>>> except
>>>>>>> PMD-
>>>>>>>        size is set to "inherit"
>>>>>>>
>>>>>>>      - It is an error to set "inherit" for any size except PMD-size if the
>>>>>>> global
>>>>>>>        control is set to "force".
>>>>>>>
>>>>>>> Certainly not too difficult to code and prove to be correct, but not the
>>>>>>> nicest
>>>>>>> UX from the user's point of view when they start seeing errors.
>>>>>>>
>>>>>>> I think we previously said this would likely be temporary, and if/when tmpfs
>>>>>>> gets mTHP support, we could simplify and allow all sizes to be set to
>>>>>>> "force".
>>>>>>> But I wonder if tmpfs would ever need explicit mTHP control? Maybe it
>>>>>>> would be
>>>>>>> more suited to the approach the page cache takes to transparently ramp up
>>>>>>> the
>>>>>>> folio size as it faults more in. (Just saying there is a chance that this
>>>>>>> error
>>>>>>> checking becomes permanent).
>>>>>>
>>>>>> Note that with shmem you're inherently facing the same memory waste
>>>>>> issues etc as you would with anonymous memory. (sometimes even worse, if
>>>>>> you're running shmem that's configured to be unswappable!).
>>>>>
>>>>> Also noting that memory waste is not really a problem when a write to a shmem
>>>>> file allocates a large folio that stays within boundaries of that write;
>>>>> issues
>>>>> only pop up if you end up over-allocating, especially, during page faults
>>>>> where
>>>>> you have not that much clue about what to do (single address, no real range
>>>>> provided).
>>>>>
>>>>> There is the other issue that wasting large chunks of contiguous memory on
>>>>> stuff
>>>>> that barely benefits from it. With memory that maybe never gets evicted, there
>>>>> is no automatic "handing back" of that memory to the system to be used by
>>>>> something else. With ordinary files, that's a bit different. But I did not
>>>>> look
>>>>> closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
>>>>> IIRC.
>>>>
>>>> OK understood. Although, with tmpfs you're not going to mmap it then randomly
>>>> extend the file through page faults - mmap doesn't permit that, I don't think?
>>>> So presumably the user must explicitly set the size of the file first? Are you
>>>> suggesting there are a lot of use cases where a large tmpfs file is created,
>>>> mmaped then only accessed sparsely?
>>>
>>> I don't know about "a lot of use cases", but for VMs that's certainly how it's
>>> used.
>>
> 
> There are more details around that and the sparsity (memory ballooning,
> virtio-mem, free page reporting), but it might distract here :) I'll note that
> shmem+THP is known to be problematic with memory ballooning.
> 
>> Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
>> than private (or shared) anonymous memory for VMs?
> 
> The primary use case I know of is sharing VM memory with other processes
> (usually not child processes): DPDK/SPDK and other vhost-user variants (such as
> virtiofs) mmap() all guest memory to access it directly (some sort of
> multi-process hypervisors). They either use real-file-based shmem or memfd
> (essentially the same without a named file) for that.
> 
> Then, there is live-hypervisor upgrade, whereby you start a second hypervisor
> process that will take over. People use shmem for that, so you can minimize
> downtime by migrating guest memory simply by mmap'ing the shmem file into the
> new hypervisor.
> 
> Shared anonymous memory is basically never used (I only know of one corner case
> in QEMU).
> 
> I would assume that there are also DBs making use of rather sparse shmem? But no
> expert on that.
> 

That all makes sense - thanks for the lesson!
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 076443cc10a6..a28496e15bdb 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,35 @@  deny
 force
     Force the huge option on for all - very useful for testing;
 
+Anonymous shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob
+to control mTHP allocation: /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled.
+Its value for each mTHP is essentially consistent with the global setting, except
+for the addition of 'inherit' to ensure compatibility with the global settings.
+always
+    Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+    have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+    Do not allocate <size> huge pages;
+
+within_size
+    Only allocate <size> huge page if it will be fully within i_size.
+    Also respect fadvise()/madvise() hints;
+
+advise
+    Only allocate <size> huge pages if requested with fadvise()/madvise();
+
+deny
+    Has the same semantics as 'never', now mTHP allocation policy is only
+    used for anonymous shmem and no not override tmpfs.
+
+force
+    Has the same semantics as 'always', now mTHP allocation policy is only
+    used for anonymous shmem and no not override tmpfs.
+
 Need of application restart
 ===========================
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e49b56c40a11..dbd6b3f56210 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@ 
 #include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@  ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@  unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
 
+struct thpsize {
+	struct kobject kobj;
+	struct list_head node;
+	int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
 enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9efb6fefc391..d3080a8843f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@  static void thpsize_release(struct kobject *kobj);
 static DEFINE_SPINLOCK(huge_anon_orders_lock);
 static LIST_HEAD(thpsize_list);
 
-struct thpsize {
-	struct kobject kobj;
-	struct list_head node;
-	int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
 static ssize_t thpsize_enabled_show(struct kobject *kobj,
 				    struct kobj_attribute *attr, char *buf)
 {
@@ -517,6 +509,9 @@  static struct kobj_attribute thpsize_enabled_attr =
 
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+	&thpsize_shmem_enabled_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index a383ea9a89a5..59cc26d44344 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@  struct shmem_options {
 #define SHMEM_SEEN_QUOTA 32
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_anon_shmem_orders_always __read_mostly;
+static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
+static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
+static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
+#endif
+
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -4687,6 +4695,12 @@  void __init shmem_init(void)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+	/*
+	 * Default to setting PMD-sized THP to inherit the global setting and
+	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
+	 */
+	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
 #endif
 	return;
 
@@ -4746,6 +4760,11 @@  static ssize_t shmem_enabled_store(struct kobject *kobj,
 			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
 		return -EINVAL;
 
+	/* Do not override huge allocation policy with non-PMD sized mTHP */
+	if (huge == SHMEM_HUGE_FORCE &&
+	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+		return -EINVAL;
+
 	shmem_huge = huge;
 	if (shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4753,6 +4772,83 @@  static ssize_t shmem_enabled_store(struct kobject *kobj,
 }
 
 struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_anon_shmem_orders_always))
+		output = "[always] inherit within_size advise never deny [force]";
+	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
+		output = "always [inherit] within_size advise never deny force";
+	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
+		output = "always inherit [within_size] advise never deny force";
+	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
+		output = "always inherit within_size [advise] never deny force";
+	else
+		output = "always inherit within_size advise [never] [deny] force";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always") || sysfs_streq(buf, "force")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_always);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "inherit")) {
+		/* Do not override huge allocation policy with non-PMD sized mTHP */
+		if (shmem_huge == SHMEM_HUGE_FORCE &&
+		    order != HPAGE_PMD_ORDER)
+			return -EINVAL;
+
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_inherit);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "within_size")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		set_bit(order, &huge_anon_shmem_orders_within_size);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "madvise")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "never") || sysfs_streq(buf, "deny")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else {
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
 #else /* !CONFIG_SHMEM */