[v1,2/2] mm: mTHP stats for pagecache folio allocations

Message ID	20240711072929.3590000-3-ryan.roberts@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Jonathan Corbet <corbet@lwn.net>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, David Hildenbrand <david@redhat.com>, Barry Song <baohua@kernel.org>, Lance Yang <ioworker0@gmail.com>, Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations Date: Thu, 11 Jul 2024 08:29:25 +0100 Message-ID: <20240711072929.3590000-3-ryan.roberts@arm.com> In-Reply-To: <20240711072929.3590000-1-ryan.roberts@arm.com> References: <20240711072929.3590000-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mTHP allocation stats for file-backed memory \| expand [v1,0/2] mTHP allocation stats for file-backed memory [v1,1/2] mm: Cleanup count_mthp_stat() definition [v1,2/2] mm: mTHP stats for pagecache folio allocations

Ryan Roberts July 11, 2024, 7:29 a.m. UTC

Expose 3 new mTHP stats for file (pagecache) folio allocations:

  /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
  /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
  /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge

This will provide some insight on the sizes of large folios being
allocated for file-backed memory, and how often allocation is failing.

All non-order-0 (and most order-0) folio allocations are currently done
through filemap_alloc_folio(), and folios are charged in a subsequent
call to filemap_add_folio(). So count file_fallback when allocation
fails in filemap_alloc_folio() and count file_alloc or
file_fallback_charge in filemap_add_folio(), based on whether charging
succeeded or not. There are some users of filemap_add_folio() that
allocate their own order-0 folio by other means, so we would not count
an allocation failure in this case, but we also don't care about order-0
allocations. This approach feels like it should be good enough and
doesn't require any (impractically large) refactoring.

The existing mTHP stats interface is reused to provide consistency to
users. And because we are reusing the same interface, we can reuse the
same infrastructure on the kernel side. The one small wrinkle is that
the set of folio sizes supported by the pagecache are not identical to
those supported by anon and shmem; pagecache supports order-1, unlike
anon and shmem, and the max pagecache order may be less than PMD-size
(see arm64 with 64K base pages), again unlike anon and shmem. So we now
create a hugepages-*kB directory for the union of the sizes supported by
all 3 memory types and populate it with the relevant stats and controls.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  13 +++
 include/linux/huge_mm.h                    |   6 +-
 include/linux/pagemap.h                    |  17 ++-
 mm/filemap.c                               |   6 +-
 mm/huge_memory.c                           | 117 ++++++++++++++++-----
 5 files changed, 128 insertions(+), 31 deletions(-)

Baolin Wang July 12, 2024, 3 a.m. UTC | #1

On 2024/7/11 15:29, Ryan Roberts wrote:
> Expose 3 new mTHP stats for file (pagecache) folio allocations:
> 
>    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
> 
> This will provide some insight on the sizes of large folios being
> allocated for file-backed memory, and how often allocation is failing.
> 
> All non-order-0 (and most order-0) folio allocations are currently done
> through filemap_alloc_folio(), and folios are charged in a subsequent
> call to filemap_add_folio(). So count file_fallback when allocation
> fails in filemap_alloc_folio() and count file_alloc or
> file_fallback_charge in filemap_add_folio(), based on whether charging
> succeeded or not. There are some users of filemap_add_folio() that
> allocate their own order-0 folio by other means, so we would not count
> an allocation failure in this case, but we also don't care about order-0
> allocations. This approach feels like it should be good enough and
> doesn't require any (impractically large) refactoring.
> 
> The existing mTHP stats interface is reused to provide consistency to
> users. And because we are reusing the same interface, we can reuse the
> same infrastructure on the kernel side. The one small wrinkle is that
> the set of folio sizes supported by the pagecache are not identical to
> those supported by anon and shmem; pagecache supports order-1, unlike
> anon and shmem, and the max pagecache order may be less than PMD-size
> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
> create a hugepages-*kB directory for the union of the sizes supported by
> all 3 memory types and populate it with the relevant stats and controls.

Personally, I like the idea that can help analyze the allocation of 
large folios for the page cache.

However, I have a slight concern about the consistency of the interface.

For 64K, the fields layout:
├── hugepages-64kB
│   ├── enabled
│   ├── shmem_enabled
│   └── stats
│       ├── anon_fault_alloc
│       ├── anon_fault_fallback
│       ├── anon_fault_fallback_charge
│       ├── file_alloc
│       ├── file_fallback
│       ├── file_fallback_charge
│       ├── shmem_alloc
│       ├── shmem_fallback
│       ├── shmem_fallback_charge
│       ├── split
│       ├── split_deferred
│       ├── split_failed
│       ├── swpout
│       └── swpout_fallback

But for 8K (for pagecache), you removed some fields (of course, I 
understand why they are not supported).

├── hugepages-8kB
│   └── stats
│       ├── file_alloc
│       ├── file_fallback
│       └── file_fallback_charge

This might not be user-friendly for some user-space parsing tools, as 
they lack certain fields for the same pattern interfaces. Of course, 
this might not be an issue if we have clear documentation describing the 
differences here:)

Another possible approach is to maintain the same field layout to keep 
consistent, but prohibit writing to the fields that are not supported by 
the pagecache, and any stats read from them would be 0.


> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst |  13 +++
>   include/linux/huge_mm.h                    |   6 +-
>   include/linux/pagemap.h                    |  17 ++-
>   mm/filemap.c                               |   6 +-
>   mm/huge_memory.c                           | 117 ++++++++++++++++-----
>   5 files changed, 128 insertions(+), 31 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 058485daf186..d4857e457add 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -512,6 +512,19 @@ shmem_fallback_charge
>   	falls back to using small pages even though the allocation was
>   	successful.
>   
> +file_alloc
> +	is incremented every time a file huge page is successfully
> +	allocated.
> +
> +file_fallback
> +	is incremented if a file huge page is attempted to be allocated
> +	but fails and instead falls back to using small pages.
> +
> +file_fallback_charge
> +	is incremented if a file huge page cannot be charged and instead
> +	falls back to using small pages even though the allocation was
> +	successful.
> +
>   split
>   	is incremented every time a huge page is successfully split into
>   	smaller orders. This can happen for a variety of reasons but a
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index cb93b9009ce4..b4fba11976f2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -117,6 +117,9 @@ enum mthp_stat_item {
>   	MTHP_STAT_SHMEM_ALLOC,
>   	MTHP_STAT_SHMEM_FALLBACK,
>   	MTHP_STAT_SHMEM_FALLBACK_CHARGE,
> +	MTHP_STAT_FILE_ALLOC,
> +	MTHP_STAT_FILE_FALLBACK,
> +	MTHP_STAT_FILE_FALLBACK_CHARGE,
>   	MTHP_STAT_SPLIT,
>   	MTHP_STAT_SPLIT_FAILED,
>   	MTHP_STAT_SPLIT_DEFERRED,
> @@ -292,11 +295,10 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>   
>   struct thpsize {
>   	struct kobject kobj;
> -	struct list_head node;
>   	int order;
>   };
>   
> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> +#define to_thpsize(_kobj) container_of(_kobj, struct thpsize, kobj)
>   
>   #define transparent_hugepage_use_zero_page()				\
>   	(transparent_hugepage_flags &					\
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 6e2f72d03176..f45a1ba6d9b6 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -365,6 +365,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>    */
>   #define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
>   #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> +#define PAGECACHE_LARGE_ORDERS	((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
>   
>   /**
>    * mapping_set_large_folios() - Indicate the file supports large folios.
> @@ -562,14 +563,26 @@ static inline void *detach_page_private(struct page *page)
>   }
>   
>   #ifdef CONFIG_NUMA
> -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
> +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
>   #else
> -static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> +static inline struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>   {
>   	return folio_alloc_noprof(gfp, order);
>   }
>   #endif
>   
> +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> +{
> +	struct folio *folio;
> +
> +	folio = __filemap_alloc_folio_noprof(gfp, order);
> +
> +	if (!folio)
> +		count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> +
> +	return folio;
> +}
> +
>   #define filemap_alloc_folio(...)				\
>   	alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
>   
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 53d5d0410b51..131d514fca29 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -963,6 +963,8 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>   	int ret;
>   
>   	ret = mem_cgroup_charge(folio, NULL, gfp);
> +	count_mthp_stat(folio_order(folio),
> +		ret ? MTHP_STAT_FILE_FALLBACK_CHARGE : MTHP_STAT_FILE_ALLOC);
>   	if (ret)
>   		return ret;
>   
> @@ -990,7 +992,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>   EXPORT_SYMBOL_GPL(filemap_add_folio);
>   
>   #ifdef CONFIG_NUMA
> -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>   {
>   	int n;
>   	struct folio *folio;
> @@ -1007,7 +1009,7 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>   	}
>   	return folio_alloc_noprof(gfp, order);
>   }
> -EXPORT_SYMBOL(filemap_alloc_folio_noprof);
> +EXPORT_SYMBOL(__filemap_alloc_folio_noprof);
>   #endif
>   
>   /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f9696c94e211..559553e2a662 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -452,8 +452,9 @@ static const struct attribute_group hugepage_attr_group = {
>   
>   static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
>   static void thpsize_release(struct kobject *kobj);
> +static void thpsize_child_release(struct kobject *kobj);
>   static DEFINE_SPINLOCK(huge_anon_orders_lock);
> -static LIST_HEAD(thpsize_list);
> +static LIST_HEAD(thpsize_child_list);
>   
>   static ssize_t thpsize_enabled_show(struct kobject *kobj,
>   				    struct kobj_attribute *attr, char *buf)
> @@ -537,6 +538,18 @@ static const struct kobj_type thpsize_ktype = {
>   	.sysfs_ops = &kobj_sysfs_ops,
>   };
>   
> +static const struct kobj_type thpsize_child_ktype = {
> +	.release = &thpsize_child_release,
> +	.sysfs_ops = &kobj_sysfs_ops,
> +};
> +
> +struct thpsize_child {
> +	struct kobject kobj;
> +	struct list_head node;
> +};
> +
> +#define to_thpsize_child(_kobj) container_of(_kobj, struct thpsize, kobj)
> +
>   DEFINE_PER_CPU(struct mthp_stat, mthp_stats) = {{{0}}};
>   
>   static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
> @@ -557,7 +570,7 @@ static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
>   static ssize_t _name##_show(struct kobject *kobj,			\
>   			struct kobj_attribute *attr, char *buf)		\
>   {									\
> -	int order = to_thpsize(kobj)->order;				\
> +	int order = to_thpsize(kobj->parent)->order;			\
>   									\
>   	return sysfs_emit(buf, "%lu\n", sum_mthp_stat(order, _index));	\
>   }									\
> @@ -591,41 +604,93 @@ static struct attribute *stats_attrs[] = {
>   };
>   
>   static struct attribute_group stats_attr_group = {
> -	.name = "stats",
>   	.attrs = stats_attrs,
>   };
>   
> -static struct thpsize *thpsize_create(int order, struct kobject *parent)
> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
> +
> +static struct attribute *file_stats_attrs[] = {
> +	&file_alloc_attr.attr,
> +	&file_fallback_attr.attr,
> +	&file_fallback_charge_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group file_stats_attr_group = {
> +	.attrs = file_stats_attrs,
> +};
> +
> +static int thpsize_create(int order, struct kobject *parent)
>   {
>   	unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> +	struct thpsize_child *stats;
>   	struct thpsize *thpsize;
>   	int ret;
>   
> +	/*
> +	 * Each child object (currently only "stats" directory) holds a
> +	 * reference to the top-level thpsize object, so we can drop our ref to
> +	 * the top-level once stats is setup. Then we just need to drop a
> +	 * reference on any children to clean everything up. We can't just use
> +	 * the attr group name for the stats subdirectory because there may be
> +	 * multiple attribute groups to populate inside stats and overlaying
> +	 * using the name property isn't supported in that way; each attr group
> +	 * name, if provided, must be unique in the parent directory.
> +	 */
> +
>   	thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
> -	if (!thpsize)
> -		return ERR_PTR(-ENOMEM);
> +	if (!thpsize) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +	thpsize->order = order;
>   
>   	ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>   				   "hugepages-%lukB", size);
>   	if (ret) {
>   		kfree(thpsize);
> -		return ERR_PTR(ret);
> +		goto err;
>   	}
>   
> -	ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
> -	if (ret) {
> +	stats = kzalloc(sizeof(*stats), GFP_KERNEL);
> +	if (!stats) {
>   		kobject_put(&thpsize->kobj);
> -		return ERR_PTR(ret);
> +		ret = -ENOMEM;
> +		goto err;
>   	}
>   
> -	ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> +	ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
> +				   &thpsize->kobj, "stats");
> +	kobject_put(&thpsize->kobj);
>   	if (ret) {
> -		kobject_put(&thpsize->kobj);
> -		return ERR_PTR(ret);
> +		kfree(stats);
> +		goto err;
>   	}
>   
> -	thpsize->order = order;
> -	return thpsize;
> +	if (BIT(order) & THP_ORDERS_ALL_ANON) {
> +		ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
> +		if (ret)
> +			goto err_put;
> +
> +		ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
> +		if (ret)
> +			goto err_put;
> +	}
> +
> +	if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
> +		ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
> +		if (ret)
> +			goto err_put;
> +	}
> +
> +	list_add(&stats->node, &thpsize_child_list);
> +	return 0;
> +err_put:

IIUC, I think you should call 'sysfs_remove_group' to remove the group 
before putting the kobject.

> +	kobject_put(&stats->kobj);
> +err:
> +	return ret;
>   }
>   
>   static void thpsize_release(struct kobject *kobj)
> @@ -633,10 +698,14 @@ static void thpsize_release(struct kobject *kobj)
>   	kfree(to_thpsize(kobj));
>   }
>   
> +static void thpsize_child_release(struct kobject *kobj)
> +{
> +	kfree(to_thpsize_child(kobj));
> +}
> +
>   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>   {
>   	int err;
> -	struct thpsize *thpsize;
>   	unsigned long orders;
>   	int order;
>   
> @@ -665,16 +734,14 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>   		goto remove_hp_group;
>   	}
>   
> -	orders = THP_ORDERS_ALL_ANON;
> +	orders = THP_ORDERS_ALL_ANON | PAGECACHE_LARGE_ORDERS;
>   	order = highest_order(orders);
>   	while (orders) {
> -		thpsize = thpsize_create(order, *hugepage_kobj);
> -		if (IS_ERR(thpsize)) {
> +		err = thpsize_create(order, *hugepage_kobj);
> +		if (err) {
>   			pr_err("failed to create thpsize for order %d\n", order);
> -			err = PTR_ERR(thpsize);
>   			goto remove_all;
>   		}
> -		list_add(&thpsize->node, &thpsize_list);
>   		order = next_order(&orders, order);
>   	}
>   
> @@ -692,11 +759,11 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>   
>   static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
>   {
> -	struct thpsize *thpsize, *tmp;
> +	struct thpsize_child *child, *tmp;
>   
> -	list_for_each_entry_safe(thpsize, tmp, &thpsize_list, node) {
> -		list_del(&thpsize->node);
> -		kobject_put(&thpsize->kobj);
> +	list_for_each_entry_safe(child, tmp, &thpsize_child_list, node) {
> +		list_del(&child->node);
> +		kobject_put(&child->kobj);
>   	}
>   
>   	sysfs_remove_group(hugepage_kobj, &khugepaged_attr_group);

Lance Yang July 12, 2024, 12:22 p.m. UTC | #2

On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/7/11 15:29, Ryan Roberts wrote:
> > Expose 3 new mTHP stats for file (pagecache) folio allocations:
> >
> >    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
> >    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
> >    /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
> >
> > This will provide some insight on the sizes of large folios being
> > allocated for file-backed memory, and how often allocation is failing.
> >
> > All non-order-0 (and most order-0) folio allocations are currently done
> > through filemap_alloc_folio(), and folios are charged in a subsequent
> > call to filemap_add_folio(). So count file_fallback when allocation
> > fails in filemap_alloc_folio() and count file_alloc or
> > file_fallback_charge in filemap_add_folio(), based on whether charging
> > succeeded or not. There are some users of filemap_add_folio() that
> > allocate their own order-0 folio by other means, so we would not count
> > an allocation failure in this case, but we also don't care about order-0
> > allocations. This approach feels like it should be good enough and
> > doesn't require any (impractically large) refactoring.
> >
> > The existing mTHP stats interface is reused to provide consistency to
> > users. And because we are reusing the same interface, we can reuse the
> > same infrastructure on the kernel side. The one small wrinkle is that
> > the set of folio sizes supported by the pagecache are not identical to
> > those supported by anon and shmem; pagecache supports order-1, unlike
> > anon and shmem, and the max pagecache order may be less than PMD-size
> > (see arm64 with 64K base pages), again unlike anon and shmem. So we now
> > create a hugepages-*kB directory for the union of the sizes supported by
> > all 3 memory types and populate it with the relevant stats and controls.
>
> Personally, I like the idea that can help analyze the allocation of
> large folios for the page cache.
>
> However, I have a slight concern about the consistency of the interface.
>
> For 64K, the fields layout:
> ├── hugepages-64kB
> │   ├── enabled
> │   ├── shmem_enabled
> │   └── stats
> │       ├── anon_fault_alloc
> │       ├── anon_fault_fallback
> │       ├── anon_fault_fallback_charge
> │       ├── file_alloc
> │       ├── file_fallback
> │       ├── file_fallback_charge
> │       ├── shmem_alloc
> │       ├── shmem_fallback
> │       ├── shmem_fallback_charge
> │       ├── split
> │       ├── split_deferred
> │       ├── split_failed
> │       ├── swpout
> │       └── swpout_fallback
>
> But for 8K (for pagecache), you removed some fields (of course, I
> understand why they are not supported).
>
> ├── hugepages-8kB
> │   └── stats
> │       ├── file_alloc
> │       ├── file_fallback
> │       └── file_fallback_charge
>
> This might not be user-friendly for some user-space parsing tools, as
> they lack certain fields for the same pattern interfaces. Of course,
> this might not be an issue if we have clear documentation describing the
> differences here:)
>
> Another possible approach is to maintain the same field layout to keep
> consistent, but prohibit writing to the fields that are not supported by
> the pagecache, and any stats read from them would be 0.

I agree that maintaining a uniform field layout, especially at the stats
level, might be necessary ;)

Keeping a consistent interface could future-proof the design. It allows
for the possibility that features not currently supported for 8kB pages
might be enabled in the future.

Anyway, like David said, it's always tough messing with such stuff ;p

Thanks,
Lance

>
>
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst |  13 +++
> >   include/linux/huge_mm.h                    |   6 +-
> >   include/linux/pagemap.h                    |  17 ++-
> >   mm/filemap.c                               |   6 +-
> >   mm/huge_memory.c                           | 117 ++++++++++++++++-----
> >   5 files changed, 128 insertions(+), 31 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 058485daf186..d4857e457add 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -512,6 +512,19 @@ shmem_fallback_charge
> >       falls back to using small pages even though the allocation was
> >       successful.
> >
> > +file_alloc
> > +     is incremented every time a file huge page is successfully
> > +     allocated.
> > +
> > +file_fallback
> > +     is incremented if a file huge page is attempted to be allocated
> > +     but fails and instead falls back to using small pages.
> > +
> > +file_fallback_charge
> > +     is incremented if a file huge page cannot be charged and instead
> > +     falls back to using small pages even though the allocation was
> > +     successful.
> > +
> >   split
> >       is incremented every time a huge page is successfully split into
> >       smaller orders. This can happen for a variety of reasons but a
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index cb93b9009ce4..b4fba11976f2 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -117,6 +117,9 @@ enum mthp_stat_item {
> >       MTHP_STAT_SHMEM_ALLOC,
> >       MTHP_STAT_SHMEM_FALLBACK,
> >       MTHP_STAT_SHMEM_FALLBACK_CHARGE,
> > +     MTHP_STAT_FILE_ALLOC,
> > +     MTHP_STAT_FILE_FALLBACK,
> > +     MTHP_STAT_FILE_FALLBACK_CHARGE,
> >       MTHP_STAT_SPLIT,
> >       MTHP_STAT_SPLIT_FAILED,
> >       MTHP_STAT_SPLIT_DEFERRED,
> > @@ -292,11 +295,10 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >
> >   struct thpsize {
> >       struct kobject kobj;
> > -     struct list_head node;
> >       int order;
> >   };
> >
> > -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> > +#define to_thpsize(_kobj) container_of(_kobj, struct thpsize, kobj)
> >
> >   #define transparent_hugepage_use_zero_page()                                \
> >       (transparent_hugepage_flags &                                   \
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 6e2f72d03176..f45a1ba6d9b6 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -365,6 +365,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> >    */
> >   #define MAX_XAS_ORDER               (XA_CHUNK_SHIFT * 2 - 1)
> >   #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> > +#define PAGECACHE_LARGE_ORDERS       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
> >
> >   /**
> >    * mapping_set_large_folios() - Indicate the file supports large folios.
> > @@ -562,14 +563,26 @@ static inline void *detach_page_private(struct page *page)
> >   }
> >
> >   #ifdef CONFIG_NUMA
> > -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
> > +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
> >   #else
> > -static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> > +static inline struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> >   {
> >       return folio_alloc_noprof(gfp, order);
> >   }
> >   #endif
> >
> > +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> > +{
> > +     struct folio *folio;
> > +
> > +     folio = __filemap_alloc_folio_noprof(gfp, order);
> > +
> > +     if (!folio)
> > +             count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
> > +
> > +     return folio;
> > +}
> > +
> >   #define filemap_alloc_folio(...)                            \
> >       alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 53d5d0410b51..131d514fca29 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -963,6 +963,8 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> >       int ret;
> >
> >       ret = mem_cgroup_charge(folio, NULL, gfp);
> > +     count_mthp_stat(folio_order(folio),
> > +             ret ? MTHP_STAT_FILE_FALLBACK_CHARGE : MTHP_STAT_FILE_ALLOC);
> >       if (ret)
> >               return ret;
> >
> > @@ -990,7 +992,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> >   EXPORT_SYMBOL_GPL(filemap_add_folio);
> >
> >   #ifdef CONFIG_NUMA
> > -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> > +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> >   {
> >       int n;
> >       struct folio *folio;
> > @@ -1007,7 +1009,7 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> >       }
> >       return folio_alloc_noprof(gfp, order);
> >   }
> > -EXPORT_SYMBOL(filemap_alloc_folio_noprof);
> > +EXPORT_SYMBOL(__filemap_alloc_folio_noprof);
> >   #endif
> >
> >   /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index f9696c94e211..559553e2a662 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -452,8 +452,9 @@ static const struct attribute_group hugepage_attr_group = {
> >
> >   static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
> >   static void thpsize_release(struct kobject *kobj);
> > +static void thpsize_child_release(struct kobject *kobj);
> >   static DEFINE_SPINLOCK(huge_anon_orders_lock);
> > -static LIST_HEAD(thpsize_list);
> > +static LIST_HEAD(thpsize_child_list);
> >
> >   static ssize_t thpsize_enabled_show(struct kobject *kobj,
> >                                   struct kobj_attribute *attr, char *buf)
> > @@ -537,6 +538,18 @@ static const struct kobj_type thpsize_ktype = {
> >       .sysfs_ops = &kobj_sysfs_ops,
> >   };
> >
> > +static const struct kobj_type thpsize_child_ktype = {
> > +     .release = &thpsize_child_release,
> > +     .sysfs_ops = &kobj_sysfs_ops,
> > +};
> > +
> > +struct thpsize_child {
> > +     struct kobject kobj;
> > +     struct list_head node;
> > +};
> > +
> > +#define to_thpsize_child(_kobj) container_of(_kobj, struct thpsize, kobj)
> > +
> >   DEFINE_PER_CPU(struct mthp_stat, mthp_stats) = {{{0}}};
> >
> >   static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
> > @@ -557,7 +570,7 @@ static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item)
> >   static ssize_t _name##_show(struct kobject *kobj,                   \
> >                       struct kobj_attribute *attr, char *buf)         \
> >   {                                                                   \
> > -     int order = to_thpsize(kobj)->order;                            \
> > +     int order = to_thpsize(kobj->parent)->order;                    \
> >                                                                       \
> >       return sysfs_emit(buf, "%lu\n", sum_mthp_stat(order, _index));  \
> >   }                                                                   \
> > @@ -591,41 +604,93 @@ static struct attribute *stats_attrs[] = {
> >   };
> >
> >   static struct attribute_group stats_attr_group = {
> > -     .name = "stats",
> >       .attrs = stats_attrs,
> >   };
> >
> > -static struct thpsize *thpsize_create(int order, struct kobject *parent)
> > +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
> > +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
> > +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
> > +
> > +static struct attribute *file_stats_attrs[] = {
> > +     &file_alloc_attr.attr,
> > +     &file_fallback_attr.attr,
> > +     &file_fallback_charge_attr.attr,
> > +     NULL,
> > +};
> > +
> > +static struct attribute_group file_stats_attr_group = {
> > +     .attrs = file_stats_attrs,
> > +};
> > +
> > +static int thpsize_create(int order, struct kobject *parent)
> >   {
> >       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> > +     struct thpsize_child *stats;
> >       struct thpsize *thpsize;
> >       int ret;
> >
> > +     /*
> > +      * Each child object (currently only "stats" directory) holds a
> > +      * reference to the top-level thpsize object, so we can drop our ref to
> > +      * the top-level once stats is setup. Then we just need to drop a
> > +      * reference on any children to clean everything up. We can't just use
> > +      * the attr group name for the stats subdirectory because there may be
> > +      * multiple attribute groups to populate inside stats and overlaying
> > +      * using the name property isn't supported in that way; each attr group
> > +      * name, if provided, must be unique in the parent directory.
> > +      */
> > +
> >       thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
> > -     if (!thpsize)
> > -             return ERR_PTR(-ENOMEM);
> > +     if (!thpsize) {
> > +             ret = -ENOMEM;
> > +             goto err;
> > +     }
> > +     thpsize->order = order;
> >
> >       ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
> >                                  "hugepages-%lukB", size);
> >       if (ret) {
> >               kfree(thpsize);
> > -             return ERR_PTR(ret);
> > +             goto err;
> >       }
> >
> > -     ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
> > -     if (ret) {
> > +     stats = kzalloc(sizeof(*stats), GFP_KERNEL);
> > +     if (!stats) {
> >               kobject_put(&thpsize->kobj);
> > -             return ERR_PTR(ret);
> > +             ret = -ENOMEM;
> > +             goto err;
> >       }
> >
> > -     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> > +     ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
> > +                                &thpsize->kobj, "stats");
> > +     kobject_put(&thpsize->kobj);
> >       if (ret) {
> > -             kobject_put(&thpsize->kobj);
> > -             return ERR_PTR(ret);
> > +             kfree(stats);
> > +             goto err;
> >       }
> >
> > -     thpsize->order = order;
> > -     return thpsize;
> > +     if (BIT(order) & THP_ORDERS_ALL_ANON) {
> > +             ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
> > +             if (ret)
> > +                     goto err_put;
> > +
> > +             ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
> > +             if (ret)
> > +                     goto err_put;
> > +     }
> > +
> > +     if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
> > +             ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
> > +             if (ret)
> > +                     goto err_put;
> > +     }
> > +
> > +     list_add(&stats->node, &thpsize_child_list);
> > +     return 0;
> > +err_put:
>
> IIUC, I think you should call 'sysfs_remove_group' to remove the group
> before putting the kobject.
>
> > +     kobject_put(&stats->kobj);
> > +err:
> > +     return ret;
> >   }
> >
> >   static void thpsize_release(struct kobject *kobj)
> > @@ -633,10 +698,14 @@ static void thpsize_release(struct kobject *kobj)
> >       kfree(to_thpsize(kobj));
> >   }
> >
> > +static void thpsize_child_release(struct kobject *kobj)
> > +{
> > +     kfree(to_thpsize_child(kobj));
> > +}
> > +
> >   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> >   {
> >       int err;
> > -     struct thpsize *thpsize;
> >       unsigned long orders;
> >       int order;
> >
> > @@ -665,16 +734,14 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> >               goto remove_hp_group;
> >       }
> >
> > -     orders = THP_ORDERS_ALL_ANON;
> > +     orders = THP_ORDERS_ALL_ANON | PAGECACHE_LARGE_ORDERS;
> >       order = highest_order(orders);
> >       while (orders) {
> > -             thpsize = thpsize_create(order, *hugepage_kobj);
> > -             if (IS_ERR(thpsize)) {
> > +             err = thpsize_create(order, *hugepage_kobj);
> > +             if (err) {
> >                       pr_err("failed to create thpsize for order %d\n", order);
> > -                     err = PTR_ERR(thpsize);
> >                       goto remove_all;
> >               }
> > -             list_add(&thpsize->node, &thpsize_list);
> >               order = next_order(&orders, order);
> >       }
> >
> > @@ -692,11 +759,11 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> >
> >   static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
> >   {
> > -     struct thpsize *thpsize, *tmp;
> > +     struct thpsize_child *child, *tmp;
> >
> > -     list_for_each_entry_safe(thpsize, tmp, &thpsize_list, node) {
> > -             list_del(&thpsize->node);
> > -             kobject_put(&thpsize->kobj);
> > +     list_for_each_entry_safe(child, tmp, &thpsize_child_list, node) {
> > +             list_del(&child->node);
> > +             kobject_put(&child->kobj);
> >       }
> >
> >       sysfs_remove_group(hugepage_kobj, &khugepaged_attr_group);

kernel test robot July 12, 2024, 10:44 p.m. UTC | #3

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20240712]
[cannot apply to linus/master v6.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Cleanup-count_mthp_stat-definition/20240711-154455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240711072929.3590000-3-ryan.roberts%40arm.com
patch subject: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations
config: arm64-defconfig (https://download.01.org/0day-ci/archive/20240713/202407130620.DbyYULnj-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240713/202407130620.DbyYULnj-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407130620.DbyYULnj-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/crypto/crct10dif-ce.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/crypto/aes-neon-bs.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/btrfs/btrfs.o
WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/af_alg.o
WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/ecc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/curve25519-generic.o
WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/xor.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libchacha.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libarc4.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libdes.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/pwm/pwm-imx27.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/pwm/pwm-mediatek.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/pwm/pwm-visconti.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/pci/controller/pcie-mediatek-gen3.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/clk/qcom/lpass-gfm-sm8250.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/soc/mediatek/mtk-cmdq-helper.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/soc/qcom/spm.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/xen/xen-pciback/xen-pciback.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/regulator/max20411-regulator.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/bridge/lontium-lt9611.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/bridge/lontium-lt9611uxc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/base/regmap/regmap-slimbus.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/misc/fastrpc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/spi/spi-omap2-mcspi.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/host/xhci-pci-renesas.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_acm.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/u_serial.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_serial.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_obex.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/u_ether.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_ncm.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_ecm.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_eem.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_ecm_subset.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_rndis.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_mass_storage.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/function/usb_f_fs.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/gadget/libcomposite.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/media/tuners/tda9887.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/media/rc/rc-core.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/media/common/uvc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/media/v4l2-core/v4l2-async.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/media/v4l2-core/v4l2-fwnode.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/edac/layerscape_edac_mod.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/ufs/host/ufs-qcom.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/firmware/google/coreboot_table.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/firmware/google/cbmem.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/sa2ul.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/mailbox/mtk-cmdq-mailbox.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/rpmsg/rpmsg_char.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/iio/buffer/kfifo_buf.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/perf/arm-ccn.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/perf/fsl_imx8_ddr_perf.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/perf/arm_cspmu/arm_cspmu_module.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/perf/arm_cspmu/nvidia_cspmu.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/interconnect/imx/imx8mm-interconnect.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/interconnect/imx/imx8mq-interconnect.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/interconnect/imx/imx8mn-interconnect.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/hte/hte-tegra194-test.o
>> ERROR: modpost: "mthp_stats" [fs/btrfs/btrfs.ko] undefined!

David Hildenbrand July 13, 2024, 1:08 a.m. UTC | #4

On 12.07.24 14:22, Lance Yang wrote:
> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>
>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>
>>> This will provide some insight on the sizes of large folios being
>>> allocated for file-backed memory, and how often allocation is failing.
>>>
>>> All non-order-0 (and most order-0) folio allocations are currently done
>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>> call to filemap_add_folio(). So count file_fallback when allocation
>>> fails in filemap_alloc_folio() and count file_alloc or
>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>> succeeded or not. There are some users of filemap_add_folio() that
>>> allocate their own order-0 folio by other means, so we would not count
>>> an allocation failure in this case, but we also don't care about order-0
>>> allocations. This approach feels like it should be good enough and
>>> doesn't require any (impractically large) refactoring.
>>>
>>> The existing mTHP stats interface is reused to provide consistency to
>>> users. And because we are reusing the same interface, we can reuse the
>>> same infrastructure on the kernel side. The one small wrinkle is that
>>> the set of folio sizes supported by the pagecache are not identical to
>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>> create a hugepages-*kB directory for the union of the sizes supported by
>>> all 3 memory types and populate it with the relevant stats and controls.
>>
>> Personally, I like the idea that can help analyze the allocation of
>> large folios for the page cache.
>>
>> However, I have a slight concern about the consistency of the interface.
>>
>> For 64K, the fields layout:
>> ├── hugepages-64kB
>> │   ├── enabled
>> │   ├── shmem_enabled
>> │   └── stats
>> │       ├── anon_fault_alloc
>> │       ├── anon_fault_fallback
>> │       ├── anon_fault_fallback_charge
>> │       ├── file_alloc
>> │       ├── file_fallback
>> │       ├── file_fallback_charge
>> │       ├── shmem_alloc
>> │       ├── shmem_fallback
>> │       ├── shmem_fallback_charge
>> │       ├── split
>> │       ├── split_deferred
>> │       ├── split_failed
>> │       ├── swpout
>> │       └── swpout_fallback
>>
>> But for 8K (for pagecache), you removed some fields (of course, I
>> understand why they are not supported).
>>
>> ├── hugepages-8kB
>> │   └── stats
>> │       ├── file_alloc
>> │       ├── file_fallback
>> │       └── file_fallback_charge
>>
>> This might not be user-friendly for some user-space parsing tools, as
>> they lack certain fields for the same pattern interfaces. Of course,
>> this might not be an issue if we have clear documentation describing the
>> differences here:)
>>
>> Another possible approach is to maintain the same field layout to keep
>> consistent, but prohibit writing to the fields that are not supported by
>> the pagecache, and any stats read from them would be 0.
> 
> I agree that maintaining a uniform field layout, especially at the stats
> level, might be necessary ;)
> 
> Keeping a consistent interface could future-proof the design. It allows
> for the possibility that features not currently supported for 8kB pages
> might be enabled in the future.

I'll just note that, with shmem/file effectively being disabled for 
order > 11, we'll also have entries there that are effectively unused.

Good question how we want to deal with that (stats are easy, but what 
about when we enable something? Maybe we should document that "enabled" 
is only effective when supported).

Hmmmmm

Ryan Roberts July 13, 2024, 10:45 a.m. UTC | #5

On 13/07/2024 02:08, David Hildenbrand wrote:
> On 12.07.24 14:22, Lance Yang wrote:
>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>
>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>    
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>
>>>> This will provide some insight on the sizes of large folios being
>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>
>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>> allocate their own order-0 folio by other means, so we would not count
>>>> an allocation failure in this case, but we also don't care about order-0
>>>> allocations. This approach feels like it should be good enough and
>>>> doesn't require any (impractically large) refactoring.
>>>>
>>>> The existing mTHP stats interface is reused to provide consistency to
>>>> users. And because we are reusing the same interface, we can reuse the
>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>> the set of folio sizes supported by the pagecache are not identical to
>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>
>>> Personally, I like the idea that can help analyze the allocation of
>>> large folios for the page cache.
>>>
>>> However, I have a slight concern about the consistency of the interface.
>>>
>>> For 64K, the fields layout:
>>> ├── hugepages-64kB
>>> │   ├── enabled
>>> │   ├── shmem_enabled
>>> │   └── stats
>>> │       ├── anon_fault_alloc
>>> │       ├── anon_fault_fallback
>>> │       ├── anon_fault_fallback_charge
>>> │       ├── file_alloc
>>> │       ├── file_fallback
>>> │       ├── file_fallback_charge
>>> │       ├── shmem_alloc
>>> │       ├── shmem_fallback
>>> │       ├── shmem_fallback_charge
>>> │       ├── split
>>> │       ├── split_deferred
>>> │       ├── split_failed
>>> │       ├── swpout
>>> │       └── swpout_fallback
>>>
>>> But for 8K (for pagecache), you removed some fields (of course, I
>>> understand why they are not supported).
>>>
>>> ├── hugepages-8kB
>>> │   └── stats
>>> │       ├── file_alloc
>>> │       ├── file_fallback
>>> │       └── file_fallback_charge
>>>
>>> This might not be user-friendly for some user-space parsing tools, as
>>> they lack certain fields for the same pattern interfaces. Of course,
>>> this might not be an issue if we have clear documentation describing the
>>> differences here:)
>>>
>>> Another possible approach is to maintain the same field layout to keep
>>> consistent, but prohibit writing to the fields that are not supported by
>>> the pagecache, and any stats read from them would be 0.
>>
>> I agree that maintaining a uniform field layout, especially at the stats
>> level, might be necessary ;)
>>
>> Keeping a consistent interface could future-proof the design. It allows
>> for the possibility that features not currently supported for 8kB pages
>> might be enabled in the future.
> 
> I'll just note that, with shmem/file effectively being disabled for order > 11,
> we'll also have entries there that are effectively unused.

Indeed, I mentioned that in the commit log :)

> 
> Good question how we want to deal with that (stats are easy, but what about when
> we enable something? Maybe we should document that "enabled" is only effective
> when supported).

The documentation already says "If enabling multiple hugepage sizes, the kernel
will select the most appropriate enabled size for a given allocation." for anon
THP (and I've added similar wording for my as-yet-unposted patch to add controls
for page cache folio sizes). So I think we could easily add dummy *enabled
controls for all sizes, that can be written to and read back consistently, but
the kernel just ignores them when deciding what size to use. It would also
simplify the code that populates the controls.

Personally though, I'm not convinced of the value of trying to make the controls
for every size look identical. What's the real value to the user to pretend that
they can select a size that they cannot? What happens when we inevitably want to
add some new control in future which only applies to select sizes and there is
no good way to fake it for the other sizes? Why can't user space just be
expected to rely on the existance of the files rather than on the existance of
the directories?

As always, I'll go with the majority, but just wanted to register my opinion.

Thanks,
Ryan

> 
> Hmmmmm
>

Ryan Roberts July 13, 2024, 11 a.m. UTC | #6

[...]

>> +static int thpsize_create(int order, struct kobject *parent)
>>   {
>>       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>> +    struct thpsize_child *stats;
>>       struct thpsize *thpsize;
>>       int ret;
>>   +    /*
>> +     * Each child object (currently only "stats" directory) holds a
>> +     * reference to the top-level thpsize object, so we can drop our ref to
>> +     * the top-level once stats is setup. Then we just need to drop a
>> +     * reference on any children to clean everything up. We can't just use
>> +     * the attr group name for the stats subdirectory because there may be
>> +     * multiple attribute groups to populate inside stats and overlaying
>> +     * using the name property isn't supported in that way; each attr group
>> +     * name, if provided, must be unique in the parent directory.
>> +     */
>> +
>>       thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
>> -    if (!thpsize)
>> -        return ERR_PTR(-ENOMEM);
>> +    if (!thpsize) {
>> +        ret = -ENOMEM;
>> +        goto err;
>> +    }
>> +    thpsize->order = order;
>>         ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>>                      "hugepages-%lukB", size);
>>       if (ret) {
>>           kfree(thpsize);
>> -        return ERR_PTR(ret);
>> +        goto err;
>>       }
>>   -    ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>> -    if (ret) {
>> +    stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>> +    if (!stats) {
>>           kobject_put(&thpsize->kobj);
>> -        return ERR_PTR(ret);
>> +        ret = -ENOMEM;
>> +        goto err;
>>       }
>>   -    ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>> +    ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
>> +                   &thpsize->kobj, "stats");
>> +    kobject_put(&thpsize->kobj);
>>       if (ret) {
>> -        kobject_put(&thpsize->kobj);
>> -        return ERR_PTR(ret);
>> +        kfree(stats);
>> +        goto err;
>>       }
>>   -    thpsize->order = order;
>> -    return thpsize;
>> +    if (BIT(order) & THP_ORDERS_ALL_ANON) {
>> +        ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>> +        if (ret)
>> +            goto err_put;
>> +
>> +        ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
>> +        if (ret)
>> +            goto err_put;
>> +    }
>> +
>> +    if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
>> +        ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
>> +        if (ret)
>> +            goto err_put;
>> +    }
>> +
>> +    list_add(&stats->node, &thpsize_child_list);
>> +    return 0;
>> +err_put:
> 
> IIUC, I think you should call 'sysfs_remove_group' to remove the group before
> putting the kobject.

Are you sure about that? As I understood it, sysfs_create_group() was
conceptually modifying the state of the kobj, so when the kobj gets destroyed,
all its state is tidied up. __kobject_del() (called on the last kobject_put())
calls sysfs_remove_groups() and tidies up the sysfs state as far as I can see?

> 
>> +    kobject_put(&stats->kobj);
>> +err:
>> +    return ret;
>>   }

Baolin Wang July 13, 2024, 12:54 p.m. UTC | #7

On 2024/7/13 19:00, Ryan Roberts wrote:
> [...]
> 
>>> +static int thpsize_create(int order, struct kobject *parent)
>>>    {
>>>        unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>> +    struct thpsize_child *stats;
>>>        struct thpsize *thpsize;
>>>        int ret;
>>>    +    /*
>>> +     * Each child object (currently only "stats" directory) holds a
>>> +     * reference to the top-level thpsize object, so we can drop our ref to
>>> +     * the top-level once stats is setup. Then we just need to drop a
>>> +     * reference on any children to clean everything up. We can't just use
>>> +     * the attr group name for the stats subdirectory because there may be
>>> +     * multiple attribute groups to populate inside stats and overlaying
>>> +     * using the name property isn't supported in that way; each attr group
>>> +     * name, if provided, must be unique in the parent directory.
>>> +     */
>>> +
>>>        thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
>>> -    if (!thpsize)
>>> -        return ERR_PTR(-ENOMEM);
>>> +    if (!thpsize) {
>>> +        ret = -ENOMEM;
>>> +        goto err;
>>> +    }
>>> +    thpsize->order = order;
>>>          ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>>>                       "hugepages-%lukB", size);
>>>        if (ret) {
>>>            kfree(thpsize);
>>> -        return ERR_PTR(ret);
>>> +        goto err;
>>>        }
>>>    -    ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>> -    if (ret) {
>>> +    stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>>> +    if (!stats) {
>>>            kobject_put(&thpsize->kobj);
>>> -        return ERR_PTR(ret);
>>> +        ret = -ENOMEM;
>>> +        goto err;
>>>        }
>>>    -    ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>> +    ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
>>> +                   &thpsize->kobj, "stats");
>>> +    kobject_put(&thpsize->kobj);
>>>        if (ret) {
>>> -        kobject_put(&thpsize->kobj);
>>> -        return ERR_PTR(ret);
>>> +        kfree(stats);
>>> +        goto err;
>>>        }
>>>    -    thpsize->order = order;
>>> -    return thpsize;
>>> +    if (BIT(order) & THP_ORDERS_ALL_ANON) {
>>> +        ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>> +        if (ret)
>>> +            goto err_put;
>>> +
>>> +        ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
>>> +        if (ret)
>>> +            goto err_put;
>>> +    }
>>> +
>>> +    if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
>>> +        ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
>>> +        if (ret)
>>> +            goto err_put;
>>> +    }
>>> +
>>> +    list_add(&stats->node, &thpsize_child_list);
>>> +    return 0;
>>> +err_put:
>>
>> IIUC, I think you should call 'sysfs_remove_group' to remove the group before
>> putting the kobject.
> 
> Are you sure about that? As I understood it, sysfs_create_group() was
> conceptually modifying the state of the kobj, so when the kobj gets destroyed,
> all its state is tidied up. __kobject_del() (called on the last kobject_put())
> calls sysfs_remove_groups() and tidies up the sysfs state as far as I can see?

IIUC, __kobject_del() only removes the ktype defaut groups by 
'sysfs_remove_groups(kobj, ktype->default_groups)', but your created 
groups are not added into the ktype->default_groups. That means you 
should mannuly remove them, or am I miss something?

Ryan Roberts July 14, 2024, 9:05 a.m. UTC | #8

On 13/07/2024 13:54, Baolin Wang wrote:
> 
> 
> On 2024/7/13 19:00, Ryan Roberts wrote:
>> [...]
>>
>>>> +static int thpsize_create(int order, struct kobject *parent)
>>>>    {
>>>>        unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>>> +    struct thpsize_child *stats;
>>>>        struct thpsize *thpsize;
>>>>        int ret;
>>>>    +    /*
>>>> +     * Each child object (currently only "stats" directory) holds a
>>>> +     * reference to the top-level thpsize object, so we can drop our ref to
>>>> +     * the top-level once stats is setup. Then we just need to drop a
>>>> +     * reference on any children to clean everything up. We can't just use
>>>> +     * the attr group name for the stats subdirectory because there may be
>>>> +     * multiple attribute groups to populate inside stats and overlaying
>>>> +     * using the name property isn't supported in that way; each attr group
>>>> +     * name, if provided, must be unique in the parent directory.
>>>> +     */
>>>> +
>>>>        thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
>>>> -    if (!thpsize)
>>>> -        return ERR_PTR(-ENOMEM);
>>>> +    if (!thpsize) {
>>>> +        ret = -ENOMEM;
>>>> +        goto err;
>>>> +    }
>>>> +    thpsize->order = order;
>>>>          ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>>>>                       "hugepages-%lukB", size);
>>>>        if (ret) {
>>>>            kfree(thpsize);
>>>> -        return ERR_PTR(ret);
>>>> +        goto err;
>>>>        }
>>>>    -    ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>> -    if (ret) {
>>>> +    stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>>>> +    if (!stats) {
>>>>            kobject_put(&thpsize->kobj);
>>>> -        return ERR_PTR(ret);
>>>> +        ret = -ENOMEM;
>>>> +        goto err;
>>>>        }
>>>>    -    ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>>> +    ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
>>>> +                   &thpsize->kobj, "stats");
>>>> +    kobject_put(&thpsize->kobj);
>>>>        if (ret) {
>>>> -        kobject_put(&thpsize->kobj);
>>>> -        return ERR_PTR(ret);
>>>> +        kfree(stats);
>>>> +        goto err;
>>>>        }
>>>>    -    thpsize->order = order;
>>>> -    return thpsize;
>>>> +    if (BIT(order) & THP_ORDERS_ALL_ANON) {
>>>> +        ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>> +        if (ret)
>>>> +            goto err_put;
>>>> +
>>>> +        ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
>>>> +        if (ret)
>>>> +            goto err_put;
>>>> +    }
>>>> +
>>>> +    if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
>>>> +        ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
>>>> +        if (ret)
>>>> +            goto err_put;
>>>> +    }
>>>> +
>>>> +    list_add(&stats->node, &thpsize_child_list);
>>>> +    return 0;
>>>> +err_put:
>>>
>>> IIUC, I think you should call 'sysfs_remove_group' to remove the group before
>>> putting the kobject.
>>
>> Are you sure about that? As I understood it, sysfs_create_group() was
>> conceptually modifying the state of the kobj, so when the kobj gets destroyed,
>> all its state is tidied up. __kobject_del() (called on the last kobject_put())
>> calls sysfs_remove_groups() and tidies up the sysfs state as far as I can see?
> 
> IIUC, __kobject_del() only removes the ktype defaut groups by
> 'sysfs_remove_groups(kobj, ktype->default_groups)', but your created groups are
> not added into the ktype->default_groups. That means you should mannuly remove
> them, or am I miss something?

That was also putting doubt in my mind. But the sample at
samples/kobject/kobject-example.c does not call sysfs_remove_group(). It just
calls sysfs_create_group() in example_init() and calls kobject_put() in
example_exit(). So I think that's the correct pattern.

Looking at the code more closely, sysfs_create_group() just creates files for
each of the attributes in the group. __kobject_del() calls sysfs_remove_dir(),
who's comment states "we remove any files in the directory before we remove the
directory" so I'm pretty sure sysfs_remove_group() is not required.

By the way, if we do choose to only populate stats if that size can be used by
anon/shmem/file, then I've found sysfs_merge_group() which will simplify adding
named groups without needing to manually create the stats directory as I am in
this version of the patch. I'll migrate to using that approach in v2. Of course
if we decide to take the approach of populating all stats for all sizes, that
problem goes away anyway.

Thanks,
Ryan

Ryan Roberts July 15, 2024, 1:55 p.m. UTC | #9

On 12/07/2024 23:44, kernel test robot wrote:
> Hi Ryan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on next-20240712]
> [cannot apply to linus/master v6.10-rc7]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Cleanup-count_mthp_stat-definition/20240711-154455
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20240711072929.3590000-3-ryan.roberts%40arm.com
> patch subject: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations
> config: arm64-defconfig (https://download.01.org/0day-ci/archive/20240713/202407130620.DbyYULnj-lkp@intel.com/config)
> compiler: aarch64-linux-gcc (GCC) 14.1.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240713/202407130620.DbyYULnj-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202407130620.DbyYULnj-lkp@intel.com/
> 
> All errors (new ones prefixed by >>, old ones prefixed by <<):

[...]

>>> ERROR: modpost: "mthp_stats" [fs/btrfs/btrfs.ko] undefined!

I'm assuming this last line is the problem one? I can't reproduce this following
your instructions and I can't see any issue based on inspection. Is it possible
this is a false positive?

Thanks,
Ryan

Ryan Roberts July 16, 2024, 8:31 a.m. UTC | #10

On 13/07/2024 11:45, Ryan Roberts wrote:
> On 13/07/2024 02:08, David Hildenbrand wrote:
>> On 12.07.24 14:22, Lance Yang wrote:
>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>>
>>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>>    
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>>
>>>>> This will provide some insight on the sizes of large folios being
>>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>>
>>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>>> allocate their own order-0 folio by other means, so we would not count
>>>>> an allocation failure in this case, but we also don't care about order-0
>>>>> allocations. This approach feels like it should be good enough and
>>>>> doesn't require any (impractically large) refactoring.
>>>>>
>>>>> The existing mTHP stats interface is reused to provide consistency to
>>>>> users. And because we are reusing the same interface, we can reuse the
>>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>>> the set of folio sizes supported by the pagecache are not identical to
>>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>>
>>>> Personally, I like the idea that can help analyze the allocation of
>>>> large folios for the page cache.
>>>>
>>>> However, I have a slight concern about the consistency of the interface.
>>>>
>>>> For 64K, the fields layout:
>>>> ├── hugepages-64kB
>>>> │   ├── enabled
>>>> │   ├── shmem_enabled
>>>> │   └── stats
>>>> │       ├── anon_fault_alloc
>>>> │       ├── anon_fault_fallback
>>>> │       ├── anon_fault_fallback_charge
>>>> │       ├── file_alloc
>>>> │       ├── file_fallback
>>>> │       ├── file_fallback_charge
>>>> │       ├── shmem_alloc
>>>> │       ├── shmem_fallback
>>>> │       ├── shmem_fallback_charge
>>>> │       ├── split
>>>> │       ├── split_deferred
>>>> │       ├── split_failed
>>>> │       ├── swpout
>>>> │       └── swpout_fallback
>>>>
>>>> But for 8K (for pagecache), you removed some fields (of course, I
>>>> understand why they are not supported).
>>>>
>>>> ├── hugepages-8kB
>>>> │   └── stats
>>>> │       ├── file_alloc
>>>> │       ├── file_fallback
>>>> │       └── file_fallback_charge
>>>>
>>>> This might not be user-friendly for some user-space parsing tools, as
>>>> they lack certain fields for the same pattern interfaces. Of course,
>>>> this might not be an issue if we have clear documentation describing the
>>>> differences here:)
>>>>
>>>> Another possible approach is to maintain the same field layout to keep
>>>> consistent, but prohibit writing to the fields that are not supported by
>>>> the pagecache, and any stats read from them would be 0.
>>>
>>> I agree that maintaining a uniform field layout, especially at the stats
>>> level, might be necessary ;)
>>>
>>> Keeping a consistent interface could future-proof the design. It allows
>>> for the possibility that features not currently supported for 8kB pages
>>> might be enabled in the future.
>>
>> I'll just note that, with shmem/file effectively being disabled for order > 11,
>> we'll also have entries there that are effectively unused.
> 
> Indeed, I mentioned that in the commit log :)
> 
>>
>> Good question how we want to deal with that (stats are easy, but what about when
>> we enable something? Maybe we should document that "enabled" is only effective
>> when supported).
> 
> The documentation already says "If enabling multiple hugepage sizes, the kernel
> will select the most appropriate enabled size for a given allocation." for anon
> THP (and I've added similar wording for my as-yet-unposted patch to add controls
> for page cache folio sizes). So I think we could easily add dummy *enabled
> controls for all sizes, that can be written to and read back consistently, but
> the kernel just ignores them when deciding what size to use. It would also
> simplify the code that populates the controls.
> 
> Personally though, I'm not convinced of the value of trying to make the controls
> for every size look identical. What's the real value to the user to pretend that
> they can select a size that they cannot? What happens when we inevitably want to
> add some new control in future which only applies to select sizes and there is
> no good way to fake it for the other sizes? Why can't user space just be
> expected to rely on the existance of the files rather than on the existance of
> the directories?
> 
> As always, I'll go with the majority, but just wanted to register my opinion.

Should I assume from the lack of reply on this that everyone else is in favour
of adding dummy controls so that all sizes have the same set of controls? If I
don't hear anything further, I'll post v2 with dummry controls today or tomorrow.

> 
> Thanks,
> Ryan
> 
>>
>> Hmmmmm
>>
>

David Hildenbrand July 16, 2024, 10:19 a.m. UTC | #11

On 16.07.24 10:31, Ryan Roberts wrote:
> On 13/07/2024 11:45, Ryan Roberts wrote:
>> On 13/07/2024 02:08, David Hildenbrand wrote:
>>> On 12.07.24 14:22, Lance Yang wrote:
>>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>>>
>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>>>     
>>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>>>
>>>>>> This will provide some insight on the sizes of large folios being
>>>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>>>
>>>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>>>> allocate their own order-0 folio by other means, so we would not count
>>>>>> an allocation failure in this case, but we also don't care about order-0
>>>>>> allocations. This approach feels like it should be good enough and
>>>>>> doesn't require any (impractically large) refactoring.
>>>>>>
>>>>>> The existing mTHP stats interface is reused to provide consistency to
>>>>>> users. And because we are reusing the same interface, we can reuse the
>>>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>>>> the set of folio sizes supported by the pagecache are not identical to
>>>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>>>
>>>>> Personally, I like the idea that can help analyze the allocation of
>>>>> large folios for the page cache.
>>>>>
>>>>> However, I have a slight concern about the consistency of the interface.
>>>>>
>>>>> For 64K, the fields layout:
>>>>> ├── hugepages-64kB
>>>>> │   ├── enabled
>>>>> │   ├── shmem_enabled
>>>>> │   └── stats
>>>>> │       ├── anon_fault_alloc
>>>>> │       ├── anon_fault_fallback
>>>>> │       ├── anon_fault_fallback_charge
>>>>> │       ├── file_alloc
>>>>> │       ├── file_fallback
>>>>> │       ├── file_fallback_charge
>>>>> │       ├── shmem_alloc
>>>>> │       ├── shmem_fallback
>>>>> │       ├── shmem_fallback_charge
>>>>> │       ├── split
>>>>> │       ├── split_deferred
>>>>> │       ├── split_failed
>>>>> │       ├── swpout
>>>>> │       └── swpout_fallback
>>>>>
>>>>> But for 8K (for pagecache), you removed some fields (of course, I
>>>>> understand why they are not supported).
>>>>>
>>>>> ├── hugepages-8kB
>>>>> │   └── stats
>>>>> │       ├── file_alloc
>>>>> │       ├── file_fallback
>>>>> │       └── file_fallback_charge
>>>>>
>>>>> This might not be user-friendly for some user-space parsing tools, as
>>>>> they lack certain fields for the same pattern interfaces. Of course,
>>>>> this might not be an issue if we have clear documentation describing the
>>>>> differences here:)
>>>>>
>>>>> Another possible approach is to maintain the same field layout to keep
>>>>> consistent, but prohibit writing to the fields that are not supported by
>>>>> the pagecache, and any stats read from them would be 0.
>>>>
>>>> I agree that maintaining a uniform field layout, especially at the stats
>>>> level, might be necessary ;)
>>>>
>>>> Keeping a consistent interface could future-proof the design. It allows
>>>> for the possibility that features not currently supported for 8kB pages
>>>> might be enabled in the future.
>>>
>>> I'll just note that, with shmem/file effectively being disabled for order > 11,
>>> we'll also have entries there that are effectively unused.
>>
>> Indeed, I mentioned that in the commit log :)

Well, I think it's more extreme than what you mentioned.

For example, shmem_enable on arm64 with 64k is now effectively 
non-functional. Just like it will be for other orders in the anon-shmem 
case when the order exceeds MAX_PAGECACHE_ORDER.

>>
>>>
>>> Good question how we want to deal with that (stats are easy, but what about when
>>> we enable something? Maybe we should document that "enabled" is only effective
>>> when supported).
>>
>> The documentation already says "If enabling multiple hugepage sizes, the kernel
>> will select the most appropriate enabled size for a given allocation." for anon
>> THP (and I've added similar wording for my as-yet-unposted patch to add controls
>> for page cache folio sizes). So I think we could easily add dummy *enabled
>> controls for all sizes, that can be written to and read back consistently, but
>> the kernel just ignores them when deciding what size to use. It would also
>> simplify the code that populates the controls.
>>
>> Personally though, I'm not convinced of the value of trying to make the controls
>> for every size look identical. What's the real value to the user to pretend that
>> they can select a size that they cannot? What happens when we inevitably want to
>> add some new control in future which only applies to select sizes and there is
>> no good way to fake it for the other sizes? Why can't user space just be
>> expected to rely on the existance of the files rather than on the existance of
>> the directories?
>>
>> As always, I'll go with the majority, but just wanted to register my opinion.
> 
> Should I assume from the lack of reply on this that everyone else is in favour
> of adding dummy controls so that all sizes have the same set of controls? If I
> don't hear anything further, I'll post v2 with dummry controls today or tomorrow.

Sorry, busy with other stuff.

Indicating only what really exists sounds cleaner. But I wonder how we 
would want to handle in general orders that are effectively non-existant?

Ryan Roberts July 16, 2024, 11:14 a.m. UTC | #12

On 16/07/2024 11:19, David Hildenbrand wrote:
> On 16.07.24 10:31, Ryan Roberts wrote:
>> On 13/07/2024 11:45, Ryan Roberts wrote:
>>> On 13/07/2024 02:08, David Hildenbrand wrote:
>>>> On 12.07.24 14:22, Lance Yang wrote:
>>>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>>>>
>>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>>>>    
>>>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>>>>
>>>>>>> This will provide some insight on the sizes of large folios being
>>>>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>>>>
>>>>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>>>>> allocate their own order-0 folio by other means, so we would not count
>>>>>>> an allocation failure in this case, but we also don't care about order-0
>>>>>>> allocations. This approach feels like it should be good enough and
>>>>>>> doesn't require any (impractically large) refactoring.
>>>>>>>
>>>>>>> The existing mTHP stats interface is reused to provide consistency to
>>>>>>> users. And because we are reusing the same interface, we can reuse the
>>>>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>>>>> the set of folio sizes supported by the pagecache are not identical to
>>>>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>>>>
>>>>>> Personally, I like the idea that can help analyze the allocation of
>>>>>> large folios for the page cache.
>>>>>>
>>>>>> However, I have a slight concern about the consistency of the interface.
>>>>>>
>>>>>> For 64K, the fields layout:
>>>>>> ├── hugepages-64kB
>>>>>> │   ├── enabled
>>>>>> │   ├── shmem_enabled
>>>>>> │   └── stats
>>>>>> │       ├── anon_fault_alloc
>>>>>> │       ├── anon_fault_fallback
>>>>>> │       ├── anon_fault_fallback_charge
>>>>>> │       ├── file_alloc
>>>>>> │       ├── file_fallback
>>>>>> │       ├── file_fallback_charge
>>>>>> │       ├── shmem_alloc
>>>>>> │       ├── shmem_fallback
>>>>>> │       ├── shmem_fallback_charge
>>>>>> │       ├── split
>>>>>> │       ├── split_deferred
>>>>>> │       ├── split_failed
>>>>>> │       ├── swpout
>>>>>> │       └── swpout_fallback
>>>>>>
>>>>>> But for 8K (for pagecache), you removed some fields (of course, I
>>>>>> understand why they are not supported).
>>>>>>
>>>>>> ├── hugepages-8kB
>>>>>> │   └── stats
>>>>>> │       ├── file_alloc
>>>>>> │       ├── file_fallback
>>>>>> │       └── file_fallback_charge
>>>>>>
>>>>>> This might not be user-friendly for some user-space parsing tools, as
>>>>>> they lack certain fields for the same pattern interfaces. Of course,
>>>>>> this might not be an issue if we have clear documentation describing the
>>>>>> differences here:)
>>>>>>
>>>>>> Another possible approach is to maintain the same field layout to keep
>>>>>> consistent, but prohibit writing to the fields that are not supported by
>>>>>> the pagecache, and any stats read from them would be 0.
>>>>>
>>>>> I agree that maintaining a uniform field layout, especially at the stats
>>>>> level, might be necessary ;)
>>>>>
>>>>> Keeping a consistent interface could future-proof the design. It allows
>>>>> for the possibility that features not currently supported for 8kB pages
>>>>> might be enabled in the future.
>>>>
>>>> I'll just note that, with shmem/file effectively being disabled for order > 11,
>>>> we'll also have entries there that are effectively unused.
>>>
>>> Indeed, I mentioned that in the commit log :)
> 
> Well, I think it's more extreme than what you mentioned.
> 
> For example, shmem_enable on arm64 with 64k is now effectively non-functional.
> Just like it will be for other orders in the anon-shmem case when the order
> exceeds MAX_PAGECACHE_ORDER.

Ahh I see what you are saying now; we already have precedent for non-functional
controls.

(Actually, looking at the code, it looks like the shmem stats will be
unconditionally exposed, but the shmem controls will only be exposed when
CONFIG_SHMEM is enabled. I guess that should be fixed - I'll post a patch).

> 
>>>
>>>>
>>>> Good question how we want to deal with that (stats are easy, but what about
>>>> when
>>>> we enable something? Maybe we should document that "enabled" is only effective
>>>> when supported).
>>>
>>> The documentation already says "If enabling multiple hugepage sizes, the kernel
>>> will select the most appropriate enabled size for a given allocation." for anon
>>> THP (and I've added similar wording for my as-yet-unposted patch to add controls
>>> for page cache folio sizes). So I think we could easily add dummy *enabled
>>> controls for all sizes, that can be written to and read back consistently, but
>>> the kernel just ignores them when deciding what size to use. It would also
>>> simplify the code that populates the controls.
>>>
>>> Personally though, I'm not convinced of the value of trying to make the controls
>>> for every size look identical. What's the real value to the user to pretend that
>>> they can select a size that they cannot? What happens when we inevitably want to
>>> add some new control in future which only applies to select sizes and there is
>>> no good way to fake it for the other sizes? Why can't user space just be
>>> expected to rely on the existance of the files rather than on the existance of
>>> the directories?
>>>
>>> As always, I'll go with the majority, but just wanted to register my opinion.
>>
>> Should I assume from the lack of reply on this that everyone else is in favour
>> of adding dummy controls so that all sizes have the same set of controls? If I
>> don't hear anything further, I'll post v2 with dummry controls today or tomorrow.
> 
> Sorry, busy with other stuff.
> 
> Indicating only what really exists sounds cleaner. But I wonder how we would
> want to handle in general orders that are effectively non-existant?

I'm not following your distinction between orders that don't "really exist" and
orders that are "effectively non-existant".

I guess the real supported orders are:

  anon:
    min order: 2
    max order: PMD_ORDER
  anon-shmem:
    min order: 1
    max order: MAX_PAGECACHE_ORDER
  tmpfs-shmem:
    min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
    max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
  file:
    min order: 1
    max order: MAX_PAGECACHE_ORDER

But today, controls and stats are exposed for:

  anon:
    min order: 2
    max order: PMD_ORDER
  anon-shmem:
    min order: 2
    max order: PMD_ORDER
  tmpfs-shmem:
    min order: PMD_ORDER
    max order: PMD_ORDER
  file:
    min order: Nothing yet (this patch proposes 1)
    max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)

So I think there is definitely a bug for shmem where the minimum order control
should be order-1 but its currently order-2.

I also wonder about PUD-order for DAX? We don't currently have a stat/control.
If we wanted to add it in future, if we take the "expose all stats/controls for
all orders" approach, we would end up extending all the way to PUD-order and all
the orders between PMD and PUD would be dummy for all memory types. That really
starts to feel odd, so I still favour only populating what's really supported.

I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
continue with the approach of "indicating only what really exists" for v2.

Shout if you disagree.

Thanks,
Ryan

David Hildenbrand July 17, 2024, 8:02 a.m. UTC | #13

>> Sorry, busy with other stuff.
>>
>> Indicating only what really exists sounds cleaner. But I wonder how we would
>> want to handle in general orders that are effectively non-existant?
> 
> I'm not following your distinction between orders that don't "really exist" and
> orders that are "effectively non-existant".

I'm questioning whether there should be a distinction at all. We should 
just hide what is either non-existant (not implemented) or non-functional.

> 
> I guess the real supported orders are:
> 
>    anon:
>      min order: 2
>      max order: PMD_ORDER
>    anon-shmem:
>      min order: 1
>      max order: MAX_PAGECACHE_ORDER
>    tmpfs-shmem:
>      min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>      max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>    file:
>      min order: 1
>      max order: MAX_PAGECACHE_ORDER

That's my understanding. But not sure about anon-shmem really supporting 
order-1, maybe we do.

> 
> But today, controls and stats are exposed for:
> 
>    anon:
>      min order: 2
>      max order: PMD_ORDER
>    anon-shmem:
>      min order: 2
>      max order: PMD_ORDER
>    tmpfs-shmem:
>      min order: PMD_ORDER
>      max order: PMD_ORDER
>    file:
>      min order: Nothing yet (this patch proposes 1)
>      max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
> 
> So I think there is definitely a bug for shmem where the minimum order control
> should be order-1 but its currently order-2.

Maybe, did not play with that yet. Likely order-1 will work. (although 
probably of questionable use :) )

> 
> I also wonder about PUD-order for DAX? We don't currently have a stat/control.
> If we wanted to add it in future, if we take the "expose all stats/controls for
> all orders" approach, we would end up extending all the way to PUD-order and all
> the orders between PMD and PUD would be dummy for all memory types. That really
> starts to feel odd, so I still favour only populating what's really supported.

I would go further and say that calling the fsdax thing a THP is 
borderline wrong and we should not expose any new toggles for it that way.

It really behaves much more like hugetlb folios that can be PTE-mapped 
... we cannot split these things, and they are not allocated from the 
buddy. So I wouldn't worry about fsdax for now.

fsdax support for compound pages (now large folios) probably never 
should have been glued to any THP toggle.

> 
> I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
> continue with the approach of "indicating only what really exists" for v2.
> 
> Shout if you disagree.

Makes sense.

Ryan Roberts July 17, 2024, 8:29 a.m. UTC | #14

On 17/07/2024 09:02, David Hildenbrand wrote:
>>> Sorry, busy with other stuff.
>>>
>>> Indicating only what really exists sounds cleaner. But I wonder how we would
>>> want to handle in general orders that are effectively non-existant?
>>
>> I'm not following your distinction between orders that don't "really exist" and
>> orders that are "effectively non-existant".
> 
> I'm questioning whether there should be a distinction at all. We should just
> hide what is either non-existant (not implemented) or non-functional.

Great we are on the same page.

> 
>>
>> I guess the real supported orders are:
>>
>>    anon:
>>      min order: 2
>>      max order: PMD_ORDER
>>    anon-shmem:
>>      min order: 1
>>      max order: MAX_PAGECACHE_ORDER
>>    tmpfs-shmem:
>>      min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>      max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>    file:
>>      min order: 1
>>      max order: MAX_PAGECACHE_ORDER
> 
> That's my understanding. But not sure about anon-shmem really supporting
> order-1, maybe we do.

Oh, I thought we only had the restriction for anon folios now (due to deferred
split queue), so assumed it would just work. With Gavin's
THP_ORDERS_ALL_FILE_DEFAULT change, that certainly implies that shmem must
support order-1. If it doesn't then we we might want to tidy that further.

Baolin, perhaps you can confirm either way?

> 
>>
>> But today, controls and stats are exposed for:
>>
>>    anon:
>>      min order: 2
>>      max order: PMD_ORDER
>>    anon-shmem:
>>      min order: 2
>>      max order: PMD_ORDER
>>    tmpfs-shmem:
>>      min order: PMD_ORDER
>>      max order: PMD_ORDER
>>    file:
>>      min order: Nothing yet (this patch proposes 1)
>>      max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>
>> So I think there is definitely a bug for shmem where the minimum order control
>> should be order-1 but its currently order-2.
> 
> Maybe, did not play with that yet. Likely order-1 will work. (although probably
> of questionable use :) )

You might have to expand on why its of "questionable use". I'd assume it has the
same amount of value as using order-1 for regular page cache pages? i.e. half
the number of objects to manage for the same amount of memory.

> 
>>
>> I also wonder about PUD-order for DAX? We don't currently have a stat/control.
>> If we wanted to add it in future, if we take the "expose all stats/controls for
>> all orders" approach, we would end up extending all the way to PUD-order and all
>> the orders between PMD and PUD would be dummy for all memory types. That really
>> starts to feel odd, so I still favour only populating what's really supported.
> 
> I would go further and say that calling the fsdax thing a THP is borderline
> wrong and we should not expose any new toggles for it that way.
> 
> It really behaves much more like hugetlb folios that can be PTE-mapped ... we
> cannot split these things, and they are not allocated from the buddy. So I
> wouldn't worry about fsdax for now.
> 
> fsdax support for compound pages (now large folios) probably never should have
> been glued to any THP toggle.

Yeah fair enough. I wasn't really arguing for adding any dax controls; I was
just trying to think of examples as to why adding dummy controls might be a bad
idea.

> 
>>
>> I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
>> continue with the approach of "indicating only what really exists" for v2.
>>
>> Shout if you disagree.
> 
> Makes sense.

Excellent. I posted v2, which has these changes, yesterday afternoon. :)

David Hildenbrand July 17, 2024, 8:44 a.m. UTC | #15

>>> I guess the real supported orders are:
>>>
>>>     anon:
>>>       min order: 2
>>>       max order: PMD_ORDER
>>>     anon-shmem:
>>>       min order: 1
>>>       max order: MAX_PAGECACHE_ORDER
>>>     tmpfs-shmem:
>>>       min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>>       max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>>     file:
>>>       min order: 1
>>>       max order: MAX_PAGECACHE_ORDER
>>
>> That's my understanding. But not sure about anon-shmem really supporting
>> order-1, maybe we do.
> 
> Oh, I thought we only had the restriction for anon folios now (due to deferred
> split queue), so assumed it would just work. With Gavin's
> THP_ORDERS_ALL_FILE_DEFAULT change, that certainly implies that shmem must
> support order-1. If it doesn't then we we might want to tidy that further.
> 
> Baolin, perhaps you can confirm either way?

Currently there would not have been a way to enable it, right? (maybe 
I'm wrong)

> 
>>
>>>
>>> But today, controls and stats are exposed for:
>>>
>>>     anon:
>>>       min order: 2
>>>       max order: PMD_ORDER
>>>     anon-shmem:
>>>       min order: 2
>>>       max order: PMD_ORDER
>>>     tmpfs-shmem:
>>>       min order: PMD_ORDER
>>>       max order: PMD_ORDER
>>>     file:
>>>       min order: Nothing yet (this patch proposes 1)
>>>       max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>
>>> So I think there is definitely a bug for shmem where the minimum order control
>>> should be order-1 but its currently order-2.
>>
>> Maybe, did not play with that yet. Likely order-1 will work. (although probably
>> of questionable use :) )
> 
> You might have to expand on why its of "questionable use". I'd assume it has the
> same amount of value as using order-1 for regular page cache pages? i.e. half
> the number of objects to manage for the same amount of memory.

order-1 was recently added for the pagecache to get some device setups 
running (IIRC, where we cannot use order-0, because device blocksize > 
PAGE_SIZE).

You might be right about "half the number of objects", but likely just 
going for order-2, order-3, order-4 ... for shmem might be even better. 
And simply falling back to order-0 when you cannot get the larger orders.

I could have sworn you mentioned something like that in your 
"configurable orders for pagecache" RFC that I only briefly skimmed so 
far :P

... only enabling "order-1" and none of the other orders for shmem 
sounds rather "interesting".

But yeah, maybe there is valid use for it, so I'm all for allowing it if 
it can be done.

> 
>>
>>>
>>> I also wonder about PUD-order for DAX? We don't currently have a stat/control.
>>> If we wanted to add it in future, if we take the "expose all stats/controls for
>>> all orders" approach, we would end up extending all the way to PUD-order and all
>>> the orders between PMD and PUD would be dummy for all memory types. That really
>>> starts to feel odd, so I still favour only populating what's really supported.
>>
>> I would go further and say that calling the fsdax thing a THP is borderline
>> wrong and we should not expose any new toggles for it that way.
>>
>> It really behaves much more like hugetlb folios that can be PTE-mapped ... we
>> cannot split these things, and they are not allocated from the buddy. So I
>> wouldn't worry about fsdax for now.
>>
>> fsdax support for compound pages (now large folios) probably never should have
>> been glued to any THP toggle.
> 
> Yeah fair enough. I wasn't really arguing for adding any dax controls; I was
> just trying to think of examples as to why adding dummy controls might be a bad
> idea.

Yes.

>>
>>>
>>> I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
>>> continue with the approach of "indicating only what really exists" for v2.
>>>
>>> Shout if you disagree.
>>
>> Makes sense.
> 
> Excellent. I posted v2, which has these changes, yesterday afternoon. :)

Yes, still digging through mails ... in-between having roughly 1000 
meetings a day :)

Ryan Roberts July 17, 2024, 9:50 a.m. UTC | #16

On 17/07/2024 09:44, David Hildenbrand wrote:
>>>> I guess the real supported orders are:
>>>>
>>>>     anon:
>>>>       min order: 2
>>>>       max order: PMD_ORDER
>>>>     anon-shmem:
>>>>       min order: 1
>>>>       max order: MAX_PAGECACHE_ORDER
>>>>     tmpfs-shmem:
>>>>       min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>>>       max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE
>>>>     file:
>>>>       min order: 1
>>>>       max order: MAX_PAGECACHE_ORDER
>>>
>>> That's my understanding. But not sure about anon-shmem really supporting
>>> order-1, maybe we do.
>>
>> Oh, I thought we only had the restriction for anon folios now (due to deferred
>> split queue), so assumed it would just work. With Gavin's
>> THP_ORDERS_ALL_FILE_DEFAULT change, that certainly implies that shmem must
>> support order-1. If it doesn't then we we might want to tidy that further.
>>
>> Baolin, perhaps you can confirm either way?
> 
> Currently there would not have been a way to enable it, right? (maybe I'm wrong)

__thp_vma_allowable_orders() doesn't do anything special for shmem if TVA_IN_PF
is set, so I guess it could concievably return order-1 in that path. Not sure if
it ever gets called that way for shmem though - I don't think so. But agree that
shmem_allowable_huge_orders() will not currently return order-1.

> 
>>
>>>
>>>>
>>>> But today, controls and stats are exposed for:
>>>>
>>>>     anon:
>>>>       min order: 2
>>>>       max order: PMD_ORDER
>>>>     anon-shmem:
>>>>       min order: 2
>>>>       max order: PMD_ORDER
>>>>     tmpfs-shmem:
>>>>       min order: PMD_ORDER
>>>>       max order: PMD_ORDER
>>>>     file:
>>>>       min order: Nothing yet (this patch proposes 1)
>>>>       max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>>
>>>> So I think there is definitely a bug for shmem where the minimum order control
>>>> should be order-1 but its currently order-2.
>>>
>>> Maybe, did not play with that yet. Likely order-1 will work. (although probably
>>> of questionable use :) )
>>
>> You might have to expand on why its of "questionable use". I'd assume it has the
>> same amount of value as using order-1 for regular page cache pages? i.e. half
>> the number of objects to manage for the same amount of memory.
> 
> order-1 was recently added for the pagecache to get some device setups running
> (IIRC, where we cannot use order-0, because device blocksize > PAGE_SIZE).
> 
> You might be right about "half the number of objects", but likely just going for
> order-2, order-3, order-4 ... for shmem might be even better. And simply falling
> back to order-0 when you cannot get the larger orders.

Sure, but then you're into the territory of baking in policy. Remember that
originally I was only interested in 64K but the concensus was to expose all the
sizes. Same argument applies to 8K; expose it and let others decide policy.

> 
> I could have sworn you mentioned something like that in your "configurable
> orders for pagecache" RFC that I only briefly skimmed so far :P

I'm exposing the 8K control for pagecache in that series.

> 
> ... only enabling "order-1" and none of the other orders for shmem sounds rather
> "interesting".
> 
> But yeah, maybe there is valid use for it, so I'm all for allowing it if it can
> be done.
> 
>>
>>>
>>>>
>>>> I also wonder about PUD-order for DAX? We don't currently have a stat/control.
>>>> If we wanted to add it in future, if we take the "expose all stats/controls for
>>>> all orders" approach, we would end up extending all the way to PUD-order and
>>>> all
>>>> the orders between PMD and PUD would be dummy for all memory types. That really
>>>> starts to feel odd, so I still favour only populating what's really supported.
>>>
>>> I would go further and say that calling the fsdax thing a THP is borderline
>>> wrong and we should not expose any new toggles for it that way.
>>>
>>> It really behaves much more like hugetlb folios that can be PTE-mapped ... we
>>> cannot split these things, and they are not allocated from the buddy. So I
>>> wouldn't worry about fsdax for now.
>>>
>>> fsdax support for compound pages (now large folios) probably never should have
>>> been glued to any THP toggle.
>>
>> Yeah fair enough. I wasn't really arguing for adding any dax controls; I was
>> just trying to think of examples as to why adding dummy controls might be a bad
>> idea.
> 
> Yes.
> 
>>>
>>>>
>>>> I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and
>>>> continue with the approach of "indicating only what really exists" for v2.
>>>>
>>>> Shout if you disagree.
>>>
>>> Makes sense.
>>
>> Excellent. I posted v2, which has these changes, yesterday afternoon. :)
> 
> Yes, still digging through mails ... in-between having roughly 1000 meetings a
> day :)

No problem. You're in-demand. I can wait. :)

David Hildenbrand July 17, 2024, 10:03 a.m. UTC | #17

>>>>>
>>>>> But today, controls and stats are exposed for:
>>>>>
>>>>>      anon:
>>>>>        min order: 2
>>>>>        max order: PMD_ORDER
>>>>>      anon-shmem:
>>>>>        min order: 2
>>>>>        max order: PMD_ORDER
>>>>>      tmpfs-shmem:
>>>>>        min order: PMD_ORDER
>>>>>        max order: PMD_ORDER
>>>>>      file:
>>>>>        min order: Nothing yet (this patch proposes 1)
>>>>>        max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>>>
>>>>> So I think there is definitely a bug for shmem where the minimum order control
>>>>> should be order-1 but its currently order-2.
>>>>
>>>> Maybe, did not play with that yet. Likely order-1 will work. (although probably
>>>> of questionable use :) )
>>>
>>> You might have to expand on why its of "questionable use". I'd assume it has the
>>> same amount of value as using order-1 for regular page cache pages? i.e. half
>>> the number of objects to manage for the same amount of memory.
>>
>> order-1 was recently added for the pagecache to get some device setups running
>> (IIRC, where we cannot use order-0, because device blocksize > PAGE_SIZE).
>>
>> You might be right about "half the number of objects", but likely just going for
>> order-2, order-3, order-4 ... for shmem might be even better. And simply falling
>> back to order-0 when you cannot get the larger orders.
> 
> Sure, but then you're into the territory of baking in policy. Remember that
> originally I was only interested in 64K but the concensus was to expose all the
> sizes. Same argument applies to 8K; expose it and let others decide policy.

I don't disagree. The point I'm trying to make is that there was so far 
there was no strong evidence that it is really required. Support for the 
pagecache had a different motivation for these special devices.

But again, I agree that we should just make it consistent and allow for 
it. :)

Ryan Roberts July 17, 2024, 10:18 a.m. UTC | #18

On 17/07/2024 11:03, David Hildenbrand wrote:
>>>>>>
>>>>>> But today, controls and stats are exposed for:
>>>>>>
>>>>>>      anon:
>>>>>>        min order: 2
>>>>>>        max order: PMD_ORDER
>>>>>>      anon-shmem:
>>>>>>        min order: 2
>>>>>>        max order: PMD_ORDER
>>>>>>      tmpfs-shmem:
>>>>>>        min order: PMD_ORDER
>>>>>>        max order: PMD_ORDER
>>>>>>      file:
>>>>>>        min order: Nothing yet (this patch proposes 1)
>>>>>>        max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>>>>
>>>>>> So I think there is definitely a bug for shmem where the minimum order
>>>>>> control
>>>>>> should be order-1 but its currently order-2.
>>>>>
>>>>> Maybe, did not play with that yet. Likely order-1 will work. (although
>>>>> probably
>>>>> of questionable use :) )
>>>>
>>>> You might have to expand on why its of "questionable use". I'd assume it has
>>>> the
>>>> same amount of value as using order-1 for regular page cache pages? i.e. half
>>>> the number of objects to manage for the same amount of memory.
>>>
>>> order-1 was recently added for the pagecache to get some device setups running
>>> (IIRC, where we cannot use order-0, because device blocksize > PAGE_SIZE).
>>>
>>> You might be right about "half the number of objects", but likely just going for
>>> order-2, order-3, order-4 ... for shmem might be even better. And simply falling
>>> back to order-0 when you cannot get the larger orders.
>>
>> Sure, but then you're into the territory of baking in policy. Remember that
>> originally I was only interested in 64K but the concensus was to expose all the
>> sizes. Same argument applies to 8K; expose it and let others decide policy.
> 
> I don't disagree. The point I'm trying to make is that there was so far there
> was no strong evidence that it is really required. Support for the pagecache had
> a different motivation for these special devices.

Sure, but there was no clear need for anon mTHP orders other than order-2 and
order-4 (for arm64's HPA and contpte, respectively), but we still chose to
expose all the others.

> 
> But again, I agree that we should just make it consistent and allow for it. :)

Yes, we both agree, so I'll stop arguing now :)

Thanks, as always, for the discussion!

David Hildenbrand July 17, 2024, 10:25 a.m. UTC | #19

On 17.07.24 12:18, Ryan Roberts wrote:
> On 17/07/2024 11:03, David Hildenbrand wrote:
>>>>>>>
>>>>>>> But today, controls and stats are exposed for:
>>>>>>>
>>>>>>>       anon:
>>>>>>>         min order: 2
>>>>>>>         max order: PMD_ORDER
>>>>>>>       anon-shmem:
>>>>>>>         min order: 2
>>>>>>>         max order: PMD_ORDER
>>>>>>>       tmpfs-shmem:
>>>>>>>         min order: PMD_ORDER
>>>>>>>         max order: PMD_ORDER
>>>>>>>       file:
>>>>>>>         min order: Nothing yet (this patch proposes 1)
>>>>>>>         max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>>>>>
>>>>>>> So I think there is definitely a bug for shmem where the minimum order
>>>>>>> control
>>>>>>> should be order-1 but its currently order-2.
>>>>>>
>>>>>> Maybe, did not play with that yet. Likely order-1 will work. (although
>>>>>> probably
>>>>>> of questionable use :) )
>>>>>
>>>>> You might have to expand on why its of "questionable use". I'd assume it has
>>>>> the
>>>>> same amount of value as using order-1 for regular page cache pages? i.e. half
>>>>> the number of objects to manage for the same amount of memory.
>>>>
>>>> order-1 was recently added for the pagecache to get some device setups running
>>>> (IIRC, where we cannot use order-0, because device blocksize > PAGE_SIZE).
>>>>
>>>> You might be right about "half the number of objects", but likely just going for
>>>> order-2, order-3, order-4 ... for shmem might be even better. And simply falling
>>>> back to order-0 when you cannot get the larger orders.
>>>
>>> Sure, but then you're into the territory of baking in policy. Remember that
>>> originally I was only interested in 64K but the concensus was to expose all the
>>> sizes. Same argument applies to 8K; expose it and let others decide policy.
>>
>> I don't disagree. The point I'm trying to make is that there was so far there
>> was no strong evidence that it is really required. Support for the pagecache had
>> a different motivation for these special devices.
> 
> Sure, but there was no clear need for anon mTHP orders other than order-2 and
> order-4 (for arm64's HPA and contpte, respectively), but we still chose to
> expose all the others.

order-2 and order-3 are valuable for AMD EPYC (depending on the 
generation 16 vs. 32 KiB coalescing).

But in general, at least for me, it's easier to argue why larger orders 
make more sense than very tiny ones.

For example, order-5 can be mapped using cont-pte as well and you get 
roughly half the memory allocation+page fault overhead compared to order-4.

order-1 ? No TLB optimization at least on any current HW I know.

But I believe we're in violent agreement here :)

Ryan Roberts July 17, 2024, 10:48 a.m. UTC | #20

On 17/07/2024 11:25, David Hildenbrand wrote:
> On 17.07.24 12:18, Ryan Roberts wrote:
>> On 17/07/2024 11:03, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> But today, controls and stats are exposed for:
>>>>>>>>
>>>>>>>>       anon:
>>>>>>>>         min order: 2
>>>>>>>>         max order: PMD_ORDER
>>>>>>>>       anon-shmem:
>>>>>>>>         min order: 2
>>>>>>>>         max order: PMD_ORDER
>>>>>>>>       tmpfs-shmem:
>>>>>>>>         min order: PMD_ORDER
>>>>>>>>         max order: PMD_ORDER
>>>>>>>>       file:
>>>>>>>>         min order: Nothing yet (this patch proposes 1)
>>>>>>>>         max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER)
>>>>>>>>
>>>>>>>> So I think there is definitely a bug for shmem where the minimum order
>>>>>>>> control
>>>>>>>> should be order-1 but its currently order-2.
>>>>>>>
>>>>>>> Maybe, did not play with that yet. Likely order-1 will work. (although
>>>>>>> probably
>>>>>>> of questionable use :) )
>>>>>>
>>>>>> You might have to expand on why its of "questionable use". I'd assume it has
>>>>>> the
>>>>>> same amount of value as using order-1 for regular page cache pages? i.e. half
>>>>>> the number of objects to manage for the same amount of memory.
>>>>>
>>>>> order-1 was recently added for the pagecache to get some device setups running
>>>>> (IIRC, where we cannot use order-0, because device blocksize > PAGE_SIZE).
>>>>>
>>>>> You might be right about "half the number of objects", but likely just
>>>>> going for
>>>>> order-2, order-3, order-4 ... for shmem might be even better. And simply
>>>>> falling
>>>>> back to order-0 when you cannot get the larger orders.
>>>>
>>>> Sure, but then you're into the territory of baking in policy. Remember that
>>>> originally I was only interested in 64K but the concensus was to expose all the
>>>> sizes. Same argument applies to 8K; expose it and let others decide policy.
>>>
>>> I don't disagree. The point I'm trying to make is that there was so far there
>>> was no strong evidence that it is really required. Support for the pagecache had
>>> a different motivation for these special devices.
>>
>> Sure, but there was no clear need for anon mTHP orders other than order-2 and
>> order-4 (for arm64's HPA and contpte, respectively), but we still chose to
>> expose all the others.
> 
> order-2 and order-3 are valuable for AMD EPYC (depending on the generation 16
> vs. 32 KiB coalescing).
> 
> But in general, at least for me, it's easier to argue why larger orders make
> more sense than very tiny ones.
> 
> For example, order-5 can be mapped using cont-pte as well and you get roughly
> half the memory allocation+page fault overhead compared to order-4.
> 
> order-1 ? No TLB optimization at least on any current HW I know.

I believe there are some variants of HPA that coalesce "up to" 4 pages, meaning
2 pages (or 3 or 4) could be coalesced into a single TLB entry. But I'm not 100%
sure on that.

> 
> But I believe we're in violent agreement here :)
>

Baolin Wang July 22, 2024, 3:52 a.m. UTC | #21

On 2024/7/14 17:05, Ryan Roberts wrote:
> On 13/07/2024 13:54, Baolin Wang wrote:
>>
>>
>> On 2024/7/13 19:00, Ryan Roberts wrote:
>>> [...]
>>>
>>>>> +static int thpsize_create(int order, struct kobject *parent)
>>>>>     {
>>>>>         unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>>>> +    struct thpsize_child *stats;
>>>>>         struct thpsize *thpsize;
>>>>>         int ret;
>>>>>     +    /*
>>>>> +     * Each child object (currently only "stats" directory) holds a
>>>>> +     * reference to the top-level thpsize object, so we can drop our ref to
>>>>> +     * the top-level once stats is setup. Then we just need to drop a
>>>>> +     * reference on any children to clean everything up. We can't just use
>>>>> +     * the attr group name for the stats subdirectory because there may be
>>>>> +     * multiple attribute groups to populate inside stats and overlaying
>>>>> +     * using the name property isn't supported in that way; each attr group
>>>>> +     * name, if provided, must be unique in the parent directory.
>>>>> +     */
>>>>> +
>>>>>         thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
>>>>> -    if (!thpsize)
>>>>> -        return ERR_PTR(-ENOMEM);
>>>>> +    if (!thpsize) {
>>>>> +        ret = -ENOMEM;
>>>>> +        goto err;
>>>>> +    }
>>>>> +    thpsize->order = order;
>>>>>           ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>>>>>                        "hugepages-%lukB", size);
>>>>>         if (ret) {
>>>>>             kfree(thpsize);
>>>>> -        return ERR_PTR(ret);
>>>>> +        goto err;
>>>>>         }
>>>>>     -    ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>>> -    if (ret) {
>>>>> +    stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>>>>> +    if (!stats) {
>>>>>             kobject_put(&thpsize->kobj);
>>>>> -        return ERR_PTR(ret);
>>>>> +        ret = -ENOMEM;
>>>>> +        goto err;
>>>>>         }
>>>>>     -    ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>>>> +    ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
>>>>> +                   &thpsize->kobj, "stats");
>>>>> +    kobject_put(&thpsize->kobj);
>>>>>         if (ret) {
>>>>> -        kobject_put(&thpsize->kobj);
>>>>> -        return ERR_PTR(ret);
>>>>> +        kfree(stats);
>>>>> +        goto err;
>>>>>         }
>>>>>     -    thpsize->order = order;
>>>>> -    return thpsize;
>>>>> +    if (BIT(order) & THP_ORDERS_ALL_ANON) {
>>>>> +        ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>>> +        if (ret)
>>>>> +            goto err_put;
>>>>> +
>>>>> +        ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
>>>>> +        if (ret)
>>>>> +            goto err_put;
>>>>> +    }
>>>>> +
>>>>> +    if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
>>>>> +        ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
>>>>> +        if (ret)
>>>>> +            goto err_put;
>>>>> +    }
>>>>> +
>>>>> +    list_add(&stats->node, &thpsize_child_list);
>>>>> +    return 0;
>>>>> +err_put:
>>>>
>>>> IIUC, I think you should call 'sysfs_remove_group' to remove the group before
>>>> putting the kobject.
>>>
>>> Are you sure about that? As I understood it, sysfs_create_group() was
>>> conceptually modifying the state of the kobj, so when the kobj gets destroyed,
>>> all its state is tidied up. __kobject_del() (called on the last kobject_put())
>>> calls sysfs_remove_groups() and tidies up the sysfs state as far as I can see?
>>
>> IIUC, __kobject_del() only removes the ktype defaut groups by
>> 'sysfs_remove_groups(kobj, ktype->default_groups)', but your created groups are
>> not added into the ktype->default_groups. That means you should mannuly remove
>> them, or am I miss something?
> 
> That was also putting doubt in my mind. But the sample at
> samples/kobject/kobject-example.c does not call sysfs_remove_group(). It just
> calls sysfs_create_group() in example_init() and calls kobject_put() in
> example_exit(). So I think that's the correct pattern.
> 
> Looking at the code more closely, sysfs_create_group() just creates files for
> each of the attributes in the group. __kobject_del() calls sysfs_remove_dir(),
> who's comment states "we remove any files in the directory before we remove the
> directory" so I'm pretty sure sysfs_remove_group() is not required.

Thanks for the explanation, and I think you are right after checking the 
code again. Sorry for the noise.

Ryan Roberts July 22, 2024, 7:36 a.m. UTC | #22

On 22/07/2024 04:52, Baolin Wang wrote:
> 
> 
> On 2024/7/14 17:05, Ryan Roberts wrote:
>> On 13/07/2024 13:54, Baolin Wang wrote:
>>>
>>>
>>> On 2024/7/13 19:00, Ryan Roberts wrote:
>>>> [...]
>>>>
>>>>>> +static int thpsize_create(int order, struct kobject *parent)
>>>>>>     {
>>>>>>         unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>>>>> +    struct thpsize_child *stats;
>>>>>>         struct thpsize *thpsize;
>>>>>>         int ret;
>>>>>>     +    /*
>>>>>> +     * Each child object (currently only "stats" directory) holds a
>>>>>> +     * reference to the top-level thpsize object, so we can drop our ref to
>>>>>> +     * the top-level once stats is setup. Then we just need to drop a
>>>>>> +     * reference on any children to clean everything up. We can't just use
>>>>>> +     * the attr group name for the stats subdirectory because there may be
>>>>>> +     * multiple attribute groups to populate inside stats and overlaying
>>>>>> +     * using the name property isn't supported in that way; each attr group
>>>>>> +     * name, if provided, must be unique in the parent directory.
>>>>>> +     */
>>>>>> +
>>>>>>         thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
>>>>>> -    if (!thpsize)
>>>>>> -        return ERR_PTR(-ENOMEM);
>>>>>> +    if (!thpsize) {
>>>>>> +        ret = -ENOMEM;
>>>>>> +        goto err;
>>>>>> +    }
>>>>>> +    thpsize->order = order;
>>>>>>           ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
>>>>>>                        "hugepages-%lukB", size);
>>>>>>         if (ret) {
>>>>>>             kfree(thpsize);
>>>>>> -        return ERR_PTR(ret);
>>>>>> +        goto err;
>>>>>>         }
>>>>>>     -    ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>>>> -    if (ret) {
>>>>>> +    stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>>>>>> +    if (!stats) {
>>>>>>             kobject_put(&thpsize->kobj);
>>>>>> -        return ERR_PTR(ret);
>>>>>> +        ret = -ENOMEM;
>>>>>> +        goto err;
>>>>>>         }
>>>>>>     -    ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>>>>> +    ret = kobject_init_and_add(&stats->kobj, &thpsize_child_ktype,
>>>>>> +                   &thpsize->kobj, "stats");
>>>>>> +    kobject_put(&thpsize->kobj);
>>>>>>         if (ret) {
>>>>>> -        kobject_put(&thpsize->kobj);
>>>>>> -        return ERR_PTR(ret);
>>>>>> +        kfree(stats);
>>>>>> +        goto err;
>>>>>>         }
>>>>>>     -    thpsize->order = order;
>>>>>> -    return thpsize;
>>>>>> +    if (BIT(order) & THP_ORDERS_ALL_ANON) {
>>>>>> +        ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
>>>>>> +        if (ret)
>>>>>> +            goto err_put;
>>>>>> +
>>>>>> +        ret = sysfs_create_group(&stats->kobj, &stats_attr_group);
>>>>>> +        if (ret)
>>>>>> +            goto err_put;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (BIT(order) & PAGECACHE_LARGE_ORDERS) {
>>>>>> +        ret = sysfs_create_group(&stats->kobj, &file_stats_attr_group);
>>>>>> +        if (ret)
>>>>>> +            goto err_put;
>>>>>> +    }
>>>>>> +
>>>>>> +    list_add(&stats->node, &thpsize_child_list);
>>>>>> +    return 0;
>>>>>> +err_put:
>>>>>
>>>>> IIUC, I think you should call 'sysfs_remove_group' to remove the group before
>>>>> putting the kobject.
>>>>
>>>> Are you sure about that? As I understood it, sysfs_create_group() was
>>>> conceptually modifying the state of the kobj, so when the kobj gets destroyed,
>>>> all its state is tidied up. __kobject_del() (called on the last kobject_put())
>>>> calls sysfs_remove_groups() and tidies up the sysfs state as far as I can see?
>>>
>>> IIUC, __kobject_del() only removes the ktype defaut groups by
>>> 'sysfs_remove_groups(kobj, ktype->default_groups)', but your created groups are
>>> not added into the ktype->default_groups. That means you should mannuly remove
>>> them, or am I miss something?
>>
>> That was also putting doubt in my mind. But the sample at
>> samples/kobject/kobject-example.c does not call sysfs_remove_group(). It just
>> calls sysfs_create_group() in example_init() and calls kobject_put() in
>> example_exit(). So I think that's the correct pattern.
>>
>> Looking at the code more closely, sysfs_create_group() just creates files for
>> each of the attributes in the group. __kobject_del() calls sysfs_remove_dir(),
>> who's comment states "we remove any files in the directory before we remove the
>> directory" so I'm pretty sure sysfs_remove_group() is not required.
> 
> Thanks for the explanation, and I think you are right after checking the code
> again. Sorry for the noise.

No problem, thanks for raising anyway; TBH, I wasn't completely sure when I
wrote it initially. So good to have clear resolution.

[v1,2/2] mm: mTHP stats for pagecache folio allocations

Commit Message

Comments

Patch