diff mbox series

[v6,2/3] mm: zswap: zswap_store() extended to handle mTHP folios.

Message ID 20240829212705.6714-3-kanchana.p.sridhar@intel.com (mailing list archive)
State New
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Commit Message

Sridhar, Kanchana P Aug. 29, 2024, 9:27 p.m. UTC
zswap_store() will now process and store mTHP and PMD-size THP folios.

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP.

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios, by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

Also, addressed some of the RFC comments from the discussion in [1].

Made a minor edit in the comments for "struct zswap_entry" to delete
the comments related to "value" since same-filled page handling has
been removed from zswap.

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/Kconfig |   8 ++
 mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 190 insertions(+), 61 deletions(-)

Comments

Yosry Ahmed Aug. 29, 2024, 11:06 p.m. UTC | #1
On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>

I think "mm: zswap: support mTHP swapout in zswap_store()" is a better
subject. We usually use imperative tone for commit logs as much as
possible.

> zswap_store() will now process and store mTHP and PMD-size THP folios.
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.

This commit log is not ordered clearly. Please start by describing
what we are doing, which is basically making zswap_store() support
large folios by compressing them page by page. Then mention important
implementation details and the tunabel and config options added at the
end. After that, refer to the RFC that this is based on.

>
> Co-developed-by: Ryan Roberts
> Signed-off-by:

This is probably supposed to be "Signed-off-by: Ryan Roberts". I am
not sure what the policy is for reusing patches sent earlier on the
mailing list. Did you talk to Ryan about this?

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

The diff is hard to follow because there is a lot of refactoring mixed
in with the functional changes. Could you please break this down into
purely refactoring patches doing the groundwork, and then the actual
functional change patch(es) on top of them?

Thanks!
Chengming Zhou Sept. 2, 2024, 11:37 a.m. UTC | #2
On 2024/8/30 05:27, Kanchana P Sridhar wrote:
> zswap_store() will now process and store mTHP and PMD-size THP folios.
> 
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
> 
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
> 
>    "[RFC,v1] mm: zswap: Store large folios without splitting"
> 
>    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> 
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
> 
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
> 
> Each page's swap offset is stored as a separate zswap entry.
> 
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> 
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
> 
> Also, addressed some of the RFC comments from the discussion in [1].
> 
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.
> 
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

The code looks ok, but I also find this patch is a little hard to 
review, maybe it's better to split it into small patches as Yosry suggested.

Thanks!

[...]
> +
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
> +bool zswap_store(struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	swp_entry_t swp = folio->swap;
> +	pgoff_t offset = swp_offset(swp);
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	bool ret = false;
> +	long index;
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +	/* Storing large folios isn't enabled */
> +	if (!zswap_mthp_enabled && folio_test_large(folio))
> +		return false;
> +
> +	if (!zswap_enabled)
> +		goto reject;
> +
>   	/*
> -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> -	 * possibly stale entry which was previously stored at this offset.
> -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> +	 * Check cgroup limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of an
> +	 * mTHP store, and not within zswap_store_page() for each page
> +	 * in the mTHP. We do however check the zswap pool limits at the
> +	 * start of zswap_store_page(). What this means is, the cgroup
> +	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +	 * However, the per-store-page zswap pool limits check should
> +	 * hopefully trigger the cgroup aware and zswap LRU aware global
> +	 * reclaim implemented in the shrinker. If this assumption holds,
> +	 * the cgroup exceeding the zswap limits could potentially be
> +	 * resolved before the next zswap_store, and if it is not, the next
> +	 * zswap_store would fail the cgroup zswap limit check at the start.
>   	 */
> -	entry = xa_erase(tree, offset);
> -	if (entry)
> -		zswap_entry_free(entry);
> -	return false;
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (shrink_memcg(memcg)) {
> +			mem_cgroup_put(memcg);
> +			goto put_objcg;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	if (zswap_check_limits())
> +		goto put_objcg;
> +
> +	pool = zswap_pool_current_get();
> +	if (!pool)
> +		goto put_objcg;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			goto put_pool;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Store each page of the folio as a separate entry. If we fail to store
> +	 * a page, unwind by removing all the previous pages we stored.
> +	 */
> +	for (index = 0; index < nr_pages; ++index) {
> +		if (!zswap_store_page(folio, index, objcg, pool))
> +			goto put_pool;
> +	}
> +
> +	ret = true;
> +
> +put_pool:
> +	zswap_pool_put(pool);
> +put_objcg:
> +	obj_cgroup_put(objcg);
> +	if (zswap_pool_reached_full)
> +		queue_work(shrink_wq, &zswap_shrink_work);
> +reject:
> +	/*
> +	 * If the zswap store fails or zswap is disabled, we must invalidate
> +	 * the possibly stale entries which were previously stored at the
> +	 * offsets corresponding to each page of the folio. Otherwise,
> +	 * writeback could overwrite the new data in the swapfile.
> +	 */
> +	if (!ret)
> +		zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +	return ret;
>   }
>   
>   bool zswap_load(struct folio *folio)
Barry Song Sept. 16, 2024, 5:55 a.m. UTC | #3
On Fri, Aug 30, 2024 at 5:27 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now process and store mTHP and PMD-size THP folios.
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).

Hi Kanchana,
I'm not opposed to this patch, but I don't understand how iterating
through each page within an mTHP supports the use of Intel IAA,
as it involves compressing pages individually.

In the document 'by_n compression and decompression with Intel IAA' by
Andre Glover
(https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com),
it appears
that zsmalloc/zram needs to support multi-page compression and
decompression to fully
leverage the hardware's capabilities. Could you clarify how this
approach fits in?

In patch2/3 of that series, you have:
"Add the 'by_n' attribute to the acomp_req. The 'by_n' attribute can be
used a directive by acomp crypto algorithms for splitting compress and
decompress operations into "n" separate jobs."

How can you apply 'by_n' to a single page rather than to a large folio?

>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++
>  mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 190 insertions(+), 61 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b23913d4e47e..68c7b01120bd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 449914ea9919..3abf9810f0b7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -190,7 +198,6 @@ static struct shrinker *zswap_shrinker;
>   *              section for context.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zpool allocation handle that stores the compressed page data
> - * value - value of the same-value filled pages which have same content
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
>   */
> @@ -876,7 +883,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         return 0;
>  }
>
> -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
>  {
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct scatterlist input, output;
> @@ -894,7 +901,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>
>         dst = acomp_ctx->buffer;
>         sg_init_table(&input, 1);
> -       sg_set_folio(&input, folio, PAGE_SIZE, 0);
> +       sg_set_page(&input, page, PAGE_SIZE, 0);
>
>         /*
>          * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> @@ -1404,35 +1411,82 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Returns true if the entry was successfully
> + * stored in the xarray, and false otherwise.
> + */
> +static bool zswap_store_entry(struct xarray *tree,
> +                             struct zswap_entry *entry)
>  {
> -       swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
> -       struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> +       struct zswap_entry *old;
> +       pgoff_t offset = swp_offset(entry->swpentry);
>
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +       old = xa_store(tree, offset, entry, GFP_KERNEL);
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       if (xa_is_err(old)) {
> +               int err = xa_err(old);
> +
> +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +               zswap_reject_alloc_fail++;
>                 return false;
> +       }
>
> -       if (!zswap_enabled)
> -               goto check_old;
> +       /*
> +        * We may have had an existing entry that became stale when
> +        * the folio was redirtied and now the new version is being
> +        * swapped out. Get rid of the old.
> +        */
> +       if (old)
> +               zswap_entry_free(old);
>
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> +       return true;
> +}
> +
> +/*
> + * If the zswap store fails or zswap is disabled, we must invalidate the
> + * possibly stale entries which were previously stored at the offsets
> + * corresponding to each page of the folio. Otherwise, writeback could
> + * overwrite the new data in the swapfile.
> + *
> + * This is called after the store of the i-th offset in a large folio has
> + * failed. All zswap entries in the folio must be deleted. This helps make
> + * sure that a swapped-out mTHP is either entirely stored in zswap, or
> + * entirely not stored in zswap.
> + *
> + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> + * All offsets for the folio are deleted from zswap in this case.
> + */
> +static void zswap_delete_stored_offsets(struct xarray *tree,
> +                                       pgoff_t offset,
> +                                       long nr_pages)
> +{
> +       struct zswap_entry *entry;
> +       long i;
> +
> +       for (i = 0; i < nr_pages; ++i) {
> +               entry = xa_erase(tree, offset + i);
> +               if (entry)
> +                       zswap_entry_free(entry);
>         }
> +}
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)
> +{
> +       swp_entry_t swp = folio->swap;
> +       int type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
> +       struct page *page = folio_page(folio, index);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct zswap_entry *entry;
> +
> +       if (objcg)
> +               obj_cgroup_get(objcg);
>
>         if (zswap_check_limits())
>                 goto reject;
> @@ -1445,42 +1499,20 @@ bool zswap_store(struct folio *folio)
>         }
>
>         /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> +       if (!zswap_pool_get(pool))
>                 goto freepage;
>
> -       if (objcg) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -                       mem_cgroup_put(memcg);
> -                       goto put_pool;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> +       entry->pool = pool;
>
> -       if (!zswap_compress(folio, entry))
> +       if (!zswap_compress(page, entry))
>                 goto put_pool;
>
> -       entry->swpentry = swp;
> +       entry->swpentry = swp_entry(type, offset);
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> -       if (xa_is_err(old)) {
> -               int err = xa_err(old);
> -
> -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -               zswap_reject_alloc_fail++;
> +       if (!zswap_store_entry(tree, entry))
>                 goto store_failed;
> -       }
> -
> -       /*
> -        * We may have had an existing entry that became stale when
> -        * the folio was redirtied and now the new version is being
> -        * swapped out. Get rid of the old.
> -        */
> -       if (old)
> -               zswap_entry_free(old);
>
>         if (objcg) {
>                 obj_cgroup_charge_zswap(objcg, entry->length);
> @@ -1511,23 +1543,112 @@ bool zswap_store(struct folio *folio)
>  store_failed:
>         zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> +       zswap_pool_put(pool);
>  freepage:
>         zswap_entry_cache_free(entry);
>  reject:
>         obj_cgroup_put(objcg);
>         if (zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +
> +       return false;
> +}
> +
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
> +bool zswap_store(struct folio *folio)
> +{
> +       long nr_pages = folio_nr_pages(folio);
> +       swp_entry_t swp = folio->swap;
> +       pgoff_t offset = swp_offset(swp);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +       /* Storing large folios isn't enabled */
> +       if (!zswap_mthp_enabled && folio_test_large(folio))
> +               return false;
> +
> +       if (!zswap_enabled)
> +               goto reject;
> +
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * Check cgroup limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of an
> +        * mTHP store, and not within zswap_store_page() for each page
> +        * in the mTHP. We do however check the zswap pool limits at the
> +        * start of zswap_store_page(). What this means is, the cgroup
> +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +        * However, the per-store-page zswap pool limits check should
> +        * hopefully trigger the cgroup aware and zswap LRU aware global
> +        * reclaim implemented in the shrinker. If this assumption holds,
> +        * the cgroup exceeding the zswap limits could potentially be
> +        * resolved before the next zswap_store, and if it is not, the next
> +        * zswap_store would fail the cgroup zswap limit check at the start.
>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> -       return false;
> +       objcg = get_obj_cgroup_from_folio(folio);
> +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (shrink_memcg(memcg)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_objcg;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       if (zswap_check_limits())
> +               goto put_objcg;
> +
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
> +
> +       if (objcg) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_pool;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Store each page of the folio as a separate entry. If we fail to store
> +        * a page, unwind by removing all the previous pages we stored.
> +        */
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool))
> +                       goto put_pool;
> +       }
> +
> +       ret = true;
> +
> +put_pool:
> +       zswap_pool_put(pool);
> +put_objcg:
> +       obj_cgroup_put(objcg);
> +       if (zswap_pool_reached_full)
> +               queue_work(shrink_wq, &zswap_shrink_work);
> +reject:
> +       /*
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
> +        */
> +       if (!ret)
> +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>

Thanks
Barry
Sridhar, Kanchana P Sept. 20, 2024, 1:57 a.m. UTC | #4
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, August 29, 2024 4:06 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> 
> I think "mm: zswap: support mTHP swapout in zswap_store()" is a better
> subject. We usually use imperative tone for commit logs as much as
> possible.

Sure, this is a much better subject, thanks! I will make this change in v7.

> 
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> 
> This commit log is not ordered clearly. Please start by describing
> what we are doing, which is basically making zswap_store() support
> large folios by compressing them page by page. Then mention important
> implementation details and the tunabel and config options added at the
> end. After that, refer to the RFC that this is based on.

Thanks for these comments. Sure, I will incorporate in v7.

> 
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> 
> This is probably supposed to be "Signed-off-by: Ryan Roberts". I am
> not sure what the policy is for reusing patches sent earlier on the
> mailing list. Did you talk to Ryan about this?

You're right, this is intended to be "Signed-off-by: Ryan Roberts" once
Ryan has had a chance to review and indicate approval of attribution
as co-author.

I have been following the documentation guidelines for submitting
patches, as pertaining to co-development. Ryan is in the recipients list
and I am hoping he can indicate his approval for the reuse of his original
RFC.

Ryan, I would greatly appreciate your inputs on the reuse of your RFC,
and also any code review comments for improving the patchset!

> 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> The diff is hard to follow because there is a lot of refactoring mixed
> in with the functional changes. Could you please break this down into
> purely refactoring patches doing the groundwork, and then the actual
> functional change patch(es) on top of them?

Sure, I will do this and submit a v7. Appreciate your comments!

Thanks,
Kanchana

> 
> Thanks!
Sridhar, Kanchana P Sept. 20, 2024, 2:43 a.m. UTC | #5
Hi Chengming,

> -----Original Message-----
> From: Chengming Zhou <chengming.zhou@linux.dev>
> Sent: Monday, September 2, 2024 4:38 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org
> Cc: Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On 2024/8/30 05:27, Kanchana P Sridhar wrote:
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >    "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> The code looks ok, but I also find this patch is a little hard to
> review, maybe it's better to split it into small patches as Yosry suggested.

Definitely, will do so and submit a v7.

Thanks,
Kanchana

> 
> Thanks!
> 
> [...]
> > +
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> > +bool zswap_store(struct folio *folio)
> > +{
> > +	long nr_pages = folio_nr_pages(folio);
> > +	swp_entry_t swp = folio->swap;
> > +	pgoff_t offset = swp_offset(swp);
> > +	struct xarray *tree = swap_zswap_tree(swp);
> > +	struct obj_cgroup *objcg = NULL;
> > +	struct mem_cgroup *memcg = NULL;
> > +	struct zswap_pool *pool;
> > +	bool ret = false;
> > +	long index;
> > +
> > +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +	/* Storing large folios isn't enabled */
> > +	if (!zswap_mthp_enabled && folio_test_large(folio))
> > +		return false;
> > +
> > +	if (!zswap_enabled)
> > +		goto reject;
> > +
> >   	/*
> > -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> > -	 * possibly stale entry which was previously stored at this offset.
> > -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> > +	 * Check cgroup limits:
> > +	 *
> > +	 * The cgroup zswap limit check is done once at the beginning of an
> > +	 * mTHP store, and not within zswap_store_page() for each page
> > +	 * in the mTHP. We do however check the zswap pool limits at the
> > +	 * start of zswap_store_page(). What this means is, the cgroup
> > +	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +	 * However, the per-store-page zswap pool limits check should
> > +	 * hopefully trigger the cgroup aware and zswap LRU aware global
> > +	 * reclaim implemented in the shrinker. If this assumption holds,
> > +	 * the cgroup exceeding the zswap limits could potentially be
> > +	 * resolved before the next zswap_store, and if it is not, the next
> > +	 * zswap_store would fail the cgroup zswap limit check at the start.
> >   	 */
> > -	entry = xa_erase(tree, offset);
> > -	if (entry)
> > -		zswap_entry_free(entry);
> > -	return false;
> > +	objcg = get_obj_cgroup_from_folio(folio);
> > +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (shrink_memcg(memcg)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_objcg;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	if (zswap_check_limits())
> > +		goto put_objcg;
> > +
> > +	pool = zswap_pool_current_get();
> > +	if (!pool)
> > +		goto put_objcg;
> > +
> > +	if (objcg) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru,
> GFP_KERNEL)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_pool;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	/*
> > +	 * Store each page of the folio as a separate entry. If we fail to store
> > +	 * a page, unwind by removing all the previous pages we stored.
> > +	 */
> > +	for (index = 0; index < nr_pages; ++index) {
> > +		if (!zswap_store_page(folio, index, objcg, pool))
> > +			goto put_pool;
> > +	}
> > +
> > +	ret = true;
> > +
> > +put_pool:
> > +	zswap_pool_put(pool);
> > +put_objcg:
> > +	obj_cgroup_put(objcg);
> > +	if (zswap_pool_reached_full)
> > +		queue_work(shrink_wq, &zswap_shrink_work);
> > +reject:
> > +	/*
> > +	 * If the zswap store fails or zswap is disabled, we must invalidate
> > +	 * the possibly stale entries which were previously stored at the
> > +	 * offsets corresponding to each page of the folio. Otherwise,
> > +	 * writeback could overwrite the new data in the swapfile.
> > +	 */
> > +	if (!ret)
> > +		zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +	return ret;
> >   }
> >
> >   bool zswap_load(struct folio *folio)
Sridhar, Kanchana P Sept. 20, 2024, 8:53 p.m. UTC | #6
Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Sunday, September 15, 2024 10:55 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On Fri, Aug 30, 2024 at 5:27 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> 
> Hi Kanchana,
> I'm not opposed to this patch, but I don't understand how iterating
> through each page within an mTHP supports the use of Intel IAA,
> as it involves compressing pages individually.

Thanks for your insightful comments and questions!

With Intel IAA, we have the opportunity to make use of compression
and decompression engines in hardware to do parallel compressions during
swapout and parallel decompressions during swapin with readahead
(and eventually mTHP swapin of larger compressed buffers when this is
ready). If compressions can be parallelized, we can improve reclaim
performance. If decompressions can be parallelized, we can improve
do_swap_page() performance.

We have implemented compress batching within mTHP folios during
zswap store, as well as compress batching of any-order folios during
shrink_folio_list() -- swap_writepage() using a plug mechanism, similar
to the existing swap_write_unplug() implementation. Initially, our
solution works at the granularity of compressing PAGE_SIZE pages within
(many) folios in parallel, to maximize throughput with IAA and minimize
latency per folio store.

With IAA, we are able to submit a batch of compress/decompress jobs
and poll for their completion asynchronously (RFCs yet to be submitted).
This brings the benefit of parallel compression/decompression in hardware
without waiting for the jobs to complete synchronously. With zswap_store
batching within an mTHP folio, the "batch" is comprised of up to say, 8 pages
in the mTHP. As mentioned above, we have extended this to construct batches
of any-order (m)THP folios during reclaim, that can be processed by zswap_store
compress batching.

We have also implemented decompress batching of 4K folios to improve
do_swap_page() performance using parallel decompression of a batch
of 4k folios. Using swapin_readahead(), we can prefetch a batch of 4k folios
in the kernel today. Decompress batching involves zswap_load of this
batch using parallel decompressions in IAA.

To utilize IAA compress/decompress engines, we have developed the
respective batching interfaces from shrink_folio_list() and from
swapin_readahead(). Our experiments in multi-instance, highly contended
server scenarios under memory pressure have demonstrated significant
kernel swapout/swapin latency improvements and workload level performance
improvements and overall system level memory savings as compared to
software compressors.

Needless to say, batching only improves performance in configurations
with Intel IAA, and it should not impact software compressors. 

> 
> In the document 'by_n compression and decompression with Intel IAA' by
> Andre Glover
> (https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.co
> m),
> it appears
> that zsmalloc/zram needs to support multi-page compression and
> decompression to fully
> leverage the hardware's capabilities. Could you clarify how this
> approach fits in?

We are also staying tuned in to the mTHP swapin progress being made by
yourself, Chuanhua and others. Our goal is to eventually be able to
swapout/swapin an mTHP as a single entity. In this case also, IAA byN
can compress/decompress a tunable number of chunks of an mTHP in
parallel [1].

As in earlier discussions, the IAA byN approach is dependent on the mTHP
swapin patchset [2] and associated zsmalloc/zram updates for storing larger
compressed buffers from ZRAM [3]. However, this will only address ZRAM.
Imho, this could be a more involved effort for ZSWAP, that would need the
mTHP swapin to be more generally applicable.

In the meantime, the IAA batching approach provides us a way to work with
the existing kernel support for mTHP swapout/swapin as pertaining to zswap.

[1] https://patchwork.kernel.org/project/linux-mm/cover/cover.1714581792.git.andre.glover@linux.intel.com/
[2] https://patchwork.kernel.org/project/linux-mm/cover/20240908232119.2157-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

> 
> In patch2/3 of that series, you have:
> "Add the 'by_n' attribute to the acomp_req. The 'by_n' attribute can be
> used a directive by acomp crypto algorithms for splitting compress and
> decompress operations into "n" separate jobs."
> 
> How can you apply 'by_n' to a single page rather than to a large folio?

In [1], Andre had introduced IAA byN as a new 'canned-by_n' algorithm.
In theory, it should be possible to apply this to any size input buffers. Although,
most of our testing and data posted in [1] was focused on using 64k mTHP
swapout/swapin with zram and your initial patchsets for [2-a] and [3].

[1] https://patchwork.kernel.org/project/linux-mm/cover/cover.1714581792.git.andre.glover@linux.intel.com/
[2-a] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Thanks,
Kanchana

> 
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++
> >  mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------
> ------
> >  2 files changed, 190 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index b23913d4e47e..68c7b01120bd 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 449914ea9919..3abf9810f0b7 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -190,7 +198,6 @@ static struct shrinker *zswap_shrinker;
> >   *              section for context.
> >   * pool - the zswap_pool the entry's data is in
> >   * handle - zpool allocation handle that stores the compressed page data
> > - * value - value of the same-value filled pages which have same content
> >   * objcg - the obj_cgroup that the compressed memory is charged to
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> > @@ -876,7 +883,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
> >         return 0;
> >  }
> >
> > -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> > +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > @@ -894,7 +901,7 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
> >
> >         dst = acomp_ctx->buffer;
> >         sg_init_table(&input, 1);
> > -       sg_set_folio(&input, folio, PAGE_SIZE, 0);
> > +       sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> >         /*
> >          * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > @@ -1404,35 +1411,82 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(struct xarray *tree,
> > +                             struct zswap_entry *entry)
> >  {
> > -       swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> > -       struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_entry *old;
> > +       pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       old = xa_store(tree, offset, entry, GFP_KERNEL);
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       if (xa_is_err(old)) {
> > +               int err = xa_err(old);
> > +
> > +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > +               zswap_reject_alloc_fail++;
> >                 return false;
> > +       }
> >
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > +       /*
> > +        * We may have had an existing entry that became stale when
> > +        * the folio was redirtied and now the new version is being
> > +        * swapped out. Get rid of the old.
> > +        */
> > +       if (old)
> > +               zswap_entry_free(old);
> >
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > +       return true;
> > +}
> > +
> > +/*
> > + * If the zswap store fails or zswap is disabled, we must invalidate the
> > + * possibly stale entries which were previously stored at the offsets
> > + * corresponding to each page of the folio. Otherwise, writeback could
> > + * overwrite the new data in the swapfile.
> > + *
> > + * This is called after the store of the i-th offset in a large folio has
> > + * failed. All zswap entries in the folio must be deleted. This helps make
> > + * sure that a swapped-out mTHP is either entirely stored in zswap, or
> > + * entirely not stored in zswap.
> > + *
> > + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> > + * All offsets for the folio are deleted from zswap in this case.
> > + */
> > +static void zswap_delete_stored_offsets(struct xarray *tree,
> > +                                       pgoff_t offset,
> > +                                       long nr_pages)
> > +{
> > +       struct zswap_entry *entry;
> > +       long i;
> > +
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               entry = xa_erase(tree, offset + i);
> > +               if (entry)
> > +                       zswap_entry_free(entry);
> >         }
> > +}
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + */
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> > +{
> > +       swp_entry_t swp = folio->swap;
> > +       int type = swp_type(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> > +       struct page *page = folio_page(folio, index);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct zswap_entry *entry;
> > +
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> >
> >         if (zswap_check_limits())
> >                 goto reject;
> > @@ -1445,42 +1499,20 @@ bool zswap_store(struct folio *folio)
> >         }
> >
> >         /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > +       if (!zswap_pool_get(pool))
> >                 goto freepage;
> >
> > -       if (objcg) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto put_pool;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > +       entry->pool = pool;
> >
> > -       if (!zswap_compress(folio, entry))
> > +       if (!zswap_compress(page, entry))
> >                 goto put_pool;
> >
> > -       entry->swpentry = swp;
> > +       entry->swpentry = swp_entry(type, offset);
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> > -       if (xa_is_err(old)) {
> > -               int err = xa_err(old);
> > -
> > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > -               zswap_reject_alloc_fail++;
> > +       if (!zswap_store_entry(tree, entry))
> >                 goto store_failed;
> > -       }
> > -
> > -       /*
> > -        * We may have had an existing entry that became stale when
> > -        * the folio was redirtied and now the new version is being
> > -        * swapped out. Get rid of the old.
> > -        */
> > -       if (old)
> > -               zswap_entry_free(old);
> >
> >         if (objcg) {
> >                 obj_cgroup_charge_zswap(objcg, entry->length);
> > @@ -1511,23 +1543,112 @@ bool zswap_store(struct folio *folio)
> >  store_failed:
> >         zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > +       zswap_pool_put(pool);
> >  freepage:
> >         zswap_entry_cache_free(entry);
> >  reject:
> >         obj_cgroup_put(objcg);
> >         if (zswap_pool_reached_full)
> >                 queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +
> > +       return false;
> > +}
> > +
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +       /* Storing large folios isn't enabled */
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> > +               return false;
> > +
> > +       if (!zswap_enabled)
> > +               goto reject;
> > +
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > -       return false;
> > +       objcg = get_obj_cgroup_from_folio(folio);
> > +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (shrink_memcg(memcg)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_objcg;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       if (zswap_check_limits())
> > +               goto put_objcg;
> > +
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> > +
> > +       if (objcg) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_pool;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Store each page of the folio as a separate entry. If we fail to store
> > +        * a page, unwind by removing all the previous pages we stored.
> > +        */
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool))
> > +                       goto put_pool;
> > +       }
> > +
> > +       ret = true;
> > +
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> > +       obj_cgroup_put(objcg);
> > +       if (zswap_pool_reached_full)
> > +               queue_work(shrink_wq, &zswap_shrink_work);
> > +reject:
> > +       /*
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> > +        */
> > +       if (!ret)
> > +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >
> 
> Thanks
> Barry
diff mbox series

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index b23913d4e47e..68c7b01120bd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -59,6 +59,14 @@  config ZSWAP_SHRINKER_DEFAULT_ON
 	  reducing the chance that cold pages will reside in the zswap pool
 	  and consume memory indefinitely.
 
+config ZSWAP_STORE_THP_DEFAULT_ON
+	bool "Store mTHP and THP folios in zswap"
+	depends on ZSWAP
+	default n
+	help
+	  If selected, zswap will process mTHP and THP folios by
+	  compressing and storing each 4K page in the large folio.
+
 choice
 	prompt "Default compressor"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 449914ea9919..3abf9810f0b7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,14 @@  static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/*
+ * Enable/disable zswap processing of mTHP folios.
+ * For now, only zswap_store will process mTHP folios.
+ */
+static bool zswap_mthp_enabled = IS_ENABLED(
+		CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
+module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -190,7 +198,6 @@  static struct shrinker *zswap_shrinker;
  *              section for context.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
  */
@@ -876,7 +883,7 @@  static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -894,7 +901,7 @@  static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_folio(&input, folio, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1404,35 +1411,82 @@  static void shrink_worker(struct work_struct *w)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+			      struct zswap_entry *entry)
 {
-	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
-	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
+	struct zswap_entry *old;
+	pgoff_t offset = swp_offset(entry->swpentry);
 
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+	old = xa_store(tree, offset, entry, GFP_KERNEL);
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
 		return false;
+	}
 
-	if (!zswap_enabled)
-		goto check_old;
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
 
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
+	return true;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entries which were previously stored at the offsets
+ * corresponding to each page of the folio. Otherwise, writeback could
+ * overwrite the new data in the swapfile.
+ *
+ * This is called after the store of the i-th offset in a large folio has
+ * failed. All zswap entries in the folio must be deleted. This helps make
+ * sure that a swapped-out mTHP is either entirely stored in zswap, or
+ * entirely not stored in zswap.
+ *
+ * This is also called if zswap_store() is invoked, but zswap is not enabled.
+ * All offsets for the folio are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+					pgoff_t offset,
+					long nr_pages)
+{
+	struct zswap_entry *entry;
+	long i;
+
+	for (i = 0; i < nr_pages; ++i) {
+		entry = xa_erase(tree, offset + i);
+		if (entry)
+			zswap_entry_free(entry);
 	}
+}
+
+/*
+ * Stores the page at specified "index" in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
 
 	if (zswap_check_limits())
 		goto reject;
@@ -1445,42 +1499,20 @@  bool zswap_store(struct folio *folio)
 	}
 
 	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
+	if (!zswap_pool_get(pool))
 		goto freepage;
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	entry->pool = pool;
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(page, entry))
 		goto put_pool;
 
-	entry->swpentry = swp;
+	entry->swpentry = swp_entry(type, offset);
 	entry->objcg = objcg;
 	entry->referenced = true;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
-
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
+	if (!zswap_store_entry(tree, entry))
 		goto store_failed;
-	}
-
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, entry->length);
@@ -1511,23 +1543,112 @@  bool zswap_store(struct folio *folio)
 store_failed:
 	zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-	zswap_pool_put(entry->pool);
+	zswap_pool_put(pool);
 freepage:
 	zswap_entry_cache_free(entry);
 reject:
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+
+	return false;
+}
+
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	pgoff_t offset = swp_offset(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+	/* Storing large folios isn't enabled */
+	if (!zswap_mthp_enabled && folio_test_large(folio))
+		return false;
+
+	if (!zswap_enabled)
+		goto reject;
+
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * Check cgroup limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of an
+	 * mTHP store, and not within zswap_store_page() for each page
+	 * in the mTHP. We do however check the zswap pool limits at the
+	 * start of zswap_store_page(). What this means is, the cgroup
+	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+	 * However, the per-store-page zswap pool limits check should
+	 * hopefully trigger the cgroup aware and zswap LRU aware global
+	 * reclaim implemented in the shrinker. If this assumption holds,
+	 * the cgroup exceeding the zswap limits could potentially be
+	 * resolved before the next zswap_store, and if it is not, the next
+	 * zswap_store would fail the cgroup zswap limit check at the start.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
+	}
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+reject:
+	/*
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
+	 */
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)