[v3,09/13] mm, swap: reduce contention on device lock

Message ID	20241230174621.61185-10-ryncsn@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kairui Song <ryncsn@gmail.com> To: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>, Ryan Roberts <ryan.roberts@arm.com>, Hugh Dickins <hughd@google.com>, Yosry Ahmed <yosryahmed@google.com>, "Huang, Ying" <ying.huang@linux.alibaba.com>, Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org, Kairui Song <kasong@tencent.com> Subject: [PATCH v3 09/13] mm, swap: reduce contention on device lock Date: Tue, 31 Dec 2024 01:46:17 +0800 Message-ID: <20241230174621.61185-10-ryncsn@gmail.com> In-Reply-To: <20241230174621.61185-1-ryncsn@gmail.com> References: <20241230174621.61185-1-ryncsn@gmail.com> Reply-To: Kairui Song <kasong@tencent.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, swap: rework of swap allocator locks \| expand [v3,00/13] mm, swap: rework of swap allocator locks [v3,01/13] mm, swap: minor clean up for swap entry allocation [v3,02/13] mm, swap: fold swap_info_get_cont in the only caller [v3,03/13] mm, swap: remove old allocation path for HDD [v3,04/13] mm, swap: use cluster lock for HDD [v3,05/13] mm, swap: clean up device availability check [v3,06/13] mm, swap: clean up plist removal and adding [v3,07/13] mm, swap: hold a reference during scan and cleanup flag usage [v3,08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes [v3,09/13] mm, swap: reduce contention on device lock [v3,10/13] mm, swap: simplify percpu cluster updating [v3,11/13] mm, swap: introduce a helper for retrieving cluster from offset [v3,12/13] mm, swap: use a global swap cluster for non-rotation devices [v3,13/13] mm, swap_slots: remove slot cache for freeing path

Kairui Song Dec. 30, 2024, 5:46 p.m. UTC

From: Kairui Song <kasong@tencent.com>

Currently, swap locking is mainly composed of two locks: the cluster
lock (ci->lock) and the device lock (si->lock).

The cluster lock is much more fine-grained, so it is best to use
ci->lock instead of si->lock as much as possible.

We have cleaned up other hard dependencies on si->lock. Following the
new cluster allocator design, most operations don't need to touch
si->lock at all. In practice, we only need to take si->lock when
moving clusters between lists.

To achieve this, this commit reworks the locking pattern of all
si->lock and ci->lock users, eliminates all usage of ci->lock inside
si->lock, and introduces a new design to avoid touching si->lock
unless needed.

For minimal contention and easier understanding of the system, two
ideas are introduced with the corresponding helpers: isolation and
relocation.

- Clusters will be `isolated` from the list when iterating the list
  to search for an allocatable cluster.

  This ensures other CPUs won't walk into the same cluster easily,
  and it releases si->lock after acquiring ci->lock, providing the
  only place that handles the inversion of two locks, and avoids
  contention.

  Iterating the cluster list almost always moves the cluster
  (free -> nonfull, nonfull -> frag, frag -> frag tail), but it
  doesn't know where the cluster should be moved to until scanning
  is done. So keeping the cluster off-list is a good option with
  low overhead.

  The off-list time window of a cluster is also minimal. In the worst
  case, one CPU will return the cluster after scanning the 512 entries
  on it, which we used to busy wait with a spin lock.

This is done with the new helper `cluster_isolate_lock`.

- Clusters will be `relocated` after allocation or freeing, according
  to their usage count and status.

  Allocations no longer hold si->lock now, and may drop ci->lock for
  reclaim, so the cluster could be moved to any location while no lock
  is held. Besides, isolation clears all flags when it takes the
  cluster off the list (the flags must be in sync with the list status,
  so cluster users don't need to touch si->lock for checking its list
  status). So the cluster has to be relocated to the right list
  according to its usage after allocation or freeing.

  Relocation is optional, if the cluster flags indicate it's already
  on the right list, it will skip touching the list or si->lock.

This is done with relocate_cluster after allocation or with
[partial_]free_cluster after freeing.

This handled usage of all kinds of clusters in a clean way.

Scanning and allocation by iterating the cluster list is handled by
"isolate - <scan / allocate> - relocate".

Scanning and allocation of per-CPU clusters will only involve
"<scan / allocate> - relocate", as it knows which cluster to lock
and use.

Freeing will only involve "relocate".

Each CPU will keep using its per-CPU cluster until the 512 entries
are all consumed. Freeing also has to free 512 entries to trigger
cluster movement in the best case, so si->lock is rarely touched.

Testing with building the Linux kernel with defconfig showed huge
improvement:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333

After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384

time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83

The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers, previously dropping
si->lock or ci->lock during scan will cause cluster order shuffle.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   3 +-
 mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
 2 files changed, 246 insertions(+), 192 deletions(-)

Baoquan He Jan. 6, 2025, 10:12 a.m. UTC | #1

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip..
> +
> +/*
> + * Must be called after allocation, moves the cluster to full or frag list.
> + * Note: allocation doesn't acquire si lock, and may drop the ci lock for
> + * reclaim, so the cluster could be any where when called.
> + */
> +static void relocate_cluster(struct swap_info_struct *si,
> +			     struct swap_cluster_info *ci)
> +{
> +	lockdep_assert_held(&ci->lock);
> +
> +	/* Discard cluster must remain off-list or on discard list */
> +	if (cluster_is_discard(ci))
> +		return;
> +
> +	if (!ci->count) {
> +		free_cluster(si, ci);

relocate_cluster() is only called in alloc_swap_scan_cluster(), there
seems to be no chance to have 'ci->count == 0' case when allocating. Do
I miss anything here? 


> +	} else if (ci->count != SWAPFILE_CLUSTER) {
> +		if (ci->flags != CLUSTER_FLAG_FRAG)
> +			cluster_move(si, ci, &si->frag_clusters[ci->order],
> +				     CLUSTER_FLAG_FRAG);
> +	} else {
> +		if (ci->flags != CLUSTER_FLAG_FULL)
> +			cluster_move(si, ci, &si->full_clusters,
> +				     CLUSTER_FLAG_FULL);
> +	}
> +}
> +
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will not be
>   * added to free cluster list and its usage counter will be increased by 1.

Baoquan He Jan. 8, 2025, 11:09 a.m. UTC | #2

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip.....
> ---
>  include/linux/swap.h |   3 +-
>  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
>  2 files changed, 246 insertions(+), 192 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 339d7f0192ff..c4ff31cb6bde 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -291,6 +291,7 @@ enum swap_cluster_flags {
>   * throughput.
>   */
>  struct percpu_cluster {
> +	local_lock_t lock; /* Protect the percpu_cluster above */
>  	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
> @@ -313,7 +314,7 @@ struct swap_info_struct {
>  					/* list of cluster that contains at least one free slot */
>  	struct list_head frag_clusters[SWAP_NR_ORDERS];
>  					/* list of cluster that are fragmented or contented */
> -	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> +	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>  	unsigned int pages;		/* total of usable pages of swap */
>  	atomic_long_t inuse_pages;	/* number of those currently in use */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7795a3d27273..dadd4fead689 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	folio_ref_sub(folio, nr_pages);
>  	folio_set_dirty(folio);
>  
> -	spin_lock(&si->lock);
>  	/* Only sinple page folio can be backed by zswap */
>  	if (nr_pages == 1)
>  		zswap_invalidate(entry);
>  	swap_entry_range_free(si, entry, nr_pages);
> -	spin_unlock(&si->lock);
>  	ret = nr_pages;
>  out_unlock:
>  	folio_unlock(folio);
> @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  
>  static inline bool cluster_is_free(struct swap_cluster_info *info)
>  {
> -	return info->flags == CLUSTER_FLAG_FREE;
> +	return info->count == 0;

This is a little confusing. Maybe we should add one and call it
cluster_is_empty(). Because discarded clusters are also be able to pass
the checking here.

> +}
> +
> +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> +{
> +	return info->flags == CLUSTER_FLAG_DISCARD;
> +}
> +
> +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
> +{
> +	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> +		return false;
> +	if (!order)
> +		return true;
> +	return cluster_is_free(ci) || order == ci->order;
>  }
>  
>  static inline unsigned int cluster_index(struct swap_info_struct *si,
> @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
>  {
>  	VM_WARN_ON(ci->flags == new_flags);
>  	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> +	lockdep_assert_held(&ci->lock);
>  
> -	if (ci->flags == CLUSTER_FLAG_NONE) {
> +	spin_lock(&si->lock);
> +	if (ci->flags == CLUSTER_FLAG_NONE)
>  		list_add_tail(&ci->list, list);
> -	} else {
> -		if (ci->flags == CLUSTER_FLAG_FRAG) {
> -			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> -			si->frag_cluster_nr[ci->order]--;
> -		}
> +	else
>  		list_move_tail(&ci->list, list);
> -	}
> +	spin_unlock(&si->lock);
> +
> +	if (ci->flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> +	else if (new_flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
>  	ci->flags = new_flags;
> -	if (new_flags == CLUSTER_FLAG_FRAG)
> -		si->frag_cluster_nr[ci->order]++;
>  }
>  
>  /* Add a cluster to discard list and schedule it to do discard */
> @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -	lockdep_assert_held(&si->lock);
>  	lockdep_assert_held(&ci->lock);
>  	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>  	ci->order = 0;
>  }
>  
> +/*
> + * Isolate and lock the first cluster that is not contented on a list,
> + * clean its flag before taken off-list. Cluster flag must be in sync
> + * with list status, so cluster updaters can always know the cluster
> + * list status without touching si lock.
> + *
> + * Note it's possible that all clusters on a list are contented so
> + * this returns NULL for an non-empty list.
> + */
> +static struct swap_cluster_info *cluster_isolate_lock(
> +		struct swap_info_struct *si, struct list_head *list)
> +{
> +	struct swap_cluster_info *ci, *ret = NULL;
> +
> +	spin_lock(&si->lock);
> +
> +	if (unlikely(!(si->flags & SWP_WRITEOK)))
> +		goto out;
> +
> +	list_for_each_entry(ci, list, list) {
> +		if (!spin_trylock(&ci->lock))
> +			continue;
> +
> +		/* We may only isolate and clear flags of following lists */
> +		VM_BUG_ON(!ci->flags);
> +		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> +			  ci->flags != CLUSTER_FLAG_FULL);
> +
> +		list_del(&ci->list);
> +		ci->flags = CLUSTER_FLAG_NONE;
> +		ret = ci;
> +		break;
> +	}
> +out:
> +	spin_unlock(&si->lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * Doing discard actually. After a cluster discard is finished, the cluster
> - * will be added to free cluster list. caller should hold si->lock.
> -*/
> -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> + * will be added to free cluster list. Discard cluster is a bit special as
> + * they don't participate in allocation or reclaim, so clusters marked as
> + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> + */
> +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
>  	struct swap_cluster_info *ci;
> +	bool ret = false;
>  	unsigned int idx;
>  
> +	spin_lock(&si->lock);
>  	while (!list_empty(&si->discard_clusters)) {
>  		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> +		/*
> +		 * Delete the cluster from list but don't clear its flags until
> +		 * discard is done, so isolation and relocation will skip it.
> +		 */
>  		list_del(&ci->list);

I don't understand above comment. ci has been taken off list. While
allocation need isolate from a usable list. Even though we clear
ci->flags now, how come isolation and relocation will touch it. I may
miss anything here.

> -		/* Must clear flag when taking a cluster off-list */
> -		ci->flags = CLUSTER_FLAG_NONE;
>  		idx = cluster_index(si, ci);
>  		spin_unlock(&si->lock);
> -
>  		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>  				SWAPFILE_CLUSTER);
>  
> -		spin_lock(&si->lock);
>  		spin_lock(&ci->lock);
> -		__free_cluster(si, ci);
> +		/*
> +		 * Discard is done, clear its flags as it's now off-list,
> +		 * then return the cluster to allocation list.
> +		 */
> +		ci->flags = CLUSTER_FLAG_NONE;
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
> +		__free_cluster(si, ci);
>  		spin_unlock(&ci->lock);
> +		ret = true;
> +		spin_lock(&si->lock);
>  	}
> +	spin_unlock(&si->lock);
> +	return ret;
>  }
>  
>  static void swap_discard_work(struct work_struct *work)
......snip....
> @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
>  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
>  					      unsigned char usage)
>  {
> -	struct percpu_cluster *cluster;
>  	struct swap_cluster_info *ci;
>  	unsigned int offset, found = 0;
>  
> -new_cluster:
> -	lockdep_assert_held(&si->lock);
> -	cluster = this_cpu_ptr(si->percpu_cluster);
> -	offset = cluster->next[order];
> +	/* Fast path using per CPU cluster */
> +	local_lock(&si->percpu_cluster->lock);
> +	offset = __this_cpu_read(si->percpu_cluster->next[order]);
>  	if (offset) {
> -		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> +		ci = lock_cluster(si, offset);
> +		/* Cluster could have been used by another order */
> +		if (cluster_is_usable(ci, order)) {
> +			if (cluster_is_free(ci))
> +				offset = cluster_offset(si, ci);
> +			offset = alloc_swap_scan_cluster(si, offset, &found,
> +							 order, usage);
> +		} else {
> +			unlock_cluster(ci);
> +		}
>  		if (found)
>  			goto done;
>  	}
>  
> -	if (!list_empty(&si->free_clusters)) {
> -		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> -		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> -		/*
> -		 * Either we didn't touch the cluster due to swapoff,
> -		 * or the allocation must success.
> -		 */
> -		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> -		goto done;
> +new_cluster:
> +	ci = cluster_isolate_lock(si, &si->free_clusters);
> +	if (ci) {
> +		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> +						 &found, order, usage);
> +		if (found)
> +			goto done;
>  	}
>  
>  	/* Try reclaim from full clusters if free clusters list is drained */
> @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		swap_reclaim_full_clusters(si, false);
>  
>  	if (order < PMD_ORDER) {
> -		unsigned int frags = 0;
> +		unsigned int frags = 0, frags_existing;
>  
> -		while (!list_empty(&si->nonfull_clusters[order])) {
> -			ci = list_first_entry(&si->nonfull_clusters[order],
> -					      struct swap_cluster_info, list);
> -			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> +		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
> +			/*
> +			 * With `fragmenting` set to true, it will surely take
                                 ~~~~~~~~~~~
                         wondering what 'fragmenting' means here.


> +			 * the cluster off nonfull list
> +			 */
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  
> -		/*
> -		 * Nonfull clusters are moved to frag tail if we reached
> -		 * here, count them too, don't over scan the frag list.
> -		 */
> -		while (frags < si->frag_cluster_nr[order]) {
> -			ci = list_first_entry(&si->frag_clusters[order],
> -					      struct swap_cluster_info, list);
> +		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> +		while (frags < frags_existing &&
> +		       (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> +			atomic_long_dec(&si->frag_cluster_nr[order]);
>  			/*
> -			 * Rotate the frag list to iterate, they were all failing
> -			 * high order allocation or moved here due to per-CPU usage,
> -			 * this help keeping usable cluster ahead.
> +			 * Rotate the frag list to iterate, they were all
> +			 * failing high order allocation or moved here due to
> +			 * per-CPU usage, but they could contain newly released
> +			 * reclaimable (eg. lazy-freed swap cache) slots.
>  			 */
> -			list_move_tail(&ci->list, &si->frag_clusters[order]);
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  	}
>  
> -	if (!list_empty(&si->discard_clusters)) {
> -		/*
> -		 * we don't have free cluster but have some clusters in
> -		 * discarding, do discard now and reclaim them, then
> -		 * reread cluster_next_cpu since we dropped si->lock
> -		 */
> -		swap_do_scheduled_discard(si);
> +	/*
> +	 * We don't have free cluster but have some clusters in
> +	 * discarding, do discard now and reclaim them, then
> +	 * reread cluster_next_cpu since we dropped si->lock
> +	 */
> +	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
>  		goto new_cluster;
> -	}
>  
>  	if (order)
>  		goto done;
.....

Kairui Song Jan. 9, 2025, 2:15 a.m. UTC | #3

On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
>

Thanks for the very detailed review!

> On 12/31/24 at 01:46am, Kairui Song wrote:
> ......snip.....
> > ---
> >  include/linux/swap.h |   3 +-
> >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> >  2 files changed, 246 insertions(+), 192 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 339d7f0192ff..c4ff31cb6bde 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> >   * throughput.
> >   */
> >  struct percpu_cluster {
> > +     local_lock_t lock; /* Protect the percpu_cluster above */
> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> >  };
> >
> > @@ -313,7 +314,7 @@ struct swap_info_struct {
> >                                       /* list of cluster that contains at least one free slot */
> >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> >                                       /* list of cluster that are fragmented or contented */
> > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> >       unsigned int pages;             /* total of usable pages of swap */
> >       atomic_long_t inuse_pages;      /* number of those currently in use */
> >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 7795a3d27273..dadd4fead689 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >       folio_ref_sub(folio, nr_pages);
> >       folio_set_dirty(folio);
> >
> > -     spin_lock(&si->lock);
> >       /* Only sinple page folio can be backed by zswap */
> >       if (nr_pages == 1)
> >               zswap_invalidate(entry);
> >       swap_entry_range_free(si, entry, nr_pages);
> > -     spin_unlock(&si->lock);
> >       ret = nr_pages;
> >  out_unlock:
> >       folio_unlock(folio);
> > @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> >
> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >  {
> > -     return info->flags == CLUSTER_FLAG_FREE;
> > +     return info->count == 0;
>
> This is a little confusing. Maybe we should add one and call it
> cluster_is_empty(). Because discarded clusters are also be able to pass
> the checking here.

Good idea, agree on this, this new name is better.

>
> > +}
> > +
> > +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> > +{
> > +     return info->flags == CLUSTER_FLAG_DISCARD;
> > +}
> > +
> > +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
> > +{
> > +     if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> > +             return false;
> > +     if (!order)
> > +             return true;
> > +     return cluster_is_free(ci) || order == ci->order;
> >  }
> >
> >  static inline unsigned int cluster_index(struct swap_info_struct *si,
> > @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
> >  {
> >       VM_WARN_ON(ci->flags == new_flags);
> >       BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> > +     lockdep_assert_held(&ci->lock);
> >
> > -     if (ci->flags == CLUSTER_FLAG_NONE) {
> > +     spin_lock(&si->lock);
> > +     if (ci->flags == CLUSTER_FLAG_NONE)
> >               list_add_tail(&ci->list, list);
> > -     } else {
> > -             if (ci->flags == CLUSTER_FLAG_FRAG) {
> > -                     VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> > -                     si->frag_cluster_nr[ci->order]--;
> > -             }
> > +     else
> >               list_move_tail(&ci->list, list);
> > -     }
> > +     spin_unlock(&si->lock);
> > +
> > +     if (ci->flags == CLUSTER_FLAG_FRAG)
> > +             atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> > +     else if (new_flags == CLUSTER_FLAG_FRAG)
> > +             atomic_long_inc(&si->frag_cluster_nr[ci->order]);
> >       ci->flags = new_flags;
> > -     if (new_flags == CLUSTER_FLAG_FRAG)
> > -             si->frag_cluster_nr[ci->order]++;
> >  }
> >
> >  /* Add a cluster to discard list and schedule it to do discard */
> > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >
> >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > -     lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> >       ci->order = 0;
> >  }
> >
> > +/*
> > + * Isolate and lock the first cluster that is not contented on a list,
> > + * clean its flag before taken off-list. Cluster flag must be in sync
> > + * with list status, so cluster updaters can always know the cluster
> > + * list status without touching si lock.
> > + *
> > + * Note it's possible that all clusters on a list are contented so
> > + * this returns NULL for an non-empty list.
> > + */
> > +static struct swap_cluster_info *cluster_isolate_lock(
> > +             struct swap_info_struct *si, struct list_head *list)
> > +{
> > +     struct swap_cluster_info *ci, *ret = NULL;
> > +
> > +     spin_lock(&si->lock);
> > +
> > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > +             goto out;
> > +
> > +     list_for_each_entry(ci, list, list) {
> > +             if (!spin_trylock(&ci->lock))
> > +                     continue;
> > +
> > +             /* We may only isolate and clear flags of following lists */
> > +             VM_BUG_ON(!ci->flags);
> > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > +                       ci->flags != CLUSTER_FLAG_FULL);
> > +
> > +             list_del(&ci->list);
> > +             ci->flags = CLUSTER_FLAG_NONE;
> > +             ret = ci;
> > +             break;
> > +     }
> > +out:
> > +     spin_unlock(&si->lock);
> > +
> > +     return ret;
> > +}
> > +
> >  /*
> >   * Doing discard actually. After a cluster discard is finished, the cluster
> > - * will be added to free cluster list. caller should hold si->lock.
> > -*/
> > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > + * will be added to free cluster list. Discard cluster is a bit special as
> > + * they don't participate in allocation or reclaim, so clusters marked as
> > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > + */
> > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> >  {
> >       struct swap_cluster_info *ci;
> > +     bool ret = false;
> >       unsigned int idx;
> >
> > +     spin_lock(&si->lock);
> >       while (!list_empty(&si->discard_clusters)) {
> >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > +             /*
> > +              * Delete the cluster from list but don't clear its flags until
> > +              * discard is done, so isolation and relocation will skip it.
> > +              */
> >               list_del(&ci->list);
>
> I don't understand above comment. ci has been taken off list. While
> allocation need isolate from a usable list. Even though we clear
> ci->flags now, how come isolation and relocation will touch it. I may
> miss anything here.

There are many cases, one possible and common situation is that the
percpu cluster (si->percpu_cluster of another CPU) is still pointing
to it.

Also, this commit removed protection of si lock on allocation, and
allocation path may also drop ci lock to call reclaim, which means one
cluster could be used or freed by anyone before allocator reacquire
the ci lock again. In that case, the allocator could see a discard
cluster.

So we don't clear the discard flag, in case anyone misuse it.

I can add more inline comments on this, this is already some related
comments above the function relocate_cluster, could add some more
referencing that.

>
> > -             /* Must clear flag when taking a cluster off-list */
> > -             ci->flags = CLUSTER_FLAG_NONE;
> >               idx = cluster_index(si, ci);
> >               spin_unlock(&si->lock);
> > -
> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> >                               SWAPFILE_CLUSTER);
> >
> > -             spin_lock(&si->lock);
> >               spin_lock(&ci->lock);
> > -             __free_cluster(si, ci);
> > +             /*
> > +              * Discard is done, clear its flags as it's now off-list,
> > +              * then return the cluster to allocation list.
> > +              */
> > +             ci->flags = CLUSTER_FLAG_NONE;
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                               0, SWAPFILE_CLUSTER);
> > +             __free_cluster(si, ci);
> >               spin_unlock(&ci->lock);
> > +             ret = true;
> > +             spin_lock(&si->lock);
> >       }
> > +     spin_unlock(&si->lock);
> > +     return ret;
> >  }
> >
> >  static void swap_discard_work(struct work_struct *work)
> ......snip....
> > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> >                                             unsigned char usage)
> >  {
> > -     struct percpu_cluster *cluster;
> >       struct swap_cluster_info *ci;
> >       unsigned int offset, found = 0;
> >
> > -new_cluster:
> > -     lockdep_assert_held(&si->lock);
> > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > -     offset = cluster->next[order];
> > +     /* Fast path using per CPU cluster */
> > +     local_lock(&si->percpu_cluster->lock);
> > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> >       if (offset) {
> > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > +             ci = lock_cluster(si, offset);
> > +             /* Cluster could have been used by another order */
> > +             if (cluster_is_usable(ci, order)) {
> > +                     if (cluster_is_free(ci))
> > +                             offset = cluster_offset(si, ci);
> > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > +                                                      order, usage);
> > +             } else {
> > +                     unlock_cluster(ci);
> > +             }
> >               if (found)
> >                       goto done;
> >       }
> >
> > -     if (!list_empty(&si->free_clusters)) {
> > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > -             /*
> > -              * Either we didn't touch the cluster due to swapoff,
> > -              * or the allocation must success.
> > -              */
> > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > -             goto done;
> > +new_cluster:
> > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > +     if (ci) {
> > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > +                                              &found, order, usage);
> > +             if (found)
> > +                     goto done;
> >       }
> >
> >       /* Try reclaim from full clusters if free clusters list is drained */
> > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >               swap_reclaim_full_clusters(si, false);
> >
> >       if (order < PMD_ORDER) {
> > -             unsigned int frags = 0;
> > +             unsigned int frags = 0, frags_existing;
> >
> > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > -                                           struct swap_cluster_info, list);
> > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> >                                                        &found, order, usage);
> > -                     frags++;
> > +                     /*
> > +                      * With `fragmenting` set to true, it will surely take
>                                  ~~~~~~~~~~~
>                          wondering what 'fragmenting' means here.

This comment is a bit out of context indeed, it actually trying to say
the alloc_swap_scan_cluster call above should move the cluster to
tail. I'll update the comment.



>
> > +                      * the cluster off nonfull list
> > +                      */
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >
> > -             /*
> > -              * Nonfull clusters are moved to frag tail if we reached
> > -              * here, count them too, don't over scan the frag list.
> > -              */
> > -             while (frags < si->frag_cluster_nr[order]) {
> > -                     ci = list_first_entry(&si->frag_clusters[order],
> > -                                           struct swap_cluster_info, list);
> > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > +             while (frags < frags_existing &&
> > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> >                       /*
> > -                      * Rotate the frag list to iterate, they were all failing
> > -                      * high order allocation or moved here due to per-CPU usage,
> > -                      * this help keeping usable cluster ahead.
> > +                      * Rotate the frag list to iterate, they were all
> > +                      * failing high order allocation or moved here due to
> > +                      * per-CPU usage, but they could contain newly released
> > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> >                        */
> > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> >                                                        &found, order, usage);
> > -                     frags++;
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >       }
> >
> > -     if (!list_empty(&si->discard_clusters)) {
> > -             /*
> > -              * we don't have free cluster but have some clusters in
> > -              * discarding, do discard now and reclaim them, then
> > -              * reread cluster_next_cpu since we dropped si->lock
> > -              */
> > -             swap_do_scheduled_discard(si);
> > +     /*
> > +      * We don't have free cluster but have some clusters in
> > +      * discarding, do discard now and reclaim them, then
> > +      * reread cluster_next_cpu since we dropped si->lock
> > +      */
> > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> >               goto new_cluster;
> > -     }
> >
> >       if (order)
> >               goto done;
> .....
>
>

Baoquan He Jan. 10, 2025, 11:23 a.m. UTC | #4

On 01/09/25 at 10:15am, Kairui Song wrote:
> On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> >
> 
> Thanks for the very detailed review!
> 
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > ......snip.....
> > > ---
> > >  include/linux/swap.h |   3 +-
> > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > >   * throughput.
> > >   */
> > >  struct percpu_cluster {
> > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > >  };
> > >
> > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > >                                       /* list of cluster that contains at least one free slot */
> > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > >                                       /* list of cluster that are fragmented or contented */
> > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > >       unsigned int pages;             /* total of usable pages of swap */
> > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 7795a3d27273..dadd4fead689 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
...snip...
> > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >
> > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > >  {
> > > -     lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > >       ci->order = 0;
> > >  }
> > >
> > > +/*
> > > + * Isolate and lock the first cluster that is not contented on a list,
> > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > + * with list status, so cluster updaters can always know the cluster
> > > + * list status without touching si lock.
> > > + *
> > > + * Note it's possible that all clusters on a list are contented so
> > > + * this returns NULL for an non-empty list.
> > > + */
> > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > +             struct swap_info_struct *si, struct list_head *list)
> > > +{
> > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > +
> > > +     spin_lock(&si->lock);
> > > +
> > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > +             goto out;
> > > +
> > > +     list_for_each_entry(ci, list, list) {
> > > +             if (!spin_trylock(&ci->lock))
> > > +                     continue;
> > > +
> > > +             /* We may only isolate and clear flags of following lists */
> > > +             VM_BUG_ON(!ci->flags);
> > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > +
> > > +             list_del(&ci->list);
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > +             ret = ci;
> > > +             break;
> > > +     }
> > > +out:
> > > +     spin_unlock(&si->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > >  /*
> > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > - * will be added to free cluster list. caller should hold si->lock.
> > > -*/
> > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > + */
> > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > >  {
> > >       struct swap_cluster_info *ci;
> > > +     bool ret = false;
> > >       unsigned int idx;
> > >
> > > +     spin_lock(&si->lock);
> > >       while (!list_empty(&si->discard_clusters)) {
> > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > +             /*
> > > +              * Delete the cluster from list but don't clear its flags until
> > > +              * discard is done, so isolation and relocation will skip it.
> > > +              */
> > >               list_del(&ci->list);
> >
> > I don't understand above comment. ci has been taken off list. While
> > allocation need isolate from a usable list. Even though we clear
> > ci->flags now, how come isolation and relocation will touch it. I may
> > miss anything here.
> 
> There are many cases, one possible and common situation is that the
> percpu cluster (si->percpu_cluster of another CPU) is still pointing
> to it.
> 
> Also, this commit removed protection of si lock on allocation, and
> allocation path may also drop ci lock to call reclaim, which means one
> cluster could be used or freed by anyone before allocator reacquire
> the ci lock again. In that case, the allocator could see a discard
> cluster.
> 
> So we don't clear the discard flag, in case anyone misuse it.
> 
> I can add more inline comments on this, this is already some related
> comments above the function relocate_cluster, could add some more
> referencing that.

Thanks for your great explanation. I understand that si->percpu_cluster
could point to a discarded ci, and a ci could be got from non-full,
frag lists but later become discarded if that ci is freed on other cpu
during cluster_reclaim_range() invocation. I haven't got how isolation
could see a discarded ci in cluster_isolate_lock(). Could you help give
an exmaple on how that happen?

Surely, I understand keeping the discarded flag is very necessary so
that checking like cluster_is_usable() will return expected value.

And by the way, I haven't got when the ' if (!ci->count)' case could
happen in relocate_cluster() since we have filtered away discarded ci
with the 'if (cluster_is_discard(ci))' checking. I asked in another
thread, could you help explain it?

static void relocate_cluster(struct swap_info_struct *si,
                             struct swap_cluster_info *ci)
{               
        lockdep_assert_held(&ci->lock); 
                
        /* Discard cluster must remain off-list or on discard list */
        if (cluster_is_discard(ci))
                return;
                
        if (!ci->count) {
                free_cluster(si, ci);
...
}
> 
> >
> > > -             /* Must clear flag when taking a cluster off-list */
> > > -             ci->flags = CLUSTER_FLAG_NONE;
> > >               idx = cluster_index(si, ci);
> > >               spin_unlock(&si->lock);
> > > -
> > >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > >                               SWAPFILE_CLUSTER);
> > >
> > > -             spin_lock(&si->lock);
> > >               spin_lock(&ci->lock);
> > > -             __free_cluster(si, ci);
> > > +             /*
> > > +              * Discard is done, clear its flags as it's now off-list,
> > > +              * then return the cluster to allocation list.
> > > +              */
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > >                               0, SWAPFILE_CLUSTER);
> > > +             __free_cluster(si, ci);
> > >               spin_unlock(&ci->lock);
> > > +             ret = true;
> > > +             spin_lock(&si->lock);
> > >       }
> > > +     spin_unlock(&si->lock);
> > > +     return ret;
> > >  }
> > >
> > >  static void swap_discard_work(struct work_struct *work)
> > ......snip....
> > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> > >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > >                                             unsigned char usage)
> > >  {
> > > -     struct percpu_cluster *cluster;
> > >       struct swap_cluster_info *ci;
> > >       unsigned int offset, found = 0;
> > >
> > > -new_cluster:
> > > -     lockdep_assert_held(&si->lock);
> > > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > > -     offset = cluster->next[order];
> > > +     /* Fast path using per CPU cluster */
> > > +     local_lock(&si->percpu_cluster->lock);
> > > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > >       if (offset) {
> > > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > > +             ci = lock_cluster(si, offset);
> > > +             /* Cluster could have been used by another order */
> > > +             if (cluster_is_usable(ci, order)) {
> > > +                     if (cluster_is_free(ci))
> > > +                             offset = cluster_offset(si, ci);
> > > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > > +                                                      order, usage);
> > > +             } else {
> > > +                     unlock_cluster(ci);
> > > +             }
> > >               if (found)
> > >                       goto done;
> > >       }
> > >
> > > -     if (!list_empty(&si->free_clusters)) {
> > > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > > -             /*
> > > -              * Either we didn't touch the cluster due to swapoff,
> > > -              * or the allocation must success.
> > > -              */
> > > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > > -             goto done;
> > > +new_cluster:
> > > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > > +     if (ci) {
> > > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > +                                              &found, order, usage);
> > > +             if (found)
> > > +                     goto done;
> > >       }
> > >
> > >       /* Try reclaim from full clusters if free clusters list is drained */
> > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >               swap_reclaim_full_clusters(si, false);
> > >
> > >       if (order < PMD_ORDER) {
> > > -             unsigned int frags = 0;
> > > +             unsigned int frags = 0, frags_existing;
> > >
> > > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > > +                     /*
> > > +                      * With `fragmenting` set to true, it will surely take
> >                                  ~~~~~~~~~~~
> >                          wondering what 'fragmenting' means here.
> 
> This comment is a bit out of context indeed, it actually trying to say
> the alloc_swap_scan_cluster call above should move the cluster to
> tail. I'll update the comment.
> 
> 
> 
> >
> > > +                      * the cluster off nonfull list
> > > +                      */
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >
> > > -             /*
> > > -              * Nonfull clusters are moved to frag tail if we reached
> > > -              * here, count them too, don't over scan the frag list.
> > > -              */
> > > -             while (frags < si->frag_cluster_nr[order]) {
> > > -                     ci = list_first_entry(&si->frag_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > > +             while (frags < frags_existing &&
> > > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> > >                       /*
> > > -                      * Rotate the frag list to iterate, they were all failing
> > > -                      * high order allocation or moved here due to per-CPU usage,
> > > -                      * this help keeping usable cluster ahead.
> > > +                      * Rotate the frag list to iterate, they were all
> > > +                      * failing high order allocation or moved here due to
> > > +                      * per-CPU usage, but they could contain newly released
> > > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> > >                        */
> > > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >       }
> > >
> > > -     if (!list_empty(&si->discard_clusters)) {
> > > -             /*
> > > -              * we don't have free cluster but have some clusters in
> > > -              * discarding, do discard now and reclaim them, then
> > > -              * reread cluster_next_cpu since we dropped si->lock
> > > -              */
> > > -             swap_do_scheduled_discard(si);
> > > +     /*
> > > +      * We don't have free cluster but have some clusters in
> > > +      * discarding, do discard now and reclaim them, then
> > > +      * reread cluster_next_cpu since we dropped si->lock
> > > +      */
> > > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > >               goto new_cluster;
> > > -     }
> > >
> > >       if (order)
> > >               goto done;
> > .....
> >
> >
>

Kairui Song Jan. 13, 2025, 6:33 a.m. UTC | #5

On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/09/25 at 10:15am, Kairui Song wrote:
> > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> >
> > Thanks for the very detailed review!
> >
> > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > ......snip.....
> > > > ---
> > > >  include/linux/swap.h |   3 +-
> > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > >   * throughput.
> > > >   */
> > > >  struct percpu_cluster {
> > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > >  };
> > > >
> > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > >                                       /* list of cluster that contains at least one free slot */
> > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > >                                       /* list of cluster that are fragmented or contented */
> > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > >       unsigned int pages;             /* total of usable pages of swap */
> > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 7795a3d27273..dadd4fead689 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> ...snip...
> > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > >
> > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > >  {
> > > > -     lockdep_assert_held(&si->lock);
> > > >       lockdep_assert_held(&ci->lock);
> > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > >       ci->order = 0;
> > > >  }
> > > >
> > > > +/*
> > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > + * with list status, so cluster updaters can always know the cluster
> > > > + * list status without touching si lock.
> > > > + *
> > > > + * Note it's possible that all clusters on a list are contented so
> > > > + * this returns NULL for an non-empty list.
> > > > + */
> > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > +{
> > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > +
> > > > +     spin_lock(&si->lock);
> > > > +
> > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > +             goto out;
> > > > +
> > > > +     list_for_each_entry(ci, list, list) {
> > > > +             if (!spin_trylock(&ci->lock))
> > > > +                     continue;
> > > > +
> > > > +             /* We may only isolate and clear flags of following lists */
> > > > +             VM_BUG_ON(!ci->flags);
> > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > +
> > > > +             list_del(&ci->list);
> > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > +             ret = ci;
> > > > +             break;
> > > > +     }
> > > > +out:
> > > > +     spin_unlock(&si->lock);
> > > > +
> > > > +     return ret;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > -*/
> > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > + */
> > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > >  {
> > > >       struct swap_cluster_info *ci;
> > > > +     bool ret = false;
> > > >       unsigned int idx;
> > > >
> > > > +     spin_lock(&si->lock);
> > > >       while (!list_empty(&si->discard_clusters)) {
> > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > +             /*
> > > > +              * Delete the cluster from list but don't clear its flags until
> > > > +              * discard is done, so isolation and relocation will skip it.
> > > > +              */
> > > >               list_del(&ci->list);
> > >
> > > I don't understand above comment. ci has been taken off list. While
> > > allocation need isolate from a usable list. Even though we clear
> > > ci->flags now, how come isolation and relocation will touch it. I may
> > > miss anything here.
> >
> > There are many cases, one possible and common situation is that the
> > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > to it.
> >
> > Also, this commit removed protection of si lock on allocation, and
> > allocation path may also drop ci lock to call reclaim, which means one
> > cluster could be used or freed by anyone before allocator reacquire
> > the ci lock again. In that case, the allocator could see a discard
> > cluster.
> >
> > So we don't clear the discard flag, in case anyone misuse it.
> >
> > I can add more inline comments on this, this is already some related
> > comments above the function relocate_cluster, could add some more
> > referencing that.

Hi Baoquan,

>
> Thanks for your great explanation. I understand that si->percpu_cluster
> could point to a discarded ci, and a ci could be got from non-full,
> frag lists but later become discarded if that ci is freed on other cpu
> during cluster_reclaim_range() invocation. I haven't got how isolation
> could see a discarded ci in cluster_isolate_lock(). Could you help give
> an exmaple on how that happen?

cluster_isolate_lock shouldn't see a discard cluster, and there is a
VM_BUG_ON for that.

>
> Surely, I understand keeping the discarded flag is very necessary so
> that checking like cluster_is_usable() will return expected value.
>
> And by the way, I haven't got when the ' if (!ci->count)' case could
> happen in relocate_cluster() since we have filtered away discarded ci
> with the 'if (cluster_is_discard(ci))' checking. I asked in another
> thread, could you help explain it?

Many swap devices doesn't need discard so the cluster could be freed
directly. And actually the ci->count check in relocate_cluster is not
necessarily related to that.

The caller of relocate_cluster, may fail an allocation (eg. race with
swapoff) and that could end up calling relocate_cluster with a empty
cluster, such cluster should go to the free list (swapoff might fail
too).

The swapoff case is extremely rare but let's just be more robust here,
covering free cluster have almost no overhead but save a lot of
efforts. I can add some comments on this.

>
> static void relocate_cluster(struct swap_info_struct *si,
>                              struct swap_cluster_info *ci)
> {
>         lockdep_assert_held(&ci->lock);
>
>         /* Discard cluster must remain off-list or on discard list */
>         if (cluster_is_discard(ci))
>                 return;
>
>         if (!ci->count) {
>                 free_cluster(si, ci);
> ...
> }

Kairui Song Jan. 13, 2025, 8:07 a.m. UTC | #6

On Mon, Jan 13, 2025 at 2:33 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 01/09/25 at 10:15am, Kairui Song wrote:
> > > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > >
> > > Thanks for the very detailed review!
> > >
> > > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > > ......snip.....
> > > > > ---
> > > > >  include/linux/swap.h |   3 +-
> > > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > > >   * throughput.
> > > > >   */
> > > > >  struct percpu_cluster {
> > > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > > >  };
> > > > >
> > > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > > >                                       /* list of cluster that contains at least one free slot */
> > > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > > >                                       /* list of cluster that are fragmented or contented */
> > > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > > >       unsigned int pages;             /* total of usable pages of swap */
> > > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 7795a3d27273..dadd4fead689 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > ...snip...
> > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > > >
> > > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > > >  {
> > > > > -     lockdep_assert_held(&si->lock);
> > > > >       lockdep_assert_held(&ci->lock);
> > > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > > >       ci->order = 0;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > > + * with list status, so cluster updaters can always know the cluster
> > > > > + * list status without touching si lock.
> > > > > + *
> > > > > + * Note it's possible that all clusters on a list are contented so
> > > > > + * this returns NULL for an non-empty list.
> > > > > + */
> > > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > > +{
> > > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > > +
> > > > > +     spin_lock(&si->lock);
> > > > > +
> > > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > > +             goto out;
> > > > > +
> > > > > +     list_for_each_entry(ci, list, list) {
> > > > > +             if (!spin_trylock(&ci->lock))
> > > > > +                     continue;
> > > > > +
> > > > > +             /* We may only isolate and clear flags of following lists */
> > > > > +             VM_BUG_ON(!ci->flags);
> > > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > > +
> > > > > +             list_del(&ci->list);
> > > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > > +             ret = ci;
> > > > > +             break;
> > > > > +     }
> > > > > +out:
> > > > > +     spin_unlock(&si->lock);
> > > > > +
> > > > > +     return ret;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > > -*/
> > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > > + */
> > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > >  {
> > > > >       struct swap_cluster_info *ci;
> > > > > +     bool ret = false;
> > > > >       unsigned int idx;
> > > > >
> > > > > +     spin_lock(&si->lock);
> > > > >       while (!list_empty(&si->discard_clusters)) {
> > > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > > +             /*
> > > > > +              * Delete the cluster from list but don't clear its flags until
> > > > > +              * discard is done, so isolation and relocation will skip it.
> > > > > +              */
> > > > >               list_del(&ci->list);
> > > >
> > > > I don't understand above comment. ci has been taken off list. While
> > > > allocation need isolate from a usable list. Even though we clear
> > > > ci->flags now, how come isolation and relocation will touch it. I may
> > > > miss anything here.
> > >
> > > There are many cases, one possible and common situation is that the
> > > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > > to it.
> > >
> > > Also, this commit removed protection of si lock on allocation, and
> > > allocation path may also drop ci lock to call reclaim, which means one
> > > cluster could be used or freed by anyone before allocator reacquire
> > > the ci lock again. In that case, the allocator could see a discard
> > > cluster.
> > >
> > > So we don't clear the discard flag, in case anyone misuse it.
> > >
> > > I can add more inline comments on this, this is already some related
> > > comments above the function relocate_cluster, could add some more
> > > referencing that.
>
> Hi Baoquan,
>
> >
> > Thanks for your great explanation. I understand that si->percpu_cluster
> > could point to a discarded ci, and a ci could be got from non-full,
> > frag lists but later become discarded if that ci is freed on other cpu
> > during cluster_reclaim_range() invocation. I haven't got how isolation
> > could see a discarded ci in cluster_isolate_lock(). Could you help give
> > an exmaple on how that happen?
>
> cluster_isolate_lock shouldn't see a discard cluster, and there is a
> VM_BUG_ON for that.

Oh, now I realize what you mean, the comment in
swap_do_scheduled_discard mentioned cluster_isolate_lock may see a
discard cluster, that is not true, it was added in an early version of
this series and forgot to update the comment. I'll just drop that.

>
> >
> > Surely, I understand keeping the discarded flag is very necessary so
> > that checking like cluster_is_usable() will return expected value.
> >
> > And by the way, I haven't got when the ' if (!ci->count)' case could
> > happen in relocate_cluster() since we have filtered away discarded ci
> > with the 'if (cluster_is_discard(ci))' checking. I asked in another
> > thread, could you help explain it?
>
> Many swap devices doesn't need discard so the cluster could be freed
> directly. And actually the ci->count check in relocate_cluster is not
> necessarily related to that.
>
> The caller of relocate_cluster, may fail an allocation (eg. race with
> swapoff) and that could end up calling relocate_cluster with a empty
> cluster, such cluster should go to the free list (swapoff might fail
> too).
>
> The swapoff case is extremely rare but let's just be more robust here,
> covering free cluster have almost no overhead but save a lot of
> efforts. I can add some comments on this.
>
> >
> > static void relocate_cluster(struct swap_info_struct *si,
> >                              struct swap_cluster_info *ci)
> > {
> >         lockdep_assert_held(&ci->lock);
> >
> >         /* Discard cluster must remain off-list or on discard list */
> >         if (cluster_is_discard(ci))
> >                 return;
> >
> >         if (!ci->count) {
> >                 free_cluster(si, ci);
> > ...
> > }

[v3,09/13] mm, swap: reduce contention on device lock

Commit Message

Comments

Patch