Message ID | 20180622035151.6676-4-ying.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 22, 2018 at 11:51:33AM +0800, Huang, Ying wrote: > +++ b/mm/swap_state.c > @@ -433,7 +433,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > /* > * Swap entry may have been freed since our caller observed it. > */ > - err = swapcache_prepare(entry); > + err = swapcache_prepare(entry, false); > if (err == -EEXIST) { > radix_tree_preload_end(); > /* This commit should be just a textual conflict.
Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jun 22, 2018 at 11:51:33AM +0800, Huang, Ying wrote: >> +++ b/mm/swap_state.c >> @@ -433,7 +433,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, >> /* >> * Swap entry may have been freed since our caller observed it. >> */ >> - err = swapcache_prepare(entry); >> + err = swapcache_prepare(entry, false); >> if (err == -EEXIST) { >> radix_tree_preload_end(); >> /* > > This commit should be just a textual conflict. Yes. Will check it. Best Regards, Huang, Ying
On Thu, Jun 21, 2018 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > From: Huang Ying <ying.huang@intel.com> > > To support to swapin the THP as a whole, we need to create PMD swap > mapping during swapout, and maintain PMD swap mapping count. This > patch implements the support to increase the PMD swap mapping > count (for swapout, fork, etc.) and set SWAP_HAS_CACHE flag (for > swapin, etc.) for a huge swap cluster in swap_duplicate() function > family. Although it only implements a part of the design of the swap > reference count with PMD swap mapping, the whole design is described > as follow to make it easy to understand the patch and the whole > picture. > > A huge swap cluster is used to hold the contents of a swapouted THP. > After swapout, a PMD page mapping to the THP will become a PMD > swap mapping to the huge swap cluster via a swap entry in PMD. While > a PTE page mapping to a subpage of the THP will become the PTE swap > mapping to a swap slot in the huge swap cluster via a swap entry in > PTE. > > If there is no PMD swap mapping and the corresponding THP is removed > from the page cache (reclaimed), the huge swap cluster will be split > and become a normal swap cluster. > > The count (cluster_count()) of the huge swap cluster is > SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count. Because > all swap slots in the huge swap cluster are mapped by PTE or PMD, or > has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is > HPAGE_PMD_NR. And the PMD swap mapping count is recorded too to make > it easy to determine whether there are remaining PMD swap mappings. > > The count in swap_map[offset] is the sum of PTE and PMD swap mapping > count. This means when we increase the PMD swap mapping count, we > need to increase swap_map[offset] for all swap slots inside the swap > cluster. An alternative choice is to make swap_map[offset] to record > PTE swap map count only, given we have recorded PMD swap mapping count > in the count of the huge swap cluster. But this need to increase > swap_map[offset] when splitting the PMD swap mapping, that may fail > because of memory allocation for swap count continuation. That is > hard to dealt with. So we choose current solution. > > The PMD swap mapping to a huge swap cluster may be split when unmap a > part of PMD mapping etc. That is easy because only the count of the > huge swap cluster need to be changed. When the last PMD swap mapping > is gone and SWAP_HAS_CACHE is unset, we will split the huge swap > cluster (clear the huge flag). This makes it easy to reason the > cluster state. > > A huge swap cluster will be split when splitting the THP in swap > cache, or failing to allocate THP during swapin, etc. But when > splitting the huge swap cluster, we will not try to split all PMD swap > mappings, because we haven't enough information available for that > sometimes. Later, when the PMD swap mapping is duplicated or swapin, > etc, the PMD swap mapping will be split and fallback to the PTE > operation. > > When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be > set in the swap_map[offset] of all swap slots inside the huge swap > cluster backing the THP. This huge swap cluster will not be split > unless the THP is split even if its PMD swap mapping count dropped to > 0. Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE > flag will be cleared in the swap_map[offset] of all swap slots inside > the huge swap cluster. And this huge swap cluster will be split if > its PMD swap mapping count is 0. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Shaohua Li <shli@kernel.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Rik van Riel <riel@redhat.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Zi Yan <zi.yan@cs.rutgers.edu> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com> > --- > include/linux/huge_mm.h | 5 + > include/linux/swap.h | 9 +- > mm/memory.c | 2 +- > mm/rmap.c | 2 +- > mm/swap_state.c | 2 +- > mm/swapfile.c | 287 +++++++++++++++++++++++++++++++++--------------- > 6 files changed, 214 insertions(+), 93 deletions(-) I'm probably missing some background, but I find the patch hard to read. Can you disseminate some of this patch changelog into kernel-doc commentary so it's easier to follow which helpers do what relative to THP swap. > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index d3bbf6bea9e9..213d32e57c39 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -80,6 +80,11 @@ extern struct kobj_attribute shmem_enabled_attr; > #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) > #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) > > +static inline bool thp_swap_supported(void) > +{ > + return IS_ENABLED(CONFIG_THP_SWAP); > +} > + > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > #define HPAGE_PMD_SHIFT PMD_SHIFT > #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) > diff --git a/include/linux/swap.h b/include/linux/swap.h > index f73eafcaf4e9..57aa655ab27d 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int); > extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); > extern int add_swap_count_continuation(swp_entry_t, gfp_t); > extern void swap_shmem_alloc(swp_entry_t); > -extern int swap_duplicate(swp_entry_t); > -extern int swapcache_prepare(swp_entry_t); > +extern int swap_duplicate(swp_entry_t *entry, bool cluster); This patch introduces a new flag to swap_duplicate(), but then all all usages still pass 'false' so why does this patch change the argument. Seems this change belongs to another patch? > +extern int swapcache_prepare(swp_entry_t entry, bool cluster); Rather than add a cluster flag to these helpers can the swp_entry_t carry the cluster flag directly?
Dan Williams <dan.j.williams@gmail.com> writes: > On Thu, Jun 21, 2018 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> From: Huang Ying <ying.huang@intel.com> >> >> To support to swapin the THP as a whole, we need to create PMD swap >> mapping during swapout, and maintain PMD swap mapping count. This >> patch implements the support to increase the PMD swap mapping >> count (for swapout, fork, etc.) and set SWAP_HAS_CACHE flag (for >> swapin, etc.) for a huge swap cluster in swap_duplicate() function >> family. Although it only implements a part of the design of the swap >> reference count with PMD swap mapping, the whole design is described >> as follow to make it easy to understand the patch and the whole >> picture. >> >> A huge swap cluster is used to hold the contents of a swapouted THP. >> After swapout, a PMD page mapping to the THP will become a PMD >> swap mapping to the huge swap cluster via a swap entry in PMD. While >> a PTE page mapping to a subpage of the THP will become the PTE swap >> mapping to a swap slot in the huge swap cluster via a swap entry in >> PTE. >> >> If there is no PMD swap mapping and the corresponding THP is removed >> from the page cache (reclaimed), the huge swap cluster will be split >> and become a normal swap cluster. >> >> The count (cluster_count()) of the huge swap cluster is >> SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count. Because >> all swap slots in the huge swap cluster are mapped by PTE or PMD, or >> has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is >> HPAGE_PMD_NR. And the PMD swap mapping count is recorded too to make >> it easy to determine whether there are remaining PMD swap mappings. >> >> The count in swap_map[offset] is the sum of PTE and PMD swap mapping >> count. This means when we increase the PMD swap mapping count, we >> need to increase swap_map[offset] for all swap slots inside the swap >> cluster. An alternative choice is to make swap_map[offset] to record >> PTE swap map count only, given we have recorded PMD swap mapping count >> in the count of the huge swap cluster. But this need to increase >> swap_map[offset] when splitting the PMD swap mapping, that may fail >> because of memory allocation for swap count continuation. That is >> hard to dealt with. So we choose current solution. >> >> The PMD swap mapping to a huge swap cluster may be split when unmap a >> part of PMD mapping etc. That is easy because only the count of the >> huge swap cluster need to be changed. When the last PMD swap mapping >> is gone and SWAP_HAS_CACHE is unset, we will split the huge swap >> cluster (clear the huge flag). This makes it easy to reason the >> cluster state. >> >> A huge swap cluster will be split when splitting the THP in swap >> cache, or failing to allocate THP during swapin, etc. But when >> splitting the huge swap cluster, we will not try to split all PMD swap >> mappings, because we haven't enough information available for that >> sometimes. Later, when the PMD swap mapping is duplicated or swapin, >> etc, the PMD swap mapping will be split and fallback to the PTE >> operation. >> >> When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be >> set in the swap_map[offset] of all swap slots inside the huge swap >> cluster backing the THP. This huge swap cluster will not be split >> unless the THP is split even if its PMD swap mapping count dropped to >> 0. Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE >> flag will be cleared in the swap_map[offset] of all swap slots inside >> the huge swap cluster. And this huge swap cluster will be split if >> its PMD swap mapping count is 0. >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> >> Cc: Andrea Arcangeli <aarcange@redhat.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Johannes Weiner <hannes@cmpxchg.org> >> Cc: Shaohua Li <shli@kernel.org> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Minchan Kim <minchan@kernel.org> >> Cc: Rik van Riel <riel@redhat.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >> Cc: Zi Yan <zi.yan@cs.rutgers.edu> >> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> >> --- >> include/linux/huge_mm.h | 5 + >> include/linux/swap.h | 9 +- >> mm/memory.c | 2 +- >> mm/rmap.c | 2 +- >> mm/swap_state.c | 2 +- >> mm/swapfile.c | 287 +++++++++++++++++++++++++++++++++--------------- >> 6 files changed, 214 insertions(+), 93 deletions(-) > > I'm probably missing some background, but I find the patch hard to > read. Can you disseminate some of this patch changelog into kernel-doc > commentary so it's easier to follow which helpers do what relative to > THP swap. Yes. This is a good idea. Thanks for pointing it out. I will add more kernel-doc commentary to make the code easier to be understood. >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index d3bbf6bea9e9..213d32e57c39 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -80,6 +80,11 @@ extern struct kobj_attribute shmem_enabled_attr; >> #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) >> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) >> >> +static inline bool thp_swap_supported(void) >> +{ >> + return IS_ENABLED(CONFIG_THP_SWAP); >> +} >> + >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> #define HPAGE_PMD_SHIFT PMD_SHIFT >> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index f73eafcaf4e9..57aa655ab27d 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int); >> extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); >> extern int add_swap_count_continuation(swp_entry_t, gfp_t); >> extern void swap_shmem_alloc(swp_entry_t); >> -extern int swap_duplicate(swp_entry_t); >> -extern int swapcache_prepare(swp_entry_t); >> +extern int swap_duplicate(swp_entry_t *entry, bool cluster); > > This patch introduces a new flag to swap_duplicate(), but then all all > usages still pass 'false' so why does this patch change the argument. > Seems this change belongs to another patch? This patch just introduce the capability to deal with huge swap entry in swap_duplicate() family functions. The first user of the huge swap entry is in [PATCH -mm -v4 08/21] mm, THP, swap: Support to read a huge swap cluster for swapin a THP via swapcache_prepare(). Yes, it is generally better to put the implementation and the user into one patch. But I found in that way, I have to put most code of this patchset into single huge patch, that is not good for review too. So I made some compromise to separate the implementation and the users into different patches to make the size of single patch not too huge. Does this make sense to you? >> +extern int swapcache_prepare(swp_entry_t entry, bool cluster); > > Rather than add a cluster flag to these helpers can the swp_entry_t > carry the cluster flag directly? Matthew Wilcox suggested to replace the "cluster" flag with the number of entries to make the interface more flexible. And he suggest to use a very smart way to encode the nr_entries into swap_entry_t with something like, https://plus.google.com/117536210417097546339/posts/hvctn17WUZu But I think we need to - encode swap type, swap offset, nr_entries into a new swp_entry_t - call a function - decode the new swp_entry_t in the function So it appears that it doesn't bring real value except reduce one parameter. Or you suggest something else? Best Regards, Huang, Ying
> +static inline bool thp_swap_supported(void) > +{ > + return IS_ENABLED(CONFIG_THP_SWAP); > +} This seems like rather useless abstraction. Why do we need it? ... > -static inline int swap_duplicate(swp_entry_t swp) > +static inline int swap_duplicate(swp_entry_t *swp, bool cluster) > { > return 0; > } FWIW, I despise true/false function arguments like this. When I see this in code: swap_duplicate(&entry, false); I have no idea what false does. I'd much rather see: enum do_swap_cluster { SWP_DO_CLUSTER, SWP_NO_CLUSTER }; So you see: swap_duplicate(&entry, SWP_NO_CLUSTER); vs. swap_duplicate(&entry, SWP_DO_CLUSTER); > diff --git a/mm/memory.c b/mm/memory.c > index e9cac1c4fa69..f3900282e3da 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -951,7 +951,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, > swp_entry_t entry = pte_to_swp_entry(pte); > > if (likely(!non_swap_entry(entry))) { > - if (swap_duplicate(entry) < 0) > + if (swap_duplicate(&entry, false) < 0) > return entry.val; > > /* make sure dst_mm is on swapoff's mmlist. */ I'll also point out that in a multi-hundred-line patch, adding arguments to a existing function would not be something I'd try to include in the patch. I'd break it out separately unless absolutely necessary. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f42b1b0cdc58..48e2c54385ee 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -49,6 +49,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, > unsigned char); > static void free_swap_count_continuations(struct swap_info_struct *); > static sector_t map_swap_entry(swp_entry_t, struct block_device**); > +static int add_swap_count_continuation_locked(struct swap_info_struct *si, > + unsigned long offset, > + struct page *page); > > DEFINE_SPINLOCK(swap_lock); > static unsigned int nr_swapfiles; > @@ -319,6 +322,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, > spin_unlock(&si->lock); > } > > +static inline bool is_cluster_offset(unsigned long offset) > +{ > + return !(offset % SWAPFILE_CLUSTER); > +} > + > static inline bool cluster_list_empty(struct swap_cluster_list *list) > { > return cluster_is_null(&list->head); > @@ -1166,16 +1174,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > return NULL; > } > > -static unsigned char __swap_entry_free(struct swap_info_struct *p, > - swp_entry_t entry, unsigned char usage) > +static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, > + struct swap_cluster_info *ci, > + unsigned long offset, > + unsigned char usage) > { > - struct swap_cluster_info *ci; > - unsigned long offset = swp_offset(entry); > unsigned char count; > unsigned char has_cache; > > - ci = lock_cluster_or_swap_info(p, offset); > - > count = p->swap_map[offset]; > > has_cache = count & SWAP_HAS_CACHE; > @@ -1203,6 +1209,17 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, > usage = count | has_cache; > p->swap_map[offset] = usage ? : SWAP_HAS_CACHE; > > + return usage; > +} > + > +static unsigned char __swap_entry_free(struct swap_info_struct *p, > + swp_entry_t entry, unsigned char usage) > +{ > + struct swap_cluster_info *ci; > + unsigned long offset = swp_offset(entry); > + > + ci = lock_cluster_or_swap_info(p, offset); > + usage = __swap_entry_free_locked(p, ci, offset, usage); > unlock_cluster_or_swap_info(p, ci); > > return usage; > @@ -3450,32 +3467,12 @@ void si_swapinfo(struct sysinfo *val) > spin_unlock(&swap_lock); > } > > -/* > - * Verify that a swap entry is valid and increment its swap map count. > - * > - * Returns error code in following case. > - * - success -> 0 > - * - swp_entry is invalid -> EINVAL > - * - swp_entry is migration entry -> EINVAL > - * - swap-cache reference is requested but there is already one. -> EEXIST > - * - swap-cache reference is requested but the entry is not used. -> ENOENT > - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM > - */ > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) > +static int __swap_duplicate_locked(struct swap_info_struct *p, > + unsigned long offset, unsigned char usage) > { > - struct swap_info_struct *p; > - struct swap_cluster_info *ci; > - unsigned long offset; > unsigned char count; > unsigned char has_cache; > - int err = -EINVAL; > - > - p = get_swap_device(entry); > - if (!p) > - goto out; > - > - offset = swp_offset(entry); > - ci = lock_cluster_or_swap_info(p, offset); > + int err = 0; > > count = p->swap_map[offset]; > > @@ -3485,12 +3482,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) > */ > if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { > err = -ENOENT; > - goto unlock_out; > + goto out; > } > > has_cache = count & SWAP_HAS_CACHE; > count &= ~SWAP_HAS_CACHE; > - err = 0; > > if (usage == SWAP_HAS_CACHE) { > > @@ -3517,11 +3513,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) > > p->swap_map[offset] = count | has_cache; > > -unlock_out: > +out: > + return err; > +} ... and that all looks like refactoring, not actively implementing PMD swap support. That's unfortunate. > +/* > + * Verify that a swap entry is valid and increment its swap map count. > + * > + * Returns error code in following case. > + * - success -> 0 > + * - swp_entry is invalid -> EINVAL > + * - swp_entry is migration entry -> EINVAL > + * - swap-cache reference is requested but there is already one. -> EEXIST > + * - swap-cache reference is requested but the entry is not used. -> ENOENT > + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM > + */ > +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) > +{ > + struct swap_info_struct *p; > + struct swap_cluster_info *ci; > + unsigned long offset; > + int err = -EINVAL; > + > + p = get_swap_device(entry); > + if (!p) > + goto out; Is this an error, or just for running into something like a migration entry? Comments please. > + offset = swp_offset(entry); > + ci = lock_cluster_or_swap_info(p, offset); > + err = __swap_duplicate_locked(p, offset, usage); > unlock_cluster_or_swap_info(p, ci); > + > + put_swap_device(p); > out: > - if (p) > - put_swap_device(p); > return err; > } Not a comment on this patch, but lock_cluster_or_swap_info() is woefully uncommented. > @@ -3534,6 +3558,81 @@ void swap_shmem_alloc(swp_entry_t entry) > __swap_duplicate(entry, SWAP_MAP_SHMEM); > } > > +#ifdef CONFIG_THP_SWAP > +static int __swap_duplicate_cluster(swp_entry_t *entry, unsigned char usage) > +{ > + struct swap_info_struct *si; > + struct swap_cluster_info *ci; > + unsigned long offset; > + unsigned char *map; > + int i, err = 0; Instead of an #ifdef, is there a reason we can't just do: if (!IS_ENABLED(THP_SWAP)) return 0; ? > + si = get_swap_device(*entry); > + if (!si) { > + err = -EINVAL; > + goto out; > + } > + offset = swp_offset(*entry); > + ci = lock_cluster(si, offset); Could you explain a bit why we do lock_cluster() and not lock_cluster_or_swap_info() here? > + if (cluster_is_free(ci)) { > + err = -ENOENT; > + goto unlock; > + } Needs comments on how this could happen. We just took the lock, so I assume this is some kind of race, but can you elaborate? > + if (!cluster_is_huge(ci)) { > + err = -ENOTDIR; > + goto unlock; > + } Yikes! This function is the core of the new functionality and its comment count is exactly 0. There was quite a long patch description, which will be surely lost to the ages, but nothing in the code that folks _will_ be looking at for decades to come. Can we fix that? > + VM_BUG_ON(!is_cluster_offset(offset)); > + VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER); So, by this point, we know we are looking at (or supposed to be looking at) a cluster on the device? > + map = si->swap_map + offset; > + if (usage == SWAP_HAS_CACHE) { > + if (map[0] & SWAP_HAS_CACHE) { > + err = -EEXIST; > + goto unlock; > + } > + for (i = 0; i < SWAPFILE_CLUSTER; i++) { > + VM_BUG_ON(map[i] & SWAP_HAS_CACHE); > + map[i] |= SWAP_HAS_CACHE; > + } So, it's OK to race with the first entry, but after that it's a bug because the tail pages should agree with the head page's state? > + } else { > + for (i = 0; i < SWAPFILE_CLUSTER; i++) { > +retry: > + err = __swap_duplicate_locked(si, offset + i, usage); > + if (err == -ENOMEM) { > + struct page *page; > + > + page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); I noticed that the non-clustering analog of this function takes a GFP mask. Why not this one? > + err = add_swap_count_continuation_locked( > + si, offset + i, page); > + if (err) { > + *entry = swp_entry(si->type, offset+i); > + goto undup; > + } > + goto retry; > + } else if (err) > + goto undup; > + } > + cluster_set_count(ci, cluster_count(ci) + usage); > + } > +unlock: > + unlock_cluster(ci); > + put_swap_device(si); > +out: > + return err; > +undup: > + for (i--; i >= 0; i--) > + __swap_entry_free_locked( > + si, ci, offset + i, usage); > + goto unlock; > +} So, we've basically created a fork of the __swap_duplicate() code for huge pages, along with a presumably new set of bugs and a second code path to update. Was this unavoidable? Can we unify this any more with the small pages path? > /* > * Increase reference count of swap entry by 1. > * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required > @@ -3541,12 +3640,15 @@ void swap_shmem_alloc(swp_entry_t entry) > * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which > * might occur if a page table entry has got corrupted. > */ > -int swap_duplicate(swp_entry_t entry) > +int swap_duplicate(swp_entry_t *entry, bool cluster) > { > int err = 0; > > - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) > - err = add_swap_count_continuation(entry, GFP_ATOMIC); > + if (thp_swap_supported() && cluster) > + return __swap_duplicate_cluster(entry, 1); > + > + while (!err && __swap_duplicate(*entry, 1) == -ENOMEM) > + err = add_swap_count_continuation(*entry, GFP_ATOMIC); > return err; > } Reading this, I wonder whether this has been refactored as much as possible. Both add_swap_count_continuation() and __swap_duplciate_cluster() start off with the same get_swap_device() dance. > @@ -3558,9 +3660,12 @@ int swap_duplicate(swp_entry_t entry) > * -EBUSY means there is a swap cache. > * Note: return code is different from swap_duplicate(). > */ > -int swapcache_prepare(swp_entry_t entry) > +int swapcache_prepare(swp_entry_t entry, bool cluster) > { > - return __swap_duplicate(entry, SWAP_HAS_CACHE); > + if (thp_swap_supported() && cluster) > + return __swap_duplicate_cluster(&entry, SWAP_HAS_CACHE); > + else > + return __swap_duplicate(entry, SWAP_HAS_CACHE); > } > > struct swap_info_struct *swp_swap_info(swp_entry_t entry) > @@ -3590,51 +3695,13 @@ pgoff_t __page_file_index(struct page *page) > } > EXPORT_SYMBOL_GPL(__page_file_index); > > -/* > - * add_swap_count_continuation - called when a swap count is duplicated > - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's > - * page of the original vmalloc'ed swap_map, to hold the continuation count > - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called > - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. This closes out with a lot of refactoring noise. Any chance that can be isolated into another patch?
Dave Hansen <dave.hansen@linux.intel.com> writes: >> +static inline bool thp_swap_supported(void) >> +{ >> + return IS_ENABLED(CONFIG_THP_SWAP); >> +} > > This seems like rather useless abstraction. Why do we need it? I just want to make it shorter, 19 vs 27 characters. But if you think IS_ENABLED(CONFIG_THP_SWAP) is much better, I can use that instead. > ... >> -static inline int swap_duplicate(swp_entry_t swp) >> +static inline int swap_duplicate(swp_entry_t *swp, bool cluster) >> { >> return 0; >> } > > FWIW, I despise true/false function arguments like this. When I see > this in code: > > swap_duplicate(&entry, false); > > I have no idea what false does. I'd much rather see: > > enum do_swap_cluster { > SWP_DO_CLUSTER, > SWP_NO_CLUSTER > }; > > So you see: > > swap_duplicate(&entry, SWP_NO_CLUSTER); > > vs. > > swap_duplicate(&entry, SWP_DO_CLUSTER); > Yes. Boolean parameter isn't good at most times. Matthew Wilcox suggested to use swap_duplicate(&entry, HPAGE_PMD_NR); vs. swap_duplicate(&entry, 1); He thinks this makes the interface more flexible to support other swap entry size in the future. What do you think about that? >> diff --git a/mm/memory.c b/mm/memory.c >> index e9cac1c4fa69..f3900282e3da 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -951,7 +951,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> swp_entry_t entry = pte_to_swp_entry(pte); >> >> if (likely(!non_swap_entry(entry))) { >> - if (swap_duplicate(entry) < 0) >> + if (swap_duplicate(&entry, false) < 0) >> return entry.val; >> >> /* make sure dst_mm is on swapoff's mmlist. */ > > I'll also point out that in a multi-hundred-line patch, adding arguments > to a existing function would not be something I'd try to include in the > patch. I'd break it out separately unless absolutely necessary. You mean add another patch, which only adds arguments to the function, but not change the body of the function? >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f42b1b0cdc58..48e2c54385ee 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -49,6 +49,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, >> unsigned char); >> static void free_swap_count_continuations(struct swap_info_struct *); >> static sector_t map_swap_entry(swp_entry_t, struct block_device**); >> +static int add_swap_count_continuation_locked(struct swap_info_struct *si, >> + unsigned long offset, >> + struct page *page); >> >> DEFINE_SPINLOCK(swap_lock); >> static unsigned int nr_swapfiles; >> @@ -319,6 +322,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, >> spin_unlock(&si->lock); >> } >> >> +static inline bool is_cluster_offset(unsigned long offset) >> +{ >> + return !(offset % SWAPFILE_CLUSTER); >> +} >> + >> static inline bool cluster_list_empty(struct swap_cluster_list *list) >> { >> return cluster_is_null(&list->head); >> @@ -1166,16 +1174,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) >> return NULL; >> } >> >> -static unsigned char __swap_entry_free(struct swap_info_struct *p, >> - swp_entry_t entry, unsigned char usage) >> +static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, >> + struct swap_cluster_info *ci, >> + unsigned long offset, >> + unsigned char usage) >> { >> - struct swap_cluster_info *ci; >> - unsigned long offset = swp_offset(entry); >> unsigned char count; >> unsigned char has_cache; >> >> - ci = lock_cluster_or_swap_info(p, offset); >> - >> count = p->swap_map[offset]; >> >> has_cache = count & SWAP_HAS_CACHE; >> @@ -1203,6 +1209,17 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, >> usage = count | has_cache; >> p->swap_map[offset] = usage ? : SWAP_HAS_CACHE; >> >> + return usage; >> +} >> + >> +static unsigned char __swap_entry_free(struct swap_info_struct *p, >> + swp_entry_t entry, unsigned char usage) >> +{ >> + struct swap_cluster_info *ci; >> + unsigned long offset = swp_offset(entry); >> + >> + ci = lock_cluster_or_swap_info(p, offset); >> + usage = __swap_entry_free_locked(p, ci, offset, usage); >> unlock_cluster_or_swap_info(p, ci); >> >> return usage; >> @@ -3450,32 +3467,12 @@ void si_swapinfo(struct sysinfo *val) >> spin_unlock(&swap_lock); >> } >> >> -/* >> - * Verify that a swap entry is valid and increment its swap map count. >> - * >> - * Returns error code in following case. >> - * - success -> 0 >> - * - swp_entry is invalid -> EINVAL >> - * - swp_entry is migration entry -> EINVAL >> - * - swap-cache reference is requested but there is already one. -> EEXIST >> - * - swap-cache reference is requested but the entry is not used. -> ENOENT >> - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM >> - */ >> -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> +static int __swap_duplicate_locked(struct swap_info_struct *p, >> + unsigned long offset, unsigned char usage) >> { >> - struct swap_info_struct *p; >> - struct swap_cluster_info *ci; >> - unsigned long offset; >> unsigned char count; >> unsigned char has_cache; >> - int err = -EINVAL; >> - >> - p = get_swap_device(entry); >> - if (!p) >> - goto out; >> - >> - offset = swp_offset(entry); >> - ci = lock_cluster_or_swap_info(p, offset); >> + int err = 0; >> >> count = p->swap_map[offset]; >> >> @@ -3485,12 +3482,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> */ >> if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { >> err = -ENOENT; >> - goto unlock_out; >> + goto out; >> } >> >> has_cache = count & SWAP_HAS_CACHE; >> count &= ~SWAP_HAS_CACHE; >> - err = 0; >> >> if (usage == SWAP_HAS_CACHE) { >> >> @@ -3517,11 +3513,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> >> p->swap_map[offset] = count | has_cache; >> >> -unlock_out: >> +out: >> + return err; >> +} > > ... and that all looks like refactoring, not actively implementing PMD > swap support. That's unfortunate. > >> +/* >> + * Verify that a swap entry is valid and increment its swap map count. >> + * >> + * Returns error code in following case. >> + * - success -> 0 >> + * - swp_entry is invalid -> EINVAL >> + * - swp_entry is migration entry -> EINVAL >> + * - swap-cache reference is requested but there is already one. -> EEXIST >> + * - swap-cache reference is requested but the entry is not used. -> ENOENT >> + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM >> + */ >> +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> +{ >> + struct swap_info_struct *p; >> + struct swap_cluster_info *ci; >> + unsigned long offset; >> + int err = -EINVAL; >> + >> + p = get_swap_device(entry); >> + if (!p) >> + goto out; > > Is this an error, or just for running into something like a migration > entry? Comments please. __swap_duplicate() may be called with invalid swap entry because the swap device may be swapoff after we get swap entry during page fault. Yes, I will add some comments here. >> + offset = swp_offset(entry); >> + ci = lock_cluster_or_swap_info(p, offset); >> + err = __swap_duplicate_locked(p, offset, usage); >> unlock_cluster_or_swap_info(p, ci); >> + >> + put_swap_device(p); >> out: >> - if (p) >> - put_swap_device(p); >> return err; >> } > > Not a comment on this patch, but lock_cluster_or_swap_info() is woefully > uncommented. OK. Will add some comments for that. >> @@ -3534,6 +3558,81 @@ void swap_shmem_alloc(swp_entry_t entry) >> __swap_duplicate(entry, SWAP_MAP_SHMEM); >> } >> >> +#ifdef CONFIG_THP_SWAP >> +static int __swap_duplicate_cluster(swp_entry_t *entry, unsigned char usage) >> +{ >> + struct swap_info_struct *si; >> + struct swap_cluster_info *ci; >> + unsigned long offset; >> + unsigned char *map; >> + int i, err = 0; > > Instead of an #ifdef, is there a reason we can't just do: > > if (!IS_ENABLED(THP_SWAP)) > return 0; > > ? Good idea. Will do this for the whole patchset. >> + si = get_swap_device(*entry); >> + if (!si) { >> + err = -EINVAL; >> + goto out; >> + } >> + offset = swp_offset(*entry); >> + ci = lock_cluster(si, offset); > > Could you explain a bit why we do lock_cluster() and not > lock_cluster_or_swap_info() here? The code size of lock_cluster() is a little smaller, and I think it is a little easier to read. But I know lock_cluster_or_swap_info() can be used here without functionality problems. If we try to merge the code for huge and normal swap entry, that could be used. >> + if (cluster_is_free(ci)) { >> + err = -ENOENT; >> + goto unlock; >> + } > > Needs comments on how this could happen. We just took the lock, so I > assume this is some kind of race, but can you elaborate? Sure. Will add some comments for this. >> + if (!cluster_is_huge(ci)) { >> + err = -ENOTDIR; >> + goto unlock; >> + } > > Yikes! This function is the core of the new functionality and its > comment count is exactly 0. There was quite a long patch description, > which will be surely lost to the ages, but nothing in the code that > folks _will_ be looking at for decades to come. > > Can we fix that? Sure. Will add more comments. >> + VM_BUG_ON(!is_cluster_offset(offset)); >> + VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER); > > So, by this point, we know we are looking at (or supposed to be looking > at) a cluster on the device? Yes. >> + map = si->swap_map + offset; >> + if (usage == SWAP_HAS_CACHE) { >> + if (map[0] & SWAP_HAS_CACHE) { >> + err = -EEXIST; >> + goto unlock; >> + } >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) { >> + VM_BUG_ON(map[i] & SWAP_HAS_CACHE); >> + map[i] |= SWAP_HAS_CACHE; >> + } > > So, it's OK to race with the first entry, but after that it's a bug > because the tail pages should agree with the head page's state? Yes. Will add some comments about this. >> + } else { >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) { >> +retry: >> + err = __swap_duplicate_locked(si, offset + i, usage); >> + if (err == -ENOMEM) { >> + struct page *page; >> + >> + page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); > > I noticed that the non-clustering analog of this function takes a GFP > mask. Why not this one? The value of gfp_mask is GFP_ATOMIC in swap_duplicate(), so they are exactly same. >> + err = add_swap_count_continuation_locked( >> + si, offset + i, page); >> + if (err) { >> + *entry = swp_entry(si->type, offset+i); >> + goto undup; >> + } >> + goto retry; >> + } else if (err) >> + goto undup; >> + } >> + cluster_set_count(ci, cluster_count(ci) + usage); >> + } >> +unlock: >> + unlock_cluster(ci); >> + put_swap_device(si); >> +out: >> + return err; >> +undup: >> + for (i--; i >= 0; i--) >> + __swap_entry_free_locked( >> + si, ci, offset + i, usage); >> + goto unlock; >> +} > > So, we've basically created a fork of the __swap_duplicate() code for > huge pages, along with a presumably new set of bugs and a second code > path to update. Was this unavoidable? Can we unify this any more with > the small pages path? Will discuss this in another thread. >> /* >> * Increase reference count of swap entry by 1. >> * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required >> @@ -3541,12 +3640,15 @@ void swap_shmem_alloc(swp_entry_t entry) >> * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which >> * might occur if a page table entry has got corrupted. >> */ >> -int swap_duplicate(swp_entry_t entry) >> +int swap_duplicate(swp_entry_t *entry, bool cluster) >> { >> int err = 0; >> >> - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) >> - err = add_swap_count_continuation(entry, GFP_ATOMIC); >> + if (thp_swap_supported() && cluster) >> + return __swap_duplicate_cluster(entry, 1); >> + >> + while (!err && __swap_duplicate(*entry, 1) == -ENOMEM) >> + err = add_swap_count_continuation(*entry, GFP_ATOMIC); >> return err; >> } > > Reading this, I wonder whether this has been refactored as much as > possible. Both add_swap_count_continuation() and > __swap_duplciate_cluster() start off with the same get_swap_device() dance. Yes. There's some duplicated code logic. Will think about how to improve it. >> @@ -3558,9 +3660,12 @@ int swap_duplicate(swp_entry_t entry) >> * -EBUSY means there is a swap cache. >> * Note: return code is different from swap_duplicate(). >> */ >> -int swapcache_prepare(swp_entry_t entry) >> +int swapcache_prepare(swp_entry_t entry, bool cluster) >> { >> - return __swap_duplicate(entry, SWAP_HAS_CACHE); >> + if (thp_swap_supported() && cluster) >> + return __swap_duplicate_cluster(&entry, SWAP_HAS_CACHE); >> + else >> + return __swap_duplicate(entry, SWAP_HAS_CACHE); >> } >> >> struct swap_info_struct *swp_swap_info(swp_entry_t entry) >> @@ -3590,51 +3695,13 @@ pgoff_t __page_file_index(struct page *page) >> } >> EXPORT_SYMBOL_GPL(__page_file_index); >> >> -/* >> - * add_swap_count_continuation - called when a swap count is duplicated >> - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's >> - * page of the original vmalloc'ed swap_map, to hold the continuation count >> - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called >> - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. > > This closes out with a lot of refactoring noise. Any chance that can be > isolated into another patch? Sure. Will do that. Best Regards, Huang, Ying
> Yes. Boolean parameter isn't good at most times. Matthew Wilcox > suggested to use > > swap_duplicate(&entry, HPAGE_PMD_NR); > > vs. > > swap_duplicate(&entry, 1); > > He thinks this makes the interface more flexible to support other swap > entry size in the future. What do you think about that? That looks great to me too. >>> if (likely(!non_swap_entry(entry))) { >>> - if (swap_duplicate(entry) < 0) >>> + if (swap_duplicate(&entry, false) < 0) >>> return entry.val; >>> >>> /* make sure dst_mm is on swapoff's mmlist. */ >> >> I'll also point out that in a multi-hundred-line patch, adding arguments >> to a existing function would not be something I'd try to include in the >> patch. I'd break it out separately unless absolutely necessary. > > You mean add another patch, which only adds arguments to the function, > but not change the body of the function? Yes. Or, just add the non-THP-swap version first.
Dave Hansen <dave.hansen@linux.intel.com> writes: >> Yes. Boolean parameter isn't good at most times. Matthew Wilcox >> suggested to use >> >> swap_duplicate(&entry, HPAGE_PMD_NR); >> >> vs. >> >> swap_duplicate(&entry, 1); >> >> He thinks this makes the interface more flexible to support other swap >> entry size in the future. What do you think about that? > > That looks great to me too. > >>>> if (likely(!non_swap_entry(entry))) { >>>> - if (swap_duplicate(entry) < 0) >>>> + if (swap_duplicate(&entry, false) < 0) >>>> return entry.val; >>>> >>>> /* make sure dst_mm is on swapoff's mmlist. */ >>> >>> I'll also point out that in a multi-hundred-line patch, adding arguments >>> to a existing function would not be something I'd try to include in the >>> patch. I'd break it out separately unless absolutely necessary. >> >> You mean add another patch, which only adds arguments to the function, >> but not change the body of the function? > > Yes. Or, just add the non-THP-swap version first. OK, will do this. Best Regards, Huang, Ying
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index d3bbf6bea9e9..213d32e57c39 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -80,6 +80,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) +static inline bool thp_swap_supported(void) +{ + return IS_ENABLED(CONFIG_THP_SWAP); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE #define HPAGE_PMD_SHIFT PMD_SHIFT #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) diff --git a/include/linux/swap.h b/include/linux/swap.h index f73eafcaf4e9..57aa655ab27d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int); extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); -extern int swap_duplicate(swp_entry_t); -extern int swapcache_prepare(swp_entry_t); +extern int swap_duplicate(swp_entry_t *entry, bool cluster); +extern int swapcache_prepare(swp_entry_t entry, bool cluster); extern void swap_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); @@ -510,7 +510,8 @@ static inline void show_swap_cache_info(void) } #define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) -#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) +#define swapcache_prepare(e, c) \ + ({(is_migration_entry(e) || is_device_private_entry(e)); }) static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) { @@ -521,7 +522,7 @@ static inline void swap_shmem_alloc(swp_entry_t swp) { } -static inline int swap_duplicate(swp_entry_t swp) +static inline int swap_duplicate(swp_entry_t *swp, bool cluster) { return 0; } diff --git a/mm/memory.c b/mm/memory.c index e9cac1c4fa69..f3900282e3da 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -951,7 +951,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, swp_entry_t entry = pte_to_swp_entry(pte); if (likely(!non_swap_entry(entry))) { - if (swap_duplicate(entry) < 0) + if (swap_duplicate(&entry, false) < 0) return entry.val; /* make sure dst_mm is on swapoff's mmlist. */ diff --git a/mm/rmap.c b/mm/rmap.c index 6db729dc4c50..5f45d6325c40 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1556,7 +1556,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, break; } - if (swap_duplicate(entry) < 0) { + if (swap_duplicate(&entry, false) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); ret = false; page_vma_mapped_walk_done(&pvmw); diff --git a/mm/swap_state.c b/mm/swap_state.c index dc312559f7df..b0575182e77b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -433,7 +433,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Swap entry may have been freed since our caller observed it. */ - err = swapcache_prepare(entry); + err = swapcache_prepare(entry, false); if (err == -EEXIST) { radix_tree_preload_end(); /* diff --git a/mm/swapfile.c b/mm/swapfile.c index f42b1b0cdc58..48e2c54385ee 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -49,6 +49,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); static sector_t map_swap_entry(swp_entry_t, struct block_device**); +static int add_swap_count_continuation_locked(struct swap_info_struct *si, + unsigned long offset, + struct page *page); DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -319,6 +322,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, spin_unlock(&si->lock); } +static inline bool is_cluster_offset(unsigned long offset) +{ + return !(offset % SWAPFILE_CLUSTER); +} + static inline bool cluster_list_empty(struct swap_cluster_list *list) { return cluster_is_null(&list->head); @@ -1166,16 +1174,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) return NULL; } -static unsigned char __swap_entry_free(struct swap_info_struct *p, - swp_entry_t entry, unsigned char usage) +static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, + struct swap_cluster_info *ci, + unsigned long offset, + unsigned char usage) { - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); unsigned char count; unsigned char has_cache; - ci = lock_cluster_or_swap_info(p, offset); - count = p->swap_map[offset]; has_cache = count & SWAP_HAS_CACHE; @@ -1203,6 +1209,17 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, usage = count | has_cache; p->swap_map[offset] = usage ? : SWAP_HAS_CACHE; + return usage; +} + +static unsigned char __swap_entry_free(struct swap_info_struct *p, + swp_entry_t entry, unsigned char usage) +{ + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + + ci = lock_cluster_or_swap_info(p, offset); + usage = __swap_entry_free_locked(p, ci, offset, usage); unlock_cluster_or_swap_info(p, ci); return usage; @@ -3450,32 +3467,12 @@ void si_swapinfo(struct sysinfo *val) spin_unlock(&swap_lock); } -/* - * Verify that a swap entry is valid and increment its swap map count. - * - * Returns error code in following case. - * - success -> 0 - * - swp_entry is invalid -> EINVAL - * - swp_entry is migration entry -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM - */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate_locked(struct swap_info_struct *p, + unsigned long offset, unsigned char usage) { - struct swap_info_struct *p; - struct swap_cluster_info *ci; - unsigned long offset; unsigned char count; unsigned char has_cache; - int err = -EINVAL; - - p = get_swap_device(entry); - if (!p) - goto out; - - offset = swp_offset(entry); - ci = lock_cluster_or_swap_info(p, offset); + int err = 0; count = p->swap_map[offset]; @@ -3485,12 +3482,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) */ if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { err = -ENOENT; - goto unlock_out; + goto out; } has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; - err = 0; if (usage == SWAP_HAS_CACHE) { @@ -3517,11 +3513,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) p->swap_map[offset] = count | has_cache; -unlock_out: +out: + return err; +} + +/* + * Verify that a swap entry is valid and increment its swap map count. + * + * Returns error code in following case. + * - success -> 0 + * - swp_entry is invalid -> EINVAL + * - swp_entry is migration entry -> EINVAL + * - swap-cache reference is requested but there is already one. -> EEXIST + * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM + */ +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +{ + struct swap_info_struct *p; + struct swap_cluster_info *ci; + unsigned long offset; + int err = -EINVAL; + + p = get_swap_device(entry); + if (!p) + goto out; + + offset = swp_offset(entry); + ci = lock_cluster_or_swap_info(p, offset); + err = __swap_duplicate_locked(p, offset, usage); unlock_cluster_or_swap_info(p, ci); + + put_swap_device(p); out: - if (p) - put_swap_device(p); return err; } @@ -3534,6 +3558,81 @@ void swap_shmem_alloc(swp_entry_t entry) __swap_duplicate(entry, SWAP_MAP_SHMEM); } +#ifdef CONFIG_THP_SWAP +static int __swap_duplicate_cluster(swp_entry_t *entry, unsigned char usage) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset; + unsigned char *map; + int i, err = 0; + + si = get_swap_device(*entry); + if (!si) { + err = -EINVAL; + goto out; + } + offset = swp_offset(*entry); + ci = lock_cluster(si, offset); + if (cluster_is_free(ci)) { + err = -ENOENT; + goto unlock; + } + if (!cluster_is_huge(ci)) { + err = -ENOTDIR; + goto unlock; + } + VM_BUG_ON(!is_cluster_offset(offset)); + VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER); + map = si->swap_map + offset; + if (usage == SWAP_HAS_CACHE) { + if (map[0] & SWAP_HAS_CACHE) { + err = -EEXIST; + goto unlock; + } + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + VM_BUG_ON(map[i] & SWAP_HAS_CACHE); + map[i] |= SWAP_HAS_CACHE; + } + } else { + for (i = 0; i < SWAPFILE_CLUSTER; i++) { +retry: + err = __swap_duplicate_locked(si, offset + i, usage); + if (err == -ENOMEM) { + struct page *page; + + page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); + err = add_swap_count_continuation_locked( + si, offset + i, page); + if (err) { + *entry = swp_entry(si->type, offset+i); + goto undup; + } + goto retry; + } else if (err) + goto undup; + } + cluster_set_count(ci, cluster_count(ci) + usage); + } +unlock: + unlock_cluster(ci); + put_swap_device(si); +out: + return err; +undup: + for (i--; i >= 0; i--) + __swap_entry_free_locked( + si, ci, offset + i, usage); + goto unlock; +} +#else +static inline int __swap_duplicate_cluster(swp_entry_t *entry, + unsigned char usage) +{ + return 0; +} +#endif + /* * Increase reference count of swap entry by 1. * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required @@ -3541,12 +3640,15 @@ void swap_shmem_alloc(swp_entry_t entry) * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which * might occur if a page table entry has got corrupted. */ -int swap_duplicate(swp_entry_t entry) +int swap_duplicate(swp_entry_t *entry, bool cluster) { int err = 0; - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) - err = add_swap_count_continuation(entry, GFP_ATOMIC); + if (thp_swap_supported() && cluster) + return __swap_duplicate_cluster(entry, 1); + + while (!err && __swap_duplicate(*entry, 1) == -ENOMEM) + err = add_swap_count_continuation(*entry, GFP_ATOMIC); return err; } @@ -3558,9 +3660,12 @@ int swap_duplicate(swp_entry_t entry) * -EBUSY means there is a swap cache. * Note: return code is different from swap_duplicate(). */ -int swapcache_prepare(swp_entry_t entry) +int swapcache_prepare(swp_entry_t entry, bool cluster) { - return __swap_duplicate(entry, SWAP_HAS_CACHE); + if (thp_swap_supported() && cluster) + return __swap_duplicate_cluster(&entry, SWAP_HAS_CACHE); + else + return __swap_duplicate(entry, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) @@ -3590,51 +3695,13 @@ pgoff_t __page_file_index(struct page *page) } EXPORT_SYMBOL_GPL(__page_file_index); -/* - * add_swap_count_continuation - called when a swap count is duplicated - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's - * page of the original vmalloc'ed swap_map, to hold the continuation count - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. - * - * These continuation pages are seldom referenced: the common paths all work - * on the original swap_map, only referring to a continuation page when the - * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. - * - * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding - * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL) - * can be called after dropping locks. - */ -int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) +static int add_swap_count_continuation_locked(struct swap_info_struct *si, + unsigned long offset, + struct page *page) { - struct swap_info_struct *si; - struct swap_cluster_info *ci; struct page *head; - struct page *page; struct page *list_page; - pgoff_t offset; unsigned char count; - int ret = 0; - - /* - * When debugging, it's easier to use __GFP_ZERO here; but it's better - * for latency not to zero a page while GFP_ATOMIC and holding locks. - */ - page = alloc_page(gfp_mask | __GFP_HIGHMEM); - - si = get_swap_device(entry); - if (!si) { - /* - * An acceptable race has occurred since the failing - * __swap_duplicate(): the swap device may be swapoff - */ - goto outer; - } - spin_lock(&si->lock); - - offset = swp_offset(entry); - - ci = lock_cluster(si, offset); count = si->swap_map[offset] & ~SWAP_HAS_CACHE; @@ -3644,13 +3711,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) * will race to add swap count continuation: we need to avoid * over-provisioning. */ - goto out; + return 0; } - if (!page) { - ret = -ENOMEM; - goto out; - } + if (!page) + return -ENOMEM; /* * We are fortunate that although vmalloc_to_page uses pte_offset_map, @@ -3698,7 +3763,57 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) page = NULL; /* now it's attached, don't free it */ out_unlock_cont: spin_unlock(&si->cont_lock); -out: + if (page) + __free_page(page); + return 0; +} + +/* + * add_swap_count_continuation - called when a swap count is duplicated + * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's + * page of the original vmalloc'ed swap_map, to hold the continuation count + * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called + * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. + * + * These continuation pages are seldom referenced: the common paths all work + * on the original swap_map, only referring to a continuation page when the + * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. + * + * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding + * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL) + * can be called after dropping locks. + */ +int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci; + struct page *page; + unsigned long offset; + int ret = 0; + + /* + * When debugging, it's easier to use __GFP_ZERO here; but it's better + * for latency not to zero a page while GFP_ATOMIC and holding locks. + */ + page = alloc_page(gfp_mask | __GFP_HIGHMEM); + + si = get_swap_device(entry); + if (!si) { + /* + * An acceptable race has occurred since the failing + * __swap_duplicate(): the swap device may be swapoff + */ + goto outer; + } + spin_lock(&si->lock); + + offset = swp_offset(entry); + + ci = lock_cluster(si, offset); + + ret = add_swap_count_continuation_locked(si, offset, page); + page = NULL; + unlock_cluster(ci); spin_unlock(&si->lock); put_swap_device(si);