diff mbox series

[RFC] mm: thp: implement THP reservations for anonymous memory

Message ID 1541746138-6706-1-git-send-email-anthony.yznaga@oracle.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: thp: implement THP reservations for anonymous memory | expand

Commit Message

Anthony Yznaga Nov. 9, 2018, 6:48 a.m. UTC
When the THP enabled policy is "always", or the mode is "madvise" and a
region is marked as MADV_HUGEPAGE, a hugepage is allocated on a page
fault if the PMD is empty.  This yields the best VA translation
performance but increases memory consumption if a significant part of
the huge page is never accessed.

A while back a former colleague presented a patch to help address this
bloat [1]. Feedback from the community suggested investigating an alternate
approach to allocating THP hugepages using reservations, and since then
I have taken my colleague's work and expanded on it to implement a form
of reservation-based THP for private anonymous memory.  What I hope to
gain from this RFC is whether this approach is viable and what issues
there may be that I had not considered.  Apologies for the monolithic
patch.

The basic idea as outlined by Mel Gorman in [2] is:

1) On first fault in a sufficiently sized range, allocate a huge page
   sized and aligned block of base pages.  Map the base page
   corresponding to the fault address and hold the rest of the pages in
   reserve.
2) On subsequent faults in the range, map the pages from the reservation.
3) When enough pages have been mapped, promote the mapped pages and
   remaining pages in the reservation to a huge page.
4) When there is memory pressure, release the unused pages from their
   reservations.

[1] https://marc.info/?l=linux-mm&m=151631857310828&w=2
[2] https://lkml.org/lkml/2018/1/25/571

To test the idea I wrote a simple test that repeatedly forks children
where each child attempts to allocate a very large chunk of memory and
then touch either 1 page or a random number of pages in each huge page
region of the chunk.  On a machine with 256GB with a test chunk size of
16GB the test ends when the 17th child fails to map its chunk.  With THP
reservations enabled, the test ends when the 118th child fails.

Below are some additional implementation details and known issues.

User-visible files:

/sys/kernel/mm/transparent_hugepage/promotion_threshold

	The number of base pages within a huge page aligned region that
	must be faulted in before the region is eligible for promotion
	to a huge page.

 	1
 	On the first page fault in a huge page sized and aligned
 	region, allocate and map a huge page.

 	> 1
 	On the first page fault in a huge page sized and aligned
 	region, allocate and reserve a huge page sized and aligned
 	block of pages and map a single page from the reservation.
 	Continue to map pages from the reservation on subsequent
 	faults.  Once the number of pages faulted from the reservation
 	is equal to or greater than the promotion_threshold, the
 	reservation is eligible to be promoted to a huge page by
 	khugepaged.

	Currently the default value is HPAGE_PMD_NR / 2.

/sys/kernel/mm/transparent_hugepage/khugepaged/res_pages_collapsed

	The number of THP reservations promoted to huge pages
	by khugepaged.

	This total is also included in the total reported in pages_collapsed.

Counters added to /proc/vmstat:

nr_thp_reserved

	The total number of small pages in existing reservations
	that have not had a page fault since their respective
	reservation were created.  The amount is also included
	in the estimated total memory available as reported
	in MemAvailable in /proc/meminfo.

thp_res_alloc

	Incremented every time the pages for a reservation have been
	successfully allocated to handle a page fault.

thp_res_alloc_failed

	Incremented if pages could not successfully allocated for
	a reservation.

Known Issues:

- COW handling of reservations is insufficient.   While the pages of a
reservation are shared between parent and child after fork, currently
the reservation data structures are not shared and remain with the
parent.  A COW fault by the child allocates a new small page and a new
reservation is not allocated.  A COW fault by the parent allocates a new
small page and releases the reservation if one exists.

- If the pages in a reservation are remapped read-only (e.g. after fork
and child exit), khugepaged will never promote the pages to a huge page
until at least one page is written.

- A reservation is allocated even if the first fault on a pmd range maps
a zero page.  It may be more space efficient to allocate the reservation
on the first write fault.

- To facilitate the shrinker implementation, reservations are kept in a
global struct list_lru.  The list_lru internal implementation puts items
added to a list_lru on to per-node lists based on the node id derived
from the address of the item passed to list_lru_add().  For the current
reservations shrinker implementation this means that reservations will
be placed on the internal per-node list corresponding to the node where
the reservation data structure is located rather than the node where the
reserved pages are located.

- When a partly used reservation is promoted to a huge page, the unused
pages are not charged to a memcg.

- Minor code duplication to support mremap.

Other TBD:
- Performance testing
- shmem support
- Investigate promoting a reservation synchronously during fault handling
  rather than waiting for khugepaged to do the promotion.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/huge_mm.h       |   1 +
 include/linux/khugepaged.h    | 119 +++++++
 include/linux/memcontrol.h    |   5 +
 include/linux/mm_types.h      |   3 +
 include/linux/mmzone.h        |   1 +
 include/linux/vm_event_item.h |   2 +
 kernel/fork.c                 |   2 +
 mm/huge_memory.c              |  29 ++
 mm/khugepaged.c               | 739 ++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c               |  33 ++
 mm/memory.c                   |  37 ++-
 mm/mmap.c                     |  14 +
 mm/mremap.c                   |   5 +
 mm/page_alloc.c               |   5 +
 mm/rmap.c                     |   3 +
 mm/util.c                     |   5 +
 mm/vmstat.c                   |   3 +
 17 files changed, 975 insertions(+), 31 deletions(-)

Comments

Mel Gorman Nov. 9, 2018, 11:07 a.m. UTC | #1
On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
> The basic idea as outlined by Mel Gorman in [2] is:
> 
> 1) On first fault in a sufficiently sized range, allocate a huge page
>    sized and aligned block of base pages.  Map the base page
>    corresponding to the fault address and hold the rest of the pages in
>    reserve.
> 2) On subsequent faults in the range, map the pages from the reservation.
> 3) When enough pages have been mapped, promote the mapped pages and
>    remaining pages in the reservation to a huge page.
> 4) When there is memory pressure, release the unused pages from their
>    reservations.
> 
> [1] https://marc.info/?l=linux-mm&m=151631857310828&w=2
> [2] https://lkml.org/lkml/2018/1/25/571
> 

I'm delighted to see someone try tackle this issue.

> To test the idea I wrote a simple test that repeatedly forks children
> where each child attempts to allocate a very large chunk of memory and
> then touch either 1 page or a random number of pages in each huge page
> region of the chunk.  On a machine with 256GB with a test chunk size of
> 16GB the test ends when the 17th child fails to map its chunk.  With THP
> reservations enabled, the test ends when the 118th child fails.
> 

That's a solid test case. I would suggest that primary metrics be fault
latency, memory consumption and successful promotions (as opposed to
successful allocations that we'd use with the existing THP setup).

> Below are some additional implementation details and known issues.
> 
> User-visible files:
> 

These all need to go into Documentation/ although I have no problems
with the fields themselves.

> /sys/kernel/mm/transparent_hugepage/promotion_threshold
> 
> 	The number of base pages within a huge page aligned region that
> 	must be faulted in before the region is eligible for promotion
> 	to a huge page.
> 

Initially, I would suggest making the default 1 and then set a higher
threshold in a subsequent patch. In the patch that introduces it,
show in the changelog that the performance of your code is identical
or close to identical as the existing approach. i.e. Show that the
worst-case scenario is performance-neutral.  Then reduce the threshold
so it can be demonstrated what the performance tradeoff is versus memory
consumption. There is going to be some loss due to the additional code
and the fact that THP is not used immediately for *some* workloads.

> /sys/kernel/mm/transparent_hugepage/khugepaged/res_pages_collapsed
> 
> 	The number of THP reservations promoted to huge pages
> 	by khugepaged.
> 
> 	This total is also included in the total reported in pages_collapsed.
> 

What is that not a vmstat like collapsed?

> Counters added to /proc/vmstat:
> 
> nr_thp_reserved
> 
> 	The total number of small pages in existing reservations
> 	that have not had a page fault since their respective
> 	reservation were created.  The amount is also included
> 	in the estimated total memory available as reported
> 	in MemAvailable in /proc/meminfo.
> 
> thp_res_alloc
> 
> 	Incremented every time the pages for a reservation have been
> 	successfully allocated to handle a page fault.
> 
> thp_res_alloc_failed
> 
> 	Incremented if pages could not successfully allocated for
> 	a reservation.
> 

Seems fair. It might need tracepoints for further debugging in the
future but lets wait until there is an actual problem that can be solved
by a tracepoint first.

> Known Issues:
> 
> - COW handling of reservations is insufficient.   While the pages of a
> reservation are shared between parent and child after fork, currently
> the reservation data structures are not shared and remain with the
> parent.  A COW fault by the child allocates a new small page and a new
> reservation is not allocated.  A COW fault by the parent allocates a new
> small page and releases the reservation if one exists.
> 

Maybe keep the reservations in the parent. I'm thinking specifically
about workloads like redis that fork to take a snapshot that don't
particularly care about THP but do care about the memory overhead due to
sparse addressing of memory.

> - If the pages in a reservation are remapped read-only (e.g. after fork
> and child exit), khugepaged will never promote the pages to a huge page
> until at least one page is written.
> 

I don't consider that a major limitation and I don't think it must be solved
in the first generation of the series. If this is a common occurance,
then it can be dealt with or else workaround by setting the threshold to
1 until it's resolved.

> - A reservation is allocated even if the first fault on a pmd range maps
> a zero page.  It may be more space efficient to allocate the reservation
> on the first write fault.
> 

Agreed but it doesn't kill the idea either. Reserving based on a zero-page
fault works counter to your goal of reducing overall memory consumption
so this should be fixed.

> - To facilitate the shrinker implementation, reservations are kept in a
> global struct list_lru.  The list_lru internal implementation puts items
> added to a list_lru on to per-node lists based on the node id derived
> from the address of the item passed to list_lru_add().  For the current
> reservations shrinker implementation this means that reservations will
> be placed on the internal per-node list corresponding to the node where
> the reservation data structure is located rather than the node where the
> reserved pages are located.
> 

Hmm, not super keen on a shrinker for this given that we probably want
to dump all reservations in the event of memory pressure and doing that
via a shrinker can be "fun".

> Other TBD:
> - Performance testing
> - shmem support
> - Investigate promoting a reservation synchronously during fault handling
>   rather than waiting for khugepaged to do the promotion.
> 

Kirill might disagree but I do not think that shmem support for this is
necessarily critical. THP was anonymous-only for a long time.

> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/linux/huge_mm.h       |   1 +
>  include/linux/khugepaged.h    | 119 +++++++
>  include/linux/memcontrol.h    |   5 +
>  include/linux/mm_types.h      |   3 +
>  include/linux/mmzone.h        |   1 +
>  include/linux/vm_event_item.h |   2 +
>  kernel/fork.c                 |   2 +
>  mm/huge_memory.c              |  29 ++
>  mm/khugepaged.c               | 739 ++++++++++++++++++++++++++++++++++++++++--
>  mm/memcontrol.c               |  33 ++
>  mm/memory.c                   |  37 ++-
>  mm/mmap.c                     |  14 +
>  mm/mremap.c                   |   5 +
>  mm/page_alloc.c               |   5 +
>  mm/rmap.c                     |   3 +
>  mm/util.c                     |   5 +
>  mm/vmstat.c                   |   3 +
>  17 files changed, 975 insertions(+), 31 deletions(-)
> 

This is somewhat intimidating though as a diffstat. Hopefully this can
be broken up. The rest of this is a drive-by review only as I'm about to
travel. Note that there will be others that may not even attempt to read
a patch of that magnitude unless *heavily* motivated by the potential of
the feature.

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fdcb45999b26..a2288f134d5d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,6 +92,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
>  extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  
>  extern unsigned long transparent_hugepage_flags;
> +extern unsigned int hugepage_promotion_threshold;
>  
>  static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 082d1d2a5216..0011eb656ff3 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -2,6 +2,7 @@
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
>  
> +#include <linux/hashtable.h>
>  #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
>  
>  
> @@ -30,6 +31,64 @@ extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>  	(transparent_hugepage_flags &				\
>  	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
>  
> +struct thp_reservation {
> +	spinlock_t *lock;
> +	unsigned long haddr;
> +	struct page *page;
> +	struct vm_area_struct *vma;
> +	struct hlist_node node;
> +	struct list_head lru;
> +	int nr_unused;
> +};
> +

Document these fields, particularly page. Is page the first fault or the
base index of a reserved huge page for example. You also track VMA which
in the THP-specific case *might* be ok because we are not rmapping it
but you might be designing yourself into a corner there.

> +struct thp_resvs {
> +	atomic_t refcnt;
> +	spinlock_t res_hash_lock;
> +	DECLARE_HASHTABLE(res_hash, 7);
> +};
> +

Also needs documentation. It isn't clear what the relationship between
thp_resvs and thp_reservation is. Is this per-mm, per-VMA etc. As I
write this, I haven't looked at the rest of the patch and thp_resvs
tells me nothing about what this is for. It parses to me as THP Reserve
Versus...... Versus what? So obviously it has some other sensible
meaning.

> +#define	vma_thp_reservations(vma)	((vma)->thp_reservations)
> +
> +static inline void thp_resvs_fork(struct vm_area_struct *vma,
> +				  struct vm_area_struct *pvma)
> +{
> +	// XXX Do not share THP reservations for now
> +	vma->thp_reservations = NULL;
> +}
> +

Consider not sharing THP reservations between parent and child
full-stop. I think fundamentally it breaks if a child can use the
reservation because it means that neither child nor parent can promote
in-place.

> +void thp_resvs_new(struct vm_area_struct *vma);
> +
> +extern void __thp_resvs_put(struct thp_resvs *r);
> +static inline void thp_resvs_put(struct thp_resvs *r)
> +{
> +	if (r)
> +		__thp_resvs_put(r);
> +}
> +

Curious that this could be called with NULL

> +void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
> +				  unsigned long address, int delta);
> +
> +struct page *khugepaged_get_reserved_page(
> +	struct vm_area_struct *vma,
> +	unsigned long address);
> +
> +void khugepaged_reserve(struct vm_area_struct *vma,
> +			unsigned long address);
> +
> +void khugepaged_release_reservation(struct vm_area_struct *vma,
> +				    unsigned long address);
> +
> +void _khugepaged_reservations_fixup(struct vm_area_struct *src,
> +				   struct vm_area_struct *dst);
> +
> +void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
> +				      struct vm_area_struct *next, long adjust);
> +
> +void thp_reservations_mremap(struct vm_area_struct *vma,
> +		unsigned long old_addr, struct vm_area_struct *new_vma,
> +		unsigned long new_addr, unsigned long len,
> +		bool need_rmap_locks);
> +
>  static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
>  	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> @@ -56,6 +115,66 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
>  	return 0;
>  }
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#define	vma_thp_reservations(vma)	NULLo
> +

static inline for type safety check.

> +static inline void thp_resvs_fork(struct vm_area_struct *vma,
> +				  struct vm_area_struct *pvma)
> +{
> +}
> +
> +static inline void thp_resvs_new(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline void __thp_resvs_put(struct thp_resvs *r)
> +{
> +}
> +
> +static inline void thp_resvs_put(struct thp_resvs *r)
> +{
> +}
> +
> +static inline void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
> +					      unsigned long address, int delta)
> +{
> +}
> +
> +static inline struct page *khugepaged_get_reserved_page(
> +	struct vm_area_struct *vma,
> +	unsigned long address)
> +{
> +	return NULL;
> +}
> +
> +static inline void khugepaged_reserve(struct vm_area_struct *vma,
> +			       unsigned long address)
> +{
> +}
> +
> +static inline void khugepaged_release_reservation(struct vm_area_struct *vma,
> +				    unsigned long address)
> +{
> +}
> +
> +static inline void _khugepaged_reservations_fixup(struct vm_area_struct *src,
> +				   struct vm_area_struct *dst)
> +{
> +}
> +
> +static inline void _khugepaged_move_reservations_adj(
> +				struct vm_area_struct *prev,
> +				struct vm_area_struct *next, long adjust)
> +{
> +}
> +
> +static inline void thp_reservations_mremap(struct vm_area_struct *vma,
> +		unsigned long old_addr, struct vm_area_struct *new_vma,
> +		unsigned long new_addr, unsigned long len,
> +		bool need_rmap_locks)
> +{
> +}
> +
>  static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
>  	return 0;
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 652f602167df..6342d5f67f75 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -787,6 +787,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void mem_cgroup_collapse_huge_fixup(struct page *head);
>  void mem_cgroup_split_huge_fixup(struct page *head);
>  #endif
>  
> @@ -1087,6 +1088,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	return 0;
>  }
>  
> +static inline void mem_cgroup_collapse_huge_fixup(struct page *head)
> +{
> +}
> +
>  static inline void mem_cgroup_split_huge_fixup(struct page *head)
>  {
>  }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..72a9f431145e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -322,6 +322,9 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	struct thp_resvs *thp_reservations;
> +#endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  } __randomize_layout;
>  

Why is this per-vma and not per address space? I'm not saying that's
good or bad but it makes sense to me that all in-place THP reservations
would inherently be about the address space. Per-vma seems unnecessarily
fine-grained and per-task would be insanity (because threads share an
address space).

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d4b0c79d2924..7deac5a1f25d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -181,6 +181,7 @@ enum node_stat_item {
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
>  	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
> +	NR_THP_RESERVED,	/* Unused small pages in THP reservations */
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 47a3441cf4c4..f3d34db7e9d5 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_ZERO_PAGE_ALLOC_FAILED,
>  		THP_SWPOUT,
>  		THP_SWPOUT_FALLBACK,
> +		THP_RES_ALLOC,
> +		THP_RES_ALLOC_FAILED,
>  #endif
>  #ifdef CONFIG_MEMORY_BALLOON
>  		BALLOON_INFLATE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f0b58479534f..a15d1cda1958 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		if (is_vm_hugetlb_page(tmp))
>  			reset_vma_resv_huge_pages(tmp);
>  
> +		thp_resvs_fork(tmp, mpnt);
> +
>  		/*
>  		 * Link in the new vma and copy the page table entries.
>  		 */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index deed97fba979..aa80b9c54d1c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -57,6 +57,8 @@
>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>  	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>  
> +unsigned int hugepage_promotion_threshold __read_mostly = HPAGE_PMD_NR / 2;
> +
>  static struct shrinker deferred_split_shrinker;
>  

Mentioned already, default this to 1 initially and then
reduce it.

>  static atomic_t huge_zero_refcount;
> @@ -288,6 +290,28 @@ static ssize_t use_zero_page_store(struct kobject *kobj,
>  static struct kobj_attribute use_zero_page_attr =
>  	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
>  
> +static ssize_t promotion_threshold_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%u\n", hugepage_promotion_threshold);
> +}
> +static ssize_t promotion_threshold_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long promotion_threshold;
> +
> +	err = kstrtoul(buf, 10, &promotion_threshold);
> +	if (err || promotion_threshold < 1 || promotion_threshold > HPAGE_PMD_NR)
> +		return -EINVAL;
> +
> +	hugepage_promotion_threshold = promotion_threshold;
> +
> +	return count;
> +}

Look at sysctl.c and see how extra1 and extra2 can be used to set a
range of permitted values without hard-coding like this. You might need
to add a special local variable like "one" and "zero" in that file to
cover HPAGE_PMD_NR. It'll save you a few lines.

> +static struct kobj_attribute promotion_threshold_attr =
> +	__ATTR(promotion_threshold, 0644, promotion_threshold_show, promotion_threshold_store);
> +
>  static ssize_t hpage_pmd_size_show(struct kobject *kobj,
>  		struct kobj_attribute *attr, char *buf)
>  {
> @@ -318,6 +342,7 @@ static ssize_t debug_cow_store(struct kobject *kobj,
>  	&enabled_attr.attr,
>  	&defrag_attr.attr,
>  	&use_zero_page_attr.attr,
> +	&promotion_threshold_attr.attr,
>  	&hpage_pmd_size_attr.attr,
>  #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
>  	&shmem_enabled_attr.attr,

I suggest splitting out any debugging code into a separate patch.

> @@ -670,6 +695,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  	struct page *page;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>  
> +	if (hugepage_promotion_threshold > 1) {
> +		khugepaged_reserve(vma, vmf->address);
> +		return VM_FAULT_FALLBACK;
> +	}
>  	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>  		return VM_FAULT_FALLBACK;
>  	if (unlikely(anon_vma_prepare(vma)))

Add a comment on why exactly it's falling back. It's also not clear at a
glance what happens if this is the fault that reaches the threshold.

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a31d740e6cd1..55d380f8ce71 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -8,6 +8,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
> +#include <linux/shrinker.h>
>  #include <linux/mm_inline.h>
>  #include <linux/kthread.h>
>  #include <linux/khugepaged.h>
> @@ -56,6 +57,7 @@ enum scan_result {
>  /* default scan 8*512 pte (or vmas) every 30 second */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
> +static unsigned int khugepaged_res_pages_collapsed;
>  static unsigned int khugepaged_full_scans;
>  static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
>  /* during fragmentation poll the hugepage allocator once every minute */
> @@ -76,6 +78,445 @@ enum scan_result {
>  
>  static struct kmem_cache *mm_slot_cache __read_mostly;
>  
> +struct list_lru thp_reservations_lru;
> +
> +void thp_resvs_new(struct vm_area_struct *vma)
> +{
> +	struct thp_resvs *new = NULL;
> +
> +	if (hugepage_promotion_threshold == 1)
> +		goto done;
> +
> +	new = kzalloc(sizeof(struct thp_resvs), GFP_KERNEL);
> +	if (!new)
> +		goto done;
> +
> +	atomic_set(&new->refcnt, 1);
> +	spin_lock_init(&new->res_hash_lock);
> +	hash_init(new->res_hash);
> +
> +done:
> +	vma->thp_reservations = new;
> +}
> +

Odd flow. Init the VMA to have this as NULL and then just return if it's
unused instead of initialising it here. Not a biggie but I'm already
hung-up on thinking the reservations should be per-mm.

> +void __thp_resvs_put(struct thp_resvs *resv)
> +{
> +	if (!atomic_dec_and_test(&resv->refcnt))
> +		return;
> +
> +	kfree(resv);
> +}
> +

kfree without clearing the pointer to it looks like a recipe for use-after
free entertainments.

> +static struct thp_reservation *khugepaged_find_reservation(
> +	struct vm_area_struct *vma,
> +	unsigned long address)
> +{
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	struct thp_reservation *res = NULL;
> +
> +	if (!vma->thp_reservations)
> +		return NULL;
> +
> +	hash_for_each_possible(vma->thp_reservations->res_hash, res, node, haddr) {
> +		if (res->haddr == haddr)
> +			break;
> +	}
> +	return res;
> +}
> +
> +static void khugepaged_free_reservation(struct thp_reservation *res)
> +{
> +	struct page *page;
> +	int unused;
> +	int i;
> +
> +	list_lru_del(&thp_reservations_lru, &res->lru);
> +	hash_del(&res->node);
> +	page = res->page;
> +	unused = res->nr_unused;
> +
> +	kfree(res);
> +
> +	if (!PageCompound(page)) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++)
> +			put_page(page + i);
> +
> +		if (unused) {
> +			mod_node_page_state(page_pgdat(page), NR_THP_RESERVED,
> +					    -unused);
> +		}
> +	}
> +}
> +
> +void khugepaged_reserve(struct vm_area_struct *vma, unsigned long address)
> +{
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	struct thp_reservation *res;
> +	struct page *page;
> +	gfp_t gfp;
> +	int i;
> +
> +	if (!vma->thp_reservations)
> +		return;
> +	if (!vma_is_anonymous(vma))
> +		return;
> +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> +		return;
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +
> +	if (khugepaged_find_reservation(vma, address)) {
> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
> +		return;
> +	}
> +
> +	/*
> +	 * Allocate the equivalent of a huge page but not as a compound page
> +	 */
> +	gfp = GFP_TRANSHUGE_LIGHT & ~__GFP_COMP;
> +	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> +	if (unlikely(!page)) {
> +		count_vm_event(THP_RES_ALLOC_FAILED);
> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
> +		return;
> +	}
> +

Put the cleanup at the end of the function and use gotos to pick what
point you unwind the work at. This will reduce duplicated code and make
it harder to introduce bugs in the cleanup if there are modifications to
this function.

> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		set_page_count(page + i, 1);
> +

split_page()?

> +	res = kzalloc(sizeof(*res), GFP_KERNEL);
> +	if (!res) {
> +		count_vm_event(THP_RES_ALLOC_FAILED);
> +		__free_pages(page, HPAGE_PMD_ORDER);
> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
> +		return;
> +	}
> +
> +	count_vm_event(THP_RES_ALLOC);
> +
> +	res->haddr = haddr;
> +	res->page = page;
> +	res->vma = vma;
> +	res->lock = &vma->thp_reservations->res_hash_lock;
> +	hash_add(vma->thp_reservations->res_hash, &res->node, haddr);
> +
> +	INIT_LIST_HEAD(&res->lru);
> +	list_lru_add(&thp_reservations_lru, &res->lru);
> +
> +	res->nr_unused = HPAGE_PMD_NR;
> +	mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, HPAGE_PMD_NR);
> +
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +
> +	khugepaged_enter(vma, vma->vm_flags);
> +}
> +

I'm undecided on the use of the LRU. I'm not sure it's worthwhile in the
initial phase to shrink reservations in LRU order. I think in the first
generation just blast all reservations if there is memory pressure and
add the shrinker as a separate patch. This has a worse-case scenario on
being no better than what we have today.

I'm partially saying this because shrinkers have historically being
a bit painful and difficult to analyse. Initially you want this to be
performance-neutral at worst and the use of shrinkers means we could
have corner cases where reservations cause page cache or mapped anonymous
pages to be prematurely discarded.

> +struct page *khugepaged_get_reserved_page(struct vm_area_struct *vma,
> +					  unsigned long address)
> +{
> +	struct thp_reservation *res;
> +	struct page *page;
> +
> +	if (!transparent_hugepage_enabled(vma))
> +		return NULL;
> +	if (!vma->thp_reservations)
> +		return NULL;
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +
> +	page = NULL;
> +	res = khugepaged_find_reservation(vma, address);
> +	if (res) {
> +		unsigned long offset = address & ~HPAGE_PMD_MASK;
> +
> +		page = res->page + (offset >> PAGE_SHIFT);
> +		get_page(page);
> +
> +		list_lru_del(&thp_reservations_lru, &res->lru);
> +		list_lru_add(&thp_reservations_lru, &res->lru);
> +
> +		dec_node_page_state(res->page, NR_THP_RESERVED);
> +	}
> +
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +
> +	return page;
> +}
> +
> +void khugepaged_release_reservation(struct vm_area_struct *vma,
> +				    unsigned long address)
> +{
> +	struct thp_reservation *res;
> +
> +	if (!vma->thp_reservations)
> +		return;
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +
> +	res = khugepaged_find_reservation(vma, address);
> +	if (!res)
> +		goto out;
> +
> +	khugepaged_free_reservation(res);
> +
> +out:
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +}
> +
> +/*
> + * Release all reservations covering a range in a VMA.
> + */
> +void __khugepaged_release_reservations(struct vm_area_struct *vma,
> +				       unsigned long addr, unsigned long len)
> +{
> +	struct thp_reservation *res;
> +	struct hlist_node *tmp;
> +	unsigned long eaddr;
> +	int i;
> +
> +	if (!vma->thp_reservations)
> +		return;
> +
> +	eaddr = addr + len;
> +	addr &= HPAGE_PMD_MASK;
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +
> +	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
> +		unsigned long hstart = res->haddr;
> +
> +		if (hstart >= addr && hstart < eaddr)
> +			khugepaged_free_reservation(res);
> +	}
> +
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +}
> +
> +static void __khugepaged_move_reservations(struct vm_area_struct *src,
> +					   struct vm_area_struct *dst,
> +					   unsigned long split_addr,
> +					   bool dst_is_below)
> +{
> +	struct thp_reservation *res;
> +	struct hlist_node *tmp;
> +	bool free_res = false;
> +	int i;
> +
> +	if (!src->thp_reservations)
> +		return;
> +
> +	if (!dst->thp_reservations)
> +		free_res = true;
> +
> +	spin_lock(&src->thp_reservations->res_hash_lock);
> +	if (!free_res)
> +		spin_lock(&dst->thp_reservations->res_hash_lock);
> +
> +	hash_for_each_safe(src->thp_reservations->res_hash, i, tmp, res, node) {
> +		unsigned long hstart = res->haddr;
> +
> +		/*
> +		 * Free the reservation if it straddles a non-aligned
> +		 * split address.
> +		 */
> +		if ((split_addr & ~HPAGE_PMD_MASK) &&
> +		    (hstart == (split_addr & HPAGE_PMD_MASK))) {
> +			khugepaged_free_reservation(res);
> +			continue;
> +		} else if (dst_is_below) {
> +			if (hstart >= split_addr)
> +				continue;
> +		} else if (hstart < split_addr) {
> +			continue;
> +		}
> +
> +		if (unlikely(free_res)) {
> +			khugepaged_free_reservation(res);
> +			continue;
> +		}
> +
> +		hash_del(&res->node);
> +		res->vma = dst;
> +		res->lock = &dst->thp_reservations->res_hash_lock;
> +		hash_add(dst->thp_reservations->res_hash, &res->node, res->haddr);
> +	}
> +
> +	if (!free_res)
> +		spin_unlock(&dst->thp_reservations->res_hash_lock);
> +	spin_unlock(&src->thp_reservations->res_hash_lock);
> +}
> +

Nothing jumped out here but I'm not reading as closely as I should.
Fundamentally any major issue here will result in memory corruption
of a type that will trigger quickly. Performance testing can be driven
by profiles.

> +/*
> + * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
> + */
> +static void take_rmap_locks(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_file)
> +		i_mmap_lock_write(vma->vm_file->f_mapping);
> +	if (vma->anon_vma)
> +		anon_vma_lock_write(vma->anon_vma);
> +}
> +
> +/*
> + * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
> + */
> +static void drop_rmap_locks(struct vm_area_struct *vma)
> +{
> +	if (vma->anon_vma)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (vma->vm_file)
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +}
> +
> +void thp_reservations_mremap(struct vm_area_struct *vma,
> +		unsigned long old_addr, struct vm_area_struct *new_vma,
> +		unsigned long new_addr, unsigned long len,
> +		bool need_rmap_locks)
> +{

Is mremap really worth optimising at this point? Would it be possible
instead to dump all reservations for a range (either VMA or mm) being
mremapped and just move it as normal? If so, do that and make mremap
handling a separate patch. Minimally, it would be nice to know if there
are mremap-intensive workloads that care deeply about preserving THP
reservations.

> +
> +	struct thp_reservation *res;
> +	unsigned long eaddr, offset;
> +	struct hlist_node *tmp;
> +	int i;
> +
> +	if (!vma->thp_reservations)
> +		return;
> +
> +	if (!new_vma->thp_reservations) {
> +		__khugepaged_release_reservations(vma, old_addr, len);
> +		return;
> +	}
> +
> +	/*
> +	 * Release all reservations if they will no longer be aligned
> +	 * in the new address range.
> +	 */
> +	if ((new_addr & ~HPAGE_PMD_MASK) != (old_addr & ~HPAGE_PMD_MASK)) {
> +		__khugepaged_release_reservations(vma, old_addr, len);
> +		return;
> +	}
> +
> +	if (need_rmap_locks)
> +		take_rmap_locks(vma);
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +	spin_lock(&new_vma->thp_reservations->res_hash_lock);
> +
> +	/*
> +	 * If the start or end addresses of the range are not huge page
> +	 * aligned, check for overlapping reservations and release them.
> +	 */
> +	if (old_addr & ~HPAGE_PMD_MASK) {
> +		res = khugepaged_find_reservation(vma, old_addr);
> +		if (res)
> +			khugepaged_free_reservation(res);
> +	}
> +
> +	eaddr = old_addr + len;
> +	if (eaddr & ~HPAGE_PMD_MASK) {
> +		res = khugepaged_find_reservation(vma, eaddr);
> +		if (res)
> +			khugepaged_free_reservation(res);
> +	}
> +
> +	offset = new_addr - old_addr;
> +
> +	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
> +		unsigned long hstart = res->haddr;
> +
> +		if (hstart < old_addr || hstart >= eaddr)
> +			continue;
> +
> +		hash_del(&res->node);
> +		res->lock = &new_vma->thp_reservations->res_hash_lock;
> +		res->vma = new_vma;
> +		res->haddr += offset;
> +		hash_add(new_vma->thp_reservations->res_hash, &res->node, res->haddr);
> +	}
> +
> +	spin_unlock(&new_vma->thp_reservations->res_hash_lock);
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +
> +	if (need_rmap_locks)
> +		drop_rmap_locks(vma);
> +
> +}
> +
> +/*
> + * Handle moving reservations for VMA merge cases 1, 6, 7, and 8 (see
> + * comments above vma_merge()) and when splitting a VMA.
> + *
> + * src is expected to be aligned with the start or end of dst
> + * src may be contained by dst or directly adjacent to dst
> + * Move all reservations if src is contained by dst.
> + * Otherwise move reservations no longer in the range of src
> + * to dst.
> + */
> +void _khugepaged_reservations_fixup(struct vm_area_struct *src,
> +				    struct vm_area_struct *dst)
> +{
> +	bool dst_is_below = false;
> +	unsigned long split_addr;
> +
> +	if (src->vm_start == dst->vm_start || src->vm_end == dst->vm_end) {
> +		split_addr = 0;
> +	} else if (src->vm_start == dst->vm_end) {
> +		split_addr = src->vm_start;
> +		dst_is_below = true;
> +	} else if (src->vm_end == dst->vm_start) {
> +		split_addr = src->vm_end;
> +	} else {
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
> +}
> +
> +/*
> + * Handle moving reservations for VMA merge cases 4 and 5 (see comments
> + * above vma_merge()).
> + */
> +void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
> +				       struct vm_area_struct *next, long adjust)
> +{
> +	unsigned long split_addr = next->vm_start;
> +	struct vm_area_struct *src, *dst;
> +	bool dst_is_below;
> +
> +	if (adjust < 0) {
> +		src = prev;
> +		dst = next;
> +		dst_is_below = false;
> +	} else {
> +		src = next;
> +		dst = prev;
> +		dst_is_below = true;
> +	}
> +
> +	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
> +}
> +
> +void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
> +				unsigned long address, int delta)
> +{
> +	struct thp_reservation *res;
> +
> +	if (!vma->thp_reservations)
> +		return;
> +
> +	spin_lock(&vma->thp_reservations->res_hash_lock);
> +
> +	res = khugepaged_find_reservation(vma, address);
> +	if (res) {
> +		WARN_ON((res->nr_unused == 0) || (res->nr_unused + delta < 0));
> +		if (res->nr_unused + delta >= 0)
> +			res->nr_unused += delta;
> +	}
> +
> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
> +}
> +
>  /**
>   * struct mm_slot - hash lookup from mm to mm_slot
>   * @hash: hash collision list
> @@ -197,6 +638,15 @@ static ssize_t pages_collapsed_show(struct kobject *kobj,
>  static struct kobj_attribute pages_collapsed_attr =
>  	__ATTR_RO(pages_collapsed);
>  
> +static ssize_t res_pages_collapsed_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr,
> +				    char *buf)
> +{
> +	return sprintf(buf, "%u\n", khugepaged_res_pages_collapsed);
> +}
> +static struct kobj_attribute res_pages_collapsed_attr =
> +	__ATTR_RO(res_pages_collapsed);
> +
>  static ssize_t full_scans_show(struct kobject *kobj,
>  			       struct kobj_attribute *attr,
>  			       char *buf)
> @@ -292,6 +742,7 @@ static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
>  	&scan_sleep_millisecs_attr.attr,
>  	&alloc_sleep_millisecs_attr.attr,
>  	&khugepaged_max_ptes_swap_attr.attr,
> +	&res_pages_collapsed_attr.attr,
>  	NULL,
>  };
>  
> @@ -342,8 +793,96 @@ int hugepage_madvise(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +/*
> + * thp_lru_free_reservation() - shrinker callback to release THP reservations
> + * and free unused pages
> + *
> + * Called from list_lru_shrink_walk() in thp_resvs_shrink_scan() to free
> + * up pages when the system is under memory pressure.
> + */
> +enum lru_status thp_lru_free_reservation(struct list_head *item,
> +					 struct list_lru_one *lru,
> +					 spinlock_t *lock,
> +					 void *cb_arg)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct thp_reservation *res = container_of(item,
> +						   struct thp_reservation,
> +						   lru);
> +	struct page *page;
> +	int unused;
> +	int i;
> +
> +	if (!spin_trylock(res->lock))
> +		goto err_get_res_lock_failed;
> +
> +	mm = res->vma->vm_mm;
> +	if (!mmget_not_zero(mm))
> +		goto err_mmget;
> +	if (!down_write_trylock(&mm->mmap_sem))
> +		goto err_down_write_mmap_sem_failed;
> +
> +	list_lru_isolate(lru, item);
> +	spin_unlock(lock);
> +
> +	hash_del(&res->node);
> +
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);
> +
> +	spin_unlock(res->lock);
> +
> +	page = res->page;
> +	unused = res->nr_unused;
> +
> +	kfree(res);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(page + i);
> +
> +	if (unused)
> +		mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, -unused);
> +
> +	spin_lock(lock);
> +
> +	return LRU_REMOVED_RETRY;
> +
> +err_down_write_mmap_sem_failed:
> +	mmput_async(mm);
> +err_mmget:
> +	spin_unlock(res->lock);
> +err_get_res_lock_failed:
> +	return LRU_SKIP;
> +}
> +
> +static unsigned long
> +thp_resvs_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	unsigned long ret = list_lru_shrink_count(&thp_reservations_lru, sc);
> +	return ret;
> +}
> +
> +static unsigned long
> +thp_resvs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	unsigned long ret;
> +
> +	ret = list_lru_shrink_walk(&thp_reservations_lru, sc,
> +				   thp_lru_free_reservation, NULL);
> +	return ret;
> +}
> +
> +static struct shrinker thp_resvs_shrinker = {
> +	.count_objects = thp_resvs_shrink_count,
> +	.scan_objects = thp_resvs_shrink_scan,
> +	.seeks = DEFAULT_SEEKS,
> +	.flags = SHRINKER_NUMA_AWARE,
> +};
> +

As before, I think the shrinker stuff should be in separate patches. Get
the core approach working first and then add bits on top. If nothing
else, it means that a mistake in the shrinker behaviour will not kill
the overall idea just because it's a monolithic patch.

>  int __init khugepaged_init(void)
>  {
> +	int err;
> +
>  	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
>  					  sizeof(struct mm_slot),
>  					  __alignof__(struct mm_slot), 0, NULL);
> @@ -354,6 +893,17 @@ int __init khugepaged_init(void)
>  	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
>  	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
>  
> +	// XXX should be in hugepage_init() so shrinker can be
> +	// unregistered if necessary.
> +	err = list_lru_init(&thp_reservations_lru);
> +	if (err == 0) {
> +		err = register_shrinker(&thp_resvs_shrinker);
> +		if (err) {
> +			list_lru_destroy(&thp_reservations_lru);
> +			return err;
> +		}
> +	}
> +
>  	return 0;
>  }
>  
> @@ -519,12 +1069,14 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte)
>  
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long address,
> -					pte_t *pte)
> +					pte_t *pte,
> +					struct thp_reservation *res)
>  {
>  	struct page *page = NULL;
>  	pte_t *_pte;
>  	int none_or_zero = 0, result = 0, referenced = 0;
>  	bool writable = false;
> +	bool is_reserved = res ? true : false;
>  
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
> @@ -573,7 +1125,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 * The page must only be referenced by the scanned process
>  		 * and page swap cache.
>  		 */
> -		if (page_count(page) != 1 + PageSwapCache(page)) {
> +		if (page_count(page) != 1 + PageSwapCache(page) + is_reserved) {
>  			unlock_page(page);
>  			result = SCAN_PAGE_COUNT;
>  			goto out;
> @@ -631,6 +1183,68 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static void __collapse_huge_page_convert(pte_t *pte, struct page *page,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address,
> +				      spinlock_t *ptl)
> +{
> +	struct page *head = page;
> +	pte_t *_pte;
> +
> +	set_page_count(page, 1);
> +
> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +				_pte++, page++, address += PAGE_SIZE) {
> +		pte_t pteval = *_pte;
> +
> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			clear_user_highpage(page, address);

Needs commenting. At a glance it's not clear what happens if the entire
range was zero pfns. Does that convert into one large allocated huge
page? If so, it goes counter to the goal of reducing memory usage.

> +			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> +			if (is_zero_pfn(pte_pfn(pteval))) {
> +				/*
> +				 * ptl mostly unnecessary.
> +				 */
> +				spin_lock(ptl);
> +				/*
> +				 * paravirt calls inside pte_clear here are
> +				 * superfluous.
> +				 */
> +				pte_clear(vma->vm_mm, address, _pte);
> +				spin_unlock(ptl);
> +			}
> +			dec_node_page_state(page, NR_THP_RESERVED);
> +		} else {
> +			dec_node_page_state(page, NR_ISOLATED_ANON +
> +					    page_is_file_cache(page));
> +			unlock_page(page);
> +			ClearPageActive(page);
> +			/*
> +			 * ptl mostly unnecessary, but preempt has to
> +			 * be disabled to update the per-cpu stats
> +			 * inside page_remove_rmap().
> +			 */
> +			spin_lock(ptl);
> +			/*
> +			 * paravirt calls inside pte_clear here are
> +			 * superfluous.
> +			 */
> +			pte_clear(vma->vm_mm, address, _pte);
> +			page_remove_rmap(page, false);
> +			spin_unlock(ptl);
> +			/*
> +			 * Swapping out a page in a reservation
> +			 * causes the reservation to be released
> +			 * therefore no pages in a reservation
> +			 * should be in swapcache.
> +			 */
> +			WARN_ON(PageSwapCache(page));
> +		}
> +	}
> +

The page table handling needs more careful review than I'm giving at the
moment. It's another reason why the core approach should be as simple as
possible as the patch has too much in it right now to keep it all in
mind.

I ran out of time at this point. Overall I don't see anything in there that
fundamentally kills the idea and I think it's workable. It does need to
be broken up though as reviewing a monolthic patch is going to be harder
to review and a single mistake at the edges (like shrinkers or mremap)
takes everything else with it. Focus on getting the reservation part right,
the zero page handling and the collapse. For any corner case, dump all the
reservations for a VMA or address space and let kcompactd clean it up later.
It can then be evaluated that it's performance-neutral relative to the
existing code and identify what workloads are helped by being clever.

Thanks!
Kirill A. Shutemov Nov. 9, 2018, 12:13 p.m. UTC | #2
On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
> The basic idea as outlined by Mel Gorman in [2] is:
> 
> 1) On first fault in a sufficiently sized range, allocate a huge page
>    sized and aligned block of base pages.  Map the base page
>    corresponding to the fault address and hold the rest of the pages in
>    reserve.
> 2) On subsequent faults in the range, map the pages from the reservation.
> 3) When enough pages have been mapped, promote the mapped pages and
>    remaining pages in the reservation to a huge page.
> 4) When there is memory pressure, release the unused pages from their
>    reservations.

I haven't yet read the patch in details, but I'm skeptical about the
approach in general for few reasons:

- PTE page table retracting to replace it with huge PMD entry requires
  down_write(mmap_sem). It makes the approach not practical for many
  multi-threaded workloads.

  I don't see a way to avoid exclusive lock here. I will be glad to
  be proved otherwise.

- The promotion will also require TLB flush which might be prohibitively
  slow on big machines.

- Short living processes will fail to benefit from THP with the policy,
  even with plenty of free memory in the system: no time to promote to THP
  or, with synchronous promotion, cost will overweight the benefit.

The goal to reduce memory overhead of THP is admirable, but we need to be
careful not to kill THP benefit itself. The approach will reduce number of
THP mapped in the system and/or shift their allocation to later stage of
process lifetime.

The only way I see it can be useful is if it will be possible to apply the
policy on per-VMA basis. It will be very useful for malloc()
implementations, for instance. But as a global policy it's no-go to me.

Prove me wrong with performance data. :)
Mel Gorman Nov. 9, 2018, 1:11 p.m. UTC | #3
On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
> > The basic idea as outlined by Mel Gorman in [2] is:
> > 
> > 1) On first fault in a sufficiently sized range, allocate a huge page
> >    sized and aligned block of base pages.  Map the base page
> >    corresponding to the fault address and hold the rest of the pages in
> >    reserve.
> > 2) On subsequent faults in the range, map the pages from the reservation.
> > 3) When enough pages have been mapped, promote the mapped pages and
> >    remaining pages in the reservation to a huge page.
> > 4) When there is memory pressure, release the unused pages from their
> >    reservations.
> 
> I haven't yet read the patch in details, but I'm skeptical about the
> approach in general for few reasons:
> 
> - PTE page table retracting to replace it with huge PMD entry requires
>   down_write(mmap_sem). It makes the approach not practical for many
>   multi-threaded workloads.
> 
>   I don't see a way to avoid exclusive lock here. I will be glad to
>   be proved otherwise.
> 

That problem is somewhat fundamental to the mmap_sem itself and
conceivably it could be alleviated by range-locking (if that gets
completed). The other thing to bear in mind is the timing. If the
promotion is in-place due to reservations, there isn't the allocation
overhead and the hold times *should* be short.

> - The promotion will also require TLB flush which might be prohibitively
>   slow on big machines.
> 

Which may be offset by either a) setting the threshold to 1 in cases
where the promtotion should always be immediate or b) offset by reduced
memory consumption potentially avoiding premature reclaim in others.

> - Short living processes will fail to benefit from THP with the policy,
>   even with plenty of free memory in the system: no time to promote to THP
>   or, with synchronous promotion, cost will overweight the benefit.
> 

Short-lived processes are also not going to be dominated by the TLB
refill cost so I think that's somewhat unfair. Potential means of
mediating this include per-task promotion thresholds via either prctl or
a task-wide policy inherited across exec

> The goal to reduce memory overhead of THP is admirable, but we need to be
> careful not to kill THP benefit itself. The approach will reduce number of
> THP mapped in the system and/or shift their allocation to later stage of
> process lifetime.
> 

While I agree with you, I also had suggested in review that the
threshold initially be set to 1 so it can be experiemented with by
people who are more concerned about memory consumption than reduced TLB
misses. While the general idea is not free of problems, I believe they
are fixable rather than fundamental.

> Prove me wrong with performance data. :)
> 

Agreed that this should be accompanied by performance data but I think I
laid out a reasonable approach here. If the default is a threshold of 1
and that is shown to be performance-neutral then incremental progress
can be made as opposed to an "all or nothing" approach.
Zi Yan Nov. 9, 2018, 3:34 p.m. UTC | #4
On 9 Nov 2018, at 8:11, Mel Gorman wrote:

> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>>> The basic idea as outlined by Mel Gorman in [2] is:
>>>
>>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>>    sized and aligned block of base pages.  Map the base page
>>>    corresponding to the fault address and hold the rest of the pages in
>>>    reserve.
>>> 2) On subsequent faults in the range, map the pages from the reservation.
>>> 3) When enough pages have been mapped, promote the mapped pages and
>>>    remaining pages in the reservation to a huge page.
>>> 4) When there is memory pressure, release the unused pages from their
>>>    reservations.
>>
>> I haven't yet read the patch in details, but I'm skeptical about the
>> approach in general for few reasons:
>>
>> - PTE page table retracting to replace it with huge PMD entry requires
>>   down_write(mmap_sem). It makes the approach not practical for many
>>   multi-threaded workloads.
>>
>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>   be proved otherwise.
>>
>
> That problem is somewhat fundamental to the mmap_sem itself and
> conceivably it could be alleviated by range-locking (if that gets
> completed). The other thing to bear in mind is the timing. If the
> promotion is in-place due to reservations, there isn't the allocation
> overhead and the hold times *should* be short.
>

Is it possible to convert all these PTEs to migration entries during
the promotion and replace them with a huge PMD entry afterwards?
AFAIK, migrating pages does not require holding a mmap_sem.
Basically, it will act like migrating 512 base pages to a THP without
actually doing the page copy.

--
Best Regards
Yan Zi
Andrea Arcangeli Nov. 9, 2018, 7:51 p.m. UTC | #5
Hello,

On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
> > The basic idea as outlined by Mel Gorman in [2] is:
> > 
> > 1) On first fault in a sufficiently sized range, allocate a huge page
> >    sized and aligned block of base pages.  Map the base page
> >    corresponding to the fault address and hold the rest of the pages in
> >    reserve.
> > 2) On subsequent faults in the range, map the pages from the reservation.
> > 3) When enough pages have been mapped, promote the mapped pages and
> >    remaining pages in the reservation to a huge page.
> > 4) When there is memory pressure, release the unused pages from their
> >    reservations.
> 
> I haven't yet read the patch in details, but I'm skeptical about the
> approach in general for few reasons:
> 
> - PTE page table retracting to replace it with huge PMD entry requires
>   down_write(mmap_sem). It makes the approach not practical for many
>   multi-threaded workloads.
> 
>   I don't see a way to avoid exclusive lock here. I will be glad to
>   be proved otherwise.
> 
> - The promotion will also require TLB flush which might be prohibitively
>   slow on big machines.
> 
> - Short living processes will fail to benefit from THP with the policy,
>   even with plenty of free memory in the system: no time to promote to THP
>   or, with synchronous promotion, cost will overweight the benefit.
> 
> The goal to reduce memory overhead of THP is admirable, but we need to be
> careful not to kill THP benefit itself. The approach will reduce number of
> THP mapped in the system and/or shift their allocation to later stage of
> process lifetime.
> 
> The only way I see it can be useful is if it will be possible to apply the
> policy on per-VMA basis. It will be very useful for malloc()
> implementations, for instance. But as a global policy it's no-go to me.

I'm also skeptical about this: the current design is quite
intentional. It's not a bug but a feature that we're not doing the
promotion.

Part of the tradeoff with THP is to use more RAM to save CPU, when you
use less RAM you're inherently already wasting some CPU just for the
reservation management and you don't get the immediate TLB benefit
anymore either.

And if you're in the camp that is concerned about the use of more RAM
or/and about the higher latency of COW faults, I'm afraid the
intermediate solution will be still slower than the already available
MADV_NOHUGEPAGE or enabled=madvise.

Apps like redis that will use more RAM during snapshot and that are
slowed down with THP needs to simply use MADV_NOHUGEPAGE which already
exists as an madvise from the very first kernel that supported
THP-anon. Same thing for other apps that use more RAM with THP and
that are on the losing end of the tradeoff.

Now about the implementation: the whole point of the reservation
complexity is to skip the khugepaged copy, so it can collapse in
place. Is skipping the copy worth it? Isn't the big cost the IPI
anyway to avoid leaving two simultaneous TLB mappings of different
granularity?

khugepaged is already tunable to specify a ratio of memory in use to
avoid wasting memory
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none.

If you set max_ptes_none to half the default value, it'll only promote
pages that are half mapped, reducing the memory waste to 50% of what
it is by default.

So if you are ok to copy the memory that you promote to THP, you'd
just need a global THP mode to avoid allocating THP even when they're
available during the page fault (while still allowing khugepaged to
collapse hugepages in the background), and then reduce max_ptes_none
to get the desired promotion ratio.

Doing the copy will avoid the reservation there will be also more THP
available to use for those khugepaged users without losing them in
reservations. You won't have to worry about what to do when there's
memory pressure because you won't have to undo the reservation because
there was no reservation in the first place. That problem also goes
away with the copy.

So it sounds like you could achieve a similar runtime behavior with
much less complexity by reducing max_ptes_none and by doing the copy
and dropping all reservation code.

> Prove me wrong with performance data. :)

Same here.

Thanks,
Andrea
Anthony Yznaga Nov. 9, 2018, 11:37 p.m. UTC | #6
On 11/09/2018 03:07 AM, Mel Gorman wrote:
> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>> The basic idea as outlined by Mel Gorman in [2] is:
>>
>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>    sized and aligned block of base pages.  Map the base page
>>    corresponding to the fault address and hold the rest of the pages in
>>    reserve.
>> 2) On subsequent faults in the range, map the pages from the reservation.
>> 3) When enough pages have been mapped, promote the mapped pages and
>>    remaining pages in the reservation to a huge page.
>> 4) When there is memory pressure, release the unused pages from their
>>    reservations.
>>
>> [1] https://marc.info/?l=linux-mm&m=151631857310828&w=2
>> [2] https://lkml.org/lkml/2018/1/25/571
>>
> I'm delighted to see someone try tackle this issue.
>
>> To test the idea I wrote a simple test that repeatedly forks children
>> where each child attempts to allocate a very large chunk of memory and
>> then touch either 1 page or a random number of pages in each huge page
>> region of the chunk.  On a machine with 256GB with a test chunk size of
>> 16GB the test ends when the 17th child fails to map its chunk.  With THP
>> reservations enabled, the test ends when the 118th child fails.
>>
> That's a solid test case. I would suggest that primary metrics be fault
> latency, memory consumption and successful promotions (as opposed to
> successful allocations that we'd use with the existing THP setup).
Okay.

>
>> Below are some additional implementation details and known issues.
>>
>> User-visible files:
>>
> These all need to go into Documentation/ although I have no problems
> with the fields themselves.
I'll add a Documentation patch.

>
>> /sys/kernel/mm/transparent_hugepage/promotion_threshold
>>
>> 	The number of base pages within a huge page aligned region that
>> 	must be faulted in before the region is eligible for promotion
>> 	to a huge page.
>>
> Initially, I would suggest making the default 1 and then set a higher
> threshold in a subsequent patch. In the patch that introduces it,
> show in the changelog that the performance of your code is identical
> or close to identical as the existing approach. i.e. Show that the
> worst-case scenario is performance-neutral.  Then reduce the threshold
> so it can be demonstrated what the performance tradeoff is versus memory
> consumption. There is going to be some loss due to the additional code
> and the fact that THP is not used immediately for *some* workloads.
Okay, I'll do that.
>
>> /sys/kernel/mm/transparent_hugepage/khugepaged/res_pages_collapsed
>>
>> 	The number of THP reservations promoted to huge pages
>> 	by khugepaged.
>>
>> 	This total is also included in the total reported in pages_collapsed.
>>
> What is that not a vmstat like collapsed?
I'm not sure I understand.  There isn't a collapsed vmstat, and pages_collapsed is a sysfs file. 
>
>> Counters added to /proc/vmstat:
>>
>> nr_thp_reserved
>>
>> 	The total number of small pages in existing reservations
>> 	that have not had a page fault since their respective
>> 	reservation were created.  The amount is also included
>> 	in the estimated total memory available as reported
>> 	in MemAvailable in /proc/meminfo.
>>
>> thp_res_alloc
>>
>> 	Incremented every time the pages for a reservation have been
>> 	successfully allocated to handle a page fault.
>>
>> thp_res_alloc_failed
>>
>> 	Incremented if pages could not successfully allocated for
>> 	a reservation.
>>
> Seems fair. It might need tracepoints for further debugging in the
> future but lets wait until there is an actual problem that can be solved
> by a tracepoint first.
>
>> Known Issues:
>>
>> - COW handling of reservations is insufficient.   While the pages of a
>> reservation are shared between parent and child after fork, currently
>> the reservation data structures are not shared and remain with the
>> parent.  A COW fault by the child allocates a new small page and a new
>> reservation is not allocated.  A COW fault by the parent allocates a new
>> small page and releases the reservation if one exists.
>>
> Maybe keep the reservations in the parent. I'm thinking specifically
> about workloads like redis that fork to take a snapshot that don't
> particularly care about THP but do care about the memory overhead due to
> sparse addressing of memory.

Okay.

>
>> - If the pages in a reservation are remapped read-only (e.g. after fork
>> and child exit), khugepaged will never promote the pages to a huge page
>> until at least one page is written.
>>
> I don't consider that a major limitation and I don't think it must be solved
> in the first generation of the series. If this is a common occurance,
> then it can be dealt with or else workaround by setting the threshold to
> 1 until it's resolved.
I agree that it's not a major problem.  One side-effect of this that I observed
was that a fully populated reservation could sit around for essentially forever
on the LRU list and potentially impede real progress by the shrinker, but I
think this could easily be addressed by removing a reservation from the LRU
list when it becomes fully populated.

>
>> - A reservation is allocated even if the first fault on a pmd range maps
>> a zero page.  It may be more space efficient to allocate the reservation
>> on the first write fault.
>>
> Agreed but it doesn't kill the idea either. Reserving based on a zero-page
> fault works counter to your goal of reducing overall memory consumption
> so this should be fixed.
Okay.

>
>> - To facilitate the shrinker implementation, reservations are kept in a
>> global struct list_lru.  The list_lru internal implementation puts items
>> added to a list_lru on to per-node lists based on the node id derived
>> from the address of the item passed to list_lru_add().  For the current
>> reservations shrinker implementation this means that reservations will
>> be placed on the internal per-node list corresponding to the node where
>> the reservation data structure is located rather than the node where the
>> reserved pages are located.
>>
> Hmm, not super keen on a shrinker for this given that we probably want
> to dump all reservations in the event of memory pressure and doing that
> via a shrinker can be "fun".

I'll implement a simple release of all reservations for comparison.

>> Other TBD:
>> - Performance testing
>> - shmem support
>> - Investigate promoting a reservation synchronously during fault handling
>>   rather than waiting for khugepaged to do the promotion.
>>
> Kirill might disagree but I do not think that shmem support for this is
> necessarily critical. THP was anonymous-only for a long time.
Okay.  At minimum it sounds like I should prove a benefit with anonymous
THP first.

>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  include/linux/huge_mm.h       |   1 +
>>  include/linux/khugepaged.h    | 119 +++++++
>>  include/linux/memcontrol.h    |   5 +
>>  include/linux/mm_types.h      |   3 +
>>  include/linux/mmzone.h        |   1 +
>>  include/linux/vm_event_item.h |   2 +
>>  kernel/fork.c                 |   2 +
>>  mm/huge_memory.c              |  29 ++
>>  mm/khugepaged.c               | 739 ++++++++++++++++++++++++++++++++++++++++--
>>  mm/memcontrol.c               |  33 ++
>>  mm/memory.c                   |  37 ++-
>>  mm/mmap.c                     |  14 +
>>  mm/mremap.c                   |   5 +
>>  mm/page_alloc.c               |   5 +
>>  mm/rmap.c                     |   3 +
>>  mm/util.c                     |   5 +
>>  mm/vmstat.c                   |   3 +
>>  17 files changed, 975 insertions(+), 31 deletions(-)
>>
> This is somewhat intimidating though as a diffstat. Hopefully this can
> be broken up. The rest of this is a drive-by review only as I'm about to
> travel. Note that there will be others that may not even attempt to read
> a patch of that magnitude unless *heavily* motivated by the potential of
> the feature.
Understood.  I'll break things up before I submit anything further.

>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index fdcb45999b26..a2288f134d5d 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -92,6 +92,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
>>  extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>>  
>>  extern unsigned long transparent_hugepage_flags;
>> +extern unsigned int hugepage_promotion_threshold;
>>  
>>  static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>>  {
>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>> index 082d1d2a5216..0011eb656ff3 100644
>> --- a/include/linux/khugepaged.h
>> +++ b/include/linux/khugepaged.h
>> @@ -2,6 +2,7 @@
>>  #ifndef _LINUX_KHUGEPAGED_H
>>  #define _LINUX_KHUGEPAGED_H
>>  
>> +#include <linux/hashtable.h>
>>  #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
>>  
>>  
>> @@ -30,6 +31,64 @@ extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>>  	(transparent_hugepage_flags &				\
>>  	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
>>  
>> +struct thp_reservation {
>> +	spinlock_t *lock;
>> +	unsigned long haddr;
>> +	struct page *page;
>> +	struct vm_area_struct *vma;
>> +	struct hlist_node node;
>> +	struct list_head lru;
>> +	int nr_unused;
>> +};
>> +
> Document these fields, particularly page. Is page the first fault or the
> base index of a reserved huge page for example. You also track VMA which
> in the THP-specific case *might* be ok because we are not rmapping it
> but you might be designing yourself into a corner there.
I can get rid of the VMA pointer.  It's only used to simplify acquiring the VMA
pointer in collapse_huge_page() and by the shrinker code to get the mm pointer.

>
>> +struct thp_resvs {
>> +	atomic_t refcnt;
>> +	spinlock_t res_hash_lock;
>> +	DECLARE_HASHTABLE(res_hash, 7);
>> +};
>> +
> Also needs documentation. It isn't clear what the relationship between
> thp_resvs and thp_reservation is. Is this per-mm, per-VMA etc. As I
> write this, I haven't looked at the rest of the patch and thp_resvs
> tells me nothing about what this is for. It parses to me as THP Reserve
> Versus...... Versus what? So obviously it has some other sensible
> meaning.
Yeah, resvs isn't great.  It's supposed to be the plural of resv.  I'll come up
with something else.  A thp_resvs encompasses the per-VMA hashtable
that the per-reservation thp_reservation structures are hashed into.
>
>> +#define	vma_thp_reservations(vma)	((vma)->thp_reservations)
>> +
>> +static inline void thp_resvs_fork(struct vm_area_struct *vma,
>> +				  struct vm_area_struct *pvma)
>> +{
>> +	// XXX Do not share THP reservations for now
>> +	vma->thp_reservations = NULL;
>> +}
>> +
> Consider not sharing THP reservations between parent and child
> full-stop. I think fundamentally it breaks if a child can use the
> reservation because it means that neither child nor parent can promote
> in-place.
khugepaged already skips over pages while they are shared by child
and parent.  What I was envisioning as a more complete behavior
was to allocate a new reservation for the process that does the first
COW with the existing reservation then remaining with the other
process.  Additional COWs then populate the now separate
reservations.

>
>> +void thp_resvs_new(struct vm_area_struct *vma);
>> +
>> +extern void __thp_resvs_put(struct thp_resvs *r);
>> +static inline void thp_resvs_put(struct thp_resvs *r)
>> +{
>> +	if (r)
>> +		__thp_resvs_put(r);
>> +}
>> +
> Curious that this could be called with NULL

It's called when a VMA is freed.  vma->thp_reservations is passed
in as the argument and the value will be NULL for VMAs that don't
support THP reservations or VMAs that were created when
promotion_threshold==1.  Maybe it would be clearer to pass a
VMA pointer instead.

>
>> +void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
>> +				  unsigned long address, int delta);
>> +
>> +struct page *khugepaged_get_reserved_page(
>> +	struct vm_area_struct *vma,
>> +	unsigned long address);
>> +
>> +void khugepaged_reserve(struct vm_area_struct *vma,
>> +			unsigned long address);
>> +
>> +void khugepaged_release_reservation(struct vm_area_struct *vma,
>> +				    unsigned long address);
>> +
>> +void _khugepaged_reservations_fixup(struct vm_area_struct *src,
>> +				   struct vm_area_struct *dst);
>> +
>> +void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
>> +				      struct vm_area_struct *next, long adjust);
>> +
>> +void thp_reservations_mremap(struct vm_area_struct *vma,
>> +		unsigned long old_addr, struct vm_area_struct *new_vma,
>> +		unsigned long new_addr, unsigned long len,
>> +		bool need_rmap_locks);
>> +
>>  static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>>  {
>>  	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
>> @@ -56,6 +115,66 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
>>  	return 0;
>>  }
>>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +#define	vma_thp_reservations(vma)	NULLo
>> +
> static inline for type safety check.
Okay.

>
>> +static inline void thp_resvs_fork(struct vm_area_struct *vma,
>> +				  struct vm_area_struct *pvma)
>> +{
>> +}
>> +
>> +static inline void thp_resvs_new(struct vm_area_struct *vma)
>> +{
>> +}
>> +
>> +static inline void __thp_resvs_put(struct thp_resvs *r)
>> +{
>> +}
>> +
>> +static inline void thp_resvs_put(struct thp_resvs *r)
>> +{
>> +}
>> +
>> +static inline void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
>> +					      unsigned long address, int delta)
>> +{
>> +}
>> +
>> +static inline struct page *khugepaged_get_reserved_page(
>> +	struct vm_area_struct *vma,
>> +	unsigned long address)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline void khugepaged_reserve(struct vm_area_struct *vma,
>> +			       unsigned long address)
>> +{
>> +}
>> +
>> +static inline void khugepaged_release_reservation(struct vm_area_struct *vma,
>> +				    unsigned long address)
>> +{
>> +}
>> +
>> +static inline void _khugepaged_reservations_fixup(struct vm_area_struct *src,
>> +				   struct vm_area_struct *dst)
>> +{
>> +}
>> +
>> +static inline void _khugepaged_move_reservations_adj(
>> +				struct vm_area_struct *prev,
>> +				struct vm_area_struct *next, long adjust)
>> +{
>> +}
>> +
>> +static inline void thp_reservations_mremap(struct vm_area_struct *vma,
>> +		unsigned long old_addr, struct vm_area_struct *new_vma,
>> +		unsigned long new_addr, unsigned long len,
>> +		bool need_rmap_locks)
>> +{
>> +}
>> +
>>  static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>>  {
>>  	return 0;
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 652f602167df..6342d5f67f75 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -787,6 +787,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>>  }
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void mem_cgroup_collapse_huge_fixup(struct page *head);
>>  void mem_cgroup_split_huge_fixup(struct page *head);
>>  #endif
>>  
>> @@ -1087,6 +1088,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>>  	return 0;
>>  }
>>  
>> +static inline void mem_cgroup_collapse_huge_fixup(struct page *head)
>> +{
>> +}
>> +
>>  static inline void mem_cgroup_split_huge_fixup(struct page *head)
>>  {
>>  }
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 5ed8f6292a53..72a9f431145e 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -322,6 +322,9 @@ struct vm_area_struct {
>>  #ifdef CONFIG_NUMA
>>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>>  #endif
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	struct thp_resvs *thp_reservations;
>> +#endif
>>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>  } __randomize_layout;
>>  
> Why is this per-vma and not per address space? I'm not saying that's
> good or bad but it makes sense to me that all in-place THP reservations
> would inherently be about the address space. Per-vma seems unnecessarily
> fine-grained and per-task would be insanity (because threads share an
> address space).
One concern I had with per address space was the potential for
for increased lock contention when looking up and adding reservations
while handling faults in different VMAs.
A per-VMA reservations pointer also makes it straightforward to avoid
checking for reservations in VMAs that don't support them though the
check could be done in other ways.  I'll take another look at a per-MM
implementation.

>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d4b0c79d2924..7deac5a1f25d 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -181,6 +181,7 @@ enum node_stat_item {
>>  	NR_DIRTIED,		/* page dirtyings since bootup */
>>  	NR_WRITTEN,		/* page writings since bootup */
>>  	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
>> +	NR_THP_RESERVED,	/* Unused small pages in THP reservations */
>>  	NR_VM_NODE_STAT_ITEMS
>>  };
>>  
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 47a3441cf4c4..f3d34db7e9d5 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>  		THP_ZERO_PAGE_ALLOC_FAILED,
>>  		THP_SWPOUT,
>>  		THP_SWPOUT_FALLBACK,
>> +		THP_RES_ALLOC,
>> +		THP_RES_ALLOC_FAILED,
>>  #endif
>>  #ifdef CONFIG_MEMORY_BALLOON
>>  		BALLOON_INFLATE,
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f0b58479534f..a15d1cda1958 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -527,6 +527,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>  		if (is_vm_hugetlb_page(tmp))
>>  			reset_vma_resv_huge_pages(tmp);
>>  
>> +		thp_resvs_fork(tmp, mpnt);
>> +
>>  		/*
>>  		 * Link in the new vma and copy the page table entries.
>>  		 */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index deed97fba979..aa80b9c54d1c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -57,6 +57,8 @@
>>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>>  	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>>  
>> +unsigned int hugepage_promotion_threshold __read_mostly = HPAGE_PMD_NR / 2;
>> +
>>  static struct shrinker deferred_split_shrinker;
>>  
> Mentioned already, default this to 1 initially and then
> reduce it.
>
>>  static atomic_t huge_zero_refcount;
>> @@ -288,6 +290,28 @@ static ssize_t use_zero_page_store(struct kobject *kobj,
>>  static struct kobj_attribute use_zero_page_attr =
>>  	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
>>  
>> +static ssize_t promotion_threshold_show(struct kobject *kobj,
>> +		struct kobj_attribute *attr, char *buf)
>> +{
>> +	return sprintf(buf, "%u\n", hugepage_promotion_threshold);
>> +}
>> +static ssize_t promotion_threshold_store(struct kobject *kobj,
>> +		struct kobj_attribute *attr, const char *buf, size_t count)
>> +{
>> +	int err;
>> +	unsigned long promotion_threshold;
>> +
>> +	err = kstrtoul(buf, 10, &promotion_threshold);
>> +	if (err || promotion_threshold < 1 || promotion_threshold > HPAGE_PMD_NR)
>> +		return -EINVAL;
>> +
>> +	hugepage_promotion_threshold = promotion_threshold;
>> +
>> +	return count;
>> +}
> Look at sysctl.c and see how extra1 and extra2 can be used to set a
> range of permitted values without hard-coding like this. You might need
> to add a special local variable like "one" and "zero" in that file to
> cover HPAGE_PMD_NR. It'll save you a few lines.
That seems to apply to /proc files, but this is a sysfs file which I modeled
after others in khugepaged.c.

>
>> +static struct kobj_attribute promotion_threshold_attr =
>> +	__ATTR(promotion_threshold, 0644, promotion_threshold_show, promotion_threshold_store);
>> +
>>  static ssize_t hpage_pmd_size_show(struct kobject *kobj,
>>  		struct kobj_attribute *attr, char *buf)
>>  {
>> @@ -318,6 +342,7 @@ static ssize_t debug_cow_store(struct kobject *kobj,
>>  	&enabled_attr.attr,
>>  	&defrag_attr.attr,
>>  	&use_zero_page_attr.attr,
>> +	&promotion_threshold_attr.attr,
>>  	&hpage_pmd_size_attr.attr,
>>  #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
>>  	&shmem_enabled_attr.attr,
> I suggest splitting out any debugging code into a separate patch.
Okay.

>
>> @@ -670,6 +695,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>  	struct page *page;
>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>  
>> +	if (hugepage_promotion_threshold > 1) {
>> +		khugepaged_reserve(vma, vmf->address);
>> +		return VM_FAULT_FALLBACK;
>> +	}
>>  	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>>  		return VM_FAULT_FALLBACK;
>>  	if (unlikely(anon_vma_prepare(vma)))
> Add a comment on why exactly it's falling back. It's also not clear at a
> glance what happens if this is the fault that reaches the threshold.
I'll add a comment.  do_huge_pmd_anonymous_page() is only called
for the first fault in a PMD range so the check here is whether to immediately
allocate a huge page or to allocate a reservation and fallback to fault
in a small page from the reservation.

If promotion was done at fault time, do_anonymous_page() would probably
be the place to check whether the threshold had been reached.

>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index a31d740e6cd1..55d380f8ce71 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -8,6 +8,7 @@
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/rmap.h>
>>  #include <linux/swap.h>
>> +#include <linux/shrinker.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/kthread.h>
>>  #include <linux/khugepaged.h>
>> @@ -56,6 +57,7 @@ enum scan_result {
>>  /* default scan 8*512 pte (or vmas) every 30 second */
>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>>  static unsigned int khugepaged_pages_collapsed;
>> +static unsigned int khugepaged_res_pages_collapsed;
>>  static unsigned int khugepaged_full_scans;
>>  static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
>>  /* during fragmentation poll the hugepage allocator once every minute */
>> @@ -76,6 +78,445 @@ enum scan_result {
>>  
>>  static struct kmem_cache *mm_slot_cache __read_mostly;
>>  
>> +struct list_lru thp_reservations_lru;
>> +
>> +void thp_resvs_new(struct vm_area_struct *vma)
>> +{
>> +	struct thp_resvs *new = NULL;
>> +
>> +	if (hugepage_promotion_threshold == 1)
>> +		goto done;
>> +
>> +	new = kzalloc(sizeof(struct thp_resvs), GFP_KERNEL);
>> +	if (!new)
>> +		goto done;
>> +
>> +	atomic_set(&new->refcnt, 1);
>> +	spin_lock_init(&new->res_hash_lock);
>> +	hash_init(new->res_hash);
>> +
>> +done:
>> +	vma->thp_reservations = new;
>> +}
>> +
> Odd flow. Init the VMA to have this as NULL and then just return if it's
> unused instead of initialising it here. Not a biggie but I'm already
> hung-up on thinking the reservations should be per-mm.
Okay.

>
>> +void __thp_resvs_put(struct thp_resvs *resv)
>> +{
>> +	if (!atomic_dec_and_test(&resv->refcnt))
>> +		return;
>> +
>> +	kfree(resv);
>> +}
>> +
> kfree without clearing the pointer to it looks like a recipe for use-after
> free entertainments.
It's only called right before freeing the vm_area_struct that points to it.

>
>> +static struct thp_reservation *khugepaged_find_reservation(
>> +	struct vm_area_struct *vma,
>> +	unsigned long address)
>> +{
>> +	unsigned long haddr = address & HPAGE_PMD_MASK;
>> +	struct thp_reservation *res = NULL;
>> +
>> +	if (!vma->thp_reservations)
>> +		return NULL;
>> +
>> +	hash_for_each_possible(vma->thp_reservations->res_hash, res, node, haddr) {
>> +		if (res->haddr == haddr)
>> +			break;
>> +	}
>> +	return res;
>> +}
>> +
>> +static void khugepaged_free_reservation(struct thp_reservation *res)
>> +{
>> +	struct page *page;
>> +	int unused;
>> +	int i;
>> +
>> +	list_lru_del(&thp_reservations_lru, &res->lru);
>> +	hash_del(&res->node);
>> +	page = res->page;
>> +	unused = res->nr_unused;
>> +
>> +	kfree(res);
>> +
>> +	if (!PageCompound(page)) {
>> +		for (i = 0; i < HPAGE_PMD_NR; i++)
>> +			put_page(page + i);
>> +
>> +		if (unused) {
>> +			mod_node_page_state(page_pgdat(page), NR_THP_RESERVED,
>> +					    -unused);
>> +		}
>> +	}
>> +}
>> +
>> +void khugepaged_reserve(struct vm_area_struct *vma, unsigned long address)
>> +{
>> +	unsigned long haddr = address & HPAGE_PMD_MASK;
>> +	struct thp_reservation *res;
>> +	struct page *page;
>> +	gfp_t gfp;
>> +	int i;
>> +
>> +	if (!vma->thp_reservations)
>> +		return;
>> +	if (!vma_is_anonymous(vma))
>> +		return;
>> +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>> +		return;
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	if (khugepaged_find_reservation(vma, address)) {
>> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Allocate the equivalent of a huge page but not as a compound page
>> +	 */
>> +	gfp = GFP_TRANSHUGE_LIGHT & ~__GFP_COMP;
>> +	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>> +	if (unlikely(!page)) {
>> +		count_vm_event(THP_RES_ALLOC_FAILED);
>> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +		return;
>> +	}
>> +
> Put the cleanup at the end of the function and use gotos to pick what
> point you unwind the work at. This will reduce duplicated code and make
> it harder to introduce bugs in the cleanup if there are modifications to
> this function.
Okay.

>
>> +	for (i = 0; i < HPAGE_PMD_NR; i++)
>> +		set_page_count(page + i, 1);
>> +
> split_page()?
I didn't know about split_page().  It definitely looks appropriate.  I'll
change it.

>
>> +	res = kzalloc(sizeof(*res), GFP_KERNEL);
>> +	if (!res) {
>> +		count_vm_event(THP_RES_ALLOC_FAILED);
>> +		__free_pages(page, HPAGE_PMD_ORDER);
>> +		spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +		return;
>> +	}
>> +
>> +	count_vm_event(THP_RES_ALLOC);
>> +
>> +	res->haddr = haddr;
>> +	res->page = page;
>> +	res->vma = vma;
>> +	res->lock = &vma->thp_reservations->res_hash_lock;
>> +	hash_add(vma->thp_reservations->res_hash, &res->node, haddr);
>> +
>> +	INIT_LIST_HEAD(&res->lru);
>> +	list_lru_add(&thp_reservations_lru, &res->lru);
>> +
>> +	res->nr_unused = HPAGE_PMD_NR;
>> +	mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, HPAGE_PMD_NR);
>> +
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	khugepaged_enter(vma, vma->vm_flags);
>> +}
>> +
> I'm undecided on the use of the LRU. I'm not sure it's worthwhile in the
> initial phase to shrink reservations in LRU order. I think in the first
> generation just blast all reservations if there is memory pressure and
> add the shrinker as a separate patch. This has a worse-case scenario on
> being no better than what we have today.
>
> I'm partially saying this because shrinkers have historically being
> a bit painful and difficult to analyse. Initially you want this to be
> performance-neutral at worst and the use of shrinkers means we could
> have corner cases where reservations cause page cache or mapped anonymous
> pages to be prematurely discarded.
I'm not sure how something could get prematurely discarded, but I'm fine
with separating these changes and starting with simpler functionality
that just releases all reservations.

>
>> +struct page *khugepaged_get_reserved_page(struct vm_area_struct *vma,
>> +					  unsigned long address)
>> +{
>> +	struct thp_reservation *res;
>> +	struct page *page;
>> +
>> +	if (!transparent_hugepage_enabled(vma))
>> +		return NULL;
>> +	if (!vma->thp_reservations)
>> +		return NULL;
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	page = NULL;
>> +	res = khugepaged_find_reservation(vma, address);
>> +	if (res) {
>> +		unsigned long offset = address & ~HPAGE_PMD_MASK;
>> +
>> +		page = res->page + (offset >> PAGE_SHIFT);
>> +		get_page(page);
>> +
>> +		list_lru_del(&thp_reservations_lru, &res->lru);
>> +		list_lru_add(&thp_reservations_lru, &res->lru);
>> +
>> +		dec_node_page_state(res->page, NR_THP_RESERVED);
>> +	}
>> +
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	return page;
>> +}
>> +
>> +void khugepaged_release_reservation(struct vm_area_struct *vma,
>> +				    unsigned long address)
>> +{
>> +	struct thp_reservation *res;
>> +
>> +	if (!vma->thp_reservations)
>> +		return;
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	res = khugepaged_find_reservation(vma, address);
>> +	if (!res)
>> +		goto out;
>> +
>> +	khugepaged_free_reservation(res);
>> +
>> +out:
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +}
>> +
>> +/*
>> + * Release all reservations covering a range in a VMA.
>> + */
>> +void __khugepaged_release_reservations(struct vm_area_struct *vma,
>> +				       unsigned long addr, unsigned long len)
>> +{
>> +	struct thp_reservation *res;
>> +	struct hlist_node *tmp;
>> +	unsigned long eaddr;
>> +	int i;
>> +
>> +	if (!vma->thp_reservations)
>> +		return;
>> +
>> +	eaddr = addr + len;
>> +	addr &= HPAGE_PMD_MASK;
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
>> +		unsigned long hstart = res->haddr;
>> +
>> +		if (hstart >= addr && hstart < eaddr)
>> +			khugepaged_free_reservation(res);
>> +	}
>> +
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +}
>> +
>> +static void __khugepaged_move_reservations(struct vm_area_struct *src,
>> +					   struct vm_area_struct *dst,
>> +					   unsigned long split_addr,
>> +					   bool dst_is_below)
>> +{
>> +	struct thp_reservation *res;
>> +	struct hlist_node *tmp;
>> +	bool free_res = false;
>> +	int i;
>> +
>> +	if (!src->thp_reservations)
>> +		return;
>> +
>> +	if (!dst->thp_reservations)
>> +		free_res = true;
>> +
>> +	spin_lock(&src->thp_reservations->res_hash_lock);
>> +	if (!free_res)
>> +		spin_lock(&dst->thp_reservations->res_hash_lock);
>> +
>> +	hash_for_each_safe(src->thp_reservations->res_hash, i, tmp, res, node) {
>> +		unsigned long hstart = res->haddr;
>> +
>> +		/*
>> +		 * Free the reservation if it straddles a non-aligned
>> +		 * split address.
>> +		 */
>> +		if ((split_addr & ~HPAGE_PMD_MASK) &&
>> +		    (hstart == (split_addr & HPAGE_PMD_MASK))) {
>> +			khugepaged_free_reservation(res);
>> +			continue;
>> +		} else if (dst_is_below) {
>> +			if (hstart >= split_addr)
>> +				continue;
>> +		} else if (hstart < split_addr) {
>> +			continue;
>> +		}
>> +
>> +		if (unlikely(free_res)) {
>> +			khugepaged_free_reservation(res);
>> +			continue;
>> +		}
>> +
>> +		hash_del(&res->node);
>> +		res->vma = dst;
>> +		res->lock = &dst->thp_reservations->res_hash_lock;
>> +		hash_add(dst->thp_reservations->res_hash, &res->node, res->haddr);
>> +	}
>> +
>> +	if (!free_res)
>> +		spin_unlock(&dst->thp_reservations->res_hash_lock);
>> +	spin_unlock(&src->thp_reservations->res_hash_lock);
>> +}
>> +
> Nothing jumped out here but I'm not reading as closely as I should.
> Fundamentally any major issue here will result in memory corruption
> of a type that will trigger quickly. Performance testing can be driven
> by profiles.
>
>> +/*
>> + * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
>> + */
>> +static void take_rmap_locks(struct vm_area_struct *vma)
>> +{
>> +	if (vma->vm_file)
>> +		i_mmap_lock_write(vma->vm_file->f_mapping);
>> +	if (vma->anon_vma)
>> +		anon_vma_lock_write(vma->anon_vma);
>> +}
>> +
>> +/*
>> + * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
>> + */
>> +static void drop_rmap_locks(struct vm_area_struct *vma)
>> +{
>> +	if (vma->anon_vma)
>> +		anon_vma_unlock_write(vma->anon_vma);
>> +	if (vma->vm_file)
>> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
>> +}
>> +
>> +void thp_reservations_mremap(struct vm_area_struct *vma,
>> +		unsigned long old_addr, struct vm_area_struct *new_vma,
>> +		unsigned long new_addr, unsigned long len,
>> +		bool need_rmap_locks)
>> +{
> Is mremap really worth optimising at this point? Would it be possible
> instead to dump all reservations for a range (either VMA or mm) being
> mremapped and just move it as normal? If so, do that and make mremap
> handling a separate patch. Minimally, it would be nice to know if there
> are mremap-intensive workloads that care deeply about preserving THP
> reservations.
I can't say if mremap support is worth it.  I was just trying to be complete. :-)
I'll move the support into a separate patch.

>
>> +
>> +	struct thp_reservation *res;
>> +	unsigned long eaddr, offset;
>> +	struct hlist_node *tmp;
>> +	int i;
>> +
>> +	if (!vma->thp_reservations)
>> +		return;
>> +
>> +	if (!new_vma->thp_reservations) {
>> +		__khugepaged_release_reservations(vma, old_addr, len);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Release all reservations if they will no longer be aligned
>> +	 * in the new address range.
>> +	 */
>> +	if ((new_addr & ~HPAGE_PMD_MASK) != (old_addr & ~HPAGE_PMD_MASK)) {
>> +		__khugepaged_release_reservations(vma, old_addr, len);
>> +		return;
>> +	}
>> +
>> +	if (need_rmap_locks)
>> +		take_rmap_locks(vma);
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +	spin_lock(&new_vma->thp_reservations->res_hash_lock);
>> +
>> +	/*
>> +	 * If the start or end addresses of the range are not huge page
>> +	 * aligned, check for overlapping reservations and release them.
>> +	 */
>> +	if (old_addr & ~HPAGE_PMD_MASK) {
>> +		res = khugepaged_find_reservation(vma, old_addr);
>> +		if (res)
>> +			khugepaged_free_reservation(res);
>> +	}
>> +
>> +	eaddr = old_addr + len;
>> +	if (eaddr & ~HPAGE_PMD_MASK) {
>> +		res = khugepaged_find_reservation(vma, eaddr);
>> +		if (res)
>> +			khugepaged_free_reservation(res);
>> +	}
>> +
>> +	offset = new_addr - old_addr;
>> +
>> +	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
>> +		unsigned long hstart = res->haddr;
>> +
>> +		if (hstart < old_addr || hstart >= eaddr)
>> +			continue;
>> +
>> +		hash_del(&res->node);
>> +		res->lock = &new_vma->thp_reservations->res_hash_lock;
>> +		res->vma = new_vma;
>> +		res->haddr += offset;
>> +		hash_add(new_vma->thp_reservations->res_hash, &res->node, res->haddr);
>> +	}
>> +
>> +	spin_unlock(&new_vma->thp_reservations->res_hash_lock);
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	if (need_rmap_locks)
>> +		drop_rmap_locks(vma);
>> +
>> +}
>> +
>> +/*
>> + * Handle moving reservations for VMA merge cases 1, 6, 7, and 8 (see
>> + * comments above vma_merge()) and when splitting a VMA.
>> + *
>> + * src is expected to be aligned with the start or end of dst
>> + * src may be contained by dst or directly adjacent to dst
>> + * Move all reservations if src is contained by dst.
>> + * Otherwise move reservations no longer in the range of src
>> + * to dst.
>> + */
>> +void _khugepaged_reservations_fixup(struct vm_area_struct *src,
>> +				    struct vm_area_struct *dst)
>> +{
>> +	bool dst_is_below = false;
>> +	unsigned long split_addr;
>> +
>> +	if (src->vm_start == dst->vm_start || src->vm_end == dst->vm_end) {
>> +		split_addr = 0;
>> +	} else if (src->vm_start == dst->vm_end) {
>> +		split_addr = src->vm_start;
>> +		dst_is_below = true;
>> +	} else if (src->vm_end == dst->vm_start) {
>> +		split_addr = src->vm_end;
>> +	} else {
>> +		WARN_ON(1);
>> +		return;
>> +	}
>> +
>> +	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
>> +}
>> +
>> +/*
>> + * Handle moving reservations for VMA merge cases 4 and 5 (see comments
>> + * above vma_merge()).
>> + */
>> +void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
>> +				       struct vm_area_struct *next, long adjust)
>> +{
>> +	unsigned long split_addr = next->vm_start;
>> +	struct vm_area_struct *src, *dst;
>> +	bool dst_is_below;
>> +
>> +	if (adjust < 0) {
>> +		src = prev;
>> +		dst = next;
>> +		dst_is_below = false;
>> +	} else {
>> +		src = next;
>> +		dst = prev;
>> +		dst_is_below = true;
>> +	}
>> +
>> +	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
>> +}
>> +
>> +void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
>> +				unsigned long address, int delta)
>> +{
>> +	struct thp_reservation *res;
>> +
>> +	if (!vma->thp_reservations)
>> +		return;
>> +
>> +	spin_lock(&vma->thp_reservations->res_hash_lock);
>> +
>> +	res = khugepaged_find_reservation(vma, address);
>> +	if (res) {
>> +		WARN_ON((res->nr_unused == 0) || (res->nr_unused + delta < 0));
>> +		if (res->nr_unused + delta >= 0)
>> +			res->nr_unused += delta;
>> +	}
>> +
>> +	spin_unlock(&vma->thp_reservations->res_hash_lock);
>> +}
>> +
>>  /**
>>   * struct mm_slot - hash lookup from mm to mm_slot
>>   * @hash: hash collision list
>> @@ -197,6 +638,15 @@ static ssize_t pages_collapsed_show(struct kobject *kobj,
>>  static struct kobj_attribute pages_collapsed_attr =
>>  	__ATTR_RO(pages_collapsed);
>>  
>> +static ssize_t res_pages_collapsed_show(struct kobject *kobj,
>> +				    struct kobj_attribute *attr,
>> +				    char *buf)
>> +{
>> +	return sprintf(buf, "%u\n", khugepaged_res_pages_collapsed);
>> +}
>> +static struct kobj_attribute res_pages_collapsed_attr =
>> +	__ATTR_RO(res_pages_collapsed);
>> +
>>  static ssize_t full_scans_show(struct kobject *kobj,
>>  			       struct kobj_attribute *attr,
>>  			       char *buf)
>> @@ -292,6 +742,7 @@ static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
>>  	&scan_sleep_millisecs_attr.attr,
>>  	&alloc_sleep_millisecs_attr.attr,
>>  	&khugepaged_max_ptes_swap_attr.attr,
>> +	&res_pages_collapsed_attr.attr,
>>  	NULL,
>>  };
>>  
>> @@ -342,8 +793,96 @@ int hugepage_madvise(struct vm_area_struct *vma,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * thp_lru_free_reservation() - shrinker callback to release THP reservations
>> + * and free unused pages
>> + *
>> + * Called from list_lru_shrink_walk() in thp_resvs_shrink_scan() to free
>> + * up pages when the system is under memory pressure.
>> + */
>> +enum lru_status thp_lru_free_reservation(struct list_head *item,
>> +					 struct list_lru_one *lru,
>> +					 spinlock_t *lock,
>> +					 void *cb_arg)
>> +{
>> +	struct mm_struct *mm = NULL;
>> +	struct thp_reservation *res = container_of(item,
>> +						   struct thp_reservation,
>> +						   lru);
>> +	struct page *page;
>> +	int unused;
>> +	int i;
>> +
>> +	if (!spin_trylock(res->lock))
>> +		goto err_get_res_lock_failed;
>> +
>> +	mm = res->vma->vm_mm;
>> +	if (!mmget_not_zero(mm))
>> +		goto err_mmget;
>> +	if (!down_write_trylock(&mm->mmap_sem))
>> +		goto err_down_write_mmap_sem_failed;
>> +
>> +	list_lru_isolate(lru, item);
>> +	spin_unlock(lock);
>> +
>> +	hash_del(&res->node);
>> +
>> +	up_write(&mm->mmap_sem);
>> +	mmput(mm);
>> +
>> +	spin_unlock(res->lock);
>> +
>> +	page = res->page;
>> +	unused = res->nr_unused;
>> +
>> +	kfree(res);
>> +
>> +	for (i = 0; i < HPAGE_PMD_NR; i++)
>> +		put_page(page + i);
>> +
>> +	if (unused)
>> +		mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, -unused);
>> +
>> +	spin_lock(lock);
>> +
>> +	return LRU_REMOVED_RETRY;
>> +
>> +err_down_write_mmap_sem_failed:
>> +	mmput_async(mm);
>> +err_mmget:
>> +	spin_unlock(res->lock);
>> +err_get_res_lock_failed:
>> +	return LRU_SKIP;
>> +}
>> +
>> +static unsigned long
>> +thp_resvs_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>> +{
>> +	unsigned long ret = list_lru_shrink_count(&thp_reservations_lru, sc);
>> +	return ret;
>> +}
>> +
>> +static unsigned long
>> +thp_resvs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>> +{
>> +	unsigned long ret;
>> +
>> +	ret = list_lru_shrink_walk(&thp_reservations_lru, sc,
>> +				   thp_lru_free_reservation, NULL);
>> +	return ret;
>> +}
>> +
>> +static struct shrinker thp_resvs_shrinker = {
>> +	.count_objects = thp_resvs_shrink_count,
>> +	.scan_objects = thp_resvs_shrink_scan,
>> +	.seeks = DEFAULT_SEEKS,
>> +	.flags = SHRINKER_NUMA_AWARE,
>> +};
>> +
> As before, I think the shrinker stuff should be in separate patches. Get
> the core approach working first and then add bits on top. If nothing
> else, it means that a mistake in the shrinker behaviour will not kill
> the overall idea just because it's a monolithic patch.
>
>>  int __init khugepaged_init(void)
>>  {
>> +	int err;
>> +
>>  	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
>>  					  sizeof(struct mm_slot),
>>  					  __alignof__(struct mm_slot), 0, NULL);
>> @@ -354,6 +893,17 @@ int __init khugepaged_init(void)
>>  	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
>>  	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
>>  
>> +	// XXX should be in hugepage_init() so shrinker can be
>> +	// unregistered if necessary.
>> +	err = list_lru_init(&thp_reservations_lru);
>> +	if (err == 0) {
>> +		err = register_shrinker(&thp_resvs_shrinker);
>> +		if (err) {
>> +			list_lru_destroy(&thp_reservations_lru);
>> +			return err;
>> +		}
>> +	}
>> +
>>  	return 0;
>>  }
>>  
>> @@ -519,12 +1069,14 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte)
>>  
>>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  					unsigned long address,
>> -					pte_t *pte)
>> +					pte_t *pte,
>> +					struct thp_reservation *res)
>>  {
>>  	struct page *page = NULL;
>>  	pte_t *_pte;
>>  	int none_or_zero = 0, result = 0, referenced = 0;
>>  	bool writable = false;
>> +	bool is_reserved = res ? true : false;
>>  
>>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>>  	     _pte++, address += PAGE_SIZE) {
>> @@ -573,7 +1125,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  		 * The page must only be referenced by the scanned process
>>  		 * and page swap cache.
>>  		 */
>> -		if (page_count(page) != 1 + PageSwapCache(page)) {
>> +		if (page_count(page) != 1 + PageSwapCache(page) + is_reserved) {
>>  			unlock_page(page);
>>  			result = SCAN_PAGE_COUNT;
>>  			goto out;
>> @@ -631,6 +1183,68 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  	return 0;
>>  }
>>  
>> +static void __collapse_huge_page_convert(pte_t *pte, struct page *page,
>> +				      struct vm_area_struct *vma,
>> +				      unsigned long address,
>> +				      spinlock_t *ptl)
>> +{
>> +	struct page *head = page;
>> +	pte_t *_pte;
>> +
>> +	set_page_count(page, 1);
>> +
>> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +				_pte++, page++, address += PAGE_SIZE) {
>> +		pte_t pteval = *_pte;
>> +
>> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +			clear_user_highpage(page, address);
> Needs commenting. At a glance it's not clear what happens if the entire
> range was zero pfns. Does that convert into one large allocated huge
> page? If so, it goes counter to the goal of reducing memory usage.
By default khugepaged skips page ranges that are entirely zero pfns
(khugepaged_max_ptes_none = HPAGE_PMD_NR - 1).
>
>> +			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>> +			if (is_zero_pfn(pte_pfn(pteval))) {
>> +				/*
>> +				 * ptl mostly unnecessary.
>> +				 */
>> +				spin_lock(ptl);
>> +				/*
>> +				 * paravirt calls inside pte_clear here are
>> +				 * superfluous.
>> +				 */
>> +				pte_clear(vma->vm_mm, address, _pte);
>> +				spin_unlock(ptl);
>> +			}
>> +			dec_node_page_state(page, NR_THP_RESERVED);
>> +		} else {
>> +			dec_node_page_state(page, NR_ISOLATED_ANON +
>> +					    page_is_file_cache(page));
>> +			unlock_page(page);
>> +			ClearPageActive(page);
>> +			/*
>> +			 * ptl mostly unnecessary, but preempt has to
>> +			 * be disabled to update the per-cpu stats
>> +			 * inside page_remove_rmap().
>> +			 */
>> +			spin_lock(ptl);
>> +			/*
>> +			 * paravirt calls inside pte_clear here are
>> +			 * superfluous.
>> +			 */
>> +			pte_clear(vma->vm_mm, address, _pte);
>> +			page_remove_rmap(page, false);
>> +			spin_unlock(ptl);
>> +			/*
>> +			 * Swapping out a page in a reservation
>> +			 * causes the reservation to be released
>> +			 * therefore no pages in a reservation
>> +			 * should be in swapcache.
>> +			 */
>> +			WARN_ON(PageSwapCache(page));
>> +		}
>> +	}
>> +
> The page table handling needs more careful review than I'm giving at the
> moment. It's another reason why the core approach should be as simple as
> possible as the patch has too much in it right now to keep it all in
> mind.
>
> I ran out of time at this point. Overall I don't see anything in there that
> fundamentally kills the idea and I think it's workable. It does need to
> be broken up though as reviewing a monolthic patch is going to be harder
> to review and a single mistake at the edges (like shrinkers or mremap)
> takes everything else with it. Focus on getting the reservation part right,
> the zero page handling and the collapse. For any corner case, dump all the
> reservations for a VMA or address space and let kcompactd clean it up later.
> It can then be evaluated that it's performance-neutral relative to the
> existing code and identify what workloads are helped by being clever.
>
> Thanks!
>
Thank you for the feedback!

Anthony
Anthony Yznaga Nov. 10, 2018, 12:04 a.m. UTC | #7
On 11/09/2018 04:13 AM, Kirill A. Shutemov wrote:
> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>> The basic idea as outlined by Mel Gorman in [2] is:
>>
>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>    sized and aligned block of base pages.  Map the base page
>>    corresponding to the fault address and hold the rest of the pages in
>>    reserve.
>> 2) On subsequent faults in the range, map the pages from the reservation.
>> 3) When enough pages have been mapped, promote the mapped pages and
>>    remaining pages in the reservation to a huge page.
>> 4) When there is memory pressure, release the unused pages from their
>>    reservations.
> I haven't yet read the patch in details, but I'm skeptical about the
> approach in general for few reasons:
>
> - PTE page table retracting to replace it with huge PMD entry requires
>   down_write(mmap_sem). It makes the approach not practical for many
>   multi-threaded workloads.
>
>   I don't see a way to avoid exclusive lock here. I will be glad to
>   be proved otherwise.
>
> - The promotion will also require TLB flush which might be prohibitively
>   slow on big machines.
>
> - Short living processes will fail to benefit from THP with the policy,
>   even with plenty of free memory in the system: no time to promote to THP
>   or, with synchronous promotion, cost will overweight the benefit.
>
> The goal to reduce memory overhead of THP is admirable, but we need to be
> careful not to kill THP benefit itself. The approach will reduce number of
> THP mapped in the system and/or shift their allocation to later stage of
> process lifetime.
>
> The only way I see it can be useful is if it will be possible to apply the
> policy on per-VMA basis. It will be very useful for malloc()
> implementations, for instance. But as a global policy it's no-go to me.
I agree that this should not be a global policy.  For example, it seems to me
that a VMA where MADV_HUGEPAGE has been applied should get huge
pages on first faults (I need to fix that in my implementation).
>
> Prove me wrong with performance data. :)
I'll try.  :-)

Thanks for the comments!

Anthony
Anthony Yznaga Nov. 10, 2018, 12:39 a.m. UTC | #8
On 11/09/2018 07:34 AM, Zi Yan wrote:
> On 9 Nov 2018, at 8:11, Mel Gorman wrote:
>
>> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>>>> The basic idea as outlined by Mel Gorman in [2] is:
>>>>
>>>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>>>    sized and aligned block of base pages.  Map the base page
>>>>    corresponding to the fault address and hold the rest of the pages in
>>>>    reserve.
>>>> 2) On subsequent faults in the range, map the pages from the reservation.
>>>> 3) When enough pages have been mapped, promote the mapped pages and
>>>>    remaining pages in the reservation to a huge page.
>>>> 4) When there is memory pressure, release the unused pages from their
>>>>    reservations.
>>> I haven't yet read the patch in details, but I'm skeptical about the
>>> approach in general for few reasons:
>>>
>>> - PTE page table retracting to replace it with huge PMD entry requires
>>>   down_write(mmap_sem). It makes the approach not practical for many
>>>   multi-threaded workloads.
>>>
>>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>>   be proved otherwise.
>>>
>> That problem is somewhat fundamental to the mmap_sem itself and
>> conceivably it could be alleviated by range-locking (if that gets
>> completed). The other thing to bear in mind is the timing. If the
>> promotion is in-place due to reservations, there isn't the allocation
>> overhead and the hold times *should* be short.
>>
> Is it possible to convert all these PTEs to migration entries during
> the promotion and replace them with a huge PMD entry afterwards?
> AFAIK, migrating pages does not require holding a mmap_sem.
> Basically, it will act like migrating 512 base pages to a THP without
> actually doing the page copy.
That's an interesting idea.  I'll look into it.

Thanks,
Anthony

>
> --
> Best Regards
> Yan Zi
Anthony Yznaga Nov. 10, 2018, 12:55 a.m. UTC | #9
On 11/09/2018 11:51 AM, Andrea Arcangeli wrote:
> Hello,
>
> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>>> The basic idea as outlined by Mel Gorman in [2] is:
>>>
>>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>>    sized and aligned block of base pages.  Map the base page
>>>    corresponding to the fault address and hold the rest of the pages in
>>>    reserve.
>>> 2) On subsequent faults in the range, map the pages from the reservation.
>>> 3) When enough pages have been mapped, promote the mapped pages and
>>>    remaining pages in the reservation to a huge page.
>>> 4) When there is memory pressure, release the unused pages from their
>>>    reservations.
>> I haven't yet read the patch in details, but I'm skeptical about the
>> approach in general for few reasons:
>>
>> - PTE page table retracting to replace it with huge PMD entry requires
>>   down_write(mmap_sem). It makes the approach not practical for many
>>   multi-threaded workloads.
>>
>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>   be proved otherwise.
>>
>> - The promotion will also require TLB flush which might be prohibitively
>>   slow on big machines.
>>
>> - Short living processes will fail to benefit from THP with the policy,
>>   even with plenty of free memory in the system: no time to promote to THP
>>   or, with synchronous promotion, cost will overweight the benefit.
>>
>> The goal to reduce memory overhead of THP is admirable, but we need to be
>> careful not to kill THP benefit itself. The approach will reduce number of
>> THP mapped in the system and/or shift their allocation to later stage of
>> process lifetime.
>>
>> The only way I see it can be useful is if it will be possible to apply the
>> policy on per-VMA basis. It will be very useful for malloc()
>> implementations, for instance. But as a global policy it's no-go to me.
> I'm also skeptical about this: the current design is quite
> intentional. It's not a bug but a feature that we're not doing the
> promotion.
>
> Part of the tradeoff with THP is to use more RAM to save CPU, when you
> use less RAM you're inherently already wasting some CPU just for the
> reservation management and you don't get the immediate TLB benefit
> anymore either.
>
> And if you're in the camp that is concerned about the use of more RAM
> or/and about the higher latency of COW faults, I'm afraid the
> intermediate solution will be still slower than the already available
> MADV_NOHUGEPAGE or enabled=madvise.
>
> Apps like redis that will use more RAM during snapshot and that are
> slowed down with THP needs to simply use MADV_NOHUGEPAGE which already
> exists as an madvise from the very first kernel that supported
> THP-anon. Same thing for other apps that use more RAM with THP and
> that are on the losing end of the tradeoff.
>
> Now about the implementation: the whole point of the reservation
> complexity is to skip the khugepaged copy, so it can collapse in
> place. Is skipping the copy worth it? Isn't the big cost the IPI
> anyway to avoid leaving two simultaneous TLB mappings of different
> granularity?
Good questions.  I'll take them into account when measuring performance.
I do wonder about other architectures (e.g. ARM) where the PMD
size may be significantly larger than 2MB.

>
> khugepaged is already tunable to specify a ratio of memory in use to
> avoid wasting memory
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none.
>
> If you set max_ptes_none to half the default value, it'll only promote
> pages that are half mapped, reducing the memory waste to 50% of what
> it is by default.
>
> So if you are ok to copy the memory that you promote to THP, you'd
> just need a global THP mode to avoid allocating THP even when they're
> available during the page fault (while still allowing khugepaged to
> collapse hugepages in the background), and then reduce max_ptes_none
> to get the desired promotion ratio.
>
> Doing the copy will avoid the reservation there will be also more THP
> available to use for those khugepaged users without losing them in
> reservations. You won't have to worry about what to do when there's
> memory pressure because you won't have to undo the reservation because
> there was no reservation in the first place. That problem also goes
> away with the copy.
>
> So it sounds like you could achieve a similar runtime behavior with
> much less complexity by reducing max_ptes_none and by doing the copy
> and dropping all reservation code.

These are compelling arguments.  I will be sure to evaluate any
performance data against this alternate implementation/tuning.

Thank you for the comments.

Anthony

>
>> Prove me wrong with performance data. :)
> Same here.
>
> Thanks,
> Andrea
Kirill A. Shutemov Nov. 10, 2018, 9:35 a.m. UTC | #10
On Fri, Nov 09, 2018 at 10:34:07AM -0500, Zi Yan wrote:
> On 9 Nov 2018, at 8:11, Mel Gorman wrote:
> 
> > On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
> >> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
> >>> The basic idea as outlined by Mel Gorman in [2] is:
> >>>
> >>> 1) On first fault in a sufficiently sized range, allocate a huge page
> >>>    sized and aligned block of base pages.  Map the base page
> >>>    corresponding to the fault address and hold the rest of the pages in
> >>>    reserve.
> >>> 2) On subsequent faults in the range, map the pages from the reservation.
> >>> 3) When enough pages have been mapped, promote the mapped pages and
> >>>    remaining pages in the reservation to a huge page.
> >>> 4) When there is memory pressure, release the unused pages from their
> >>>    reservations.
> >>
> >> I haven't yet read the patch in details, but I'm skeptical about the
> >> approach in general for few reasons:
> >>
> >> - PTE page table retracting to replace it with huge PMD entry requires
> >>   down_write(mmap_sem). It makes the approach not practical for many
> >>   multi-threaded workloads.
> >>
> >>   I don't see a way to avoid exclusive lock here. I will be glad to
> >>   be proved otherwise.
> >>
> >
> > That problem is somewhat fundamental to the mmap_sem itself and
> > conceivably it could be alleviated by range-locking (if that gets
> > completed). The other thing to bear in mind is the timing. If the
> > promotion is in-place due to reservations, there isn't the allocation
> > overhead and the hold times *should* be short.
> >
> 
> Is it possible to convert all these PTEs to migration entries during
> the promotion and replace them with a huge PMD entry afterwards?
> AFAIK, migrating pages does not require holding a mmap_sem.
> Basically, it will act like migrating 512 base pages to a THP without
> actually doing the page copy.

You'll still need down_write(mmap_sem) to convert PTE page table full of
migration entires to PMD entry. It's required at least to protect against
parallel MADV_DONTNEED that can zap migration entries under you.
Mel Gorman Nov. 10, 2018, 1:22 p.m. UTC | #11
On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
> > I haven't yet read the patch in details, but I'm skeptical about the
> > approach in general for few reasons:
> > 
> > - PTE page table retracting to replace it with huge PMD entry requires
> >   down_write(mmap_sem). It makes the approach not practical for many
> >   multi-threaded workloads.
> > 
> >   I don't see a way to avoid exclusive lock here. I will be glad to
> >   be proved otherwise.
> > 
> > - The promotion will also require TLB flush which might be prohibitively
> >   slow on big machines.
> > 
> > - Short living processes will fail to benefit from THP with the policy,
> >   even with plenty of free memory in the system: no time to promote to THP
> >   or, with synchronous promotion, cost will overweight the benefit.
> > 
> > The goal to reduce memory overhead of THP is admirable, but we need to be
> > careful not to kill THP benefit itself. The approach will reduce number of
> > THP mapped in the system and/or shift their allocation to later stage of
> > process lifetime.
> > 
> > The only way I see it can be useful is if it will be possible to apply the
> > policy on per-VMA basis. It will be very useful for malloc()
> > implementations, for instance. But as a global policy it's no-go to me.
> 
> I'm also skeptical about this: the current design is quite
> intentional. It's not a bug but a feature that we're not doing the
> promotion.
> 

Understood. I think with two people with extensive THP experience being
skeptical about this, we should take a step back before Anthony spends
too much more time on this. It would be a shame to work extensively on
a series just to have it rejected.

> Part of the tradeoff with THP is to use more RAM to save CPU, when you
> use less RAM you're inherently already wasting some CPU just for the
> reservation management and you don't get the immediate TLB benefit
> anymore either.
> 

This is true, there is a gap where there is no THP benefit. The big
question is how many workloads, if any, suffer as a result of premature
reclaim due to sparse references of the address space consuming too much
memory. Anthony, do you have any benchmarks in mind? I don't because the
HPC workloads I'm aware of are usually sized to fit in memory regardless
of THP use.

> And if you're in the camp that is concerned about the use of more RAM
> or/and about the higher latency of COW faults, I'm afraid the
> intermediate solution will be still slower than the already available
> MADV_NOHUGEPAGE or enabled=madvise.
> 

Does that not prevent huge page usage? Maybe you can spell it out a bit
better. What is the set of system calls an application should make to
not use huge pages either for the address space or on a per-VMA basis
and defer to kcompactd? I know that can be tuned globally but that's not
quite the same thing given that multiple applications or containers can
be running with different requirements.

> Now about the implementation: the whole point of the reservation
> complexity is to skip the khugepaged copy, so it can collapse in
> place. Is skipping the copy worth it? Isn't the big cost the IPI
> anyway to avoid leaving two simultaneous TLB mappings of different
> granularity?
> 

Not necessarily. With THP anon in the simple case, it might be just a
single thread and kcompact so that's one IPI (kcompactd flushes local and
one IPI to the CPU the thread was running on assuming it's not migrating
excessively). It would scale up with the number of threads but I suspect
the main cost is the actual copying, page table manipulation and the
locking required.

> So if you are ok to copy the memory that you promote to THP, you'd
> just need a global THP mode to avoid allocating THP even when they're
> available during the page fault (while still allowing khugepaged to
> collapse hugepages in the background), and then reduce max_ptes_none
> to get the desired promotion ratio.
> 

As an aside, a universal benefit would be looking at reducing the time
to allocate the necessary huge page as we know that can be excessive. It
would be ortogonal to this series.

> > <SNIP>
> >
> > Prove me wrong with performance data. :)
> 
> Same here.
> 

Could you and Kirill outline what sort of workloads you would consider
acceptable for evaluating this series? One would assume it covers at
least the following, potentially with a number of workloads.

1. Evaluate the collapse and copying costs (probing the entire time
   spent in collapse_huge_page might do it)
2. Evaluate mmap_sem hold time during hugepage collapse
3. Estimate excessive RAM use due to unnecessary THP usage
4. Estimate the slowdown due to delayed THP usage 

1 and 2 would indicate how much time is lost due to not using
reservations. That potentially goes in the direction of simply making
this faster -- fragmentation reduction (posted but unreviewed), faster
compaction searches, better page isolation during compaction to
avoid free pages being reused before an order-9 is free.

3 should be straight-forward but 4 would be the hardest to evaluate
because it would have to be determimed if 4 is offset by improvements to
1-3. If 1-3 is improved enough, it might remove the motivation for the
series entirely.

In other words, if we agree on a workload in advance, it might bring
this the right direction and not accidentally throw Anthony down a hole
working on a series that never gets ack'd.

I'm not necessarily the best person to answer because my natural inclination
after the fragmentation series would be to keep using thpfiosacle
(from the fragmentation avoidance series) and work on improving the THP
allocation success rates and reduce latencies. I've tunnel vision on that
for the moment.

Thanks.
Andrea Arcangeli Nov. 10, 2018, 4:44 p.m. UTC | #12
On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote:
> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
> > And if you're in the camp that is concerned about the use of more RAM
> > or/and about the higher latency of COW faults, I'm afraid the
> > intermediate solution will be still slower than the already available
> > MADV_NOHUGEPAGE or enabled=madvise.
> > 
> 
> Does that not prevent huge page usage? Maybe you can spell it out a bit

Yes it prevents huge page usage, but preventing the huge page usage is
also what is achieved with the reservation.

> better. What is the set of system calls an application should make to
> not use huge pages either for the address space or on a per-VMA basis
> and defer to kcompactd? I know that can be tuned globally but that's not
> quite the same thing given that multiple applications or containers can
> be running with different requirements.

Yes, in terms of inheritance that could be used to tune a container
we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE
useless too, but then for microservices that should not be a
concern. How to make those sysfs tunables reentrant in namespaces is a
separate issue I think.

The difference is that with the reservation over time they can be
promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not
even khugepaged will scan that vma anymore.

The benefit of the reservation will showup in those regions that will
not become hugepages, so if you can predict beforehand that those
ranges don't benefit from THP, it's better if userland calls
madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo
the reservation later during memory pressure.

The reservation and promotion is a bit like auto-detecting when
MADV_NOHUGEPAGE should be set, so it boils down of how much of a
corner case that is.

I'm not so concerned about the RAM wasted because I don't think it's
very significant, after all the application can just do a smaller
malloc if it wants to reduce memory usage.

A massive amount of huge RAM waste is fairly rare and to the extreme
it could still be wasted even with 4k if the app uses only 1 bit from
every 4k page it allocates with malloc.

I'm more concerned about cases where THP is wasting CPU: like in redis
that is hurted by the 2M COWs. redis will map all pages and they will
be all promoted to THP also with the reservation logic applied, but
when the parent writes to the memory (after fork) it must trigger 4k
cows (not 2M cows) and in turn split the THP before the COW, or it
won't work as fast as with THP disabled. In addition we should try to
reuse the same IPI for the transhuge pmd split to cover the COW too.

If we add the reservation and that work makes zero difference for the
redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not
great in my view. It looks like we're trying to optimize issues that
are less critical.

The redis+THP case should be possible to optimize later with uffd WP
model (once completed, Peter Xu is working on it), and uffd WP will
also remove fork() and it'll convert it to a clone(). The granularity
of the fault is decided by the userland that way so when uffd
wrprotects a 4k fragment of a THP, the THP will be split during the
uffd mprotect ioctl.

> > Now about the implementation: the whole point of the reservation
> > complexity is to skip the khugepaged copy, so it can collapse in
> > place. Is skipping the copy worth it? Isn't the big cost the IPI
> > anyway to avoid leaving two simultaneous TLB mappings of different
> > granularity?
> > 
> 
> Not necessarily. With THP anon in the simple case, it might be just a
> single thread and kcompact so that's one IPI (kcompactd flushes local and
> one IPI to the CPU the thread was running on assuming it's not migrating
> excessively). It would scale up with the number of threads but I suspect
> the main cost is the actual copying, page table manipulation and the
> locking required.

Agreed, the IPI wouldn't be a concern for a single threaded app. I was
looking more at the worst case scenario. For a single threaded app the
locking should not be too bad either.

> As an aside, a universal benefit would be looking at reducing the time
> to allocate the necessary huge page as we know that can be excessive. It
> would be ortogonal to this series.

With what I suggested the allocation would happen as usual in
khugepaged at slow peace, without holding locks. So I don't see
obvious disadvantages in terms of THP allocation latency.

> Could you and Kirill outline what sort of workloads you would consider
> acceptable for evaluating this series? One would assume it covers at
> least the following, potentially with a number of workloads.

I would prefer to add intelligence to detect when COWs after fork
should be done at 2m or 4k granularity (in the latter case by
splitting the pmd before the actual COW while leaving the transhuge
pmd intact in the other mm), because that would save CPU (and it'd
automatically optimize redis). The snapshot process especially would
run faster as it will read with THP performance.

I'm more worried to ensure THP doesn't cause more CPU usage like it
happens to the above case in COWs, than to just try to save RAM when
the virtual ranges are only partially utilized by the app.

> 1. Evaluate the collapse and copying costs (probing the entire time
>    spent in collapse_huge_page might do it)
> 2. Evaluate mmap_sem hold time during hugepage collapse
> 3. Estimate excessive RAM use due to unnecessary THP usage
> 4. Estimate the slowdown due to delayed THP usage 
> 
> 1 and 2 would indicate how much time is lost due to not using
> reservations. That potentially goes in the direction of simply making
> this faster -- fragmentation reduction (posted but unreviewed), faster
> compaction searches, better page isolation during compaction to
> avoid free pages being reused before an order-9 is free.
> 
> 3 should be straight-forward but 4 would be the hardest to evaluate
> because it would have to be determimed if 4 is offset by improvements to
> 1-3. If 1-3 is improved enough, it might remove the motivation for the
> series entirely.
> 
> In other words, if we agree on a workload in advance, it might bring
> this the right direction and not accidentally throw Anthony down a hole
> working on a series that never gets ack'd.
> 
> I'm not necessarily the best person to answer because my natural inclination
> after the fragmentation series would be to keep using thpfiosacle
> (from the fragmentation avoidance series) and work on improving the THP
> allocation success rates and reduce latencies. I've tunnel vision on that
> for the moment.

Deciding the workloads is a good question indeed, but I would also be
curious to how many of those pages would not end up to be promoted
with this logic.

What's the number of pte_none that you require in each pmd to avoid
promotion? If it's just 1 then apps will run slower, if there's
partial utilization THP already helps. I've an hard time to think at
an ideal ratio, this is why max_ptes_none is 511 after all.

Can we start by counting the total number of pte_none() in all pmds
that can fit a THP according to vma->vm_start/end? The pagetable
dumper in debugfs may already provide the info we need by scanning all
mm and by printing the number of "none" pte that would generate
"wasted" memory (and marginally wasted CPU during copy/clear).

Then you can exactly tell how many pmds won't be promoted to transhuge
pmds with the patch applied in the real life workloads, even before
running any benchmark. It'd be good to be sure we're talking about a
significant number in real life workloads or there's not much to
optimize to begin with.

If the amount of RAM saved is significant in real life workloads and
in turn there's a chance of having a worthwhile tradeoff from the
reservation logic, then we can do the benchmarks because the behavior
will be different for the page fault, and it'll end up running slower
with the reservation logic.

Thanks,
Andrea
Anthony Yznaga Nov. 14, 2018, 11:15 p.m. UTC | #13
On 11/10/2018 08:44 AM, Andrea Arcangeli wrote:
> On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote:
>> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
>>> And if you're in the camp that is concerned about the use of more RAM
>>> or/and about the higher latency of COW faults, I'm afraid the
>>> intermediate solution will be still slower than the already available
>>> MADV_NOHUGEPAGE or enabled=madvise.
>>>
>> Does that not prevent huge page usage? Maybe you can spell it out a bit
> Yes it prevents huge page usage, but preventing the huge page usage is
> also what is achieved with the reservation.
>
>> better. What is the set of system calls an application should make to
>> not use huge pages either for the address space or on a per-VMA basis
>> and defer to kcompactd? I know that can be tuned globally but that's not
>> quite the same thing given that multiple applications or containers can
>> be running with different requirements.
> Yes, in terms of inheritance that could be used to tune a container
> we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE
> useless too, but then for microservices that should not be a
> concern. How to make those sysfs tunables reentrant in namespaces is a
> separate issue I think.
>
> The difference is that with the reservation over time they can be
> promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not
> even khugepaged will scan that vma anymore.
>
> The benefit of the reservation will showup in those regions that will
> not become hugepages, so if you can predict beforehand that those
> ranges don't benefit from THP, it's better if userland calls
> madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo
> the reservation later during memory pressure.
>
> The reservation and promotion is a bit like auto-detecting when
> MADV_NOHUGEPAGE should be set, so it boils down of how much of a
> corner case that is.
>
> I'm not so concerned about the RAM wasted because I don't think it's
> very significant, after all the application can just do a smaller
> malloc if it wants to reduce memory usage.
>
> A massive amount of huge RAM waste is fairly rare and to the extreme
> it could still be wasted even with 4k if the app uses only 1 bit from
> every 4k page it allocates with malloc.
>
> I'm more concerned about cases where THP is wasting CPU: like in redis
> that is hurted by the 2M COWs. redis will map all pages and they will
> be all promoted to THP also with the reservation logic applied, but
> when the parent writes to the memory (after fork) it must trigger 4k
> cows (not 2M cows) and in turn split the THP before the COW, or it
> won't work as fast as with THP disabled. In addition we should try to
> reuse the same IPI for the transhuge pmd split to cover the COW too.
>
> If we add the reservation and that work makes zero difference for the
> redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not
> great in my view. It looks like we're trying to optimize issues that
> are less critical.
>
> The redis+THP case should be possible to optimize later with uffd WP
> model (once completed, Peter Xu is working on it), and uffd WP will
> also remove fork() and it'll convert it to a clone(). The granularity
> of the fault is decided by the userland that way so when uffd
> wrprotects a 4k fragment of a THP, the THP will be split during the
> uffd mprotect ioctl.
>
>>> Now about the implementation: the whole point of the reservation
>>> complexity is to skip the khugepaged copy, so it can collapse in
>>> place. Is skipping the copy worth it? Isn't the big cost the IPI
>>> anyway to avoid leaving two simultaneous TLB mappings of different
>>> granularity?
>>>
>> Not necessarily. With THP anon in the simple case, it might be just a
>> single thread and kcompact so that's one IPI (kcompactd flushes local and
>> one IPI to the CPU the thread was running on assuming it's not migrating
>> excessively). It would scale up with the number of threads but I suspect
>> the main cost is the actual copying, page table manipulation and the
>> locking required.
> Agreed, the IPI wouldn't be a concern for a single threaded app. I was
> looking more at the worst case scenario. For a single threaded app the
> locking should not be too bad either.
>
>> As an aside, a universal benefit would be looking at reducing the time
>> to allocate the necessary huge page as we know that can be excessive. It
>> would be ortogonal to this series.
> With what I suggested the allocation would happen as usual in
> khugepaged at slow peace, without holding locks. So I don't see
> obvious disadvantages in terms of THP allocation latency.
>
>> Could you and Kirill outline what sort of workloads you would consider
>> acceptable for evaluating this series? One would assume it covers at
>> least the following, potentially with a number of workloads.
> I would prefer to add intelligence to detect when COWs after fork
> should be done at 2m or 4k granularity (in the latter case by
> splitting the pmd before the actual COW while leaving the transhuge
> pmd intact in the other mm), because that would save CPU (and it'd
> automatically optimize redis). The snapshot process especially would
> run faster as it will read with THP performance.
And presumably to maintain the performance benefit in subsequent
snapshots the original split PMD would need to be re-promoted
prior to forking or promoted in the child during fork?

>
> I'm more worried to ensure THP doesn't cause more CPU usage like it
> happens to the above case in COWs, than to just try to save RAM when
> the virtual ranges are only partially utilized by the app.
>
>> 1. Evaluate the collapse and copying costs (probing the entire time
>>    spent in collapse_huge_page might do it)
>> 2. Evaluate mmap_sem hold time during hugepage collapse
>> 3. Estimate excessive RAM use due to unnecessary THP usage
>> 4. Estimate the slowdown due to delayed THP usage 
>>
>> 1 and 2 would indicate how much time is lost due to not using
>> reservations. That potentially goes in the direction of simply making
>> this faster -- fragmentation reduction (posted but unreviewed), faster
>> compaction searches, better page isolation during compaction to
>> avoid free pages being reused before an order-9 is free.
>>
>> 3 should be straight-forward but 4 would be the hardest to evaluate
>> because it would have to be determimed if 4 is offset by improvements to
>> 1-3. If 1-3 is improved enough, it might remove the motivation for the
>> series entirely.
>>
>> In other words, if we agree on a workload in advance, it might bring
>> this the right direction and not accidentally throw Anthony down a hole
>> working on a series that never gets ack'd.
>>
>> I'm not necessarily the best person to answer because my natural inclination
>> after the fragmentation series would be to keep using thpfiosacle
>> (from the fragmentation avoidance series) and work on improving the THP
>> allocation success rates and reduce latencies. I've tunnel vision on that
>> for the moment.
> Deciding the workloads is a good question indeed, but I would also be
> curious to how many of those pages would not end up to be promoted
> with this logic.
>
> What's the number of pte_none that you require in each pmd to avoid
> promotion? If it's just 1 then apps will run slower, if there's
> partial utilization THP already helps. I've an hard time to think at
> an ideal ratio, this is why max_ptes_none is 511 after all.
>
> Can we start by counting the total number of pte_none() in all pmds
> that can fit a THP according to vma->vm_start/end? The pagetable
> dumper in debugfs may already provide the info we need by scanning all
> mm and by printing the number of "none" pte that would generate
> "wasted" memory (and marginally wasted CPU during copy/clear).
>
> Then you can exactly tell how many pmds won't be promoted to transhuge
> pmds with the patch applied in the real life workloads, even before
> running any benchmark. It'd be good to be sure we're talking about a
> significant number in real life workloads or there's not much to
> optimize to begin with.
>
> If the amount of RAM saved is significant in real life workloads and
> in turn there's a chance of having a worthwhile tradeoff from the
> reservation logic, then we can do the benchmarks because the behavior
> will be different for the page fault, and it'll end up running slower
> with the reservation logic.

Thank you, Andrea and Mel, for the feedback.  I really appreciate it.
I'm going to proceed as suggested and evaluate the huge page
collapse and copy costs and perform more analysis on the potential
RAM savings.

Anthony

>
> Thanks,
> Andrea
Kirill A. Shutemov Nov. 20, 2018, 9:11 a.m. UTC | #14
On Sat, Nov 10, 2018 at 11:44:12AM -0500, Andrea Arcangeli wrote:
> I would prefer to add intelligence to detect when COWs after fork
> should be done at 2m or 4k granularity (in the latter case by
> splitting the pmd before the actual COW while leaving the transhuge
> pmd intact in the other mm), because that would save CPU (and it'd
> automatically optimize redis). The snapshot process especially would
> run faster as it will read with THP performance.

I would argue we should switch to 4k COW everywhere. But it requires some
work on khugepaged side to be able to recover THP back after multiple 4k
COW in the range. Currently khugepaged is not able to collapse PTE entires
backed by compound page back to PMD.

I have this on my todo list for long time, but...
Andrea Arcangeli Nov. 20, 2018, 5:04 p.m. UTC | #15
On Tue, Nov 20, 2018 at 12:11:22PM +0300, Kirill A. Shutemov wrote:
> On Sat, Nov 10, 2018 at 11:44:12AM -0500, Andrea Arcangeli wrote:
> > I would prefer to add intelligence to detect when COWs after fork
> > should be done at 2m or 4k granularity (in the latter case by
> > splitting the pmd before the actual COW while leaving the transhuge
> > pmd intact in the other mm), because that would save CPU (and it'd
> > automatically optimize redis). The snapshot process especially would
> > run faster as it will read with THP performance.
> 
> I would argue we should switch to 4k COW everywhere. But it requires some

We could do that if MADV_HUGEPAGE is not set for example. So there
would still be a way to force the 2M cows if something benefits from
them.

For example with binaries executed in tmpfs one could want 2M cows on
MAP_PRIVATE to keep all the executable in 2MB tlbs despite the memory
loss (but then there are those libs that apparently aren't released to
load the binaries into THP anon too for the same reason and with even
higher memory waste risk as unlike tmpfs nothing can be shared if you
run multiple copies of a go large binary or something).

Certainly it would help whenever fork() is used for snapshotting
purposes, but then fork() used for snapshotting purposes doesn't look
the best mechanism possible for atomic snapshots.

It would be interesting to know which other common workloads will
benefit, for workloads that unlike fork()-for-snapshot, are already as
optimal as it can get.

> work on khugepaged side to be able to recover THP back after multiple 4k
> COW in the range. Currently khugepaged is not able to collapse PTE entires
> backed by compound page back to PMD.

Yes this is also answering Anthony question about what shall happen
after to the 4k cows on the doublemap.

The thing is, by the time khugepaged comes around, the child will
hopefully already have quit, so it would be ideal if it can understand
the anon page isn't even shared anymore, it's fully private to the
process after holding the mmap_sem for writing, so if it's not-shared
anymore and mapcount is 1, khugepaged doesn't need to do the 2M cow of
the doublemap THP at all, it just needs to flush the 4k fragment back
to the THP and drop the doublemap and convert the readonly pte entries
to a writable pmd_trans_huge (if VM_WRITE is still set).

> I have this on my todo list for long time, but...

We're also slowly making progress on the uffd-wp to offer a hopefully
way more efficient way to do the snapshot than using fork(), then the
whole fork thing won't be an issue because there will be no fork.
Anthony Yznaga Jan. 25, 2019, 2:28 a.m. UTC | #16
On 11/14/18 3:15 PM, anthony.yznaga@oracle.com wrote:
> 
> 
> On 11/10/2018 08:44 AM, Andrea Arcangeli wrote:
>> On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote:
>>> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
>>>> And if you're in the camp that is concerned about the use of more RAM
>>>> or/and about the higher latency of COW faults, I'm afraid the
>>>> intermediate solution will be still slower than the already available
>>>> MADV_NOHUGEPAGE or enabled=madvise.
>>>>
>>> Does that not prevent huge page usage? Maybe you can spell it out a bit
>> Yes it prevents huge page usage, but preventing the huge page usage is
>> also what is achieved with the reservation.
>>
>>> better. What is the set of system calls an application should make to
>>> not use huge pages either for the address space or on a per-VMA basis
>>> and defer to kcompactd? I know that can be tuned globally but that's not
>>> quite the same thing given that multiple applications or containers can
>>> be running with different requirements.
>> Yes, in terms of inheritance that could be used to tune a container
>> we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE
>> useless too, but then for microservices that should not be a
>> concern. How to make those sysfs tunables reentrant in namespaces is a
>> separate issue I think.
>>
>> The difference is that with the reservation over time they can be
>> promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not
>> even khugepaged will scan that vma anymore.
>>
>> The benefit of the reservation will showup in those regions that will
>> not become hugepages, so if you can predict beforehand that those
>> ranges don't benefit from THP, it's better if userland calls
>> madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo
>> the reservation later during memory pressure.
>>
>> The reservation and promotion is a bit like auto-detecting when
>> MADV_NOHUGEPAGE should be set, so it boils down of how much of a
>> corner case that is.
>>
>> I'm not so concerned about the RAM wasted because I don't think it's
>> very significant, after all the application can just do a smaller
>> malloc if it wants to reduce memory usage.
>>
>> A massive amount of huge RAM waste is fairly rare and to the extreme
>> it could still be wasted even with 4k if the app uses only 1 bit from
>> every 4k page it allocates with malloc.
>>
>> I'm more concerned about cases where THP is wasting CPU: like in redis
>> that is hurted by the 2M COWs. redis will map all pages and they will
>> be all promoted to THP also with the reservation logic applied, but
>> when the parent writes to the memory (after fork) it must trigger 4k
>> cows (not 2M cows) and in turn split the THP before the COW, or it
>> won't work as fast as with THP disabled. In addition we should try to
>> reuse the same IPI for the transhuge pmd split to cover the COW too.
>>
>> If we add the reservation and that work makes zero difference for the
>> redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not
>> great in my view. It looks like we're trying to optimize issues that
>> are less critical.
>>
>> The redis+THP case should be possible to optimize later with uffd WP
>> model (once completed, Peter Xu is working on it), and uffd WP will
>> also remove fork() and it'll convert it to a clone(). The granularity
>> of the fault is decided by the userland that way so when uffd
>> wrprotects a 4k fragment of a THP, the THP will be split during the
>> uffd mprotect ioctl.
>>
>>>> Now about the implementation: the whole point of the reservation
>>>> complexity is to skip the khugepaged copy, so it can collapse in
>>>> place. Is skipping the copy worth it? Isn't the big cost the IPI
>>>> anyway to avoid leaving two simultaneous TLB mappings of different
>>>> granularity?
>>>>
>>> Not necessarily. With THP anon in the simple case, it might be just a
>>> single thread and kcompact so that's one IPI (kcompactd flushes local and
>>> one IPI to the CPU the thread was running on assuming it's not migrating
>>> excessively). It would scale up with the number of threads but I suspect
>>> the main cost is the actual copying, page table manipulation and the
>>> locking required.
>> Agreed, the IPI wouldn't be a concern for a single threaded app. I was
>> looking more at the worst case scenario. For a single threaded app the
>> locking should not be too bad either.
>>
>>> As an aside, a universal benefit would be looking at reducing the time
>>> to allocate the necessary huge page as we know that can be excessive. It
>>> would be ortogonal to this series.
>> With what I suggested the allocation would happen as usual in
>> khugepaged at slow peace, without holding locks. So I don't see
>> obvious disadvantages in terms of THP allocation latency.
>>
>>> Could you and Kirill outline what sort of workloads you would consider
>>> acceptable for evaluating this series? One would assume it covers at
>>> least the following, potentially with a number of workloads.
>> I would prefer to add intelligence to detect when COWs after fork
>> should be done at 2m or 4k granularity (in the latter case by
>> splitting the pmd before the actual COW while leaving the transhuge
>> pmd intact in the other mm), because that would save CPU (and it'd
>> automatically optimize redis). The snapshot process especially would
>> run faster as it will read with THP performance.
> And presumably to maintain the performance benefit in subsequent
> snapshots the original split PMD would need to be re-promoted
> prior to forking or promoted in the child during fork?
> 
>>
>> I'm more worried to ensure THP doesn't cause more CPU usage like it
>> happens to the above case in COWs, than to just try to save RAM when
>> the virtual ranges are only partially utilized by the app.
>>
>>> 1. Evaluate the collapse and copying costs (probing the entire time
>>>    spent in collapse_huge_page might do it)
>>> 2. Evaluate mmap_sem hold time during hugepage collapse
>>> 3. Estimate excessive RAM use due to unnecessary THP usage
>>> 4. Estimate the slowdown due to delayed THP usage 
>>>
>>> 1 and 2 would indicate how much time is lost due to not using
>>> reservations. That potentially goes in the direction of simply making
>>> this faster -- fragmentation reduction (posted but unreviewed), faster
>>> compaction searches, better page isolation during compaction to
>>> avoid free pages being reused before an order-9 is free.
>>>
>>> 3 should be straight-forward but 4 would be the hardest to evaluate
>>> because it would have to be determimed if 4 is offset by improvements to
>>> 1-3. If 1-3 is improved enough, it might remove the motivation for the
>>> series entirely.
>>>
>>> In other words, if we agree on a workload in advance, it might bring
>>> this the right direction and not accidentally throw Anthony down a hole
>>> working on a series that never gets ack'd.
>>>
>>> I'm not necessarily the best person to answer because my natural inclination
>>> after the fragmentation series would be to keep using thpfiosacle
>>> (from the fragmentation avoidance series) and work on improving the THP
>>> allocation success rates and reduce latencies. I've tunnel vision on that
>>> for the moment.
>> Deciding the workloads is a good question indeed, but I would also be
>> curious to how many of those pages would not end up to be promoted
>> with this logic.
>>
>> What's the number of pte_none that you require in each pmd to avoid
>> promotion? If it's just 1 then apps will run slower, if there's
>> partial utilization THP already helps. I've an hard time to think at
>> an ideal ratio, this is why max_ptes_none is 511 after all.
>>
>> Can we start by counting the total number of pte_none() in all pmds
>> that can fit a THP according to vma->vm_start/end? The pagetable
>> dumper in debugfs may already provide the info we need by scanning all
>> mm and by printing the number of "none" pte that would generate
>> "wasted" memory (and marginally wasted CPU during copy/clear).
>>
>> Then you can exactly tell how many pmds won't be promoted to transhuge
>> pmds with the patch applied in the real life workloads, even before
>> running any benchmark. It'd be good to be sure we're talking about a
>> significant number in real life workloads or there's not much to
>> optimize to begin with.
>>
>> If the amount of RAM saved is significant in real life workloads and
>> in turn there's a chance of having a worthwhile tradeoff from the
>> reservation logic, then we can do the benchmarks because the behavior
>> will be different for the page fault, and it'll end up running slower
>> with the reservation logic.
> 
> Thank you, Andrea and Mel, for the feedback.  I really appreciate it.
> I'm going to proceed as suggested and evaluate the huge page
> collapse and copy costs and perform more analysis on the potential
> RAM savings.

Thanks again to everyone for the feedback.  To follow up on this, I was
unable to find a workload that could justify these changes.  If I had, I
suspect that Andrea's suggestion of a THP mode that simply avoided
allocating a hugepage on first fault would have sufficed.

I did find that khugepaged often spends the most time copying from base
pages to a huge page.  Separate from the original intent of mitigating
bloat, I explored using reservations to reduce the time in khugepaged by
allocating them for partially-mmap'd PMD-aligned regions of anon memory
in anticipation of the unmapped portion eventually being mapped (think
the tail portion of a heap).  The number of copies avoided was highly
dependent on workload and generally not very high, though, because
either a process was too short-lived for the reservation to be converted
by khugepaged or the process forked and a parent COW forced the
reservation to be released before conversion.  Too much overhead for too
little gain.  An application is better off using a THP-aware allocator.

Anthony

> 
> Anthony
> 
>>
>> Thanks,
>> Andrea
>
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fdcb45999b26..a2288f134d5d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@  extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
+extern unsigned int hugepage_promotion_threshold;
 
 static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 082d1d2a5216..0011eb656ff3 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -2,6 +2,7 @@ 
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
 
+#include <linux/hashtable.h>
 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
 
 
@@ -30,6 +31,64 @@  extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	(transparent_hugepage_flags &				\
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
 
+struct thp_reservation {
+	spinlock_t *lock;
+	unsigned long haddr;
+	struct page *page;
+	struct vm_area_struct *vma;
+	struct hlist_node node;
+	struct list_head lru;
+	int nr_unused;
+};
+
+struct thp_resvs {
+	atomic_t refcnt;
+	spinlock_t res_hash_lock;
+	DECLARE_HASHTABLE(res_hash, 7);
+};
+
+#define	vma_thp_reservations(vma)	((vma)->thp_reservations)
+
+static inline void thp_resvs_fork(struct vm_area_struct *vma,
+				  struct vm_area_struct *pvma)
+{
+	// XXX Do not share THP reservations for now
+	vma->thp_reservations = NULL;
+}
+
+void thp_resvs_new(struct vm_area_struct *vma);
+
+extern void __thp_resvs_put(struct thp_resvs *r);
+static inline void thp_resvs_put(struct thp_resvs *r)
+{
+	if (r)
+		__thp_resvs_put(r);
+}
+
+void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
+				  unsigned long address, int delta);
+
+struct page *khugepaged_get_reserved_page(
+	struct vm_area_struct *vma,
+	unsigned long address);
+
+void khugepaged_reserve(struct vm_area_struct *vma,
+			unsigned long address);
+
+void khugepaged_release_reservation(struct vm_area_struct *vma,
+				    unsigned long address);
+
+void _khugepaged_reservations_fixup(struct vm_area_struct *src,
+				   struct vm_area_struct *dst);
+
+void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
+				      struct vm_area_struct *next, long adjust);
+
+void thp_reservations_mremap(struct vm_area_struct *vma,
+		unsigned long old_addr, struct vm_area_struct *new_vma,
+		unsigned long new_addr, unsigned long len,
+		bool need_rmap_locks);
+
 static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
@@ -56,6 +115,66 @@  static inline int khugepaged_enter(struct vm_area_struct *vma,
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define	vma_thp_reservations(vma)	NULL
+
+static inline void thp_resvs_fork(struct vm_area_struct *vma,
+				  struct vm_area_struct *pvma)
+{
+}
+
+static inline void thp_resvs_new(struct vm_area_struct *vma)
+{
+}
+
+static inline void __thp_resvs_put(struct thp_resvs *r)
+{
+}
+
+static inline void thp_resvs_put(struct thp_resvs *r)
+{
+}
+
+static inline void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
+					      unsigned long address, int delta)
+{
+}
+
+static inline struct page *khugepaged_get_reserved_page(
+	struct vm_area_struct *vma,
+	unsigned long address)
+{
+	return NULL;
+}
+
+static inline void khugepaged_reserve(struct vm_area_struct *vma,
+			       unsigned long address)
+{
+}
+
+static inline void khugepaged_release_reservation(struct vm_area_struct *vma,
+				    unsigned long address)
+{
+}
+
+static inline void _khugepaged_reservations_fixup(struct vm_area_struct *src,
+				   struct vm_area_struct *dst)
+{
+}
+
+static inline void _khugepaged_move_reservations_adj(
+				struct vm_area_struct *prev,
+				struct vm_area_struct *next, long adjust)
+{
+}
+
+static inline void thp_reservations_mremap(struct vm_area_struct *vma,
+		unsigned long old_addr, struct vm_area_struct *new_vma,
+		unsigned long new_addr, unsigned long len,
+		bool need_rmap_locks)
+{
+}
+
 static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	return 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602167df..6342d5f67f75 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -787,6 +787,7 @@  static inline void memcg_memory_event_mm(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void mem_cgroup_collapse_huge_fixup(struct page *head);
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
@@ -1087,6 +1088,10 @@  unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	return 0;
 }
 
+static inline void mem_cgroup_collapse_huge_fixup(struct page *head)
+{
+}
+
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..72a9f431145e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -322,6 +322,9 @@  struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct thp_resvs *thp_reservations;
+#endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4b0c79d2924..7deac5a1f25d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -181,6 +181,7 @@  enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
+	NR_THP_RESERVED,	/* Unused small pages in THP reservations */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..f3d34db7e9d5 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -88,6 +88,8 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
+		THP_RES_ALLOC,
+		THP_RES_ALLOC_FAILED,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b58479534f..a15d1cda1958 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,8 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (is_vm_hugetlb_page(tmp))
 			reset_vma_resv_huge_pages(tmp);
 
+		thp_resvs_fork(tmp, mpnt);
+
 		/*
 		 * Link in the new vma and copy the page table entries.
 		 */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index deed97fba979..aa80b9c54d1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -57,6 +57,8 @@ 
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
+unsigned int hugepage_promotion_threshold __read_mostly = HPAGE_PMD_NR / 2;
+
 static struct shrinker deferred_split_shrinker;
 
 static atomic_t huge_zero_refcount;
@@ -288,6 +290,28 @@  static ssize_t use_zero_page_store(struct kobject *kobj,
 static struct kobj_attribute use_zero_page_attr =
 	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
 
+static ssize_t promotion_threshold_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", hugepage_promotion_threshold);
+}
+static ssize_t promotion_threshold_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long promotion_threshold;
+
+	err = kstrtoul(buf, 10, &promotion_threshold);
+	if (err || promotion_threshold < 1 || promotion_threshold > HPAGE_PMD_NR)
+		return -EINVAL;
+
+	hugepage_promotion_threshold = promotion_threshold;
+
+	return count;
+}
+static struct kobj_attribute promotion_threshold_attr =
+	__ATTR(promotion_threshold, 0644, promotion_threshold_show, promotion_threshold_store);
+
 static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -318,6 +342,7 @@  static ssize_t debug_cow_store(struct kobject *kobj,
 	&enabled_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
+	&promotion_threshold_attr.attr,
 	&hpage_pmd_size_attr.attr,
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
 	&shmem_enabled_attr.attr,
@@ -670,6 +695,10 @@  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 
+	if (hugepage_promotion_threshold > 1) {
+		khugepaged_reserve(vma, vmf->address);
+		return VM_FAULT_FALLBACK;
+	}
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a31d740e6cd1..55d380f8ce71 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -8,6 +8,7 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/shrinker.h>
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/khugepaged.h>
@@ -56,6 +57,7 @@  enum scan_result {
 /* default scan 8*512 pte (or vmas) every 30 second */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_res_pages_collapsed;
 static unsigned int khugepaged_full_scans;
 static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
@@ -76,6 +78,445 @@  enum scan_result {
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
+struct list_lru thp_reservations_lru;
+
+void thp_resvs_new(struct vm_area_struct *vma)
+{
+	struct thp_resvs *new = NULL;
+
+	if (hugepage_promotion_threshold == 1)
+		goto done;
+
+	new = kzalloc(sizeof(struct thp_resvs), GFP_KERNEL);
+	if (!new)
+		goto done;
+
+	atomic_set(&new->refcnt, 1);
+	spin_lock_init(&new->res_hash_lock);
+	hash_init(new->res_hash);
+
+done:
+	vma->thp_reservations = new;
+}
+
+void __thp_resvs_put(struct thp_resvs *resv)
+{
+	if (!atomic_dec_and_test(&resv->refcnt))
+		return;
+
+	kfree(resv);
+}
+
+static struct thp_reservation *khugepaged_find_reservation(
+	struct vm_area_struct *vma,
+	unsigned long address)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct thp_reservation *res = NULL;
+
+	if (!vma->thp_reservations)
+		return NULL;
+
+	hash_for_each_possible(vma->thp_reservations->res_hash, res, node, haddr) {
+		if (res->haddr == haddr)
+			break;
+	}
+	return res;
+}
+
+static void khugepaged_free_reservation(struct thp_reservation *res)
+{
+	struct page *page;
+	int unused;
+	int i;
+
+	list_lru_del(&thp_reservations_lru, &res->lru);
+	hash_del(&res->node);
+	page = res->page;
+	unused = res->nr_unused;
+
+	kfree(res);
+
+	if (!PageCompound(page)) {
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			put_page(page + i);
+
+		if (unused) {
+			mod_node_page_state(page_pgdat(page), NR_THP_RESERVED,
+					    -unused);
+		}
+	}
+}
+
+void khugepaged_reserve(struct vm_area_struct *vma, unsigned long address)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct thp_reservation *res;
+	struct page *page;
+	gfp_t gfp;
+	int i;
+
+	if (!vma->thp_reservations)
+		return;
+	if (!vma_is_anonymous(vma))
+		return;
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return;
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+
+	if (khugepaged_find_reservation(vma, address)) {
+		spin_unlock(&vma->thp_reservations->res_hash_lock);
+		return;
+	}
+
+	/*
+	 * Allocate the equivalent of a huge page but not as a compound page
+	 */
+	gfp = GFP_TRANSHUGE_LIGHT & ~__GFP_COMP;
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+	if (unlikely(!page)) {
+		count_vm_event(THP_RES_ALLOC_FAILED);
+		spin_unlock(&vma->thp_reservations->res_hash_lock);
+		return;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		set_page_count(page + i, 1);
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL);
+	if (!res) {
+		count_vm_event(THP_RES_ALLOC_FAILED);
+		__free_pages(page, HPAGE_PMD_ORDER);
+		spin_unlock(&vma->thp_reservations->res_hash_lock);
+		return;
+	}
+
+	count_vm_event(THP_RES_ALLOC);
+
+	res->haddr = haddr;
+	res->page = page;
+	res->vma = vma;
+	res->lock = &vma->thp_reservations->res_hash_lock;
+	hash_add(vma->thp_reservations->res_hash, &res->node, haddr);
+
+	INIT_LIST_HEAD(&res->lru);
+	list_lru_add(&thp_reservations_lru, &res->lru);
+
+	res->nr_unused = HPAGE_PMD_NR;
+	mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, HPAGE_PMD_NR);
+
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+
+	khugepaged_enter(vma, vma->vm_flags);
+}
+
+struct page *khugepaged_get_reserved_page(struct vm_area_struct *vma,
+					  unsigned long address)
+{
+	struct thp_reservation *res;
+	struct page *page;
+
+	if (!transparent_hugepage_enabled(vma))
+		return NULL;
+	if (!vma->thp_reservations)
+		return NULL;
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+
+	page = NULL;
+	res = khugepaged_find_reservation(vma, address);
+	if (res) {
+		unsigned long offset = address & ~HPAGE_PMD_MASK;
+
+		page = res->page + (offset >> PAGE_SHIFT);
+		get_page(page);
+
+		list_lru_del(&thp_reservations_lru, &res->lru);
+		list_lru_add(&thp_reservations_lru, &res->lru);
+
+		dec_node_page_state(res->page, NR_THP_RESERVED);
+	}
+
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+
+	return page;
+}
+
+void khugepaged_release_reservation(struct vm_area_struct *vma,
+				    unsigned long address)
+{
+	struct thp_reservation *res;
+
+	if (!vma->thp_reservations)
+		return;
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+
+	res = khugepaged_find_reservation(vma, address);
+	if (!res)
+		goto out;
+
+	khugepaged_free_reservation(res);
+
+out:
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+}
+
+/*
+ * Release all reservations covering a range in a VMA.
+ */
+void __khugepaged_release_reservations(struct vm_area_struct *vma,
+				       unsigned long addr, unsigned long len)
+{
+	struct thp_reservation *res;
+	struct hlist_node *tmp;
+	unsigned long eaddr;
+	int i;
+
+	if (!vma->thp_reservations)
+		return;
+
+	eaddr = addr + len;
+	addr &= HPAGE_PMD_MASK;
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+
+	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
+		unsigned long hstart = res->haddr;
+
+		if (hstart >= addr && hstart < eaddr)
+			khugepaged_free_reservation(res);
+	}
+
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+}
+
+static void __khugepaged_move_reservations(struct vm_area_struct *src,
+					   struct vm_area_struct *dst,
+					   unsigned long split_addr,
+					   bool dst_is_below)
+{
+	struct thp_reservation *res;
+	struct hlist_node *tmp;
+	bool free_res = false;
+	int i;
+
+	if (!src->thp_reservations)
+		return;
+
+	if (!dst->thp_reservations)
+		free_res = true;
+
+	spin_lock(&src->thp_reservations->res_hash_lock);
+	if (!free_res)
+		spin_lock(&dst->thp_reservations->res_hash_lock);
+
+	hash_for_each_safe(src->thp_reservations->res_hash, i, tmp, res, node) {
+		unsigned long hstart = res->haddr;
+
+		/*
+		 * Free the reservation if it straddles a non-aligned
+		 * split address.
+		 */
+		if ((split_addr & ~HPAGE_PMD_MASK) &&
+		    (hstart == (split_addr & HPAGE_PMD_MASK))) {
+			khugepaged_free_reservation(res);
+			continue;
+		} else if (dst_is_below) {
+			if (hstart >= split_addr)
+				continue;
+		} else if (hstart < split_addr) {
+			continue;
+		}
+
+		if (unlikely(free_res)) {
+			khugepaged_free_reservation(res);
+			continue;
+		}
+
+		hash_del(&res->node);
+		res->vma = dst;
+		res->lock = &dst->thp_reservations->res_hash_lock;
+		hash_add(dst->thp_reservations->res_hash, &res->node, res->haddr);
+	}
+
+	if (!free_res)
+		spin_unlock(&dst->thp_reservations->res_hash_lock);
+	spin_unlock(&src->thp_reservations->res_hash_lock);
+}
+
+/*
+ * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
+ */
+static void take_rmap_locks(struct vm_area_struct *vma)
+{
+	if (vma->vm_file)
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+	if (vma->anon_vma)
+		anon_vma_lock_write(vma->anon_vma);
+}
+
+/*
+ * XXX dup from mm/mremap.c.  Move thp_reservations_mremap() to mm/mremap.c?
+ */
+static void drop_rmap_locks(struct vm_area_struct *vma)
+{
+	if (vma->anon_vma)
+		anon_vma_unlock_write(vma->anon_vma);
+	if (vma->vm_file)
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+}
+
+void thp_reservations_mremap(struct vm_area_struct *vma,
+		unsigned long old_addr, struct vm_area_struct *new_vma,
+		unsigned long new_addr, unsigned long len,
+		bool need_rmap_locks)
+{
+
+	struct thp_reservation *res;
+	unsigned long eaddr, offset;
+	struct hlist_node *tmp;
+	int i;
+
+	if (!vma->thp_reservations)
+		return;
+
+	if (!new_vma->thp_reservations) {
+		__khugepaged_release_reservations(vma, old_addr, len);
+		return;
+	}
+
+	/*
+	 * Release all reservations if they will no longer be aligned
+	 * in the new address range.
+	 */
+	if ((new_addr & ~HPAGE_PMD_MASK) != (old_addr & ~HPAGE_PMD_MASK)) {
+		__khugepaged_release_reservations(vma, old_addr, len);
+		return;
+	}
+
+	if (need_rmap_locks)
+		take_rmap_locks(vma);
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+	spin_lock(&new_vma->thp_reservations->res_hash_lock);
+
+	/*
+	 * If the start or end addresses of the range are not huge page
+	 * aligned, check for overlapping reservations and release them.
+	 */
+	if (old_addr & ~HPAGE_PMD_MASK) {
+		res = khugepaged_find_reservation(vma, old_addr);
+		if (res)
+			khugepaged_free_reservation(res);
+	}
+
+	eaddr = old_addr + len;
+	if (eaddr & ~HPAGE_PMD_MASK) {
+		res = khugepaged_find_reservation(vma, eaddr);
+		if (res)
+			khugepaged_free_reservation(res);
+	}
+
+	offset = new_addr - old_addr;
+
+	hash_for_each_safe(vma->thp_reservations->res_hash, i, tmp, res, node) {
+		unsigned long hstart = res->haddr;
+
+		if (hstart < old_addr || hstart >= eaddr)
+			continue;
+
+		hash_del(&res->node);
+		res->lock = &new_vma->thp_reservations->res_hash_lock;
+		res->vma = new_vma;
+		res->haddr += offset;
+		hash_add(new_vma->thp_reservations->res_hash, &res->node, res->haddr);
+	}
+
+	spin_unlock(&new_vma->thp_reservations->res_hash_lock);
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+
+	if (need_rmap_locks)
+		drop_rmap_locks(vma);
+
+}
+
+/*
+ * Handle moving reservations for VMA merge cases 1, 6, 7, and 8 (see
+ * comments above vma_merge()) and when splitting a VMA.
+ *
+ * src is expected to be aligned with the start or end of dst
+ * src may be contained by dst or directly adjacent to dst
+ * Move all reservations if src is contained by dst.
+ * Otherwise move reservations no longer in the range of src
+ * to dst.
+ */
+void _khugepaged_reservations_fixup(struct vm_area_struct *src,
+				    struct vm_area_struct *dst)
+{
+	bool dst_is_below = false;
+	unsigned long split_addr;
+
+	if (src->vm_start == dst->vm_start || src->vm_end == dst->vm_end) {
+		split_addr = 0;
+	} else if (src->vm_start == dst->vm_end) {
+		split_addr = src->vm_start;
+		dst_is_below = true;
+	} else if (src->vm_end == dst->vm_start) {
+		split_addr = src->vm_end;
+	} else {
+		WARN_ON(1);
+		return;
+	}
+
+	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
+}
+
+/*
+ * Handle moving reservations for VMA merge cases 4 and 5 (see comments
+ * above vma_merge()).
+ */
+void _khugepaged_move_reservations_adj(struct vm_area_struct *prev,
+				       struct vm_area_struct *next, long adjust)
+{
+	unsigned long split_addr = next->vm_start;
+	struct vm_area_struct *src, *dst;
+	bool dst_is_below;
+
+	if (adjust < 0) {
+		src = prev;
+		dst = next;
+		dst_is_below = false;
+	} else {
+		src = next;
+		dst = prev;
+		dst_is_below = true;
+	}
+
+	__khugepaged_move_reservations(src, dst, split_addr, dst_is_below);
+}
+
+void khugepaged_mod_resv_unused(struct vm_area_struct *vma,
+				unsigned long address, int delta)
+{
+	struct thp_reservation *res;
+
+	if (!vma->thp_reservations)
+		return;
+
+	spin_lock(&vma->thp_reservations->res_hash_lock);
+
+	res = khugepaged_find_reservation(vma, address);
+	if (res) {
+		WARN_ON((res->nr_unused == 0) || (res->nr_unused + delta < 0));
+		if (res->nr_unused + delta >= 0)
+			res->nr_unused += delta;
+	}
+
+	spin_unlock(&vma->thp_reservations->res_hash_lock);
+}
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -197,6 +638,15 @@  static ssize_t pages_collapsed_show(struct kobject *kobj,
 static struct kobj_attribute pages_collapsed_attr =
 	__ATTR_RO(pages_collapsed);
 
+static ssize_t res_pages_collapsed_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_res_pages_collapsed);
+}
+static struct kobj_attribute res_pages_collapsed_attr =
+	__ATTR_RO(res_pages_collapsed);
+
 static ssize_t full_scans_show(struct kobject *kobj,
 			       struct kobj_attribute *attr,
 			       char *buf)
@@ -292,6 +742,7 @@  static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
 	&scan_sleep_millisecs_attr.attr,
 	&alloc_sleep_millisecs_attr.attr,
 	&khugepaged_max_ptes_swap_attr.attr,
+	&res_pages_collapsed_attr.attr,
 	NULL,
 };
 
@@ -342,8 +793,96 @@  int hugepage_madvise(struct vm_area_struct *vma,
 	return 0;
 }
 
+/*
+ * thp_lru_free_reservation() - shrinker callback to release THP reservations
+ * and free unused pages
+ *
+ * Called from list_lru_shrink_walk() in thp_resvs_shrink_scan() to free
+ * up pages when the system is under memory pressure.
+ */
+enum lru_status thp_lru_free_reservation(struct list_head *item,
+					 struct list_lru_one *lru,
+					 spinlock_t *lock,
+					 void *cb_arg)
+{
+	struct mm_struct *mm = NULL;
+	struct thp_reservation *res = container_of(item,
+						   struct thp_reservation,
+						   lru);
+	struct page *page;
+	int unused;
+	int i;
+
+	if (!spin_trylock(res->lock))
+		goto err_get_res_lock_failed;
+
+	mm = res->vma->vm_mm;
+	if (!mmget_not_zero(mm))
+		goto err_mmget;
+	if (!down_write_trylock(&mm->mmap_sem))
+		goto err_down_write_mmap_sem_failed;
+
+	list_lru_isolate(lru, item);
+	spin_unlock(lock);
+
+	hash_del(&res->node);
+
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+
+	spin_unlock(res->lock);
+
+	page = res->page;
+	unused = res->nr_unused;
+
+	kfree(res);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(page + i);
+
+	if (unused)
+		mod_node_page_state(page_pgdat(page), NR_THP_RESERVED, -unused);
+
+	spin_lock(lock);
+
+	return LRU_REMOVED_RETRY;
+
+err_down_write_mmap_sem_failed:
+	mmput_async(mm);
+err_mmget:
+	spin_unlock(res->lock);
+err_get_res_lock_failed:
+	return LRU_SKIP;
+}
+
+static unsigned long
+thp_resvs_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	unsigned long ret = list_lru_shrink_count(&thp_reservations_lru, sc);
+	return ret;
+}
+
+static unsigned long
+thp_resvs_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	unsigned long ret;
+
+	ret = list_lru_shrink_walk(&thp_reservations_lru, sc,
+				   thp_lru_free_reservation, NULL);
+	return ret;
+}
+
+static struct shrinker thp_resvs_shrinker = {
+	.count_objects = thp_resvs_shrink_count,
+	.scan_objects = thp_resvs_shrink_scan,
+	.seeks = DEFAULT_SEEKS,
+	.flags = SHRINKER_NUMA_AWARE,
+};
+
 int __init khugepaged_init(void)
 {
+	int err;
+
 	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
 					  sizeof(struct mm_slot),
 					  __alignof__(struct mm_slot), 0, NULL);
@@ -354,6 +893,17 @@  int __init khugepaged_init(void)
 	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
 	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
 
+	// XXX should be in hugepage_init() so shrinker can be
+	// unregistered if necessary.
+	err = list_lru_init(&thp_reservations_lru);
+	if (err == 0) {
+		err = register_shrinker(&thp_resvs_shrinker);
+		if (err) {
+			list_lru_destroy(&thp_reservations_lru);
+			return err;
+		}
+	}
+
 	return 0;
 }
 
@@ -519,12 +1069,14 @@  static void release_pte_pages(pte_t *pte, pte_t *_pte)
 
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
-					pte_t *pte)
+					pte_t *pte,
+					struct thp_reservation *res)
 {
 	struct page *page = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, result = 0, referenced = 0;
 	bool writable = false;
+	bool is_reserved = res ? true : false;
 
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
@@ -573,7 +1125,7 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 * The page must only be referenced by the scanned process
 		 * and page swap cache.
 		 */
-		if (page_count(page) != 1 + PageSwapCache(page)) {
+		if (page_count(page) != 1 + PageSwapCache(page) + is_reserved) {
 			unlock_page(page);
 			result = SCAN_PAGE_COUNT;
 			goto out;
@@ -631,6 +1183,68 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	return 0;
 }
 
+static void __collapse_huge_page_convert(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address,
+				      spinlock_t *ptl)
+{
+	struct page *head = page;
+	pte_t *_pte;
+
+	set_page_count(page, 1);
+
+	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+				_pte++, page++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+
+		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			clear_user_highpage(page, address);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+			if (is_zero_pfn(pte_pfn(pteval))) {
+				/*
+				 * ptl mostly unnecessary.
+				 */
+				spin_lock(ptl);
+				/*
+				 * paravirt calls inside pte_clear here are
+				 * superfluous.
+				 */
+				pte_clear(vma->vm_mm, address, _pte);
+				spin_unlock(ptl);
+			}
+			dec_node_page_state(page, NR_THP_RESERVED);
+		} else {
+			dec_node_page_state(page, NR_ISOLATED_ANON +
+					    page_is_file_cache(page));
+			unlock_page(page);
+			ClearPageActive(page);
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pte_clear(vma->vm_mm, address, _pte);
+			page_remove_rmap(page, false);
+			spin_unlock(ptl);
+			/*
+			 * Swapping out a page in a reservation
+			 * causes the reservation to be released
+			 * therefore no pages in a reservation
+			 * should be in swapcache.
+			 */
+			WARN_ON(PageSwapCache(page));
+		}
+	}
+
+	prep_compound_page(head, HPAGE_PMD_ORDER);
+	prep_transhuge_page(head);
+}
+
 static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
@@ -934,7 +1548,8 @@  static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 static void collapse_huge_page(struct mm_struct *mm,
 				   unsigned long address,
 				   struct page **hpage,
-				   int node, int referenced)
+				   int node, int referenced,
+				   struct thp_reservation *res)
 {
 	pmd_t *pmd, _pmd;
 	pte_t *pte;
@@ -947,6 +1562,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t gfp;
+	bool is_reserved = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -959,30 +1575,38 @@  static void collapse_huge_page(struct mm_struct *mm,
 	 * sync compaction, and we do not need to hold the mmap_sem during
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
-	up_read(&mm->mmap_sem);
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
-		goto out_nolock;
-	}
+	if (res) {
+		new_page = res->page;
+		vma = res->vma;
+		is_reserved = true;
+	} else {
+		up_read(&mm->mmap_sem);
+		new_page = khugepaged_alloc_page(hpage, gfp, node);
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out_nolock;
-	}
+		if (!new_page) {
+			result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+			goto out_nolock;
+		}
 
-	down_read(&mm->mmap_sem);
-	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result) {
-		mem_cgroup_cancel_charge(new_page, memcg, true);
-		up_read(&mm->mmap_sem);
-		goto out_nolock;
+		if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+			result = SCAN_CGROUP_CHARGE_FAIL;
+			goto out_nolock;
+		}
+
+		down_read(&mm->mmap_sem);
+		result = hugepage_vma_revalidate(mm, address, &vma);
+		if (result) {
+			mem_cgroup_cancel_charge(new_page, memcg, true);
+			up_read(&mm->mmap_sem);
+			goto out_nolock;
+		}
 	}
 
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd) {
 		result = SCAN_PMD_NULL;
-		mem_cgroup_cancel_charge(new_page, memcg, true);
+		if (is_reserved == false)
+			mem_cgroup_cancel_charge(new_page, memcg, true);
 		up_read(&mm->mmap_sem);
 		goto out_nolock;
 	}
@@ -993,7 +1617,8 @@  static void collapse_huge_page(struct mm_struct *mm,
 	 * Continuing to collapse causes inconsistency.
 	 */
 	if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) {
-		mem_cgroup_cancel_charge(new_page, memcg, true);
+		if (is_reserved == false)
+			mem_cgroup_cancel_charge(new_page, memcg, true);
 		up_read(&mm->mmap_sem);
 		goto out_nolock;
 	}
@@ -1014,6 +1639,42 @@  static void collapse_huge_page(struct mm_struct *mm,
 
 	anon_vma_lock_write(vma->anon_vma);
 
+	/*
+	 * Revalidate the reservation now that the locking guarantees it will
+	 * not be released.
+	 */
+	if (is_reserved) {
+		int pgs_inuse;
+
+		res = khugepaged_find_reservation(vma, address);
+		if (!res) {
+			anon_vma_unlock_write(vma->anon_vma);
+			result = SCAN_VMA_CHECK;
+			goto out_up_write;
+		}
+
+		/*
+		 * XXX highly unlikely that the check in khugepage_scan_pmd()
+		 * would pass and this one would fail.
+		 */
+		pgs_inuse = HPAGE_PMD_NR - res->nr_unused;
+		if (pgs_inuse < hugepage_promotion_threshold) {
+			result = SCAN_PAGE_COUNT;
+			goto out_up_write;
+		}
+
+		new_page = res->page;
+
+		/* XXX
+		 * If some pages in the reservation are unused at this point,
+		 * they should be charged to a memcg if applicable.  Need to
+		 * determine the right way to do this when no further faults
+		 * can happen and the reservation will not be released.
+		 * mem_cgroup_try_charge() works by charging one page (or
+		 * huge page) at a time.
+		 */
+	}
+
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
@@ -1032,7 +1693,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	spin_lock(pte_ptl);
-	isolated = __collapse_huge_page_isolate(vma, address, pte);
+	isolated = __collapse_huge_page_isolate(vma, address, pte, res);
 	spin_unlock(pte_ptl);
 
 	if (unlikely(!isolated)) {
@@ -1057,7 +1718,12 @@  static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock_write(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
+	if (is_reserved) {
+		__collapse_huge_page_convert(pte, new_page, vma, address, pte_ptl);
+		khugepaged_free_reservation(res);
+	} else {
+		__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
+	}
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1075,7 +1741,10 @@  static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	mem_cgroup_commit_charge(new_page, memcg, false, true);
+	if (is_reserved)
+		mem_cgroup_collapse_huge_fixup(new_page);
+	else
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -1085,6 +1754,10 @@  static void collapse_huge_page(struct mm_struct *mm,
 	*hpage = NULL;
 
 	khugepaged_pages_collapsed++;
+
+	if (is_reserved)
+		khugepaged_res_pages_collapsed++;
+
 	result = SCAN_SUCCEED;
 out_up_write:
 	up_write(&mm->mmap_sem);
@@ -1092,7 +1765,8 @@  static void collapse_huge_page(struct mm_struct *mm,
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 out:
-	mem_cgroup_cancel_charge(new_page, memcg, true);
+	if (is_reserved == false)
+		mem_cgroup_cancel_charge(new_page, memcg, true);
 	goto out_up_write;
 }
 
@@ -1109,6 +1783,7 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
 	bool writable = false;
+	struct thp_reservation *res = 0;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1184,12 +1859,22 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+		res = khugepaged_find_reservation(vma, address);
+		if (res) {
+			int pgs_inuse = HPAGE_PMD_NR - res->nr_unused;
+
+			if (pgs_inuse < hugepage_promotion_threshold) {
+				result = SCAN_PAGE_COUNT;
+				goto out_unmap;
+			}
+		}
+
 		/*
 		 * cannot use mapcount: can't collapse if there's a gup pin.
 		 * The page must only be referenced by the scanned process
 		 * and page swap cache.
 		 */
-		if (page_count(page) != 1 + PageSwapCache(page)) {
+		if (page_count(page) != 1 + PageSwapCache(page) + (res ? 1 : 0)) {
 			result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
@@ -1213,7 +1898,7 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (ret) {
 		node = khugepaged_find_target_node();
 		/* collapse_huge_page will return with the mmap_sem released */
-		collapse_huge_page(mm, address, hpage, node, referenced);
+		collapse_huge_page(mm, address, hpage, node, referenced, res);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..9b9e4d3a6205 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2662,6 +2662,39 @@  void memcg_kmem_uncharge(struct page *page, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 /*
+ * Fix up the mem_cgroup field of the head and tail pages of a compound
+ * page that has been converted from a reservation into a huge page.
+ */
+void mem_cgroup_collapse_huge_fixup(struct page *head)
+{
+	int i;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/*
+	 * Some pages may already have mem_cgroup == NULL if only some of
+	 * the pages in the reservation were faulted in when it was converted.
+	 */
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		if (head[i].mem_cgroup != NULL) {
+			if (i != 0)
+				head->mem_cgroup = head[i].mem_cgroup;
+			else
+				i++;
+			break;
+		}
+	}
+	for (; i < HPAGE_PMD_NR; i++)
+		head[i].mem_cgroup = NULL;
+
+	if (WARN_ON(head->mem_cgroup == NULL))
+		return;
+
+	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, HPAGE_PMD_NR);
+}
+
+/*
  * Because tail pages are not marked as "used", set it. We're under
  * zone_lru_lock and migration entries setup in all page mappings.
  */
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..91df155c3991 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@ 
 #include <linux/pagemap.h>
 #include <linux/memremap.h>
 #include <linux/ksm.h>
+#include <linux/khugepaged.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
 #include <linux/delayacct.h>
@@ -1438,6 +1439,7 @@  static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		khugepaged_release_reservation(vma, addr);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -2494,16 +2496,29 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	bool pg_from_reservation = false;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_page = alloc_zeroed_user_highpage_movable(vma,
+		new_page = khugepaged_get_reserved_page(vma, vmf->address);
+		if (!new_page) {
+			new_page = alloc_zeroed_user_highpage_movable(vma,
 							      vmf->address);
+		} else {
+			clear_user_highpage(new_page, vmf->address);
+			pg_from_reservation = true;
+		}
+
 		if (!new_page)
 			goto oom;
 	} else {
+		/*
+		 * XXX If there's a THP reservation, for now just
+		 * release it since they're not shared on fork.
+		 */
+		khugepaged_release_reservation(vma, vmf->address);
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 				vmf->address);
 		if (!new_page)
@@ -2578,6 +2593,9 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			page_remove_rmap(old_page, false);
 		}
 
+		if (pg_from_reservation)
+			khugepaged_mod_resv_unused(vma, vmf->address, -1);
+
 		/* Free the old page.. */
 		new_page = old_page;
 		page_copied = 1;
@@ -3124,6 +3142,7 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	bool pg_from_reservation = false;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -3169,9 +3188,16 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
-	if (!page)
-		goto oom;
+
+	page = khugepaged_get_reserved_page(vma, vmf->address);
+	if (!page) {
+		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+		if (!page)
+			goto oom;
+	} else {
+		clear_user_highpage(page, vmf->address);
+		pg_from_reservation = true;
+	}
 
 	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg,
 					false))
@@ -3205,6 +3231,9 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
+	if (pg_from_reservation)
+		khugepaged_mod_resv_unused(vma, vmf->address, -1);
+
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
diff --git a/mm/mmap.c b/mm/mmap.c
index f7cd9cb966c0..a1979392273b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,6 +182,7 @@  static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
+	thp_resvs_put(vma_thp_reservations(vma));
 	vm_area_free(vma);
 	return next;
 }
@@ -839,6 +840,7 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (adjust_next) {
 		next->vm_start += adjust_next << PAGE_SHIFT;
 		next->vm_pgoff += adjust_next;
+		_khugepaged_move_reservations_adj(vma, next, adjust_next);
 	}
 
 	if (root) {
@@ -849,6 +851,8 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
+		_khugepaged_reservations_fixup(next, vma);
+
 		/*
 		 * vma_merge has merged next into vma, and needs
 		 * us to remove next before dropping the locks.
@@ -875,6 +879,8 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		 * (it may either follow vma or precede it).
 		 */
 		__insert_vm_struct(mm, insert);
+
+		_khugepaged_reservations_fixup(vma, insert);
 	} else {
 		if (start_changed)
 			vma_gap_update(vma);
@@ -1780,6 +1786,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	} else {
 		vma_set_anonymous(vma);
+		thp_resvs_new(vma);
 	}
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -2640,6 +2647,9 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (err)
 		goto out_free_mpol;
 
+	if (vma_thp_reservations(vma))
+		thp_resvs_new(new);
+
 	if (new->vm_file)
 		get_file(new->vm_file);
 
@@ -2657,6 +2667,7 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 
 	/* Clean everything up if vma_adjust failed. */
+	thp_resvs_put(vma_thp_reservations(new));
 	if (new->vm_ops && new->vm_ops->close)
 		new->vm_ops->close(new);
 	if (new->vm_file)
@@ -2992,6 +3003,7 @@  static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
 	vma->vm_page_prot = vm_get_page_prot(flags);
+	thp_resvs_new(vma);
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	perf_event_mmap(vma);
@@ -3205,6 +3217,8 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			goto out_free_vma;
 		if (anon_vma_clone(new_vma, vma))
 			goto out_free_mempol;
+		if (vma_thp_reservations(vma))
+			thp_resvs_new(new_vma);
 		if (new_vma->vm_file)
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
diff --git a/mm/mremap.c b/mm/mremap.c
index a9617e72e6b7..194c20cfce73 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -24,6 +24,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/khugepaged.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -294,6 +295,8 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	thp_reservations_mremap(vma, old_addr, new_vma, new_addr, old_len,
+				need_rmap_locks);
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -308,6 +311,8 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
 		 */
+		thp_reservations_mremap(new_vma, new_addr, vma, old_addr,
+					moved_len, true);
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
 		vma = new_vma;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..0118775ab31a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4740,6 +4740,11 @@  long si_mem_available(void)
 	available += global_node_page_state(NR_INDIRECTLY_RECLAIMABLE_BYTES) >>
 		PAGE_SHIFT;
 
+	/*
+	 * Unused small pages in THP reservations
+	 */
+	available += global_node_page_state(NR_THP_RESERVED);
+
 	if (available < 0)
 		available = 0;
 	return available;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1e79fac3186b..859fa1b1030c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -65,6 +65,7 @@ 
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/khugepaged.h>
 
 #include <asm/tlbflush.h>
 
@@ -1646,6 +1647,8 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(page));
 		}
 discard:
+		khugepaged_release_reservation(vma, address);
+
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
 		 * done above for all cases requiring it to happen under page
diff --git a/mm/util.c b/mm/util.c
index 9e3ebd2ef65f..e5617de04006 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -689,6 +689,11 @@  int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 			NR_INDIRECTLY_RECLAIMABLE_BYTES) >> PAGE_SHIFT;
 
 		/*
+		 * Unused small pages in THP reservations
+		 */
+		free += global_node_page_state(NR_THP_RESERVED);
+
+		/*
 		 * Leave reserved pages. The pages are not for anonymous pages.
 		 */
 		if (free <= totalreserve_pages)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7878da76abf2..49c51c2b03f4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1162,6 +1162,7 @@  int fragmentation_index(struct zone *zone, unsigned int order)
 	"nr_dirtied",
 	"nr_written",
 	"", /* nr_indirectly_reclaimable */
+	"nr_thp_reserved",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
@@ -1263,6 +1264,8 @@  int fragmentation_index(struct zone *zone, unsigned int order)
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
 	"thp_swpout_fallback",
+	"thp_res_alloc",
+	"thp_res_alloc_failed",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",