diff mbox series

[RFC,1/8] hugetlb: add per-hstate mutex to synchronize user adjustments

Message ID 20210319224209.150047-2-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series make hugetlb put_page safe for all calling contexts | expand

Commit Message

Mike Kravetz March 19, 2021, 10:42 p.m. UTC
The number of hugetlb pages can be adjusted by writing to the
sysps/proc files nr_hugepages, nr_hugepages_mempolicy or
nr_overcommit_hugepages.  There is nothing to prevent two
concurrent modifications via these files.  The underlying routine
set_max_huge_pages() makes assumptions that only one occurrence is
running at a time.  Specifically, alloc_pool_huge_page uses a
hstate specific variable without any synchronization.

Add a mutex to the hstate and use it to only allow one hugetlb
page adjustment at a time.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c            | 5 +++++
 2 files changed, 6 insertions(+)

Comments

Michal Hocko March 22, 2021, 1:59 p.m. UTC | #1
On Fri 19-03-21 15:42:02, Mike Kravetz wrote:
> The number of hugetlb pages can be adjusted by writing to the
> sysps/proc files nr_hugepages, nr_hugepages_mempolicy or
> nr_overcommit_hugepages.  There is nothing to prevent two
> concurrent modifications via these files.  The underlying routine
> set_max_huge_pages() makes assumptions that only one occurrence is
> running at a time.  Specifically, alloc_pool_huge_page uses a
> hstate specific variable without any synchronization.

From the above it is not really clear whether the unsynchronized nature
of set_max_huge_pages is really a problem or a mere annoynce. I suspect
the later because counters are properly synchronized with the
hugetlb_lock. It would be great to clarify that.
 
> Add a mutex to the hstate and use it to only allow one hugetlb
> page adjustment at a time.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h | 1 +
>  mm/hugetlb.c            | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index cccd1aab69dd..f42d44050548 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -555,6 +555,7 @@ HPAGEFLAG(Freed, freed)
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> +	struct mutex mutex;
>  	int next_nid_to_alloc;
>  	int next_nid_to_free;
>  	unsigned int order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5b1ab1f427c5..d5be25f910e8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2601,6 +2601,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	else
>  		return -ENOMEM;
>  
> +	/* mutex prevents concurrent adjustments for the same hstate */
> +	mutex_lock(&h->mutex);
>  	spin_lock(&hugetlb_lock);
>  
>  	/*
> @@ -2633,6 +2635,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>  		if (count > persistent_huge_pages(h)) {
>  			spin_unlock(&hugetlb_lock);
> +			mutex_unlock(&h->mutex);
>  			NODEMASK_FREE(node_alloc_noretry);
>  			return -EINVAL;
>  		}
> @@ -2707,6 +2710,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  out:
>  	h->max_huge_pages = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	mutex_unlock(&h->mutex);
>  
>  	NODEMASK_FREE(node_alloc_noretry);
>  
> @@ -3194,6 +3198,7 @@ void __init hugetlb_add_hstate(unsigned int order)
>  	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
>  	BUG_ON(order == 0);
>  	h = &hstates[hugetlb_max_hstate++];
> +	mutex_init(&h->mutex);
>  	h->order = order;
>  	h->mask = ~(huge_page_size(h) - 1);
>  	for (i = 0; i < MAX_NUMNODES; ++i)
> -- 
> 2.30.2
>
Mike Kravetz March 22, 2021, 4:57 p.m. UTC | #2
On 3/22/21 6:59 AM, Michal Hocko wrote:
> On Fri 19-03-21 15:42:02, Mike Kravetz wrote:
>> The number of hugetlb pages can be adjusted by writing to the
>> sysps/proc files nr_hugepages, nr_hugepages_mempolicy or
>> nr_overcommit_hugepages.  There is nothing to prevent two
>> concurrent modifications via these files.  The underlying routine
>> set_max_huge_pages() makes assumptions that only one occurrence is
>> running at a time.  Specifically, alloc_pool_huge_page uses a
>> hstate specific variable without any synchronization.
> 
> From the above it is not really clear whether the unsynchronized nature
> of set_max_huge_pages is really a problem or a mere annoynce. I suspect
> the later because counters are properly synchronized with the
> hugetlb_lock. It would be great to clarify that.
>  

It is a problem and an annoyance.

The problem is that alloc_pool_huge_page -> for_each_node_mask_to_alloc is
called after dropping the hugetlb lock.  for_each_node_mask_to_alloc
uses the helper hstate_next_node_to_alloc which uses and modifies
h->next_nid_to_alloc.  Worst case would be two instances of set_max_huge_pages
trying to allocate pages on different sets of nodes.  Pages could get
allocated on the wrong nodes.

I really doubt this problem has ever been experienced in practice.
However, when looking at the code in was a real annoyance. :)

I'll update the commit message to be more clear.
Michal Hocko March 23, 2021, 7:48 a.m. UTC | #3
On Mon 22-03-21 09:57:14, Mike Kravetz wrote:
> On 3/22/21 6:59 AM, Michal Hocko wrote:
> > On Fri 19-03-21 15:42:02, Mike Kravetz wrote:
> >> The number of hugetlb pages can be adjusted by writing to the
> >> sysps/proc files nr_hugepages, nr_hugepages_mempolicy or
> >> nr_overcommit_hugepages.  There is nothing to prevent two
> >> concurrent modifications via these files.  The underlying routine
> >> set_max_huge_pages() makes assumptions that only one occurrence is
> >> running at a time.  Specifically, alloc_pool_huge_page uses a
> >> hstate specific variable without any synchronization.
> > 
> > From the above it is not really clear whether the unsynchronized nature
> > of set_max_huge_pages is really a problem or a mere annoynce. I suspect
> > the later because counters are properly synchronized with the
> > hugetlb_lock. It would be great to clarify that.
> >  
> 
> It is a problem and an annoyance.
> 
> The problem is that alloc_pool_huge_page -> for_each_node_mask_to_alloc is
> called after dropping the hugetlb lock.  for_each_node_mask_to_alloc
> uses the helper hstate_next_node_to_alloc which uses and modifies
> h->next_nid_to_alloc.  Worst case would be two instances of set_max_huge_pages
> trying to allocate pages on different sets of nodes.  Pages could get
> allocated on the wrong nodes.

Yes, what I meant by the annoyance. On the other hand a parallel access
to a global knob mantaining a global resource should be expected to
have some side effects without an external synchronization unless it is
explicitly documented that such an access is synchronized internally.

> I really doubt this problem has ever been experienced in practice.
> However, when looking at the code in was a real annoyance. :)

IMHO it would be a bit of a stretch to consider it a real life problem.
 
> I'll update the commit message to be more clear.

Thanks! Clarification will definitely help.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..f42d44050548 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -555,6 +555,7 @@  HPAGEFLAG(Freed, freed)
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
+	struct mutex mutex;
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b1ab1f427c5..d5be25f910e8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2601,6 +2601,8 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	else
 		return -ENOMEM;
 
+	/* mutex prevents concurrent adjustments for the same hstate */
+	mutex_lock(&h->mutex);
 	spin_lock(&hugetlb_lock);
 
 	/*
@@ -2633,6 +2635,7 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			mutex_unlock(&h->mutex);
 			NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
@@ -2707,6 +2710,7 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 out:
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	mutex_unlock(&h->mutex);
 
 	NODEMASK_FREE(node_alloc_noretry);
 
@@ -3194,6 +3198,7 @@  void __init hugetlb_add_hstate(unsigned int order)
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order == 0);
 	h = &hstates[hugetlb_max_hstate++];
+	mutex_init(&h->mutex);
 	h->order = order;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)