[v10,3/4] mm, thp: introduce FOLL_SPLIT_PMD

Message ID	20190730052305.3672336-4-songliubraving@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of prvs=31148e3214=songliubraving@fb.com designates 67.231.153.30 as permitted sender) client-ip=67.231.153.30; Smtp-Origin-Hostprefix: devbig From: Song Liu <songliubraving@fb.com> Smtp-Origin-Hostname: devbig006.ftw2.facebook.com To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <akpm@linux-foundation.org> CC: <matthew.wilcox@oracle.com>, <kirill.shutemov@linux.intel.com>, <oleg@redhat.com>, <kernel-team@fb.com>, <william.kucharski@oracle.com>, <srikar@linux.vnet.ibm.com>, Song Liu <songliubraving@fb.com> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH v10 3/4] mm, thp: introduce FOLL_SPLIT_PMD Date: Mon, 29 Jul 2019 22:23:04 -0700 Message-ID: <20190730052305.3672336-4-songliubraving@fb.com> In-Reply-To: <20190730052305.3672336-1-songliubraving@fb.com> References: <20190730052305.3672336-1-songliubraving@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	THP aware uprobe \| expand [v10,0/4] THP aware uprobe [v10,1/4] mm: move memcmp_pages() and pages_identical() [v10,2/4] uprobe: use original page when all uprobes are removed [v10,3/4] mm, thp: introduce FOLL_SPLIT_PMD [v10,4/4] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT

Song Liu July 30, 2019, 5:23 a.m. UTC

This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
but would switch back to huge page and huge pmd on. One of such example
is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h | 1 +
 mm/gup.c           | 8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Oleg Nesterov July 30, 2019, 4:11 p.m. UTC | #1

I don't understand this code, so I can't review, but.

On 07/29, Song Liu wrote:
>
> This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
> FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
> page stays as-is.
>
> FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
> but would switch back to huge page and huge pmd on. One of such example
> is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes)
and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm().

Hmm.

> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>  	}
> -	if (flags & FOLL_SPLIT) {
> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>  		int ret;
>  		page = pmd_page(*pmd);
>  		if (is_huge_zero_page(page)) {
> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			split_huge_pmd(vma, pmd, address);
>  			if (pmd_trans_unstable(pmd))
>  				ret = -EBUSY;
> -		} else {
> +		} else if (flags & FOLL_SPLIT) {
>  			if (unlikely(!try_get_page(page))) {
>  				spin_unlock(ptl);
>  				return ERR_PTR(-ENOMEM);
> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			put_page(page);
>  			if (pmd_none(*pmd))
>  				return no_page_table(vma, flags);
> +		} else {  /* flags & FOLL_SPLIT_PMD */
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, address);
> +			ret = pte_alloc(mm, pmd);

I fail to understand why this differs from the is_huge_zero_page() case above.

Anyway, ret = pte_alloc(mm, pmd) can't be correct. If __pte_alloc() fails pte_alloc()
will return 1. This will fool the IS_ERR(page) check in __get_user_pages().

Oleg.

Song Liu July 30, 2019, 5:42 p.m. UTC | #2

> On Jul 30, 2019, at 9:11 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> I don't understand this code, so I can't review, but.
> 
> On 07/29, Song Liu wrote:
>> 
>> This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
>> FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
>> page stays as-is.
>> 
>> FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
>> but would switch back to huge page and huge pmd on. One of such example
>> is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.
> 
> So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes)
> and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm().
> 
> Hmm.

I think this is what we want. :) 

FOLL_SPLIT is the fallback solution for users who cannot handle THP. With
more THP aware code, there will be fewer users of FOLL_SPLIT. 

> 
>> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 		spin_unlock(ptl);
>> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>> 	}
>> -	if (flags & FOLL_SPLIT) {
>> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>> 		int ret;
>> 		page = pmd_page(*pmd);
>> 		if (is_huge_zero_page(page)) {
>> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			split_huge_pmd(vma, pmd, address);
>> 			if (pmd_trans_unstable(pmd))
>> 				ret = -EBUSY;
>> -		} else {
>> +		} else if (flags & FOLL_SPLIT) {
>> 			if (unlikely(!try_get_page(page))) {
>> 				spin_unlock(ptl);
>> 				return ERR_PTR(-ENOMEM);
>> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			put_page(page);
>> 			if (pmd_none(*pmd))
>> 				return no_page_table(vma, flags);
>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, address);
>> +			ret = pte_alloc(mm, pmd);
> 
> I fail to understand why this differs from the is_huge_zero_page() case above.

split_huge_pmd() handles is_huge_zero_page() differently. In this case, we 
cannot use the pmd_trans_unstable() check. 

> 
> Anyway, ret = pte_alloc(mm, pmd) can't be correct. If __pte_alloc() fails pte_alloc()
> will return 1. This will fool the IS_ERR(page) check in __get_user_pages().

Great catch! Let me fix it.

Thanks,
Song

Oleg Nesterov July 31, 2019, 3:18 p.m. UTC | #3

On 07/30, Song Liu wrote:
>
>
> > On Jul 30, 2019, at 9:11 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes)
> > and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm().
> >
> > Hmm.
>
> I think this is what we want. :)

We? I don't ;)

> FOLL_SPLIT is the fallback solution for users who cannot handle THP.

and again, we have a single user: thp_split_mm(). I do not know if it
can use FOLL_SPLIT_PMD or not, may be you can take a look...

> With
> more THP aware code, there will be fewer users of FOLL_SPLIT.

Fewer than 1? Good ;)

> >> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 		spin_unlock(ptl);
> >> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> >> 	}
> >> -	if (flags & FOLL_SPLIT) {
> >> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> >> 		int ret;
> >> 		page = pmd_page(*pmd);
> >> 		if (is_huge_zero_page(page)) {
> >> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 			split_huge_pmd(vma, pmd, address);
> >> 			if (pmd_trans_unstable(pmd))
> >> 				ret = -EBUSY;
> >> -		} else {
> >> +		} else if (flags & FOLL_SPLIT) {
> >> 			if (unlikely(!try_get_page(page))) {
> >> 				spin_unlock(ptl);
> >> 				return ERR_PTR(-ENOMEM);
> >> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 			put_page(page);
> >> 			if (pmd_none(*pmd))
> >> 				return no_page_table(vma, flags);
> >> +		} else {  /* flags & FOLL_SPLIT_PMD */
> >> +			spin_unlock(ptl);
> >> +			split_huge_pmd(vma, pmd, address);
> >> +			ret = pte_alloc(mm, pmd);
> >
> > I fail to understand why this differs from the is_huge_zero_page() case above.
>
> split_huge_pmd() handles is_huge_zero_page() differently. In this case, we
> cannot use the pmd_trans_unstable() check.

Please correct me, but iiuc the problem is not that split_huge_pmd() handles
is_huge_zero_page() differently, the problem is that __split_huge_pmd_locked()
handles the !vma_is_anonymous(vma) differently and returns with pmd_none() = T
after pmdp_huge_clear_flush_notify(). This means that pmd_trans_unstable() will
fail.

Now, I don't understand why do we need pmd_trans_unstable() after
split_huge_pmd(huge-zero-pmd), but whatever reason we have, why can't we
unify both cases?

IOW, could you explain why the path below is wrong?

Oleg.


--- x/mm/gup.c
+++ x/mm/gup.c
@@ -399,14 +399,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT) {
+	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
 		int ret;
 		page = pmd_page(*pmd);
-		if (is_huge_zero_page(page)) {
+		if ((flags & FOLL_SPLIT_PMD) || is_huge_zero_page(page)) {
 			spin_unlock(ptl);
-			ret = 0;
 			split_huge_pmd(vma, pmd, address);
-			if (pmd_trans_unstable(pmd))
+			ret = 0;
+			if (pte_alloc(mm, pmd))
+				ret = -ENOMEM;
+			else if (pmd_trans_unstable(pmd))
 				ret = -EBUSY;
 		} else {
 			if (unlikely(!try_get_page(page))) {

Song Liu July 31, 2019, 5:10 p.m. UTC | #4

> On Jul 31, 2019, at 8:18 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 07/30, Song Liu wrote:
>> 
>> 
>>> On Jul 30, 2019, at 9:11 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>> So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes)
>>> and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm().
>>> 
>>> Hmm.
>> 
>> I think this is what we want. :)
> 
> We? I don't ;)
> 
>> FOLL_SPLIT is the fallback solution for users who cannot handle THP.
> 
> and again, we have a single user: thp_split_mm(). I do not know if it
> can use FOLL_SPLIT_PMD or not, may be you can take a look...

I haven't played with s390, so it gonna take me some time to ramp up. 
I will add it to my to-do list. 

> 
>> With
>> more THP aware code, there will be fewer users of FOLL_SPLIT.
> 
> Fewer than 1? Good ;)

Yes! It will be great if thp_split_mm() can use FOLL_SPLIT_PMD 
instead. 

> 
>>>> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 		spin_unlock(ptl);
>>>> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>>>> 	}
>>>> -	if (flags & FOLL_SPLIT) {
>>>> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>>>> 		int ret;
>>>> 		page = pmd_page(*pmd);
>>>> 		if (is_huge_zero_page(page)) {
>>>> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 			split_huge_pmd(vma, pmd, address);
>>>> 			if (pmd_trans_unstable(pmd))
>>>> 				ret = -EBUSY;
>>>> -		} else {
>>>> +		} else if (flags & FOLL_SPLIT) {
>>>> 			if (unlikely(!try_get_page(page))) {
>>>> 				spin_unlock(ptl);
>>>> 				return ERR_PTR(-ENOMEM);
>>>> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 			put_page(page);
>>>> 			if (pmd_none(*pmd))
>>>> 				return no_page_table(vma, flags);
>>>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>>>> +			spin_unlock(ptl);
>>>> +			split_huge_pmd(vma, pmd, address);
>>>> +			ret = pte_alloc(mm, pmd);
>>> 
>>> I fail to understand why this differs from the is_huge_zero_page() case above.
>> 
>> split_huge_pmd() handles is_huge_zero_page() differently. In this case, we
>> cannot use the pmd_trans_unstable() check.
> 
> Please correct me, but iiuc the problem is not that split_huge_pmd() handles
> is_huge_zero_page() differently, the problem is that __split_huge_pmd_locked()
> handles the !vma_is_anonymous(vma) differently and returns with pmd_none() = T
> after pmdp_huge_clear_flush_notify(). This means that pmd_trans_unstable() will
> fail.

Agreed. 

> 
> Now, I don't understand why do we need pmd_trans_unstable() after
> split_huge_pmd(huge-zero-pmd), but whatever reason we have, why can't we
> unify both cases?
> 
> IOW, could you explain why the path below is wrong?

I _think_ the following patch works (haven't fully tested yet). But I am not 
sure whether this is the best. By separating the two cases, we don't duplicate 
much code. And it is clear that the two cases are handled differently. 
Therefore, I would prefer to keep these separate for now. 

Thanks,
Song

> 
> 
> --- x/mm/gup.c
> +++ x/mm/gup.c
> @@ -399,14 +399,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> 		spin_unlock(ptl);
> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 	}
> -	if (flags & FOLL_SPLIT) {
> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> 		int ret;
> 		page = pmd_page(*pmd);
> -		if (is_huge_zero_page(page)) {
> +		if ((flags & FOLL_SPLIT_PMD) || is_huge_zero_page(page)) {
> 			spin_unlock(ptl);
> -			ret = 0;
> 			split_huge_pmd(vma, pmd, address);
> -			if (pmd_trans_unstable(pmd))
> +			ret = 0;
> +			if (pte_alloc(mm, pmd))
> +				ret = -ENOMEM;
> +			else if (pmd_trans_unstable(pmd))
> 				ret = -EBUSY;
> 		} else {
> 			if (unlikely(!try_get_page(page))) {
>

Oleg Nesterov Aug. 1, 2019, 3:04 p.m. UTC | #5

On 07/31, Song Liu wrote:
>
> > On Jul 31, 2019, at 8:18 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Now, I don't understand why do we need pmd_trans_unstable() after
> > split_huge_pmd(huge-zero-pmd), but whatever reason we have, why can't we
> > unify both cases?
> >
> > IOW, could you explain why the path below is wrong?
>
> I _think_ the following patch works (haven't fully tested yet). But I am not
> sure whether this is the best. By separating the two cases, we don't duplicate
> much code. And it is clear that the two cases are handled differently.
> Therefore, I would prefer to keep these separate for now.

I disagree. I think this separation makes the code less readable/understandable.
Exactly because it handles two cases differently and it is absolutely not clear
why.

But I can't argue, please forget.

Oleg.

[v10,3/4] mm, thp: introduce FOLL_SPLIT_PMD

Commit Message

Comments

Patch