[RFC,09/12] khugepaged: Introduce vma_collapse_anon_folio()

Message ID	20241216165105.56185-10-dev.jain@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Dev Jain <dev.jain@arm.com> To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain <dev.jain@arm.com> Subject: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Date: Mon, 16 Dec 2024 22:21:02 +0530 Message-Id: <20241216165105.56185-10-dev.jain@arm.com> In-Reply-To: <20241216165105.56185-1-dev.jain@arm.com> References: <20241216165105.56185-1-dev.jain@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	khugepaged: Asynchronous mTHP collapse \| expand [RFC,00/12] khugepaged: Asynchronous mTHP collapse [RFC,01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() [RFC,02/12] khugepaged: Generalize alloc_charge_folio() [RFC,03/12] khugepaged: Generalize hugepage_vma_revalidate() [RFC,04/12] khugepaged: Generalize __collapse_huge_page_swapin() [RFC,05/12] khugepaged: Generalize __collapse_huge_page_isolate() [RFC,06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() [RFC,07/12] khugepaged: Scan PTEs order-wise [RFC,08/12] khugepaged: Abstract PMD-THP collapse [RFC,09/12] khugepaged: Introduce vma_collapse_anon_folio() [RFC,10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped [RFC,11/12] khugepaged: Enable sysfs to control order of collapse [RFC,12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse

Message ID

20241216165105.56185-10-dev.jain@arm.com (mailing list archive)

State

New

Headers

From: Dev Jain <dev.jain@arm.com>
To: akpm@linux-foundation.org,
	david@redhat.com,
	willy@infradead.org,
	kirill.shutemov@linux.intel.com
Cc: ryan.roberts@arm.com,
	anshuman.khandual@arm.com,
	catalin.marinas@arm.com,
	cl@gentwo.org,
	vbabka@suse.cz,
	mhocko@suse.com,
	apopple@nvidia.com,
	dave.hansen@linux.intel.com,
	will@kernel.org,
	baohua@kernel.org,
	jack@suse.cz,
	srivatsa@csail.mit.edu,
	haowenchao22@gmail.com,
	hughd@google.com,
	aneesh.kumar@kernel.org,
	yang@os.amperecomputing.com,
	peterx@redhat.com,
	ioworker0@gmail.com,
	wangkefeng.wang@huawei.com,
	ziy@nvidia.com,
	jglisse@google.com,
	surenb@google.com,
	vishal.moola@gmail.com,
	zokeefe@google.com,
	zhengqi.arch@bytedance.com,
	jhubbard@nvidia.com,
	21cnbao@gmail.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Dev Jain <dev.jain@arm.com>
Subject: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
Date: Mon, 16 Dec 2024 22:21:02 +0530
Message-Id: <20241216165105.56185-10-dev.jain@arm.com>
In-Reply-To: <20241216165105.56185-1-dev.jain@arm.com>
References: <20241216165105.56185-1-dev.jain@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

khugepaged: Asynchronous mTHP collapse | expand

Commit Message

Dev Jain Dec. 16, 2024, 4:51 p.m. UTC

In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
old folios. Set the PTEs to the new folio using the set_ptes() API.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Note: I have been trying hard to get rid of the locks in here: we still are
taking the PTL around the page copying; dropping the PTL and taking it after
the copying should lead to a deadlock, for example:
khugepaged						madvise(MADV_COLD)
folio_lock()						lock(ptl)
lock(ptl)						folio_lock()

We can create a locked folio list, altogether drop both the locks, take the PTL,
do everything which __collapse_huge_page_isolate() does *except* the isolation and
again try locking folios, but then it will reduce efficiency of khugepaged
and almost looks like a forced solution :)
Please note the following discussion if anyone is interested:
https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
(Apologies for not CCing the mailing list from the start)

 mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 87 insertions(+), 21 deletions(-)

Comments

David Hildenbrand Dec. 16, 2024, 5:06 p.m. UTC | #1

On 16.12.24 17:51, Dev Jain wrote:
> In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
> simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
> take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
> which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
> copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
> old folios. Set the PTEs to the new folio using the set_ptes() API.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Note: I have been trying hard to get rid of the locks in here: we still are
> taking the PTL around the page copying; dropping the PTL and taking it after
> the copying should lead to a deadlock, for example:
> khugepaged						madvise(MADV_COLD)
> folio_lock()						lock(ptl)
> lock(ptl)						folio_lock()
> 
> We can create a locked folio list, altogether drop both the locks, take the PTL,
> do everything which __collapse_huge_page_isolate() does *except* the isolation and
> again try locking folios, but then it will reduce efficiency of khugepaged
> and almost looks like a forced solution :)
> Please note the following discussion if anyone is interested:
> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
> (Apologies for not CCing the mailing list from the start)
> 
>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 87 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 88beebef773e..8040b130e677 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						struct vm_area_struct *vma,
>   						unsigned long address,
>   						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist, int order)
>   {
>   	struct folio *src, *tmp;
>   	pte_t *_pte;
>   	pte_t pteval;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1UL << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>   			if (is_zero_pfn(pte_pfn(pteval))) {
> -				/*
> -				 * ptl mostly unnecessary.
> -				 */
> -				spin_lock(ptl);
> -				ptep_clear(vma->vm_mm, address, _pte);
> -				spin_unlock(ptl);
> +				if (order == HPAGE_PMD_ORDER) {
> +					/*
> +					* ptl mostly unnecessary.
> +					*/
> +					spin_lock(ptl);
> +					ptep_clear(vma->vm_mm, address, _pte);
> +					spin_unlock(ptl);
> +				} else {
> +					ptep_clear(vma->vm_mm, address, _pte);
> +				}
>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>   			}
>   		} else {
> @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   			src = page_folio(src_page);
>   			if (!folio_test_large(src))
>   				release_pte_folio(src);
> -			/*
> -			 * ptl mostly unnecessary, but preempt has to
> -			 * be disabled to update the per-cpu stats
> -			 * inside folio_remove_rmap_pte().
> -			 */
> -			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> -			spin_unlock(ptl);
> +			if (order == HPAGE_PMD_ORDER) {
> +				/*
> +				* ptl mostly unnecessary, but preempt has to
> +				* be disabled to update the per-cpu stats
> +				* inside folio_remove_rmap_pte().
> +				*/
> +				spin_lock(ptl);
> +				ptep_clear(vma->vm_mm, address, _pte);




> +				folio_remove_rmap_pte(src, src_page, vma);
> +				spin_unlock(ptl);
> +			} else {
> +				ptep_clear(vma->vm_mm, address, _pte);
> +				folio_remove_rmap_pte(src, src_page, vma);
> +			}

As I've talked to Nico about this code recently ... :)

Are you clearing the PTE after the copy succeeded? If so, where is the 
TLB flush?

How do you sync against concurrent write acess + GUP-fast?


The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check if 
there are unexpected page references (e.g., GUP) if so back off (3) 
copy page content (4) set updated PTE/PMD.

To Nico, I suggested doing it simple initially, and still clear the 
high-level PMD entry + flush under mmap write lock, then re-map the PTE 
table after modifying the page table. It's not as efficient, but "harder 
to get wrong".

Maybe that's already happening, but I stumbled over this clearing logic 
in __collapse_huge_page_copy_succeeded(), so I'm curious.

Yang Shi Dec. 16, 2024, 7:08 p.m. UTC | #2

On Mon, Dec 16, 2024 at 9:09 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.12.24 17:51, Dev Jain wrote:
> > In contrast to PMD-collapse, we do not need to operate on two levels of pagetable
> > simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still
> > take the anon_vma lock in exclusive mode so as to not waste time in the rmap path,
> > which is anyways going to fail since the PTEs are going to be changed. Under the PTL,
> > copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
> > old folios. Set the PTEs to the new folio using the set_ptes() API.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> > Note: I have been trying hard to get rid of the locks in here: we still are
> > taking the PTL around the page copying; dropping the PTL and taking it after
> > the copying should lead to a deadlock, for example:
> > khugepaged                                            madvise(MADV_COLD)
> > folio_lock()                                          lock(ptl)
> > lock(ptl)                                             folio_lock()
> >
> > We can create a locked folio list, altogether drop both the locks, take the PTL,
> > do everything which __collapse_huge_page_isolate() does *except* the isolation and
> > again try locking folios, but then it will reduce efficiency of khugepaged
> > and almost looks like a forced solution :)
> > Please note the following discussion if anyone is interested:
> > https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
> > (Apologies for not CCing the mailing list from the start)
> >
> >   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
> >   1 file changed, 87 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 88beebef773e..8040b130e677 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                                               struct vm_area_struct *vma,
> >                                               unsigned long address,
> >                                               spinlock_t *ptl,
> > -                                             struct list_head *compound_pagelist)
> > +                                             struct list_head *compound_pagelist, int order)
> >   {
> >       struct folio *src, *tmp;
> >       pte_t *_pte;
> >       pte_t pteval;
> >
> > -     for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > +     for (_pte = pte; _pte < pte + (1UL << order);
> >            _pte++, address += PAGE_SIZE) {
> >               pteval = ptep_get(_pte);
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >                       add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> >                       if (is_zero_pfn(pte_pfn(pteval))) {
> > -                             /*
> > -                              * ptl mostly unnecessary.
> > -                              */
> > -                             spin_lock(ptl);
> > -                             ptep_clear(vma->vm_mm, address, _pte);
> > -                             spin_unlock(ptl);
> > +                             if (order == HPAGE_PMD_ORDER) {
> > +                                     /*
> > +                                     * ptl mostly unnecessary.
> > +                                     */
> > +                                     spin_lock(ptl);
> > +                                     ptep_clear(vma->vm_mm, address, _pte);
> > +                                     spin_unlock(ptl);
> > +                             } else {
> > +                                     ptep_clear(vma->vm_mm, address, _pte);
> > +                             }
> >                               ksm_might_unmap_zero_page(vma->vm_mm, pteval);
> >                       }
> >               } else {
> > @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >                       src = page_folio(src_page);
> >                       if (!folio_test_large(src))
> >                               release_pte_folio(src);
> > -                     /*
> > -                      * ptl mostly unnecessary, but preempt has to
> > -                      * be disabled to update the per-cpu stats
> > -                      * inside folio_remove_rmap_pte().
> > -                      */
> > -                     spin_lock(ptl);
> > -                     ptep_clear(vma->vm_mm, address, _pte);
> > -                     folio_remove_rmap_pte(src, src_page, vma);
> > -                     spin_unlock(ptl);
> > +                     if (order == HPAGE_PMD_ORDER) {
> > +                             /*
> > +                             * ptl mostly unnecessary, but preempt has to
> > +                             * be disabled to update the per-cpu stats
> > +                             * inside folio_remove_rmap_pte().
> > +                             */
> > +                             spin_lock(ptl);
> > +                             ptep_clear(vma->vm_mm, address, _pte);
>
>
>
>
> > +                             folio_remove_rmap_pte(src, src_page, vma);
> > +                             spin_unlock(ptl);

I think it is ok not to take the ptl since the preempt is disabled at
this point by pte_map(). pte_unmap() is called after copy.

> > +                     } else {
> > +                             ptep_clear(vma->vm_mm, address, _pte);
> > +                             folio_remove_rmap_pte(src, src_page, vma);
> > +                     }
>
> As I've talked to Nico about this code recently ... :)
>
> Are you clearing the PTE after the copy succeeded? If so, where is the
> TLB flush?
>
> How do you sync against concurrent write acess + GUP-fast?
>
>
> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check if
> there are unexpected page references (e.g., GUP) if so back off (3)
> copy page content (4) set updated PTE/PMD.

Yeah, either PMD is not cleared or tlb_remove_table_sync_one() is not
called IIRC, the concurrent GUP may change the refcount after the
refcount check.

>
> To Nico, I suggested doing it simple initially, and still clear the
> high-level PMD entry + flush under mmap write lock, then re-map the PTE
> table after modifying the page table. It's not as efficient, but "harder
> to get wrong".
>
> Maybe that's already happening, but I stumbled over this clearing logic
> in __collapse_huge_page_copy_succeeded(), so I'm curious.
>
> --
> Cheers,
>
> David / dhildenb
>
>

Dev Jain Dec. 17, 2024, 10:07 a.m. UTC | #3

On 16/12/24 10:36 pm, David Hildenbrand wrote:
> On 16.12.24 17:51, Dev Jain wrote:
>> In contrast to PMD-collapse, we do not need to operate on two levels 
>> of pagetable
>> simultaneously. Therefore, downgrade the mmap lock from write to read 
>> mode. Still
>> take the anon_vma lock in exclusive mode so as to not waste time in 
>> the rmap path,
>> which is anyways going to fail since the PTEs are going to be 
>> changed. Under the PTL,
>> copy page contents, clear the PTEs, remove folio pins, and (try to) 
>> unmap the
>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Note: I have been trying hard to get rid of the locks in here: we 
>> still are
>> taking the PTL around the page copying; dropping the PTL and taking 
>> it after
>> the copying should lead to a deadlock, for example:
>> khugepaged                        madvise(MADV_COLD)
>> folio_lock()                        lock(ptl)
>> lock(ptl)                        folio_lock()
>>
>> We can create a locked folio list, altogether drop both the locks, 
>> take the PTL,
>> do everything which __collapse_huge_page_isolate() does *except* the 
>> isolation and
>> again try locking folios, but then it will reduce efficiency of 
>> khugepaged
>> and almost looks like a forced solution :)
>> Please note the following discussion if anyone is interested:
>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>
>> (Apologies for not CCing the mailing list from the start)
>>
>>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 87 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 88beebef773e..8040b130e677 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -714,24 +714,28 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>                           struct vm_area_struct *vma,
>>                           unsigned long address,
>>                           spinlock_t *ptl,
>> -                        struct list_head *compound_pagelist)
>> +                        struct list_head *compound_pagelist, int order)
>>   {
>>       struct folio *src, *tmp;
>>       pte_t *_pte;
>>       pte_t pteval;
>>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>            _pte++, address += PAGE_SIZE) {
>>           pteval = ptep_get(_pte);
>>           if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>               if (is_zero_pfn(pte_pfn(pteval))) {
>> -                /*
>> -                 * ptl mostly unnecessary.
>> -                 */
>> -                spin_lock(ptl);
>> -                ptep_clear(vma->vm_mm, address, _pte);
>> -                spin_unlock(ptl);
>> +                if (order == HPAGE_PMD_ORDER) {
>> +                    /*
>> +                    * ptl mostly unnecessary.
>> +                    */
>> +                    spin_lock(ptl);
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                    spin_unlock(ptl);
>> +                } else {
>> +                    ptep_clear(vma->vm_mm, address, _pte);
>> +                }
>>                   ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>               }
>>           } else {
>> @@ -740,15 +744,20 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>               src = page_folio(src_page);
>>               if (!folio_test_large(src))
>>                   release_pte_folio(src);
>> -            /*
>> -             * ptl mostly unnecessary, but preempt has to
>> -             * be disabled to update the per-cpu stats
>> -             * inside folio_remove_rmap_pte().
>> -             */
>> -            spin_lock(ptl);
>> -            ptep_clear(vma->vm_mm, address, _pte);
>> -            folio_remove_rmap_pte(src, src_page, vma);
>> -            spin_unlock(ptl);
>> +            if (order == HPAGE_PMD_ORDER) {
>> +                /*
>> +                * ptl mostly unnecessary, but preempt has to
>> +                * be disabled to update the per-cpu stats
>> +                * inside folio_remove_rmap_pte().
>> +                */
>> +                spin_lock(ptl);
>> +                ptep_clear(vma->vm_mm, address, _pte);
>
>
>
>
>> + folio_remove_rmap_pte(src, src_page, vma);
>> +                spin_unlock(ptl);
>> +            } else {
>> +                ptep_clear(vma->vm_mm, address, _pte);
>> +                folio_remove_rmap_pte(src, src_page, vma);
>> +            }
>
> As I've talked to Nico about this code recently ... :)
>
> Are you clearing the PTE after the copy succeeded? If so, where is the 
> TLB flush?
>
> How do you sync against concurrent write acess + GUP-fast?
>
>
> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check 
> if there are unexpected page references (e.g., GUP) if so back off (3) 
> copy page content (4) set updated PTE/PMD.

Thanks...we need to ensure GUP-fast does not write when we are copying 
contents, so (2) will ensure that GUP-fast will see the cleared PTE and 
back-off.
>
> To Nico, I suggested doing it simple initially, and still clear the 
> high-level PMD entry + flush under mmap write lock, then re-map the 
> PTE table after modifying the page table. It's not as efficient, but 
> "harder to get wrong".
>
> Maybe that's already happening, but I stumbled over this clearing 
> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.

No, I am not even touching the PMD. I guess the sequence you described 
should work? I just need to reverse the copying and PTE clearing order 
to implement this sequence.

David Hildenbrand Dec. 17, 2024, 10:32 a.m. UTC | #4

On 17.12.24 11:07, Dev Jain wrote:
> 
> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>> On 16.12.24 17:51, Dev Jain wrote:
>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>> of pagetable
>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>> mode. Still
>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>> the rmap path,
>>> which is anyways going to fail since the PTEs are going to be
>>> changed. Under the PTL,
>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>> unmap the
>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> Note: I have been trying hard to get rid of the locks in here: we
>>> still are
>>> taking the PTL around the page copying; dropping the PTL and taking
>>> it after
>>> the copying should lead to a deadlock, for example:
>>> khugepaged                        madvise(MADV_COLD)
>>> folio_lock()                        lock(ptl)
>>> lock(ptl)                        folio_lock()
>>>
>>> We can create a locked folio list, altogether drop both the locks,
>>> take the PTL,
>>> do everything which __collapse_huge_page_isolate() does *except* the
>>> isolation and
>>> again try locking folios, but then it will reduce efficiency of
>>> khugepaged
>>> and almost looks like a forced solution :)
>>> Please note the following discussion if anyone is interested:
>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
>>>
>>> (Apologies for not CCing the mailing list from the start)
>>>
>>>    mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 88beebef773e..8040b130e677 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -714,24 +714,28 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                            struct vm_area_struct *vma,
>>>                            unsigned long address,
>>>                            spinlock_t *ptl,
>>> -                        struct list_head *compound_pagelist)
>>> +                        struct list_head *compound_pagelist, int order)
>>>    {
>>>        struct folio *src, *tmp;
>>>        pte_t *_pte;
>>>        pte_t pteval;
>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>             _pte++, address += PAGE_SIZE) {
>>>            pteval = ptep_get(_pte);
>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>> -                /*
>>> -                 * ptl mostly unnecessary.
>>> -                 */
>>> -                spin_lock(ptl);
>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>> -                spin_unlock(ptl);
>>> +                if (order == HPAGE_PMD_ORDER) {
>>> +                    /*
>>> +                    * ptl mostly unnecessary.
>>> +                    */
>>> +                    spin_lock(ptl);
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                    spin_unlock(ptl);
>>> +                } else {
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                }
>>>                    ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>                }
>>>            } else {
>>> @@ -740,15 +744,20 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                src = page_folio(src_page);
>>>                if (!folio_test_large(src))
>>>                    release_pte_folio(src);
>>> -            /*
>>> -             * ptl mostly unnecessary, but preempt has to
>>> -             * be disabled to update the per-cpu stats
>>> -             * inside folio_remove_rmap_pte().
>>> -             */
>>> -            spin_lock(ptl);
>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>> -            spin_unlock(ptl);
>>> +            if (order == HPAGE_PMD_ORDER) {
>>> +                /*
>>> +                * ptl mostly unnecessary, but preempt has to
>>> +                * be disabled to update the per-cpu stats
>>> +                * inside folio_remove_rmap_pte().
>>> +                */
>>> +                spin_lock(ptl);
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>
>>
>>
>>
>>> + folio_remove_rmap_pte(src, src_page, vma);
>>> +                spin_unlock(ptl);
>>> +            } else {
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>> +            }
>>
>> As I've talked to Nico about this code recently ... :)
>>
>> Are you clearing the PTE after the copy succeeded? If so, where is the
>> TLB flush?
>>
>> How do you sync against concurrent write acess + GUP-fast?
>>
>>
>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>> if there are unexpected page references (e.g., GUP) if so back off (3)
>> copy page content (4) set updated PTE/PMD.
> 
> Thanks...we need to ensure GUP-fast does not write when we are copying
> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
> back-off.

Yes, and of course, that also the CPU cannot concurrently still modify 
the page content while/after you copy the page content, but before you 
unmap+flush.

>>
>> To Nico, I suggested doing it simple initially, and still clear the
>> high-level PMD entry + flush under mmap write lock, then re-map the
>> PTE table after modifying the page table. It's not as efficient, but
>> "harder to get wrong".
>>
>> Maybe that's already happening, but I stumbled over this clearing
>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
> 
> No, I am not even touching the PMD. I guess the sequence you described
> should work? I just need to reverse the copying and PTE clearing order
> to implement this sequence.

That would work, but you really have to hold the PTL for the whole 
period: from when you temporarily clear PTEs +_ flush the TLB, when you 
copy, until you re-insert the updated ones.

When having to back-off (restore original PTEs), or for copying, you'll 
likely need access to the original PTEs, which were already cleared. So 
likely you need a temporary copy of the original PTEs somehow.

That's why temporarily clearing the PMD und mmap write lock is easier to 
implement, at the cost of requiring the mmap lock in write mode like PMD 
collapse.

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88beebef773e..8040b130e677 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -714,24 +714,28 @@  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist, int order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 			if (is_zero_pfn(pte_pfn(pteval))) {
-				/*
-				 * ptl mostly unnecessary.
-				 */
-				spin_lock(ptl);
-				ptep_clear(vma->vm_mm, address, _pte);
-				spin_unlock(ptl);
+				if (order == HPAGE_PMD_ORDER) {
+					/*
+					* ptl mostly unnecessary.
+					*/
+					spin_lock(ptl);
+					ptep_clear(vma->vm_mm, address, _pte);
+					spin_unlock(ptl);
+				} else {
+					ptep_clear(vma->vm_mm, address, _pte);
+				}
 				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
 			}
 		} else {
@@ -740,15 +744,20 @@  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 			src = page_folio(src_page);
 			if (!folio_test_large(src))
 				release_pte_folio(src);
-			/*
-			 * ptl mostly unnecessary, but preempt has to
-			 * be disabled to update the per-cpu stats
-			 * inside folio_remove_rmap_pte().
-			 */
-			spin_lock(ptl);
-			ptep_clear(vma->vm_mm, address, _pte);
-			folio_remove_rmap_pte(src, src_page, vma);
-			spin_unlock(ptl);
+			if (order == HPAGE_PMD_ORDER) {
+				/*
+				* ptl mostly unnecessary, but preempt has to
+				* be disabled to update the per-cpu stats
+				* inside folio_remove_rmap_pte().
+				*/
+				spin_lock(ptl);
+				ptep_clear(vma->vm_mm, address, _pte);
+				folio_remove_rmap_pte(src, src_page, vma);
+				spin_unlock(ptl);
+			} else {
+				ptep_clear(vma->vm_mm, address, _pte);
+				folio_remove_rmap_pte(src, src_page, vma);
+			}
 			free_page_and_swap_cache(src_page);
 		}
 	}
@@ -807,7 +816,7 @@  static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, int order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -815,7 +824,7 @@  static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -834,7 +843,7 @@  static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
 						 compound_pagelist, order);
@@ -1196,7 +1205,7 @@  static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
@@ -1228,6 +1237,61 @@  static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
 	return result;
 }
 
+/* Enter with mmap read lock */
+static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
+		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
+		struct folio *folio, int order)
+{
+	int result;
+	struct mmu_notifier_range range;
+	spinlock_t *pte_ptl;
+	LIST_HEAD(compound_pagelist);
+	pte_t *pte;
+	pte_t entry;
+	int nr_pages = folio_nr_pages(folio);
+
+	anon_vma_lock_write(vma->anon_vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
+				address + (PAGE_SIZE << order));
+	mmu_notifier_invalidate_range_start(&range);
+
+	pte = pte_offset_map_lock(mm, pmd, address, &pte_ptl);
+	if (pte)
+		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      &compound_pagelist, order);
+	else
+		result = SCAN_PMD_NULL;
+
+	if (unlikely(result != SCAN_SUCCEED))
+		goto out_up_read;
+
+	anon_vma_unlock_write(vma->anon_vma);
+
+	__folio_mark_uptodate(folio);
+	entry = mk_pte(&folio->page, vma->vm_page_prot);
+	entry = maybe_mkwrite(entry, vma);
+
+	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
+					   vma, address, pte_ptl,
+					   &compound_pagelist, order);
+	if (unlikely(result != SCAN_SUCCEED))
+		goto out_up_read;
+
+	folio_ref_add(folio, nr_pages - 1);
+	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(folio, vma);
+	deferred_split_folio(folio, false);
+	set_ptes(mm, address, pte, entry, nr_pages);
+	update_mmu_cache_range(NULL, vma, address, pte, nr_pages);
+	pte_unmap_unlock(pte, pte_ptl);
+	mmu_notifier_invalidate_range_end(&range);
+	result = SCAN_SUCCEED;
+
+out_up_read:
+	mmap_read_unlock(mm);
+	return result;
+}
+
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped, int order,
 			      struct collapse_control *cc)
@@ -1276,6 +1340,8 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	if (order == HPAGE_PMD_ORDER)
 		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
+	else
+		result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order);
 
 	if (result == SCAN_SUCCEED)
 		folio = NULL;

[RFC,09/12] khugepaged: Introduce vma_collapse_anon_folio()

Commit Message

Comments

Patch