[v4,5/6] mm: vmscan: Avoid split during shrink_folio_list()

Message ID	20240311150058.1122862-6-ryan.roberts@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Barry Song <21cnbao@gmail.com>, Chris Li <chrisl@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list() Date: Mon, 11 Mar 2024 15:00:57 +0000 Message-Id: <20240311150058.1122862-6-ryan.roberts@arm.com> In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Swap-out mTHP without splitting \| expand [v4,0/6] Swap-out mTHP without splitting [v4,1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags [v4,2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() [v4,3/6] mm: swap: Simplify struct percpu_cluster [v4,4/6] mm: swap: Allow storage of all mTHP orders [v4,5/6] mm: vmscan: Avoid split during shrink_folio_list() [v4,6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

Ryan Roberts March 11, 2024, 3 p.m. UTC

Now that swap supports storing all mTHP sizes, avoid splitting large
folios before swap-out. This benefits performance of the swap-out path
by eliding split_folio_to_list(), which is expensive, and also sets us
up for swapping in large folios in a future series.

If the folio is partially mapped, we continue to split it since we want
to avoid the extra IO overhead and storage of writing out pages
uneccessarily.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmscan.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Barry Song March 11, 2024, 10:30 p.m. UTC | #1

On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Now that swap supports storing all mTHP sizes, avoid splitting large
> folios before swap-out. This benefits performance of the swap-out path
> by eliding split_folio_to_list(), which is expensive, and also sets us
> up for swapping in large folios in a future series.
>
> If the folio is partially mapped, we continue to split it since we want
> to avoid the extra IO overhead and storage of writing out pages
> uneccessarily.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmscan.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf7d4cf47f1a..0ebec99e04c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         if (!can_split_folio(folio, NULL))
>                                                 goto activate_locked;
>                                         /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> +                                        * Split partially mapped folios map
> +                                        * right away. Chances are some or all
> +                                        * of the tail pages can be freed
> +                                        * without IO.
>                                          */
> -                                       if (!folio_entire_mapcount(folio) &&
> +                                       if (!list_empty(&folio->_deferred_list) &&

Hi Ryan,
After reconsidering our previous discussion about PMD-mapped large
folios, I've pondered
the possibility of PMD-mapped Transparent Huge Pages (THPs) being
mapped by multiple
processes. In such a scenario, if one process decides to unmap a
portion of the folio while
others retain the entire mapping, it raises questions about how the
system should handle
this situation. Would the large folio be placed in a deferred list? If
so, splitting it might not
yield benefits, as neither I/O nor swap slots would increase in this
case by not splitting it.

Regarding PTE-mapped large folios, the absence of an indicator like
"entire_map" makes it
challenging to identify cases where the entire folio is mapped. Thus,
splitting seems to be
the only viable solution in such circumstances.

>                                             split_folio_to_list(folio,
>                                                                 folio_list))
>                                                 goto activate_locked;
> --
> 2.25.1

Thanks
Barry

Ryan Roberts March 12, 2024, 8:12 a.m. UTC | #2

On 11/03/2024 22:30, Barry Song wrote:
> On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/vmscan.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index cf7d4cf47f1a..0ebec99e04c6 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                                         if (!can_split_folio(folio, NULL))
>>                                                 goto activate_locked;
>>                                         /*
>> -                                        * Split folios without a PMD map right
>> -                                        * away. Chances are some or all of the
>> -                                        * tail pages can be freed without IO.
>> +                                        * Split partially mapped folios map
>> +                                        * right away. Chances are some or all
>> +                                        * of the tail pages can be freed
>> +                                        * without IO.
>>                                          */
>> -                                       if (!folio_entire_mapcount(folio) &&
>> +                                       if (!list_empty(&folio->_deferred_list) &&
> 
> Hi Ryan,
> After reconsidering our previous discussion about PMD-mapped large
> folios, I've pondered
> the possibility of PMD-mapped Transparent Huge Pages (THPs) being
> mapped by multiple
> processes. In such a scenario, if one process decides to unmap a
> portion of the folio while
> others retain the entire mapping, it raises questions about how the
> system should handle
> this situation. Would the large folio be placed in a deferred list? 

No - if the large folio is entirely mapped (via PMD), then the folio will not be
put on the deferred split list in the first place. See __folio_remove_rmap():

	last = (last < ENTIRELY_MAPPED);

means that nr will never be incremented above 0. (_nr_pages_mapped is
incremented by ENTIRELY_MAPPED for every PMD map).

> If
> so, splitting it might not
> yield benefits, as neither I/O nor swap slots would increase in this
> case by not splitting it.
> 
> Regarding PTE-mapped large folios, the absence of an indicator like
> "entire_map" makes it
> challenging to identify cases where the entire folio is mapped. Thus,
> splitting seems to be
> the only viable solution in such circumstances.
> 
>>                                             split_folio_to_list(folio,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
>> --
>> 2.25.1
> 
> Thanks
> Barry

Barry Song March 12, 2024, 8:40 a.m. UTC | #3

On Tue, Mar 12, 2024 at 9:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/03/2024 22:30, Barry Song wrote:
> > On Mon, Mar 11, 2024 at 11:01 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Now that swap supports storing all mTHP sizes, avoid splitting large
> >> folios before swap-out. This benefits performance of the swap-out path
> >> by eliding split_folio_to_list(), which is expensive, and also sets us
> >> up for swapping in large folios in a future series.
> >>
> >> If the folio is partially mapped, we continue to split it since we want
> >> to avoid the extra IO overhead and storage of writing out pages
> >> uneccessarily.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/vmscan.c | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index cf7d4cf47f1a..0ebec99e04c6 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                         if (!can_split_folio(folio, NULL))
> >>                                                 goto activate_locked;
> >>                                         /*
> >> -                                        * Split folios without a PMD map right
> >> -                                        * away. Chances are some or all of the
> >> -                                        * tail pages can be freed without IO.
> >> +                                        * Split partially mapped folios map
> >> +                                        * right away. Chances are some or all
> >> +                                        * of the tail pages can be freed
> >> +                                        * without IO.
> >>                                          */
> >> -                                       if (!folio_entire_mapcount(folio) &&
> >> +                                       if (!list_empty(&folio->_deferred_list) &&
> >
> > Hi Ryan,
> > After reconsidering our previous discussion about PMD-mapped large
> > folios, I've pondered
> > the possibility of PMD-mapped Transparent Huge Pages (THPs) being
> > mapped by multiple
> > processes. In such a scenario, if one process decides to unmap a
> > portion of the folio while
> > others retain the entire mapping, it raises questions about how the
> > system should handle
> > this situation. Would the large folio be placed in a deferred list?
>
> No - if the large folio is entirely mapped (via PMD), then the folio will not be
> put on the deferred split list in the first place. See __folio_remove_rmap():
>
>         last = (last < ENTIRELY_MAPPED);
>
> means that nr will never be incremented above 0. (_nr_pages_mapped is
> incremented by ENTIRELY_MAPPED for every PMD map).

you are right, I missed this part, we are breaking early in RMAP_LEVEL_PTE.
so we won't get to if (nr). Thanks for your clarification. now we get
unified code
for both pmd-mapped and pte-mapped large folios. feel free to add,

Reviewed-by: Barry Song <v-songbaohua@oppo.com>

>
> > If
> > so, splitting it might not
> > yield benefits, as neither I/O nor swap slots would increase in this
> > case by not splitting it.
> >
> > Regarding PTE-mapped large folios, the absence of an indicator like
> > "entire_map" makes it
> > challenging to identify cases where the entire folio is mapped. Thus,
> > splitting seems to be
> > the only viable solution in such circumstances.
> >
> >>                                             split_folio_to_list(folio,
> >>                                                                 folio_list))
> >>                                                 goto activate_locked;
> >> --
> >> 2.25.1
>

Thanks
Barry

David Hildenbrand March 15, 2024, 10:43 a.m. UTC | #4

On 11.03.24 16:00, Ryan Roberts wrote:
> Now that swap supports storing all mTHP sizes, avoid splitting large
> folios before swap-out. This benefits performance of the swap-out path
> by eliding split_folio_to_list(), which is expensive, and also sets us
> up for swapping in large folios in a future series.
> 
> If the folio is partially mapped, we continue to split it since we want
> to avoid the extra IO overhead and storage of writing out pages
> uneccessarily.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/vmscan.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf7d4cf47f1a..0ebec99e04c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   					if (!can_split_folio(folio, NULL))
>   						goto activate_locked;
>   					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split partially mapped folios map
> +					 * right away. Chances are some or all
> +					 * of the tail pages can be freed
> +					 * without IO.
>   					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (!list_empty(&folio->_deferred_list) &&
>   					    split_folio_to_list(folio,
>   								folio_list))
>   						goto activate_locked;

Not sure if we might have to annotate that with data_race().

Reviewed-by: David Hildenbrand <david@redhat.com>

Ryan Roberts March 15, 2024, 10:49 a.m. UTC | #5

On 15/03/2024 10:43, David Hildenbrand wrote:
> On 11.03.24 16:00, Ryan Roberts wrote:
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   mm/vmscan.c | 9 +++++----
>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index cf7d4cf47f1a..0ebec99e04c6 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>> *folio_list,
>>                       if (!can_split_folio(folio, NULL))
>>                           goto activate_locked;
>>                       /*
>> -                     * Split folios without a PMD map right
>> -                     * away. Chances are some or all of the
>> -                     * tail pages can be freed without IO.
>> +                     * Split partially mapped folios map
>> +                     * right away. Chances are some or all
>> +                     * of the tail pages can be freed
>> +                     * without IO.
>>                        */
>> -                    if (!folio_entire_mapcount(folio) &&
>> +                    if (!list_empty(&folio->_deferred_list) &&
>>                           split_folio_to_list(folio,
>>                                   folio_list))
>>                           goto activate_locked;
> 
> Not sure if we might have to annotate that with data_race().

I asked that exact question to Matthew in another context bt didn't get a
response. There are examples of checking if the deferred list is empty with and
without data_race() in the code base. But list_empty() is implemented like this:

static inline int list_empty(const struct list_head *head)
{
	return READ_ONCE(head->next) == head;
}

So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
sufficient for KCSAN?


> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

Thanks!

David Hildenbrand March 15, 2024, 11:12 a.m. UTC | #6

On 15.03.24 11:49, Ryan Roberts wrote:
> On 15/03/2024 10:43, David Hildenbrand wrote:
>> On 11.03.24 16:00, Ryan Roberts wrote:
>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>> folios before swap-out. This benefits performance of the swap-out path
>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>> up for swapping in large folios in a future series.
>>>
>>> If the folio is partially mapped, we continue to split it since we want
>>> to avoid the extra IO overhead and storage of writing out pages
>>> uneccessarily.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    mm/vmscan.c | 9 +++++----
>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>> *folio_list,
>>>                        if (!can_split_folio(folio, NULL))
>>>                            goto activate_locked;
>>>                        /*
>>> -                     * Split folios without a PMD map right
>>> -                     * away. Chances are some or all of the
>>> -                     * tail pages can be freed without IO.
>>> +                     * Split partially mapped folios map
>>> +                     * right away. Chances are some or all
>>> +                     * of the tail pages can be freed
>>> +                     * without IO.
>>>                         */
>>> -                    if (!folio_entire_mapcount(folio) &&
>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>                            split_folio_to_list(folio,
>>>                                    folio_list))
>>>                            goto activate_locked;
>>
>> Not sure if we might have to annotate that with data_race().
> 
> I asked that exact question to Matthew in another context bt didn't get a
> response. There are examples of checking if the deferred list is empty with and
> without data_race() in the code base. But list_empty() is implemented like this:
> 
> static inline int list_empty(const struct list_head *head)
> {
> 	return READ_ONCE(head->next) == head;
> }
> 
> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
> sufficient for KCSAN?

Yeah, there is only one use of data_race with that list.

It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is 
in deferred list").

Looks like that was added right in v1 of that change [1], so my best 
guess is that it is not actually required.

If not required, likely we should just cleanup the single user.

[1] 
https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/

Ryan Roberts March 15, 2024, 11:38 a.m. UTC | #7

Hi Yin Fengwei,

On 15/03/2024 11:12, David Hildenbrand wrote:
> On 15.03.24 11:49, Ryan Roberts wrote:
>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>> folios before swap-out. This benefits performance of the swap-out path
>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>> up for swapping in large folios in a future series.
>>>>
>>>> If the folio is partially mapped, we continue to split it since we want
>>>> to avoid the extra IO overhead and storage of writing out pages
>>>> uneccessarily.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    mm/vmscan.c | 9 +++++----
>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>> *folio_list,
>>>>                        if (!can_split_folio(folio, NULL))
>>>>                            goto activate_locked;
>>>>                        /*
>>>> -                     * Split folios without a PMD map right
>>>> -                     * away. Chances are some or all of the
>>>> -                     * tail pages can be freed without IO.
>>>> +                     * Split partially mapped folios map
>>>> +                     * right away. Chances are some or all
>>>> +                     * of the tail pages can be freed
>>>> +                     * without IO.
>>>>                         */
>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>                            split_folio_to_list(folio,
>>>>                                    folio_list))
>>>>                            goto activate_locked;
>>>
>>> Not sure if we might have to annotate that with data_race().
>>
>> I asked that exact question to Matthew in another context bt didn't get a
>> response. There are examples of checking if the deferred list is empty with and
>> without data_race() in the code base. But list_empty() is implemented like this:
>>
>> static inline int list_empty(const struct list_head *head)
>> {
>>     return READ_ONCE(head->next) == head;
>> }
>>
>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>> sufficient for KCSAN?
> 
> Yeah, there is only one use of data_race with that list.
> 
> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
> deferred list").
> 
> Looks like that was added right in v1 of that change [1], so my best guess is
> that it is not actually required.
> 
> If not required, likely we should just cleanup the single user.
> 
> [1]
> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/

Do you have any recollection of why you added the data_race() markup?

Thanks,
Ryan

>

Huang, Ying March 18, 2024, 2:16 a.m. UTC | #8

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi Yin Fengwei,
>
> On 15/03/2024 11:12, David Hildenbrand wrote:
>> On 15.03.24 11:49, Ryan Roberts wrote:
>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>> up for swapping in large folios in a future series.
>>>>>
>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>> uneccessarily.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>    mm/vmscan.c | 9 +++++----
>>>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>> *folio_list,
>>>>>                        if (!can_split_folio(folio, NULL))
>>>>>                            goto activate_locked;
>>>>>                        /*
>>>>> -                     * Split folios without a PMD map right
>>>>> -                     * away. Chances are some or all of the
>>>>> -                     * tail pages can be freed without IO.
>>>>> +                     * Split partially mapped folios map
>>>>> +                     * right away. Chances are some or all
>>>>> +                     * of the tail pages can be freed
>>>>> +                     * without IO.
>>>>>                         */
>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>                            split_folio_to_list(folio,
>>>>>                                    folio_list))
>>>>>                            goto activate_locked;
>>>>
>>>> Not sure if we might have to annotate that with data_race().
>>>
>>> I asked that exact question to Matthew in another context bt didn't get a
>>> response. There are examples of checking if the deferred list is empty with and
>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>
>>> static inline int list_empty(const struct list_head *head)
>>> {
>>>     return READ_ONCE(head->next) == head;
>>> }
>>>
>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>> sufficient for KCSAN?
>> 
>> Yeah, there is only one use of data_race with that list.
>> 
>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>> deferred list").
>> 
>> Looks like that was added right in v1 of that change [1], so my best guess is
>> that it is not actually required.
>> 
>> If not required, likely we should just cleanup the single user.
>> 
>> [1]
>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>
> Do you have any recollection of why you added the data_race() markup?

Per my understanding, this is used to mark that the code accesses
folio->_deferred_list without lock intentionally, while
folio->_deferred_list may be changed in parallel.  IIUC, this is what
data_race() is used for.  Or, my understanding is wrong?

--
Best Regards,
Huang, Ying

Yin Fengwei March 18, 2024, 10 a.m. UTC | #9

On 3/18/2024 10:16 AM, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi Yin Fengwei,
>>
>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>> up for swapping in large folios in a future series.
>>>>>>
>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>> uneccessarily.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>     mm/vmscan.c | 9 +++++----
>>>>>>     1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>> *folio_list,
>>>>>>                         if (!can_split_folio(folio, NULL))
>>>>>>                             goto activate_locked;
>>>>>>                         /*
>>>>>> -                     * Split folios without a PMD map right
>>>>>> -                     * away. Chances are some or all of the
>>>>>> -                     * tail pages can be freed without IO.
>>>>>> +                     * Split partially mapped folios map
>>>>>> +                     * right away. Chances are some or all
>>>>>> +                     * of the tail pages can be freed
>>>>>> +                     * without IO.
>>>>>>                          */
>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>                             split_folio_to_list(folio,
>>>>>>                                     folio_list))
>>>>>>                             goto activate_locked;
>>>>>
>>>>> Not sure if we might have to annotate that with data_race().
>>>>
>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>> response. There are examples of checking if the deferred list is empty with and
>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>
>>>> static inline int list_empty(const struct list_head *head)
>>>> {
>>>>      return READ_ONCE(head->next) == head;
>>>> }
>>>>
>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>> sufficient for KCSAN?
I don't think READ_ONCE() can replace the lock.

>>>
>>> Yeah, there is only one use of data_race with that list.
>>>
>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>> deferred list").
>>>
>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>> that it is not actually required.
>>>
>>> If not required, likely we should just cleanup the single user.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>
>> Do you have any recollection of why you added the data_race() markup?
> 
> Per my understanding, this is used to mark that the code accesses
> folio->_deferred_list without lock intentionally, while
> folio->_deferred_list may be changed in parallel.  IIUC, this is what
> data_race() is used for.  Or, my understanding is wrong?
Yes. This is my understanding also.


Regards
Yin, Fengwei

> 
> --
> Best Regards,
> Huang, Ying

David Hildenbrand March 18, 2024, 10:05 a.m. UTC | #10

On 18.03.24 11:00, Yin, Fengwei wrote:
> 
> 
> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> Hi Yin Fengwei,
>>>
>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>> up for swapping in large folios in a future series.
>>>>>>>
>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>> uneccessarily.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> ---
>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>>> *folio_list,
>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>                              goto activate_locked;
>>>>>>>                          /*
>>>>>>> -                     * Split folios without a PMD map right
>>>>>>> -                     * away. Chances are some or all of the
>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>> +                     * Split partially mapped folios map
>>>>>>> +                     * right away. Chances are some or all
>>>>>>> +                     * of the tail pages can be freed
>>>>>>> +                     * without IO.
>>>>>>>                           */
>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>                              split_folio_to_list(folio,
>>>>>>>                                      folio_list))
>>>>>>>                              goto activate_locked;
>>>>>>
>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>
>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>> response. There are examples of checking if the deferred list is empty with and
>>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>>
>>>>> static inline int list_empty(const struct list_head *head)
>>>>> {
>>>>>       return READ_ONCE(head->next) == head;
>>>>> }
>>>>>
>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>>> sufficient for KCSAN?
> I don't think READ_ONCE() can replace the lock.
> 
>>>>
>>>> Yeah, there is only one use of data_race with that list.
>>>>
>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>> deferred list").
>>>>
>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>> that it is not actually required.
>>>>
>>>> If not required, likely we should just cleanup the single user.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>
>>> Do you have any recollection of why you added the data_race() markup?
>>
>> Per my understanding, this is used to mark that the code accesses
>> folio->_deferred_list without lock intentionally, while
>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>> data_race() is used for.  Or, my understanding is wrong?
> Yes. This is my understanding also.

Why don't we have a data_race() in deferred_split_folio() then, before 
taking the lock?

It's used a bit inconsistently here.

Ryan Roberts March 18, 2024, 3:35 p.m. UTC | #11

On 18/03/2024 10:05, David Hildenbrand wrote:
> On 18.03.24 11:00, Yin, Fengwei wrote:
>>
>>
>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Hi Yin Fengwei,
>>>>
>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>
>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>> uneccessarily.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> ---
>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>> list_head
>>>>>>>> *folio_list,
>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>                              goto activate_locked;
>>>>>>>>                          /*
>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>> +                     * without IO.
>>>>>>>>                           */
>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>                                      folio_list))
>>>>>>>>                              goto activate_locked;
>>>>>>>
>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>
>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>> with and
>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>> this:
>>>>>>
>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>> {
>>>>>>       return READ_ONCE(head->next) == head;
>>>>>> }
>>>>>>
>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>> not
>>>>>> sufficient for KCSAN?
>> I don't think READ_ONCE() can replace the lock.

But it doesn't ensure we get a consistent value and that the compiler orders the
load correctly. There are lots of patterns in the kernel that use READ_ONCE()
without a lock and they don't use data_race() - e.g. ptep_get_lockless().

It sounds like none of us really understand what data_race() is for, so I guess
I'll just do a KCSAN build and invoke the code path to see if it complains.


>>
>>>>>
>>>>> Yeah, there is only one use of data_race with that list.
>>>>>
>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>> deferred list").
>>>>>
>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>> that it is not actually required.
>>>>>
>>>>> If not required, likely we should just cleanup the single user.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>>
>>>> Do you have any recollection of why you added the data_race() markup?
>>>
>>> Per my understanding, this is used to mark that the code accesses
>>> folio->_deferred_list without lock intentionally, while
>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>> data_race() is used for.  Or, my understanding is wrong?
>> Yes. This is my understanding also.
> 
> Why don't we have a data_race() in deferred_split_folio() then, before taking
> the lock?
> 
> It's used a bit inconsistently here.
>

Ryan Roberts March 18, 2024, 3:36 p.m. UTC | #12

On 18/03/2024 15:35, Ryan Roberts wrote:
> On 18/03/2024 10:05, David Hildenbrand wrote:
>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Hi Yin Fengwei,
>>>>>
>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>
>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>> uneccessarily.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> ---
>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>> list_head
>>>>>>>>> *folio_list,
>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>>                          /*
>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>> +                     * without IO.
>>>>>>>>>                           */
>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>                                      folio_list))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>
>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>
>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>> with and
>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>> this:
>>>>>>>
>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>> {
>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>> }
>>>>>>>
>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>> not
>>>>>>> sufficient for KCSAN?
>>> I don't think READ_ONCE() can replace the lock.
> 
> But it doesn't ensure we get a consistent value and that the compiler orders the

Sorry - fat fingers... I meant it *does* ensure we get a consistent value (i.e.
untorn)

> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
> 
> It sounds like none of us really understand what data_race() is for, so I guess
> I'll just do a KCSAN build and invoke the code path to see if it complains.
> 
> 
>>>
>>>>>>
>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>
>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>> deferred list").
>>>>>>
>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>> that it is not actually required.
>>>>>>
>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>>>
>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>
>>>> Per my understanding, this is used to mark that the code accesses
>>>> folio->_deferred_list without lock intentionally, while
>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>> data_race() is used for.  Or, my understanding is wrong?
>>> Yes. This is my understanding also.
>>
>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>> the lock?
>>
>> It's used a bit inconsistently here.
>>
>

Yin Fengwei March 19, 2024, 2:20 a.m. UTC | #13

On 3/18/24 23:35, Ryan Roberts wrote:
> On 18/03/2024 10:05, David Hildenbrand wrote:
>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Hi Yin Fengwei,
>>>>>
>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>
>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>> uneccessarily.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> ---
>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>> list_head
>>>>>>>>> *folio_list,
>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>>                          /*
>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>> +                     * without IO.
>>>>>>>>>                           */
>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>                                      folio_list))
>>>>>>>>>                              goto activate_locked;
>>>>>>>>
>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>
>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>> with and
>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>> this:
>>>>>>>
>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>> {
>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>> }
>>>>>>>
>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>> not
>>>>>>> sufficient for KCSAN?
>>> I don't think READ_ONCE() can replace the lock.
> 
> But it doesn't ensure we get a consistent value and that the compiler orders the
> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
They (ptep_get_lockless() and deferred_list) have different access pattern
(or race pattern) here. I don't think they are comparable.

> 
> It sounds like none of us really understand what data_race() is for, so I guess
> I'll just do a KCSAN build and invoke the code path to see if it complains.
READ_ONCE() in list_empty will shutdown the KCSAN also.

> 
> 
>>>
>>>>>>
>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>
>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>> deferred list").
>>>>>>
>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>> that it is not actually required.
>>>>>>
>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>>>
>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>
>>>> Per my understanding, this is used to mark that the code accesses
>>>> folio->_deferred_list without lock intentionally, while
>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>> data_race() is used for.  Or, my understanding is wrong?
>>> Yes. This is my understanding also.
>>
>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>> the lock?
>>
>> It's used a bit inconsistently here.
>>
>

Yin Fengwei March 19, 2024, 2:31 a.m. UTC | #14

On 3/18/24 18:05, David Hildenbrand wrote:
> On 18.03.24 11:00, Yin, Fengwei wrote:
>>
>>
>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Hi Yin Fengwei,
>>>>
>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>
>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>> uneccessarily.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> ---
>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head
>>>>>>>> *folio_list,
>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>                              goto activate_locked;
>>>>>>>>                          /*
>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>> +                     * without IO.
>>>>>>>>                           */
>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>                                      folio_list))
>>>>>>>>                              goto activate_locked;
>>>>>>>
>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>
>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>> response. There are examples of checking if the deferred list is empty with and
>>>>>> without data_race() in the code base. But list_empty() is implemented like this:
>>>>>>
>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>> {
>>>>>>       return READ_ONCE(head->next) == head;
>>>>>> }
>>>>>>
>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps not
>>>>>> sufficient for KCSAN?
>> I don't think READ_ONCE() can replace the lock.
>>
>>>>>
>>>>> Yeah, there is only one use of data_race with that list.
>>>>>
>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>> deferred list").
>>>>>
>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>> that it is not actually required.
>>>>>
>>>>> If not required, likely we should just cleanup the single user.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>>
>>>> Do you have any recollection of why you added the data_race() markup?
>>>
>>> Per my understanding, this is used to mark that the code accesses
>>> folio->_deferred_list without lock intentionally, while
>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>> data_race() is used for.  Or, my understanding is wrong?
>> Yes. This is my understanding also.
> 
> Why don't we have a data_race() in deferred_split_folio() then, before taking the lock?
No idea why there is no data_race() added. But I think we should add data_race().

Regards
Yin, Fengwei

> 
> It's used a bit inconsistently here.
>

Ryan Roberts March 19, 2024, 2:40 p.m. UTC | #15

On 19/03/2024 02:20, Yin Fengwei wrote:
> 
> 
> On 3/18/24 23:35, Ryan Roberts wrote:
>> On 18/03/2024 10:05, David Hildenbrand wrote:
>>> On 18.03.24 11:00, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 3/18/2024 10:16 AM, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> Hi Yin Fengwei,
>>>>>>
>>>>>> On 15/03/2024 11:12, David Hildenbrand wrote:
>>>>>>> On 15.03.24 11:49, Ryan Roberts wrote:
>>>>>>>> On 15/03/2024 10:43, David Hildenbrand wrote:
>>>>>>>>> On 11.03.24 16:00, Ryan Roberts wrote:
>>>>>>>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>>>>>>>> folios before swap-out. This benefits performance of the swap-out path
>>>>>>>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>>>>>>>> up for swapping in large folios in a future series.
>>>>>>>>>>
>>>>>>>>>> If the folio is partially mapped, we continue to split it since we want
>>>>>>>>>> to avoid the extra IO overhead and storage of writing out pages
>>>>>>>>>> uneccessarily.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>> ---
>>>>>>>>>>      mm/vmscan.c | 9 +++++----
>>>>>>>>>>      1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>>> index cf7d4cf47f1a..0ebec99e04c6 100644
>>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>>> @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct
>>>>>>>>>> list_head
>>>>>>>>>> *folio_list,
>>>>>>>>>>                          if (!can_split_folio(folio, NULL))
>>>>>>>>>>                              goto activate_locked;
>>>>>>>>>>                          /*
>>>>>>>>>> -                     * Split folios without a PMD map right
>>>>>>>>>> -                     * away. Chances are some or all of the
>>>>>>>>>> -                     * tail pages can be freed without IO.
>>>>>>>>>> +                     * Split partially mapped folios map
>>>>>>>>>> +                     * right away. Chances are some or all
>>>>>>>>>> +                     * of the tail pages can be freed
>>>>>>>>>> +                     * without IO.
>>>>>>>>>>                           */
>>>>>>>>>> -                    if (!folio_entire_mapcount(folio) &&
>>>>>>>>>> +                    if (!list_empty(&folio->_deferred_list) &&
>>>>>>>>>>                              split_folio_to_list(folio,
>>>>>>>>>>                                      folio_list))
>>>>>>>>>>                              goto activate_locked;
>>>>>>>>>
>>>>>>>>> Not sure if we might have to annotate that with data_race().
>>>>>>>>
>>>>>>>> I asked that exact question to Matthew in another context bt didn't get a
>>>>>>>> response. There are examples of checking if the deferred list is empty
>>>>>>>> with and
>>>>>>>> without data_race() in the code base. But list_empty() is implemented like
>>>>>>>> this:
>>>>>>>>
>>>>>>>> static inline int list_empty(const struct list_head *head)
>>>>>>>> {
>>>>>>>>       return READ_ONCE(head->next) == head;
>>>>>>>> }
>>>>>>>>
>>>>>>>> So I assumed the READ_ONCE() makes everything safe without a lock? Perhaps
>>>>>>>> not
>>>>>>>> sufficient for KCSAN?
>>>> I don't think READ_ONCE() can replace the lock.
>>
>> But it doesn't ensure we get a consistent value and that the compiler orders the
>> load correctly. There are lots of patterns in the kernel that use READ_ONCE()
>> without a lock and they don't use data_race() - e.g. ptep_get_lockless().
> They (ptep_get_lockless() and deferred_list) have different access pattern
> (or race pattern) here. I don't think they are comparable.
> 
>>
>> It sounds like none of us really understand what data_race() is for, so I guess
>> I'll just do a KCSAN build and invoke the code path to see if it complains.
> READ_ONCE() in list_empty will shutdown the KCSAN also.

OK, I found some time to run the test with KCSAN; nothing fires.

But then I read the docs and looked at the code a bit.
Documentation/dev-tools/kcsan.rst states:

    In an execution, two memory accesses form a *data race* if they *conflict*,
    they happen concurrently in different threads, and at least one of them is a
    *plain access*; they *conflict* if both access the same memory location, and
    at least one is a write.

It also clarifies the READ_ONCE() is a "marked access". So we would have a data
race if there was a concurrent, *plain* write to folio->_deferred_list.next.
This can occur in a couple of places I believe, for example:

deferred_split_folio()
  list_add_tail()
    __list_add()
      new->next = next;

deferred_split_scan()
  list_move()
    list_add()
      __list_add()
        new->next = next;

So if either partially deferred_split_folio() or deferred_split_scan() can run
concurrently with shrink_folio_list(), for the same folio (I beleive both can
can), then we have a race, and this list_empty() check needs to be protected
with data_race(). The race is safe/by design, but it does need to be marked.

I'll fix this in my next version.

Thanks,
Ryan


> 
>>
>>
>>>>
>>>>>>>
>>>>>>> Yeah, there is only one use of data_race with that list.
>>>>>>>
>>>>>>> It was added in f3ebdf042df4 ("THP: avoid lock when check whether THP is in
>>>>>>> deferred list").
>>>>>>>
>>>>>>> Looks like that was added right in v1 of that change [1], so my best guess is
>>>>>>> that it is not actually required.
>>>>>>>
>>>>>>> If not required, likely we should just cleanup the single user.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-mm/20230417075643.3287513-2-fengwei.yin@intel.com/
>>>>>>
>>>>>> Do you have any recollection of why you added the data_race() markup?
>>>>>
>>>>> Per my understanding, this is used to mark that the code accesses
>>>>> folio->_deferred_list without lock intentionally, while
>>>>> folio->_deferred_list may be changed in parallel.  IIUC, this is what
>>>>> data_race() is used for.  Or, my understanding is wrong?
>>>> Yes. This is my understanding also.
>>>
>>> Why don't we have a data_race() in deferred_split_folio() then, before taking
>>> the lock?
>>>
>>> It's used a bit inconsistently here.
>>>
>>

[v4,5/6] mm: vmscan: Avoid split during shrink_folio_list()

Commit Message

Comments

Patch