[v4] mm: fix is_pinnable_page against on cma page

Message ID	20220510211743.95831-1-minchan@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Minchan Kim <minchan@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, John Hubbard <jhubbard@nvidia.com>, John Dias <joaodias@google.com>, Minchan Kim <minchan@kernel.org>, David Hildenbrand <david@redhat.com> Subject: [PATCH v4] mm: fix is_pinnable_page against on cma page Date: Tue, 10 May 2022 14:17:43 -0700 Message-Id: <20220510211743.95831-1-minchan@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v4] mm: fix is_pinnable_page against on cma page \| expand [v4] mm: fix is_pinnable_page against on cma page

Minchan Kim May 10, 2022, 9:17 p.m. UTC

Pages on CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA
so current is_pinnable_page could miss CMA pages which has MIGRATE_
ISOLATE. It ends up pinning CMA pages as longterm at pin_user_pages
APIs so CMA allocation keep failed until the pin is released.


     CPU 0                                   CPU 1 - Task B

cma_alloc
alloc_contig_range
					pin_user_pages_fast(FOLL_LONGTERM)
change pageblock as MIGRATE_ISOLATE
					internal_get_user_pages_fast
                                        lockless_pages_from_mm
                                        gup_pte_range
                                        try_grab_folio
                                        is_pinnable_page
                                          return true;
                                        So, pinned the page successfully.
page migration failure with pinned page
					..
                                        .. After 30 sec
					unpin_user_page(page)

CMA allocation succeeded after 30 sec.

The CMA allocation path protects the migration type change race
using zone->lock but what GUP path need to know is just whether the
page is on CMA area or not rather than exact migration type.
Thus, we don't need zone->lock but just checks migration type in
either of (MIGRATE_ISOLATE and MIGRATE_CMA).

Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause
rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even
though it's neither CMA nor movable zone if the page is temporarily
unmovable. However, such a migration failure by unexpected temporal
refcount holding is general issue, not only come from MIGRATE_ISOLATE
and the MIGRATE_ISOLATE is also transient state like other temporal
elevated refcount problem.

Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
* from v3 - https://lore.kernel.org/all/20220509153430.4125710-1-minchan@kernel.org/
  * Fix typo and adding more description - akpm

* from v2 - https://lore.kernel.org/all/20220505064429.2818496-1-minchan@kernel.org/
  * Use __READ_ONCE instead of volatile - akpm

* from v1 - https://lore.kernel.org/all/20220502173558.2510641-1-minchan@kernel.org/
  * fix build warning - lkp
  * fix refetching issue of migration type
  * add side effect on !ZONE_MOVABLE and !MIGRATE_CMA in description - david

 include/linux/mm.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

John Hubbard May 10, 2022, 10:56 p.m. UTC | #1

On 5/10/22 14:17, Minchan Kim wrote:
> Pages on CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA
> so current is_pinnable_page could miss CMA pages which has MIGRATE_
> ISOLATE. It ends up pinning CMA pages as longterm at pin_user_pages
> APIs so CMA allocation keep failed until the pin is released.
> 
> 
>       CPU 0                                   CPU 1 - Task B
> 
> cma_alloc
> alloc_contig_range
> 					pin_user_pages_fast(FOLL_LONGTERM)
> change pageblock as MIGRATE_ISOLATE
> 					internal_get_user_pages_fast
>                                          lockless_pages_from_mm
>                                          gup_pte_range
>                                          try_grab_folio
>                                          is_pinnable_page
>                                            return true;
>                                          So, pinned the page successfully.
> page migration failure with pinned page
> 					..
>                                          .. After 30 sec
> 					unpin_user_page(page)
> 
> CMA allocation succeeded after 30 sec.

Hi Minchan,

Thanks for spelling out how this works, that really speeds up the
review and helps others quickly learn what is going on with the code.

For my own information, mainly: where is CMA blocking, so that
it waits (apparently) for the during of the pin, before proceeding?
(Or is the caller retrying?)

I noticed a few minor points but was too slow to reply, notes below:

> 
> The CMA allocation path protects the migration type change race
> using zone->lock but what GUP path need to know is just whether the
> page is on CMA area or not rather than exact migration type.
> Thus, we don't need zone->lock but just checks migration type in
> either of (MIGRATE_ISOLATE and MIGRATE_CMA).
> 
> Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause
> rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even
> though it's neither CMA nor movable zone if the page is temporarily
> unmovable. However, such a migration failure by unexpected temporal
> refcount holding is general issue, not only come from MIGRATE_ISOLATE
> and the MIGRATE_ISOLATE is also transient state like other temporal
> elevated refcount problem.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> * from v3 - https://lore.kernel.org/all/20220509153430.4125710-1-minchan@kernel.org/
>    * Fix typo and adding more description - akpm
> 
> * from v2 - https://lore.kernel.org/all/20220505064429.2818496-1-minchan@kernel.org/
>    * Use __READ_ONCE instead of volatile - akpm
> 
> * from v1 - https://lore.kernel.org/all/20220502173558.2510641-1-minchan@kernel.org/
>    * fix build warning - lkp
>    * fix refetching issue of migration type
>    * add side effect on !ZONE_MOVABLE and !MIGRATE_CMA in description - david
> 
>   include/linux/mm.h | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6acca5cecbc5..cbf79eb790e0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1625,8 +1625,19 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
>   #ifdef CONFIG_MIGRATION
>   static inline bool is_pinnable_page(struct page *page)
>   {
> -	return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) ||
> -		is_zero_pfn(page_to_pfn(page));
> +#ifdef CONFIG_CMA
> +	/*
> +	 * use volatile to use local variable mt instead of
> +	 * refetching mt value.
> +	 */

This comment is stale and should therefore be deleted.

> +	int __mt = get_pageblock_migratetype(page);
> +	int mt = __READ_ONCE(__mt);

Although I saw the email discussion about this in v2, that discussion
didn't go far enough. It started with "don't use volatile", and went
on to "try __READ_ONCE() instead", but it should have continued on
to "you don't need this at all".

Because you don't. There is nothing you are racing with, and adding
__READ_ONCE() in order to avoid a completely not-going-to-happen
compiler re-invocation of a significant code block is just very wrong.

So let's just let it go entirely. :)

> +
> +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)

MIGRATE_ISOLATE is not always defined, and must therefore be protected
with a check on CONFIG_MEMORY_ISOLATION...oh never mind, I see in
mm/Kconfig:

config CMA
	...
	select MEMORY_ISOLATION

...so that's OK. What a tangled web, I wonder if enum migratetype
really needs to be sliced up like that, but that's a separate topic.

> +		return false;
> +#endif
> +
> +	return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)));

And actually this area is getting rather nested with the various ifdefs,
and it is probably time to refactor them a bit, considering the above
point about MIGRATE_ISOLATE. I had something in mind (which is why I
delayed my feedback), along the lines of merging _ISOLATE and _CMA and
the ifdefs. But it's just a fine point and not critical of course, just
a thought.

>   }
>   #else
>   static inline bool is_pinnable_page(struct page *page)


thanks,

Minchan Kim May 10, 2022, 11:31 p.m. UTC | #2

Hi John,

On Tue, May 10, 2022 at 03:56:37PM -0700, John Hubbard wrote:
> On 5/10/22 14:17, Minchan Kim wrote:
> > Pages on CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA
> > so current is_pinnable_page could miss CMA pages which has MIGRATE_
> > ISOLATE. It ends up pinning CMA pages as longterm at pin_user_pages
> > APIs so CMA allocation keep failed until the pin is released.
> > 
> > 
> >       CPU 0                                   CPU 1 - Task B
> > 
> > cma_alloc
> > alloc_contig_range
> > 					pin_user_pages_fast(FOLL_LONGTERM)
> > change pageblock as MIGRATE_ISOLATE
> > 					internal_get_user_pages_fast
> >                                          lockless_pages_from_mm
> >                                          gup_pte_range
> >                                          try_grab_folio
> >                                          is_pinnable_page
> >                                            return true;
> >                                          So, pinned the page successfully.
> > page migration failure with pinned page
> > 					..
> >                                          .. After 30 sec
> > 					unpin_user_page(page)
> > 
> > CMA allocation succeeded after 30 sec.
> 
> Hi Minchan,
> 
> Thanks for spelling out how this works, that really speeds up the
> review and helps others quickly learn what is going on with the code.
> 
> For my own information, mainly: where is CMA blocking, so that
> it waits (apparently) for the during of the pin, before proceeding?
> (Or is the caller retrying?)

It would fail the cma_alloc in the first place since it couldn't
migrate page out due to the elevated refcount and cma_allc would
proceed next pageblocks to keep pages migrated out but it ends up
failing the cma allocation because the user tries to allocate
entire CMA pageblocks, not part of them so one of the pinned page
make the cma allocation failure. Since then, user(e.g., dmabuf)
could retry a few more times but it keeps failed until Task B
release the refcount of the page.

> 
> I noticed a few minor points but was too slow to reply, notes below:
> 
> > 
> > The CMA allocation path protects the migration type change race
> > using zone->lock but what GUP path need to know is just whether the
> > page is on CMA area or not rather than exact migration type.
> > Thus, we don't need zone->lock but just checks migration type in
> > either of (MIGRATE_ISOLATE and MIGRATE_CMA).
> > 
> > Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause
> > rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even
> > though it's neither CMA nor movable zone if the page is temporarily
> > unmovable. However, such a migration failure by unexpected temporal
> > refcount holding is general issue, not only come from MIGRATE_ISOLATE
> > and the MIGRATE_ISOLATE is also transient state like other temporal
> > elevated refcount problem.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > * from v3 - https://lore.kernel.org/all/20220509153430.4125710-1-minchan@kernel.org/
> >    * Fix typo and adding more description - akpm
> > 
> > * from v2 - https://lore.kernel.org/all/20220505064429.2818496-1-minchan@kernel.org/
> >    * Use __READ_ONCE instead of volatile - akpm
> > 
> > * from v1 - https://lore.kernel.org/all/20220502173558.2510641-1-minchan@kernel.org/
> >    * fix build warning - lkp
> >    * fix refetching issue of migration type
> >    * add side effect on !ZONE_MOVABLE and !MIGRATE_CMA in description - david
> > 
> >   include/linux/mm.h | 15 +++++++++++++--
> >   1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 6acca5cecbc5..cbf79eb790e0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1625,8 +1625,19 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
> >   #ifdef CONFIG_MIGRATION
> >   static inline bool is_pinnable_page(struct page *page)
> >   {
> > -	return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) ||
> > -		is_zero_pfn(page_to_pfn(page));
> > +#ifdef CONFIG_CMA
> > +	/*
> > +	 * use volatile to use local variable mt instead of
> > +	 * refetching mt value.
> > +	 */
> 
> This comment is stale and should therefore be deleted.

Yeah.

> 
> > +	int __mt = get_pageblock_migratetype(page);
> > +	int mt = __READ_ONCE(__mt);
> 
> Although I saw the email discussion about this in v2, that discussion
> didn't go far enough. It started with "don't use volatile", and went
> on to "try __READ_ONCE() instead", but it should have continued on
> to "you don't need this at all".

That's really what I want to hear from experts so wanted to learn
"Why". How could we prevent refetching of the mt if we don't use
__READ_ONCE or volatile there?

> 
> Because you don't. There is nothing you are racing with, and adding
> __READ_ONCE() in order to avoid a completely not-going-to-happen
> compiler re-invocation of a significant code block is just very wrong.
> 
> So let's just let it go entirely. :)

Yeah, once it's clear for everyone, I am happy to remove the
unnecessary lines.

> 
> > +
> > +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> 
> MIGRATE_ISOLATE is not always defined, and must therefore be protected
> with a check on CONFIG_MEMORY_ISOLATION...oh never mind, I see in
> mm/Kconfig:
> 
> config CMA
> 	...
> 	select MEMORY_ISOLATION
> 
> ...so that's OK. What a tangled web, I wonder if enum migratetype
> really needs to be sliced up like that, but that's a separate topic.
> 
> > +		return false;
> > +#endif
> > +
> > +	return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)));
> 
> And actually this area is getting rather nested with the various ifdefs,
> and it is probably time to refactor them a bit, considering the above
> point about MIGRATE_ISOLATE. I had something in mind (which is why I
> delayed my feedback), along the lines of merging _ISOLATE and _CMA and
> the ifdefs. But it's just a fine point and not critical of course, just
> a thought.

Glad to hear someone is looking that.

John Hubbard May 10, 2022, 11:58 p.m. UTC | #3

On 5/10/22 4:31 PM, Minchan Kim wrote:
>>> +	int __mt = get_pageblock_migratetype(page);
>>> +	int mt = __READ_ONCE(__mt);
>>
>> Although I saw the email discussion about this in v2, that discussion
>> didn't go far enough. It started with "don't use volatile", and went
>> on to "try __READ_ONCE() instead", but it should have continued on
>> to "you don't need this at all".
> 
> That's really what I want to hear from experts so wanted to learn
> "Why". How could we prevent refetching of the mt if we don't use
> __READ_ONCE or volatile there?
> 
>>
>> Because you don't. There is nothing you are racing with, and adding
>> __READ_ONCE() in order to avoid a completely not-going-to-happen
>> compiler re-invocation of a significant code block is just very wrong.
>>
>> So let's just let it go entirely. :)
> 
> Yeah, once it's clear for everyone, I am happy to remove the
> unnecessary lines.
> 
>>
>>> +
>>> +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
>>

With or without __READ_ONCE() or volatile or anything else,
this code will do what you want. Which is: loosely check
for either of the above.

What functional problem do you think you are preventing
with __READ_ONCE()? Because I don't see one.

thanks,

Minchan Kim May 11, 2022, 12:09 a.m. UTC | #4

On Tue, May 10, 2022 at 04:58:13PM -0700, John Hubbard wrote:
> On 5/10/22 4:31 PM, Minchan Kim wrote:
> > > > +	int __mt = get_pageblock_migratetype(page);
> > > > +	int mt = __READ_ONCE(__mt);
> > > 
> > > Although I saw the email discussion about this in v2, that discussion
> > > didn't go far enough. It started with "don't use volatile", and went
> > > on to "try __READ_ONCE() instead", but it should have continued on
> > > to "you don't need this at all".
> > 
> > That's really what I want to hear from experts so wanted to learn
> > "Why". How could we prevent refetching of the mt if we don't use
> > __READ_ONCE or volatile there?
> > 
> > > 
> > > Because you don't. There is nothing you are racing with, and adding
> > > __READ_ONCE() in order to avoid a completely not-going-to-happen
> > > compiler re-invocation of a significant code block is just very wrong.
> > > 
> > > So let's just let it go entirely. :)
> > 
> > Yeah, once it's clear for everyone, I am happy to remove the
> > unnecessary lines.
> > 
> > > 
> > > > +
> > > > +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> > > 
> 
> With or without __READ_ONCE() or volatile or anything else,
> this code will do what you want. Which is: loosely check
> for either of the above.
> 
> What functional problem do you think you are preventing
> with __READ_ONCE()? Because I don't see one.

I discussed the issue at v1 so please take a look.

https://lore.kernel.org/all/YnFvmc+eMoXvLCWf@google.com/

John Hubbard May 11, 2022, 4:32 a.m. UTC | #5

On 5/10/22 17:09, Minchan Kim wrote:
> On Tue, May 10, 2022 at 04:58:13PM -0700, John Hubbard wrote:
>> On 5/10/22 4:31 PM, Minchan Kim wrote:
>>>>> +	int __mt = get_pageblock_migratetype(page);
>>>>> +	int mt = __READ_ONCE(__mt);
>>>>
>>>> Although I saw the email discussion about this in v2, that discussion
>>>> didn't go far enough. It started with "don't use volatile", and went
>>>> on to "try __READ_ONCE() instead", but it should have continued on
>>>> to "you don't need this at all".
>>>
>>> That's really what I want to hear from experts so wanted to learn
>>> "Why". How could we prevent refetching of the mt if we don't use
>>> __READ_ONCE or volatile there?
>>>
>>>>
>>>> Because you don't. There is nothing you are racing with, and adding
>>>> __READ_ONCE() in order to avoid a completely not-going-to-happen
>>>> compiler re-invocation of a significant code block is just very wrong.
>>>>
>>>> So let's just let it go entirely. :)
>>>
>>> Yeah, once it's clear for everyone, I am happy to remove the
>>> unnecessary lines.
>>>
>>>>
>>>>> +
>>>>> +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
>>>>
>>
>> With or without __READ_ONCE() or volatile or anything else,
>> this code will do what you want. Which is: loosely check
>> for either of the above.
>>
>> What functional problem do you think you are preventing
>> with __READ_ONCE()? Because I don't see one.
> 
> I discussed the issue at v1 so please take a look.
> 
> https://lore.kernel.org/all/YnFvmc+eMoXvLCWf@google.com/

I read that, but there was never any real justification there for needing
to prevent a re-read of mt, just a preference: "I'd like to keep use the local
variable mt's value in folloing conditions checks instead of refetching
the value from get_pageblock_migratetype."

But I don't believe that there is any combination of values of mt that
will cause a problem here.

I also think that once we pull in experts, they will tell us that the
compiler is not going to re-run a non-trivial function to re-fetch a
value, but I'm not one of those experts, so that's still arguable. But
imagine what the kernel code would look like if every time we call
a large function, we have to consider if it actually gets called some
arbitrary number of times, due to (anti-) optimizations by the compiler.
This seems like something that is not really happening.


thanks,

Minchan Kim May 11, 2022, 9:46 p.m. UTC | #6

On Tue, May 10, 2022 at 09:32:05PM -0700, John Hubbard wrote:
> On 5/10/22 17:09, Minchan Kim wrote:
> > On Tue, May 10, 2022 at 04:58:13PM -0700, John Hubbard wrote:
> > > On 5/10/22 4:31 PM, Minchan Kim wrote:
> > > > > > +	int __mt = get_pageblock_migratetype(page);
> > > > > > +	int mt = __READ_ONCE(__mt);
> > > > > 
> > > > > Although I saw the email discussion about this in v2, that discussion
> > > > > didn't go far enough. It started with "don't use volatile", and went
> > > > > on to "try __READ_ONCE() instead", but it should have continued on
> > > > > to "you don't need this at all".
> > > > 
> > > > That's really what I want to hear from experts so wanted to learn
> > > > "Why". How could we prevent refetching of the mt if we don't use
> > > > __READ_ONCE or volatile there?
> > > > 
> > > > > 
> > > > > Because you don't. There is nothing you are racing with, and adding
> > > > > __READ_ONCE() in order to avoid a completely not-going-to-happen
> > > > > compiler re-invocation of a significant code block is just very wrong.
> > > > > 
> > > > > So let's just let it go entirely. :)
> > > > 
> > > > Yeah, once it's clear for everyone, I am happy to remove the
> > > > unnecessary lines.
> > > > 
> > > > > 
> > > > > > +
> > > > > > +	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> > > > > 
> > > 
> > > With or without __READ_ONCE() or volatile or anything else,
> > > this code will do what you want. Which is: loosely check
> > > for either of the above.
> > > 
> > > What functional problem do you think you are preventing
> > > with __READ_ONCE()? Because I don't see one.
> > 
> > I discussed the issue at v1 so please take a look.
> > 
> > https://lore.kernel.org/all/YnFvmc+eMoXvLCWf@google.com/
> 
> I read that, but there was never any real justification there for needing
> to prevent a re-read of mt, just a preference: "I'd like to keep use the local
> variable mt's value in folloing conditions checks instead of refetching
> the value from get_pageblock_migratetype."
> 
> But I don't believe that there is any combination of values of mt that
> will cause a problem here.
> 
> I also think that once we pull in experts, they will tell us that the
> compiler is not going to re-run a non-trivial function to re-fetch a
> value, but I'm not one of those experts, so that's still arguable. But
> imagine what the kernel code would look like if every time we call
> a large function, we have to consider if it actually gets called some
> arbitrary number of times, due to (anti-) optimizations by the compiler.
> This seems like something that is not really happening.

Maybe, I might be paranoid since I have heard too subtle things
about how compiler could changes high level language code so wanted
be careful especially when we do lockless-stuff.

Who cares when we change the large(?) function to small(?) function
later on? I'd like to hear from experts to decide it.

John Hubbard May 11, 2022, 10:25 p.m. UTC | #7

On 5/11/22 2:46 PM, Minchan Kim wrote:
>> I read that, but there was never any real justification there for needing
>> to prevent a re-read of mt, just a preference: "I'd like to keep use the local
>> variable mt's value in folloing conditions checks instead of refetching
>> the value from get_pageblock_migratetype."
>>
>> But I don't believe that there is any combination of values of mt that
>> will cause a problem here.
>>
>> I also think that once we pull in experts, they will tell us that the
>> compiler is not going to re-run a non-trivial function to re-fetch a
>> value, but I'm not one of those experts, so that's still arguable. But
>> imagine what the kernel code would look like if every time we call
>> a large function, we have to consider if it actually gets called some
>> arbitrary number of times, due to (anti-) optimizations by the compiler.
>> This seems like something that is not really happening.
> 
> Maybe, I might be paranoid since I have heard too subtle things
> about how compiler could changes high level language code so wanted
> be careful especially when we do lockless-stuff.
> 
> Who cares when we change the large(?) function to small(?) function
> later on? I'd like to hear from experts to decide it.
> 

Yes. But one thing that is still unanswered, that I think you can
answer, is: even if the compiler *did* re-read the mt variable, what
problems could that cause? I claim "no problems", because there is
no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
problems here.

Any if that's true, then we can leave the experts alone, because
the answer is there without knowing what happens exactly to mt.

thanks,

Minchan Kim May 11, 2022, 10:37 p.m. UTC | #8

On Wed, May 11, 2022 at 03:25:49PM -0700, John Hubbard wrote:
> On 5/11/22 2:46 PM, Minchan Kim wrote:
> > > I read that, but there was never any real justification there for needing
> > > to prevent a re-read of mt, just a preference: "I'd like to keep use the local
> > > variable mt's value in folloing conditions checks instead of refetching
> > > the value from get_pageblock_migratetype."
> > > 
> > > But I don't believe that there is any combination of values of mt that
> > > will cause a problem here.
> > > 
> > > I also think that once we pull in experts, they will tell us that the
> > > compiler is not going to re-run a non-trivial function to re-fetch a
> > > value, but I'm not one of those experts, so that's still arguable. But
> > > imagine what the kernel code would look like if every time we call
> > > a large function, we have to consider if it actually gets called some
> > > arbitrary number of times, due to (anti-) optimizations by the compiler.
> > > This seems like something that is not really happening.
> > 
> > Maybe, I might be paranoid since I have heard too subtle things
> > about how compiler could changes high level language code so wanted
> > be careful especially when we do lockless-stuff.
> > 
> > Who cares when we change the large(?) function to small(?) function
> > later on? I'd like to hear from experts to decide it.
> > 
> 
> Yes. But one thing that is still unanswered, that I think you can
> answer, is: even if the compiler *did* re-read the mt variable, what
> problems could that cause? I claim "no problems", because there is
> no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
> problems here.

What scenario I am concerning with __READ_ONCE so compiler
inlining get_pageblock_migratetype two times are

        CPU 0                                                       CPU 1
                                                                alloc_contig_range
is_pinnable_page                                                start_isolate_page_range
                                                                  set_pageblock_migratetype(MIGRATE_ISOLATE)
   if (get_pageeblock_migratetype(page) == MIGRATE_CMA)    
       so it's false
                                                                undo:
                                                                  set_pageblock_migratetype(MIGRATE_CMA)
     
   if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE) 
       so it's false

In the end, CMA memory would be pinned by CPU 0 process
so CMA allocation keep failed until the process release the
refcount.

> 
> Any if that's true, then we can leave the experts alone, because
> the answer is there without knowing what happens exactly to mt.
> 
> thanks,
> 
> -- 
> John Hubbard
> NVIDIA
>

John Hubbard May 11, 2022, 10:49 p.m. UTC | #9

On 5/11/22 15:37, Minchan Kim wrote:
>> Yes. But one thing that is still unanswered, that I think you can
>> answer, is: even if the compiler *did* re-read the mt variable, what
>> problems could that cause? I claim "no problems", because there is
>> no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
>> problems here.
> 
> What scenario I am concerning with __READ_ONCE so compiler
> inlining get_pageblock_migratetype two times are
> 
>          CPU 0                                                       CPU 1
>                                                                  alloc_contig_range
> is_pinnable_page                                                start_isolate_page_range
>                                                                    set_pageblock_migratetype(MIGRATE_ISOLATE)
>     if (get_pageeblock_migratetype(page) == MIGRATE_CMA)
>         so it's false
>                                                                  undo:
>                                                                    set_pageblock_migratetype(MIGRATE_CMA)
>       
>     if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE)
>         so it's false
> 
> In the end, CMA memory would be pinned by CPU 0 process
> so CMA allocation keep failed until the process release the
> refcount.
> 

OK, so the code checks the wrong item each time. But the code really
only needs to know "is either _CMA or _ISOLATE set?". And so you
can just sidestep the entire question by writing it like this:

int mt = get_pageblock_migratetype(page);

if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
	return false;


...yes?

thanks,

Minchan Kim May 11, 2022, 11:08 p.m. UTC | #10

On Wed, May 11, 2022 at 03:49:06PM -0700, John Hubbard wrote:
> On 5/11/22 15:37, Minchan Kim wrote:
> > > Yes. But one thing that is still unanswered, that I think you can
> > > answer, is: even if the compiler *did* re-read the mt variable, what
> > > problems could that cause? I claim "no problems", because there is
> > > no combination of 0, _CMA, _ISOLATE, _CMA|ISOLATE that will cause
> > > problems here.
> > 
> > What scenario I am concerning with __READ_ONCE so compiler
> > inlining get_pageblock_migratetype two times are
> > 
> >          CPU 0                                                       CPU 1
> >                                                                  alloc_contig_range
> > is_pinnable_page                                                start_isolate_page_range
> >                                                                    set_pageblock_migratetype(MIGRATE_ISOLATE)
> >     if (get_pageeblock_migratetype(page) == MIGRATE_CMA)
> >         so it's false
> >                                                                  undo:
> >                                                                    set_pageblock_migratetype(MIGRATE_CMA)
> >     if (get_pageeblock_migratetype(page) == MIGRATE_ISOLATE)
> >         so it's false
> > 
> > In the end, CMA memory would be pinned by CPU 0 process
> > so CMA allocation keep failed until the process release the
> > refcount.
> > 
> 
> OK, so the code checks the wrong item each time. But the code really
> only needs to know "is either _CMA or _ISOLATE set?". And so you

Yes.

> can just sidestep the entire question by writing it like this:
> 
> int mt = get_pageblock_migratetype(page);
> 
> if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> 	return false;

I am confused. Isn't it same question?

                                                    set_pageblock_migratetype(MIGRATE_ISOLATE)
if (get_pageblock_migrate(page) & MIGRATE_CMA)

                                                    set_pageblock_migratetype(MIGRATE_CMA)

if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)

John Hubbard May 11, 2022, 11:13 p.m. UTC | #11

On 5/11/22 16:08, Minchan Kim wrote:
>> OK, so the code checks the wrong item each time. But the code really
>> only needs to know "is either _CMA or _ISOLATE set?". And so you
> 
> Yes.
> 
>> can just sidestep the entire question by writing it like this:
>>
>> int mt = get_pageblock_migratetype(page);
>>
>> if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
>> 	return false;
> 
> I am confused. Isn't it same question?
> 
>                                                      set_pageblock_migratetype(MIGRATE_ISOLATE)
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
> 
>                                                      set_pageblock_migratetype(MIGRATE_CMA)
> 
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)

Well no, because the "&" operation is a single operation on the CPU, and 
isn't going to get split up like that.


thanks,

Minchan Kim May 11, 2022, 11:15 p.m. UTC | #12

On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> On 5/11/22 16:08, Minchan Kim wrote:
> > > OK, so the code checks the wrong item each time. But the code really
> > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> > 
> > Yes.
> > 
> > > can just sidestep the entire question by writing it like this:
> > > 
> > > int mt = get_pageblock_migratetype(page);
> > > 
> > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > 	return false;
> > 
> > I am confused. Isn't it same question?
> > 
> >                                                      set_pageblock_migratetype(MIGRATE_ISOLATE)
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > 
> >                                                      set_pageblock_migratetype(MIGRATE_CMA)
> > 
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> 
> Well no, because the "&" operation is a single operation on the CPU, and
> isn't going to get split up like that.

Oh, if that's true, yeah, I could live with it.

Thanks, let me post next revision with commenting about that.

Minchan Kim May 11, 2022, 11:28 p.m. UTC | #13

On Wed, May 11, 2022 at 04:15:17PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> > On 5/11/22 16:08, Minchan Kim wrote:
> > > > OK, so the code checks the wrong item each time. But the code really
> > > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> > > 
> > > Yes.
> > > 
> > > > can just sidestep the entire question by writing it like this:
> > > > 
> > > > int mt = get_pageblock_migratetype(page);
> > > > 
> > > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > > 	return false;
> > > 
> > > I am confused. Isn't it same question?
> > > 
> > >                                                      set_pageblock_migratetype(MIGRATE_ISOLATE)
> > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > > 
> > >                                                      set_pageblock_migratetype(MIGRATE_CMA)
> > > 
> > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > 
> > Well no, because the "&" operation is a single operation on the CPU, and
> > isn't going to get split up like that.
> 
> Oh, if that's true, yeah, I could live with it.
> 
> Thanks, let me post next revision with commenting about that.

This is delta to confirm before posting next revision.

Are you okay with this one?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cbf79eb790e0..7b2df6780552 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1626,14 +1626,14 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 static inline bool is_pinnable_page(struct page *page)
 {
 #ifdef CONFIG_CMA
+       int mt = get_pageblock_migratetype(page);
+
        /*
-        * use volatile to use local variable mt instead of
-        * refetching mt value.
+        * "&" operation would prevent compiler split up
+        * get_pageblock_migratetype two times for each
+        * condition check: refetching mt value two times.
         */
-       int __mt = get_pageblock_migratetype(page);
-       int mt = __READ_ONCE(__mt);
-
-       if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
+       if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
                return false;
 #endif

John Hubbard May 11, 2022, 11:33 p.m. UTC | #14

On 5/11/22 16:28, Minchan Kim wrote:
> This is delta to confirm before posting next revision.
> 
> Are you okay with this one?

Yes, just maybe delete the comment:

> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cbf79eb790e0..7b2df6780552 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1626,14 +1626,14 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
>   static inline bool is_pinnable_page(struct page *page)
>   {
>   #ifdef CONFIG_CMA
> +       int mt = get_pageblock_migratetype(page);
> +
>          /*
> -        * use volatile to use local variable mt instead of
> -        * refetching mt value.
> +        * "&" operation would prevent compiler split up
> +        * get_pageblock_migratetype two times for each
> +        * condition check: refetching mt value two times.
>           */

If you wanted a useful comment here, I think a description of
what is running at the same time would be good. But otherwise,
I'd just delete the entire comment, because a micro-comment
about "&" is not really worth the vertical space here IMHO.

> -       int __mt = get_pageblock_migratetype(page);
> -       int mt = __READ_ONCE(__mt);
> -
> -       if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
> +       if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
>                  return false;
>   #endif
> 

Anyway, I'm comfortable with this now:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,

Paul E. McKenney May 11, 2022, 11:45 p.m. UTC | #15

On Wed, May 11, 2022 at 04:13:10PM -0700, John Hubbard wrote:
> On 5/11/22 16:08, Minchan Kim wrote:
> > > OK, so the code checks the wrong item each time. But the code really
> > > only needs to know "is either _CMA or _ISOLATE set?". And so you
> > 
> > Yes.
> > 
> > > can just sidestep the entire question by writing it like this:
> > > 
> > > int mt = get_pageblock_migratetype(page);
> > > 
> > > if (mt & (MIGRATE_ISOLATE | MIGRATE_CMA))
> > > 	return false;
> > 
> > I am confused. Isn't it same question?
> > 
> >                                                      set_pageblock_migratetype(MIGRATE_ISOLATE)
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > 
> >                                                      set_pageblock_migratetype(MIGRATE_CMA)
> > 
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> 
> Well no, because the "&" operation is a single operation on the CPU, and
> isn't going to get split up like that.

Chiming in a bit late...

The usual way that this sort of thing causes trouble is if there is a
single store instruction that changes the value from MIGRATE_ISOLATE
to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
and then combine the results.  This could give a zero outcome where the
underlying variable never had the value zero.

Is this sort of thing low probability?

Definitely.

Isn't this sort of thing prohibited?

Definitely not.

So what you have will likely work for at least a while longer, but it
is not guaranteed and it forces you to think a lot harder about what
the current implementations of the compiler can and cannot do to you.

The following LWN article goes through some of the possible optimizations
(vandalisms?) in this area: https://lwn.net/Articles/793253/

In the end, it is your code, so you get to decide how much you would
like to keep track of what compilers get up to over time.  ;-)

							Thanx, Paul

John Hubbard May 11, 2022, 11:57 p.m. UTC | #16

On 5/11/22 16:45, Paul E. McKenney wrote:
>>
>> Well no, because the "&" operation is a single operation on the CPU, and
>> isn't going to get split up like that.
> 
> Chiming in a bit late...

Much appreciated!

> 
> The usual way that this sort of thing causes trouble is if there is a
> single store instruction that changes the value from MIGRATE_ISOLATE
> to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,

Doing an AND twice for "x & constant" this definitely blows my mind. Is
nothing sacred? :)

> and then combine the results.  This could give a zero outcome where the
> underlying variable never had the value zero.
> 
> Is this sort of thing low probability?
> 
> Definitely.
> 
> Isn't this sort of thing prohibited?
> 
> Definitely not.
> 
> So what you have will likely work for at least a while longer, but it
> is not guaranteed and it forces you to think a lot harder about what
> the current implementations of the compiler can and cannot do to you.
> 
> The following LWN article goes through some of the possible optimizations
> (vandalisms?) in this area: https://lwn.net/Articles/793253/
> 

hmm, I don't think we hit any of those  cases, do we? Because here, the 
"write" side is via a non-inline function that I just don't believe the 
compiler is allowed to call twice. Or is it?

Minchan's earlier summary:

CPU 0                         CPU1


                               set_pageblock_migratetype(MIGRATE_ISOLATE)

if (get_pageblock_migrate(page) & MIGRATE_CMA)

                               set_pageblock_migratetype(MIGRATE_CMA)

if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)

...where set_pageblock_migratetype() is not inline.

thanks,

Paul E. McKenney May 12, 2022, 12:12 a.m. UTC | #17

On Wed, May 11, 2022 at 04:57:04PM -0700, John Hubbard wrote:
> On 5/11/22 16:45, Paul E. McKenney wrote:
> > > 
> > > Well no, because the "&" operation is a single operation on the CPU, and
> > > isn't going to get split up like that.
> > 
> > Chiming in a bit late...
> 
> Much appreciated!
> 
> > The usual way that this sort of thing causes trouble is if there is a
> > single store instruction that changes the value from MIGRATE_ISOLATE
> > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> 
> Doing an AND twice for "x & constant" this definitely blows my mind. Is
> nothing sacred? :)

Apparently there is not much sacred to compiler writers in search of
additional optimizations.  :-/

> > and then combine the results.  This could give a zero outcome where the
> > underlying variable never had the value zero.
> > 
> > Is this sort of thing low probability?
> > 
> > Definitely.
> > 
> > Isn't this sort of thing prohibited?
> > 
> > Definitely not.
> > 
> > So what you have will likely work for at least a while longer, but it
> > is not guaranteed and it forces you to think a lot harder about what
> > the current implementations of the compiler can and cannot do to you.
> > 
> > The following LWN article goes through some of the possible optimizations
> > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> 
> hmm, I don't think we hit any of those  cases, do we? Because here, the
> "write" side is via a non-inline function that I just don't believe the
> compiler is allowed to call twice. Or is it?

Not yet.  But if link-time optimizations (LTO) continue their march,
I wouldn't feel safe ruling it out...

> Minchan's earlier summary:
> 
> CPU 0                         CPU1
> 
> 
>                               set_pageblock_migratetype(MIGRATE_ISOLATE)
> 
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
> 
>                               set_pageblock_migratetype(MIGRATE_CMA)
> 
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> 
> ...where set_pageblock_migratetype() is not inline.

...especially if the code is reorganized for whatever reason.

> thanks,
> -- 
> John Hubbard
> NVIDIA

But again:

> > In the end, it is your code, so you get to decide how much you would
> > like to keep track of what compilers get up to over time.  ;-)

 							Thanx, Paul

John Hubbard May 12, 2022, 12:12 a.m. UTC | #18

On 5/11/22 16:57, John Hubbard wrote:
> On 5/11/22 16:45, Paul E. McKenney wrote:
>>>
>>> Well no, because the "&" operation is a single operation on the CPU, and
>>> isn't going to get split up like that.
>>
>> Chiming in a bit late...
> 
> Much appreciated!
> 
>>
>> The usual way that this sort of thing causes trouble is if there is a
>> single store instruction that changes the value from MIGRATE_ISOLATE
>> to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> 
> Doing an AND twice for "x & constant" this definitely blows my mind. Is
> nothing sacred? :)
> 
>> and then combine the results.  This could give a zero outcome where the
>> underlying variable never had the value zero.
>>
>> Is this sort of thing low probability?
>>
>> Definitely.
>>
>> Isn't this sort of thing prohibited?
>>
>> Definitely not.
>>
>> So what you have will likely work for at least a while longer, but it
>> is not guaranteed and it forces you to think a lot harder about what
>> the current implementations of the compiler can and cannot do to you.
>>
>> The following LWN article goes through some of the possible optimizations
>> (vandalisms?) in this area: https://lwn.net/Articles/793253/
>>
> 
> hmm, I don't think we hit any of those  cases, do we? Because here, the 
> "write" side is via a non-inline function that I just don't believe the 
> compiler is allowed to call twice. Or is it?
> 
> Minchan's earlier summary:
> 
> CPU 0                         CPU1
> 
> 
>                                set_pageblock_migratetype(MIGRATE_ISOLATE)
> 
> if (get_pageblock_migrate(page) & MIGRATE_CMA)
> 
>                                set_pageblock_migratetype(MIGRATE_CMA)
> 
> if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> 
> ...where set_pageblock_migratetype() is not inline.
> 
> thanks,

Let me try to say this more clearly: I don't think that the following
__READ_ONCE() statement can actually help anything, given that
get_pageblock_migratetype() is non-inlined:

+	int __mt = get_pageblock_migratetype(page);
+	int mt = __READ_ONCE(__mt);
+
+	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
+		return false;


Am I missing anything here?


thanks,

Paul E. McKenney May 12, 2022, 12:22 a.m. UTC | #19

On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> On 5/11/22 16:57, John Hubbard wrote:
> > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > > 
> > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > isn't going to get split up like that.
> > > 
> > > Chiming in a bit late...
> > 
> > Much appreciated!
> > 
> > > 
> > > The usual way that this sort of thing causes trouble is if there is a
> > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> > 
> > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > nothing sacred? :)
> > 
> > > and then combine the results.  This could give a zero outcome where the
> > > underlying variable never had the value zero.
> > > 
> > > Is this sort of thing low probability?
> > > 
> > > Definitely.
> > > 
> > > Isn't this sort of thing prohibited?
> > > 
> > > Definitely not.
> > > 
> > > So what you have will likely work for at least a while longer, but it
> > > is not guaranteed and it forces you to think a lot harder about what
> > > the current implementations of the compiler can and cannot do to you.
> > > 
> > > The following LWN article goes through some of the possible optimizations
> > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > > 
> > 
> > hmm, I don't think we hit any of those  cases, do we? Because here, the
> > "write" side is via a non-inline function that I just don't believe the
> > compiler is allowed to call twice. Or is it?
> > 
> > Minchan's earlier summary:
> > 
> > CPU 0                         CPU1
> > 
> > 
> >                                set_pageblock_migratetype(MIGRATE_ISOLATE)
> > 
> > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > 
> >                                set_pageblock_migratetype(MIGRATE_CMA)
> > 
> > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > 
> > ...where set_pageblock_migratetype() is not inline.
> > 
> > thanks,
> 
> Let me try to say this more clearly: I don't think that the following
> __READ_ONCE() statement can actually help anything, given that
> get_pageblock_migratetype() is non-inlined:
> 
> +	int __mt = get_pageblock_migratetype(page);
> +	int mt = __READ_ONCE(__mt);
> +
> +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> +		return false;
> 
> 
> Am I missing anything here?

In the absence of future aggression from link-time optimizations (LTO),
you are missing nothing.

							Thanx, Paul

Minchan Kim May 12, 2022, 12:26 a.m. UTC | #20

On Wed, May 11, 2022 at 05:22:07PM -0700, Paul E. McKenney wrote:
> On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> > On 5/11/22 16:57, John Hubbard wrote:
> > > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > > > 
> > > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > > isn't going to get split up like that.
> > > > 
> > > > Chiming in a bit late...
> > > 
> > > Much appreciated!
> > > 
> > > > 
> > > > The usual way that this sort of thing causes trouble is if there is a
> > > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> > > 
> > > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > > nothing sacred? :)
> > > 
> > > > and then combine the results.  This could give a zero outcome where the
> > > > underlying variable never had the value zero.
> > > > 
> > > > Is this sort of thing low probability?
> > > > 
> > > > Definitely.
> > > > 
> > > > Isn't this sort of thing prohibited?
> > > > 
> > > > Definitely not.
> > > > 
> > > > So what you have will likely work for at least a while longer, but it
> > > > is not guaranteed and it forces you to think a lot harder about what
> > > > the current implementations of the compiler can and cannot do to you.
> > > > 
> > > > The following LWN article goes through some of the possible optimizations
> > > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > > > 
> > > 
> > > hmm, I don't think we hit any of those  cases, do we? Because here, the
> > > "write" side is via a non-inline function that I just don't believe the
> > > compiler is allowed to call twice. Or is it?
> > > 
> > > Minchan's earlier summary:
> > > 
> > > CPU 0                         CPU1
> > > 
> > > 
> > >                                set_pageblock_migratetype(MIGRATE_ISOLATE)
> > > 
> > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > > 
> > >                                set_pageblock_migratetype(MIGRATE_CMA)
> > > 
> > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > > 
> > > ...where set_pageblock_migratetype() is not inline.
> > > 
> > > thanks,
> > 
> > Let me try to say this more clearly: I don't think that the following
> > __READ_ONCE() statement can actually help anything, given that
> > get_pageblock_migratetype() is non-inlined:
> > 
> > +	int __mt = get_pageblock_migratetype(page);
> > +	int mt = __READ_ONCE(__mt);
> > +
> > +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > +		return false;
> > 
> > 
> > Am I missing anything here?
> 
> In the absence of future aggression from link-time optimizations (LTO),
> you are missing nothing.

A thing I want to note is Android kernel uses LTO full mode.

John Hubbard May 12, 2022, 12:34 a.m. UTC | #21

On 5/11/22 17:26, Minchan Kim wrote:
>>> Let me try to say this more clearly: I don't think that the following
>>> __READ_ONCE() statement can actually help anything, given that
>>> get_pageblock_migratetype() is non-inlined:
>>>
>>> +	int __mt = get_pageblock_migratetype(page);
>>> +	int mt = __READ_ONCE(__mt);
>>> +
>>> +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>>> +		return false;
>>>
>>>
>>> Am I missing anything here?
>>
>> In the absence of future aggression from link-time optimizations (LTO),
>> you are missing nothing.
> 
> A thing I want to note is Android kernel uses LTO full mode.

Thanks Paul for explaining the state of things.

Minchan, how about something like very close to your original draft,
then, but with a little note, and the "&" as well:

int __mt = get_pageblock_migratetype(page);

/*
  * Defend against future compiler LTO features, or code refactoring
  * that inlines the above function, by forcing a single read. Because, this
  * routine races with set_pageblock_migratetype(), and we want to avoid
  * reading zero, when actually one or the other flags was set.
  */
int mt = __READ_ONCE(__mt);

if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
     return false;

...which should make everyone comfortable and protected from the
future sins of the compiler and linker teams? :)

thanks,

Paul E. McKenney May 12, 2022, 12:35 a.m. UTC | #22

On Wed, May 11, 2022 at 05:26:55PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 05:22:07PM -0700, Paul E. McKenney wrote:
> > On Wed, May 11, 2022 at 05:12:32PM -0700, John Hubbard wrote:
> > > On 5/11/22 16:57, John Hubbard wrote:
> > > > On 5/11/22 16:45, Paul E. McKenney wrote:
> > > > > > 
> > > > > > Well no, because the "&" operation is a single operation on the CPU, and
> > > > > > isn't going to get split up like that.
> > > > > 
> > > > > Chiming in a bit late...
> > > > 
> > > > Much appreciated!
> > > > 
> > > > > 
> > > > > The usual way that this sort of thing causes trouble is if there is a
> > > > > single store instruction that changes the value from MIGRATE_ISOLATE
> > > > > to MIGRATE_CMA, and if the compiler decides to fetch twice, AND twice,
> > > > 
> > > > Doing an AND twice for "x & constant" this definitely blows my mind. Is
> > > > nothing sacred? :)
> > > > 
> > > > > and then combine the results.  This could give a zero outcome where the
> > > > > underlying variable never had the value zero.
> > > > > 
> > > > > Is this sort of thing low probability?
> > > > > 
> > > > > Definitely.
> > > > > 
> > > > > Isn't this sort of thing prohibited?
> > > > > 
> > > > > Definitely not.
> > > > > 
> > > > > So what you have will likely work for at least a while longer, but it
> > > > > is not guaranteed and it forces you to think a lot harder about what
> > > > > the current implementations of the compiler can and cannot do to you.
> > > > > 
> > > > > The following LWN article goes through some of the possible optimizations
> > > > > (vandalisms?) in this area: https://lwn.net/Articles/793253/
> > > > > 
> > > > 
> > > > hmm, I don't think we hit any of those  cases, do we? Because here, the
> > > > "write" side is via a non-inline function that I just don't believe the
> > > > compiler is allowed to call twice. Or is it?
> > > > 
> > > > Minchan's earlier summary:
> > > > 
> > > > CPU 0                         CPU1
> > > > 
> > > > 
> > > >                                set_pageblock_migratetype(MIGRATE_ISOLATE)
> > > > 
> > > > if (get_pageblock_migrate(page) & MIGRATE_CMA)
> > > > 
> > > >                                set_pageblock_migratetype(MIGRATE_CMA)
> > > > 
> > > > if (get_pageblock_migrate(page) & MIGRATE_ISOLATE)
> > > > 
> > > > ...where set_pageblock_migratetype() is not inline.
> > > > 
> > > > thanks,
> > > 
> > > Let me try to say this more clearly: I don't think that the following
> > > __READ_ONCE() statement can actually help anything, given that
> > > get_pageblock_migratetype() is non-inlined:
> > > 
> > > +	int __mt = get_pageblock_migratetype(page);
> > > +	int mt = __READ_ONCE(__mt);
> > > +
> > > +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > +		return false;
> > > 
> > > 
> > > Am I missing anything here?
> > 
> > In the absence of future aggression from link-time optimizations (LTO),
> > you are missing nothing.
> 
> A thing I want to note is Android kernel uses LTO full mode.

I doubt that current LTO can do this sort of optimized inlining, at least,
not yet.

							Thanx, Paul

Paul E. McKenney May 12, 2022, 12:49 a.m. UTC | #23

On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> On 5/11/22 17:26, Minchan Kim wrote:
> > > > Let me try to say this more clearly: I don't think that the following
> > > > __READ_ONCE() statement can actually help anything, given that
> > > > get_pageblock_migratetype() is non-inlined:
> > > > 
> > > > +	int __mt = get_pageblock_migratetype(page);
> > > > +	int mt = __READ_ONCE(__mt);
> > > > +
> > > > +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > +		return false;
> > > > 
> > > > 
> > > > Am I missing anything here?
> > > 
> > > In the absence of future aggression from link-time optimizations (LTO),
> > > you are missing nothing.
> > 
> > A thing I want to note is Android kernel uses LTO full mode.
> 
> Thanks Paul for explaining the state of things.
> 
> Minchan, how about something like very close to your original draft,
> then, but with a little note, and the "&" as well:
> 
> int __mt = get_pageblock_migratetype(page);
> 
> /*
>  * Defend against future compiler LTO features, or code refactoring
>  * that inlines the above function, by forcing a single read. Because, this
>  * routine races with set_pageblock_migratetype(), and we want to avoid
>  * reading zero, when actually one or the other flags was set.
>  */
> int mt = __READ_ONCE(__mt);
> 
> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>     return false;
> 
> 
> ...which should make everyone comfortable and protected from the
> future sins of the compiler and linker teams? :)

This would work, but it would force a store to the stack and an immediate
reload.  Which might be OK on this code path.

But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
would likely generate the same code that is produced today.

	word = READ_ONCE(bitmap[word_bitidx]);

But I could easily have missed a turn in that cascade of functions.  ;-)

Or there might be some code path that really hates a READ_ONCE() in
that place.

							Thanx, Paul

John Hubbard May 12, 2022, 1:02 a.m. UTC | #24

On 5/11/22 17:49, Paul E. McKenney wrote:
>> Thanks Paul for explaining the state of things.
>>
>> Minchan, how about something like very close to your original draft,
>> then, but with a little note, and the "&" as well:
>>
>> int __mt = get_pageblock_migratetype(page);
>>
>> /*
>>   * Defend against future compiler LTO features, or code refactoring
>>   * that inlines the above function, by forcing a single read. Because, this
>>   * routine races with set_pageblock_migratetype(), and we want to avoid
>>   * reading zero, when actually one or the other flags was set.
>>   */
>> int mt = __READ_ONCE(__mt);
>>
>> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>>      return false;
>>
>>
>> ...which should make everyone comfortable and protected from the
>> future sins of the compiler and linker teams? :)
> 
> This would work, but it would force a store to the stack and an immediate
> reload.  Which might be OK on this code path.
> 
> But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
> would likely generate the same code that is produced today.
> 
> 	word = READ_ONCE(bitmap[word_bitidx]);

Ah right, I like that much, much better. The READ_ONCE is placed where
it actually clearly matters, rather than several layers removed.

> 
> But I could easily have missed a turn in that cascade of functions.  ;-)
> 
> Or there might be some code path that really hates a READ_ONCE() in
> that place.

I certainly hope not. I see free_one_page(), among other things, calls
this. But having the correct READ_ONCE() in a central place seems worth
it, unless we know that this will cause a measurable slowdown somewhere.


thanks,

Minchan Kim May 12, 2022, 1:03 a.m. UTC | #25

On Wed, May 11, 2022 at 05:49:49PM -0700, Paul E. McKenney wrote:
> On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> > On 5/11/22 17:26, Minchan Kim wrote:
> > > > > Let me try to say this more clearly: I don't think that the following
> > > > > __READ_ONCE() statement can actually help anything, given that
> > > > > get_pageblock_migratetype() is non-inlined:
> > > > > 
> > > > > +	int __mt = get_pageblock_migratetype(page);
> > > > > +	int mt = __READ_ONCE(__mt);
> > > > > +
> > > > > +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > > +		return false;
> > > > > 
> > > > > 
> > > > > Am I missing anything here?
> > > > 
> > > > In the absence of future aggression from link-time optimizations (LTO),
> > > > you are missing nothing.
> > > 
> > > A thing I want to note is Android kernel uses LTO full mode.
> > 
> > Thanks Paul for explaining the state of things.
> > 
> > Minchan, how about something like very close to your original draft,
> > then, but with a little note, and the "&" as well:
> > 
> > int __mt = get_pageblock_migratetype(page);
> > 
> > /*
> >  * Defend against future compiler LTO features, or code refactoring
> >  * that inlines the above function, by forcing a single read. Because, this
> >  * routine races with set_pageblock_migratetype(), and we want to avoid
> >  * reading zero, when actually one or the other flags was set.
> >  */
> > int mt = __READ_ONCE(__mt);
> > 
> > if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> >     return false;
> > 
> > 
> > ...which should make everyone comfortable and protected from the
> > future sins of the compiler and linker teams? :)
> 
> This would work, but it would force a store to the stack and an immediate
> reload.  Which might be OK on this code path.
> 
> But using READ_ONCE() in (I think?) __get_pfnblock_flags_mask()
> would likely generate the same code that is produced today.
> 
> 	word = READ_ONCE(bitmap[word_bitidx]);
> 
> But I could easily have missed a turn in that cascade of functions.  ;-)
> 
> Or there might be some code path that really hates a READ_ONCE() in
> that place.

My worry about chaning __get_pfnblock_flags_mask is it's called
multiple hot places in mm codes so I didn't want to add overhead
to them.

Minchan Kim May 12, 2022, 1:03 a.m. UTC | #26

On Wed, May 11, 2022 at 05:34:52PM -0700, John Hubbard wrote:
> On 5/11/22 17:26, Minchan Kim wrote:
> > > > Let me try to say this more clearly: I don't think that the following
> > > > __READ_ONCE() statement can actually help anything, given that
> > > > get_pageblock_migratetype() is non-inlined:
> > > > 
> > > > +	int __mt = get_pageblock_migratetype(page);
> > > > +	int mt = __READ_ONCE(__mt);
> > > > +
> > > > +	if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
> > > > +		return false;
> > > > 
> > > > 
> > > > Am I missing anything here?
> > > 
> > > In the absence of future aggression from link-time optimizations (LTO),
> > > you are missing nothing.
> > 
> > A thing I want to note is Android kernel uses LTO full mode.
> 
> Thanks Paul for explaining the state of things.
> 
> Minchan, how about something like very close to your original draft,
> then, but with a little note, and the "&" as well:
> 
> int __mt = get_pageblock_migratetype(page);
> 
> /*
>  * Defend against future compiler LTO features, or code refactoring
>  * that inlines the above function, by forcing a single read. Because, this
>  * routine races with set_pageblock_migratetype(), and we want to avoid
>  * reading zero, when actually one or the other flags was set.
>  */
> int mt = __READ_ONCE(__mt);
> 
> if (mt & (MIGRATE_CMA | MIGRATE_ISOLATE))
>     return false;
> 
> 
> ...which should make everyone comfortable and protected from the
> future sins of the compiler and linker teams? :)

Will take. Thanks.

John Hubbard May 12, 2022, 1:08 a.m. UTC | #27

On 5/11/22 18:03, Minchan Kim wrote:
>>
>> Or there might be some code path that really hates a READ_ONCE() in
>> that place.
> 
> My worry about chaning __get_pfnblock_flags_mask is it's called
> multiple hot places in mm codes so I didn't want to add overhead
> to them.

...unless it really does generate the same code as is already there,
right? Let me check that real quick.

thanks,

John Hubbard May 12, 2022, 2:18 a.m. UTC | #28

On 5/11/22 18:08, John Hubbard wrote:
> On 5/11/22 18:03, Minchan Kim wrote:
>>>
>>> Or there might be some code path that really hates a READ_ONCE() in
>>> that place.
>>
>> My worry about chaning __get_pfnblock_flags_mask is it's called
>> multiple hot places in mm codes so I didn't want to add overhead
>> to them.
> 
> ...unless it really does generate the same code as is already there,
> right? Let me check that real quick.
> 

It does change the generated code slightly. I don't know if this will
affect performance here or not. But just for completeness, here you go:

free_one_page() originally has this (just showing the changed parts):

     mov    0x8(%rdx,%rax,8),%rbx
     and    $0x3f,%ecx
     shr    %cl,%rbx
     and    $0x7,


And after applying this diff:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..df1f8e9a294f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct 
page *page,
         word_bitidx = bitidx / BITS_PER_LONG;
         bitidx &= (BITS_PER_LONG-1);

-       word = bitmap[word_bitidx];
+       word = READ_ONCE(bitmap[word_bitidx]);
         return (word >> bitidx) & mask;
  }


...it now does an extra memory dereference:

     lea    0x8(%rdx,%rax,8),%rax
     and    $0x3f,%ecx
     mov    (%rax),%rbx
     shr    %cl,%rbx
     and    $0x7,%ebx


thanks,

Minchan Kim May 12, 2022, 3:44 a.m. UTC | #29

On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> On 5/11/22 18:08, John Hubbard wrote:
> > On 5/11/22 18:03, Minchan Kim wrote:
> > > > 
> > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > that place.
> > > 
> > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > multiple hot places in mm codes so I didn't want to add overhead
> > > to them.
> > 
> > ...unless it really does generate the same code as is already there,
> > right? Let me check that real quick.
> > 
> 
> It does change the generated code slightly. I don't know if this will
> affect performance here or not. But just for completeness, here you go:
> 
> free_one_page() originally has this (just showing the changed parts):
> 
>     mov    0x8(%rdx,%rax,8),%rbx
>     and    $0x3f,%ecx
>     shr    %cl,%rbx
>     and    $0x7,
> 
> 
> And after applying this diff:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..df1f8e9a294f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> page *page,
>         word_bitidx = bitidx / BITS_PER_LONG;
>         bitidx &= (BITS_PER_LONG-1);
> 
> -       word = bitmap[word_bitidx];
> +       word = READ_ONCE(bitmap[word_bitidx]);
>         return (word >> bitidx) & mask;
>  }
> 
> 
> ...it now does an extra memory dereference:
> 
>     lea    0x8(%rdx,%rax,8),%rax
>     and    $0x3f,%ecx
>     mov    (%rax),%rbx
>     shr    %cl,%rbx
>     and    $0x7,%ebx
> 

Thanks for checking, John.

I don't want to have the READ_ONCE in __get_pfnblock_flags_mask 
atm even though it's an extra memory dereference for specific
architecutre and specific compiler unless other callsites *do*
need it.

We choose the choice(i.e., having __READ_ONCE in is_pinanble_
page) for *potential* cases(e.g., aggressive LTO for future)
and if it's an extra memory dereference atm, it would be multiple
instructions *potentailly* as having more change or different
compiler and/or arches now or later. It's out of scope in
this patch.

Paul E. McKenney May 12, 2022, 3:57 a.m. UTC | #30

On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> On 5/11/22 18:08, John Hubbard wrote:
> > On 5/11/22 18:03, Minchan Kim wrote:
> > > > 
> > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > that place.
> > > 
> > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > multiple hot places in mm codes so I didn't want to add overhead
> > > to them.
> > 
> > ...unless it really does generate the same code as is already there,
> > right? Let me check that real quick.
> > 
> 
> It does change the generated code slightly. I don't know if this will
> affect performance here or not. But just for completeness, here you go:
> 
> free_one_page() originally has this (just showing the changed parts):
> 
>     mov    0x8(%rdx,%rax,8),%rbx
>     and    $0x3f,%ecx
>     shr    %cl,%rbx
>     and    $0x7,
> 
> 
> And after applying this diff:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..df1f8e9a294f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> page *page,
>         word_bitidx = bitidx / BITS_PER_LONG;
>         bitidx &= (BITS_PER_LONG-1);
> 
> -       word = bitmap[word_bitidx];
> +       word = READ_ONCE(bitmap[word_bitidx]);
>         return (word >> bitidx) & mask;
>  }
> 
> 
> ...it now does an extra memory dereference:
> 
>     lea    0x8(%rdx,%rax,8),%rax
>     and    $0x3f,%ecx
>     mov    (%rax),%rbx
>     shr    %cl,%rbx
>     and    $0x7,%ebx

That could indeed be a bad thing on a fastpath.

							Thanx, Paul

John Hubbard May 12, 2022, 4:47 a.m. UTC | #31

On 5/11/22 20:44, Minchan Kim wrote:
> Thanks for checking, John.
> 
> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
> atm even though it's an extra memory dereference for specific
> architecutre and specific compiler unless other callsites *do*
> need it.
> 
> We choose the choice(i.e., having __READ_ONCE in is_pinanble_
> page) for *potential* cases(e.g., aggressive LTO for future)
> and if it's an extra memory dereference atm, it would be multiple
> instructions *potentailly* as having more change or different
> compiler and/or arches now or later. It's out of scope in
> this patch.

Agreed!


thanks,

Jason Gunthorpe May 17, 2022, 2 p.m. UTC | #32

On Wed, May 11, 2022 at 08:44:43PM -0700, Minchan Kim wrote:
> On Wed, May 11, 2022 at 07:18:56PM -0700, John Hubbard wrote:
> > On 5/11/22 18:08, John Hubbard wrote:
> > > On 5/11/22 18:03, Minchan Kim wrote:
> > > > > 
> > > > > Or there might be some code path that really hates a READ_ONCE() in
> > > > > that place.
> > > > 
> > > > My worry about chaning __get_pfnblock_flags_mask is it's called
> > > > multiple hot places in mm codes so I didn't want to add overhead
> > > > to them.
> > > 
> > > ...unless it really does generate the same code as is already there,
> > > right? Let me check that real quick.
> > > 
> > 
> > It does change the generated code slightly. I don't know if this will
> > affect performance here or not. But just for completeness, here you go:
> > 
> > free_one_page() originally has this (just showing the changed parts):
> > 
> >     mov    0x8(%rdx,%rax,8),%rbx
> >     and    $0x3f,%ecx
> >     shr    %cl,%rbx
> >     and    $0x7,
> > 
> > 
> > And after applying this diff:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 0e42038382c1..df1f8e9a294f 100644
> > +++ b/mm/page_alloc.c
> > @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> > page *page,
> >         word_bitidx = bitidx / BITS_PER_LONG;
> >         bitidx &= (BITS_PER_LONG-1);
> > 
> > -       word = bitmap[word_bitidx];
> > +       word = READ_ONCE(bitmap[word_bitidx]);
> >         return (word >> bitidx) & mask;
> >  }
> > 
> > 
> > ...it now does an extra memory dereference:
> > 
> >     lea    0x8(%rdx,%rax,8),%rax
> >     and    $0x3f,%ecx
> >     mov    (%rax),%rbx
> >     shr    %cl,%rbx
> >     and    $0x7,%ebx

Where is the extra memory reference? 'lea' is not a memory reference,
it is just some maths?

> Thanks for checking, John.
> 
> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask 
> atm even though it's an extra memory dereference for specific
> architecutre and specific compiler unless other callsites *do*
> need it.

If a callpath is called under locking or not under locking then I
would expect to have two call chains clearly marked what their locking
conditions are. ie __get_pfn_block_flags_mask_unlocked() - and
obviously clearly document and check what the locking requirements are
of the locked path.

IMHO putting a READ_ONCE on something that is not a memory load from
shared data is nonsense - if a simple == has a stability risk then so
does the '(word >> bitidx) & mask'.

Jason

John Hubbard May 17, 2022, 6:12 p.m. UTC | #33

On 5/17/22 07:00, Jason Gunthorpe wrote:
>>> It does change the generated code slightly. I don't know if this will
>>> affect performance here or not. But just for completeness, here you go:
>>>
>>> free_one_page() originally has this (just showing the changed parts):
>>>
>>>      mov    0x8(%rdx,%rax,8),%rbx
>>>      and    $0x3f,%ecx
>>>      shr    %cl,%rbx
>>>      and    $0x7,
>>>
>>>
>>> And after applying this diff:
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 0e42038382c1..df1f8e9a294f 100644
>>> +++ b/mm/page_alloc.c
>>> @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
>>> page *page,
>>>          word_bitidx = bitidx / BITS_PER_LONG;
>>>          bitidx &= (BITS_PER_LONG-1);
>>>
>>> -       word = bitmap[word_bitidx];
>>> +       word = READ_ONCE(bitmap[word_bitidx]);
>>>          return (word >> bitidx) & mask;
>>>   }
>>>
>>>
>>> ...it now does an extra memory dereference:
>>>
>>>      lea    0x8(%rdx,%rax,8),%rax
>>>      and    $0x3f,%ecx
>>>      mov    (%rax),%rbx
>>>      shr    %cl,%rbx
>>>      and    $0x7,%ebx
> 
> Where is the extra memory reference? 'lea' is not a memory reference,
> it is just some maths?

If you compare this to the snippet above, you'll see that there is
an extra mov statement, and that one dereferences a pointer from
%rax:

     mov    (%rax),%rbx

> 
>> Thanks for checking, John.
>>
>> I don't want to have the READ_ONCE in __get_pfnblock_flags_mask
>> atm even though it's an extra memory dereference for specific
>> architecutre and specific compiler unless other callsites *do*
>> need it.
> 
> If a callpath is called under locking or not under locking then I
> would expect to have two call chains clearly marked what their locking
> conditions are. ie __get_pfn_block_flags_mask_unlocked() - and

__get_pfn_block_flags_mask_unlocked() would definitely clarify things,
and allow some clear documentation, good idea.

I haven't checked to see if some code could keep using the normal
__get_pfn_block_flags_mask(), but if it could, that would help with the
problem of keeping the fast path fast.

> obviously clearly document and check what the locking requirements are
> of the locked path.
> 
> IMHO putting a READ_ONCE on something that is not a memory load from
> shared data is nonsense - if a simple == has a stability risk then so
> does the '(word >> bitidx) & mask'.
> 
> Jason

Doing something like this:

     int __x = y();
     int x = READ_ONCE(__x);

is just awful! I agree. Really, y() should handle any barriers, because
otherwise it really does look pointless, and people reading the code
need something that is clearer. My first reaction was that this was
pointless and wrong, and it turns out that that's only about 80% true:
as long as LTO-of-the-future doesn't arrive, and as long as no one
refactors y() to be inline.


thanks,

Jason Gunthorpe May 17, 2022, 7:28 p.m. UTC | #34

On Tue, May 17, 2022 at 11:12:37AM -0700, John Hubbard wrote:
> On 5/17/22 07:00, Jason Gunthorpe wrote:
> > > > It does change the generated code slightly. I don't know if this will
> > > > affect performance here or not. But just for completeness, here you go:
> > > > 
> > > > free_one_page() originally has this (just showing the changed parts):
> > > > 
> > > >      mov    0x8(%rdx,%rax,8),%rbx
> > > >      and    $0x3f,%ecx
> > > >      shr    %cl,%rbx
> > > >      and    $0x7,
> > > > 
> > > > 
> > > > And after applying this diff:
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 0e42038382c1..df1f8e9a294f 100644
> > > > +++ b/mm/page_alloc.c
> > > > @@ -482,7 +482,7 @@ unsigned long __get_pfnblock_flags_mask(const struct
> > > > page *page,
> > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > >          bitidx &= (BITS_PER_LONG-1);
> > > > 
> > > > -       word = bitmap[word_bitidx];
> > > > +       word = READ_ONCE(bitmap[word_bitidx]);
> > > >          return (word >> bitidx) & mask;
> > > >   }
> > > > 
> > > > 
> > > > ...it now does an extra memory dereference:
> > > > 
> > > >      lea    0x8(%rdx,%rax,8),%rax
> > > >      and    $0x3f,%ecx
> > > >      mov    (%rax),%rbx
> > > >      shr    %cl,%rbx
> > > >      and    $0x7,%ebx
> > 
> > Where is the extra memory reference? 'lea' is not a memory reference,
> > it is just some maths?
> 
> If you compare this to the snippet above, you'll see that there is
> an extra mov statement, and that one dereferences a pointer from
> %rax:
> 
>     mov    (%rax),%rbx

That is the same move as:

   mov    0x8(%rdx,%rax,8),%rbx

Except that the EA calculation was done in advance and stored in rax.

lea isn't a memory reference, it is just computing the pointer value 
that 0x8(%rdx,%rax,8) represents. ie the lea computes

  %rax = %rdx + %rax*8 + 6

Which is then fed into the mov. Maybe it is an optimization to allow
one pipe to do the shr and an other to the EA - IDK, seems like a
random thing for the compiler to do.
> > IMHO putting a READ_ONCE on something that is not a memory load from
> > shared data is nonsense - if a simple == has a stability risk then so
> > does the '(word >> bitidx) & mask'.
> 
> Doing something like this:
> 
>     int __x = y();
>     int x = READ_ONCE(__x);
> 
> is just awful! I agree. Really, y() should handle any barriers, because
> otherwise it really does look pointless, and people reading the code
> need something that is clearer. My first reaction was that this was
> pointless and wrong, and it turns out that that's only about 80% true:
> as long as LTO-of-the-future doesn't arrive, and as long as no one
> refactors y() to be inline.

It is always pointless, because look at the whole flow:

 y():
   word = bitmap[word_bitidx];
   return (word >> bitidx) & mask;

 int __x = y()
 int x = READ_ONCE(__x)

The requirement here is for a coherent read of bitmap[word_bitidx]
that is not mangled by multi-load, load tearing or other theoritcal
compiler problems. Stated more clearly if bitmap[] has values {A,B,C}
then the result of all this in x should always be {A',B',C'} according
to the given maths, never, ever an impossible value D.

The nature of the compiler problems we are trying to prevent is that
the compiler my take a computation that seems deterministic and
corrupt it somehow by making an assumption that the memory is stable
when it is not.

For instance == may not work right if the compiler decides to do:

 if (READ_ONCE(a) == 1 && READ_ONCE(a) == 1)

This also means that >> and & could malfunction. So when LTO expands
and inlines everything and you get this:

 if (READ_ONCE((bitmap[word_bitidx] >> bitidx) & mask) == MIGRATE_CMA)

It is still not safe because the >> and & could have multi-loaded or
torn the read in some way that it is no longer a sane
calculation. (for instance imagine that the compiler decides to read
the value byte-at-a-time without READ_ONCE)

ie the memory may have had A racing turning into C but x computed into
D not A'.

Paul can correct me, but I understand we do not have a list of allowed
operations that are exempted from the READ_ONCE() requirement. ie it
is not just conditional branching that requires READ_ONCE().

This is why READ_ONCE() must always be on the memory load, because the
point is to sanitize away the uncertainty that comes with an unlocked
read of unstable memory contents. READ_ONCE() samples the value in
memory, and removes all tearing, multiload, etc "instability" that may
effect down stream computations. In this way down stream compulations
become reliable.

Jason

John Hubbard May 17, 2022, 8:12 p.m. UTC | #35

On 5/17/22 12:28, Jason Gunthorpe wrote:
>> If you compare this to the snippet above, you'll see that there is
>> an extra mov statement, and that one dereferences a pointer from
>> %rax:
>>
>>      mov    (%rax),%rbx
> 
> That is the same move as:
> 
>     mov    0x8(%rdx,%rax,8),%rbx
> 
> Except that the EA calculation was done in advance and stored in rax.
> 
> lea isn't a memory reference, it is just computing the pointer value
> that 0x8(%rdx,%rax,8) represents. ie the lea computes
> 
>    %rax = %rdx + %rax*8 + 6
> 
> Which is then fed into the mov. Maybe it is an optimization to allow
> one pipe to do the shr and an other to the EA - IDK, seems like a
> random thing for the compiler to do.

Apologies for getting that wrong, and thanks for walking me through the
asm.

[...]
> 
> Paul can correct me, but I understand we do not have a list of allowed
> operations that are exempted from the READ_ONCE() requirement. ie it
> is not just conditional branching that requires READ_ONCE().
> 
> This is why READ_ONCE() must always be on the memory load, because the
> point is to sanitize away the uncertainty that comes with an unlocked
> read of unstable memory contents. READ_ONCE() samples the value in
> memory, and removes all tearing, multiload, etc "instability" that may
> effect down stream computations. In this way down stream compulations
> become reliable.
> 
> Jason

So then:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..b404f87e2682 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
         word_bitidx = bitidx / BITS_PER_LONG;
         bitidx &= (BITS_PER_LONG-1);

-       word = bitmap[word_bitidx];
+       /*
+        * This races, without locks, with set_pageblock_migratetype(). Ensure
+        * a consistent (non-tearing) read of the memory array, so that results,
+        * even though racy, are not corrupted.
+        */
+       word = READ_ONCE(bitmap[word_bitidx]);
         return (word >> bitidx) & mask;
  }


thanks,

Paul E. McKenney May 17, 2022, 8:21 p.m. UTC | #36

On Tue, May 17, 2022 at 01:12:02PM -0700, John Hubbard wrote:
> On 5/17/22 12:28, Jason Gunthorpe wrote:
> > > If you compare this to the snippet above, you'll see that there is
> > > an extra mov statement, and that one dereferences a pointer from
> > > %rax:
> > > 
> > >      mov    (%rax),%rbx
> > 
> > That is the same move as:
> > 
> >     mov    0x8(%rdx,%rax,8),%rbx
> > 
> > Except that the EA calculation was done in advance and stored in rax.
> > 
> > lea isn't a memory reference, it is just computing the pointer value
> > that 0x8(%rdx,%rax,8) represents. ie the lea computes
> > 
> >    %rax = %rdx + %rax*8 + 6
> > 
> > Which is then fed into the mov. Maybe it is an optimization to allow
> > one pipe to do the shr and an other to the EA - IDK, seems like a
> > random thing for the compiler to do.

Maybe an optimization suppressed due to the volatile nature of the
load?  If so, perhaps it might be considered a compiler bug.  Though
it is quite difficult to get optimization bugs involving volatile
to be taken seriously.

> Apologies for getting that wrong, and thanks for walking me through the
> asm.
> 
> [...]
> > 
> > Paul can correct me, but I understand we do not have a list of allowed
> > operations that are exempted from the READ_ONCE() requirement. ie it
> > is not just conditional branching that requires READ_ONCE().
> > 
> > This is why READ_ONCE() must always be on the memory load, because the
> > point is to sanitize away the uncertainty that comes with an unlocked
> > read of unstable memory contents. READ_ONCE() samples the value in
> > memory, and removes all tearing, multiload, etc "instability" that may
> > effect down stream computations. In this way down stream compulations
> > become reliable.
> > 
> > Jason
> 
> So then:

Works for me!

Acked-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..b404f87e2682 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>         word_bitidx = bitidx / BITS_PER_LONG;
>         bitidx &= (BITS_PER_LONG-1);
> 
> -       word = bitmap[word_bitidx];
> +       /*
> +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> +        * a consistent (non-tearing) read of the memory array, so that results,
> +        * even though racy, are not corrupted.
> +        */
> +       word = READ_ONCE(bitmap[word_bitidx]);
>         return (word >> bitidx) & mask;
>  }
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

Minchan Kim May 23, 2022, 4:33 p.m. UTC | #37

On Tue, May 17, 2022 at 01:12:02PM -0700, John Hubbard wrote:
> On 5/17/22 12:28, Jason Gunthorpe wrote:
> > > If you compare this to the snippet above, you'll see that there is
> > > an extra mov statement, and that one dereferences a pointer from
> > > %rax:
> > > 
> > >      mov    (%rax),%rbx
> > 
> > That is the same move as:
> > 
> >     mov    0x8(%rdx,%rax,8),%rbx
> > 
> > Except that the EA calculation was done in advance and stored in rax.
> > 
> > lea isn't a memory reference, it is just computing the pointer value
> > that 0x8(%rdx,%rax,8) represents. ie the lea computes
> > 
> >    %rax = %rdx + %rax*8 + 6
> > 
> > Which is then fed into the mov. Maybe it is an optimization to allow
> > one pipe to do the shr and an other to the EA - IDK, seems like a
> > random thing for the compiler to do.
> 
> Apologies for getting that wrong, and thanks for walking me through the
> asm.
> 
> [...]
> > 
> > Paul can correct me, but I understand we do not have a list of allowed
> > operations that are exempted from the READ_ONCE() requirement. ie it
> > is not just conditional branching that requires READ_ONCE().
> > 
> > This is why READ_ONCE() must always be on the memory load, because the
> > point is to sanitize away the uncertainty that comes with an unlocked
> > read of unstable memory contents. READ_ONCE() samples the value in
> > memory, and removes all tearing, multiload, etc "instability" that may
> > effect down stream computations. In this way down stream compulations
> > become reliable.
> > 
> > Jason
> 
> So then:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0e42038382c1..b404f87e2682 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>         word_bitidx = bitidx / BITS_PER_LONG;
>         bitidx &= (BITS_PER_LONG-1);
> 
> -       word = bitmap[word_bitidx];
> +       /*
> +        * This races, without locks, with set_pageblock_migratetype(). Ensure
                                                 
                                             set_pfnblock_flags_mask would be better?
                          
> +        * a consistent (non-tearing) read of the memory array, so that results,

Thanks for proceeding and suggestion, John.

IIUC, the load tearing wouldn't be an issue since [1] fixed the issue. 

The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
to make the code inline in *future* could potentially cause forcing refetching(i.e.,
re-read) tie bitmap[word_bitidx].

If so, shouldn't the comment be the one you helped before?

/*
 * Defend against future compiler LTO features, or code refactoring
 * that inlines the above function, by forcing a single read. Because,
 * re-reads of bitmap[word_bitidx] by inlining could cause trouble
 * for whom believe they use a local variable for the value.
 */

[1] e58469bafd05, mm: page_alloc: use word-based accesses for get/set pageblock bitmaps

> +        * even though racy, are not corrupted.
> +        */
> +       word = READ_ONCE(bitmap[word_bitidx]);
>         return (word >> bitidx) & mask;
>  }
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

John Hubbard May 24, 2022, 2:55 a.m. UTC | #38

On 5/23/22 09:33, Minchan Kim wrote:
...
>> So then:
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 0e42038382c1..b404f87e2682 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>>          word_bitidx = bitidx / BITS_PER_LONG;
>>          bitidx &= (BITS_PER_LONG-1);
>>
>> -       word = bitmap[word_bitidx];
>> +       /*
>> +        * This races, without locks, with set_pageblock_migratetype(). Ensure
>                                                   
>                                               set_pfnblock_flags_mask would be better?
>                            
>> +        * a consistent (non-tearing) read of the memory array, so that results,
> 
> Thanks for proceeding and suggestion, John.
> 
> IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.

Did it? [1] fixed something, but I'm not sure we can claim that that
code is now safe against tearing in all possible cases, especially given
the recent discussion here. Specifically, having this code do a read,
then follow that up with calculations, seems correct. Anything else is
sketchy...

> 
> The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
> to make the code inline in *future* could potentially cause forcing refetching(i.e.,
> re-read) tie bitmap[word_bitidx].
> 
> If so, shouldn't the comment be the one you helped before?

Well, maybe updated to something like this?

/*
  * This races, without locks, with set_pageblock_migratetype(). Ensure
  * a consistent (non-tearing) read of the memory array, so that results,
  * even though racy, are not corrupted--even if this function is
  * refactored and/or inlined.
  */

thanks,

Minchan Kim May 24, 2022, 5:16 a.m. UTC | #39

On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> On 5/23/22 09:33, Minchan Kim wrote:
> ...
> > > So then:
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 0e42038382c1..b404f87e2682 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > >          word_bitidx = bitidx / BITS_PER_LONG;
> > >          bitidx &= (BITS_PER_LONG-1);
> > > 
> > > -       word = bitmap[word_bitidx];
> > > +       /*
> > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> >                                               set_pfnblock_flags_mask would be better?
> > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > 
> > Thanks for proceeding and suggestion, John.
> > 
> > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> 
> Did it? [1] fixed something, but I'm not sure we can claim that that
> code is now safe against tearing in all possible cases, especially given
> the recent discussion here. Specifically, having this code do a read,
> then follow that up with calculations, seems correct. Anything else is

The load tearing you are trying to explain in the comment would be
solved by [1] since the bits will always align on a word and accessing
word size based on word aligned address is always atomic so there is
no load tearing problem IIUC.

Instead of the tearing problem, what we are trying to solve with
READ_ONCE is to prevent refetching when the function would be
inlined in the future.

> sketchy...
> 
> > 
> > The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
> > to make the code inline in *future* could potentially cause forcing refetching(i.e.,
> > re-read) tie bitmap[word_bitidx].
> > 
> > If so, shouldn't the comment be the one you helped before?
> 
> Well, maybe updated to something like this?
> 
> /*
>  * This races, without locks, with set_pageblock_migratetype(). Ensure

set_pageblock_migratetype is more upper level function so it would
be better fit to say set_pfnblock_flags_mask.
                                     
>  * a consistent (non-tearing) read of the memory array, so that results,

So tearing problem should't already happen by [1] so I am trying to
explain refetching(or re-read) problem in the comment.

>  * even though racy, are not corrupted--even if this function is

The value is already atomic so I don't think it could be corrupted
even though it would be inlined in the future.

Please correct me if I miss something.

>  * refactored and/or inlined.
>  */

John Hubbard May 24, 2022, 6:22 a.m. UTC | #40

On 5/23/22 10:16 PM, Minchan Kim wrote:
> On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
>> On 5/23/22 09:33, Minchan Kim wrote:
>> ...
>>>> So then:
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 0e42038382c1..b404f87e2682 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
>>>>           word_bitidx = bitidx / BITS_PER_LONG;
>>>>           bitidx &= (BITS_PER_LONG-1);
>>>>
>>>> -       word = bitmap[word_bitidx];
>>>> +       /*
>>>> +        * This races, without locks, with set_pageblock_migratetype(). Ensure
>>>                                                set_pfnblock_flags_mask would be better?
>>>> +        * a consistent (non-tearing) read of the memory array, so that results,
>>>
>>> Thanks for proceeding and suggestion, John.
>>>
>>> IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
>>
>> Did it? [1] fixed something, but I'm not sure we can claim that that
>> code is now safe against tearing in all possible cases, especially given
>> the recent discussion here. Specifically, having this code do a read,
>> then follow that up with calculations, seems correct. Anything else is
> 
> The load tearing you are trying to explain in the comment would be
> solved by [1] since the bits will always align on a word and accessing
> word size based on word aligned address is always atomic so there is
> no load tearing problem IIUC.
> 
> Instead of the tearing problem, what we are trying to solve with
> READ_ONCE is to prevent refetching when the function would be
> inlined in the future.
> 

I'm perhaps using "tearing" as too broad of a term, maybe just removing
the "(non-tearing)" part would fix up the comment.

>> sketchy...
>>
>>>
>>> The concern in our dicussion was aggressive compiler(e.g., LTO) or code refactoring
>>> to make the code inline in *future* could potentially cause forcing refetching(i.e.,
>>> re-read) tie bitmap[word_bitidx].
>>>
>>> If so, shouldn't the comment be the one you helped before?
>>
>> Well, maybe updated to something like this?
>>
>> /*
>>   * This races, without locks, with set_pageblock_migratetype(). Ensure
> 
> set_pageblock_migratetype is more upper level function so it would
> be better fit to say set_pfnblock_flags_mask.

OK

>                                       
>>   * a consistent (non-tearing) read of the memory array, so that results,
> 
> So tearing problem should't already happen by [1] so I am trying to
> explain refetching(or re-read) problem in the comment.
> 
>>   * even though racy, are not corrupted--even if this function is
> 
> The value is already atomic so I don't think it could be corrupted
> even though it would be inlined in the future.
> 
> Please correct me if I miss something.
> 
>>   * refactored and/or inlined.
>>   */
> 

thanks,

Jason Gunthorpe May 24, 2022, 2:19 p.m. UTC | #41

On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > On 5/23/22 09:33, Minchan Kim wrote:
> > ...
> > > > So then:
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 0e42038382c1..b404f87e2682 100644
> > > > +++ b/mm/page_alloc.c
> > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > >          bitidx &= (BITS_PER_LONG-1);
> > > > 
> > > > -       word = bitmap[word_bitidx];
> > > > +       /*
> > > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> > >                                               set_pfnblock_flags_mask would be better?
> > > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > > 
> > > Thanks for proceeding and suggestion, John.
> > > 
> > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > 
> > Did it? [1] fixed something, but I'm not sure we can claim that that
> > code is now safe against tearing in all possible cases, especially given
> > the recent discussion here. Specifically, having this code do a read,
> > then follow that up with calculations, seems correct. Anything else is
> 
> The load tearing you are trying to explain in the comment would be
> solved by [1] since the bits will always align on a word and accessing
> word size based on word aligned address is always atomic so there is
> no load tearing problem IIUC.

That is not technically true. It is exactly the sort of thing
READ_ONCE is intended to guard against.

> Instead of the tearing problem, what we are trying to solve with
> READ_ONCE is to prevent refetching when the function would be
> inlined in the future.

It is the same problem, who is to say it doesn't refetch while doing
the maths?

Jason

Minchan Kim May 24, 2022, 3:43 p.m. UTC | #42

On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > On 5/23/22 09:33, Minchan Kim wrote:
> > > ...
> > > > > So then:
> > > > > 
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > > >          bitidx &= (BITS_PER_LONG-1);
> > > > > 
> > > > > -       word = bitmap[word_bitidx];
> > > > > +       /*
> > > > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > >                                               set_pfnblock_flags_mask would be better?
> > > > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > > > 
> > > > Thanks for proceeding and suggestion, John.
> > > > 
> > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > 
> > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > code is now safe against tearing in all possible cases, especially given
> > > the recent discussion here. Specifically, having this code do a read,
> > > then follow that up with calculations, seems correct. Anything else is
> > 
> > The load tearing you are trying to explain in the comment would be
> > solved by [1] since the bits will always align on a word and accessing
> > word size based on word aligned address is always atomic so there is
> > no load tearing problem IIUC.
> 
> That is not technically true. It is exactly the sort of thing
> READ_ONCE is intended to guard against.

Oh, does word access based on the aligned address still happen
load tearing? 

I just referred to
https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759

> 
> > Instead of the tearing problem, what we are trying to solve with
> > READ_ONCE is to prevent refetching when the function would be
> > inlined in the future.
> 
> It is the same problem, who is to say it doesn't refetch while doing
> the maths?

I didn't say it doesn't refetch the value without the READ_ONCE.

What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
context because I though there is no load-tearing issue there since
bitmap is word-aligned/accessed. No?

If the load tearing would still happens in the bitmap, it would be
better to keep last suggestion from John.

+       /*
+        * This races, without locks, with set_pfnblock_flags_mask(). Ensure
+        * a consistent read of the memory array, so that results, even though
+        *  racy, are not corrupted.
+        */

Jason Gunthorpe May 24, 2022, 3:48 p.m. UTC | #43

On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > ...
> > > > > > So then:
> > > > > > 
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > > > >          bitidx &= (BITS_PER_LONG-1);
> > > > > > 
> > > > > > -       word = bitmap[word_bitidx];
> > > > > > +       /*
> > > > > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > >                                               set_pfnblock_flags_mask would be better?
> > > > > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > > > > 
> > > > > Thanks for proceeding and suggestion, John.
> > > > > 
> > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > > 
> > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > code is now safe against tearing in all possible cases, especially given
> > > > the recent discussion here. Specifically, having this code do a read,
> > > > then follow that up with calculations, seems correct. Anything else is
> > > 
> > > The load tearing you are trying to explain in the comment would be
> > > solved by [1] since the bits will always align on a word and accessing
> > > word size based on word aligned address is always atomic so there is
> > > no load tearing problem IIUC.
> > 
> > That is not technically true. It is exactly the sort of thing
> > READ_ONCE is intended to guard against.
> 
> Oh, does word access based on the aligned address still happen
> load tearing? 
> 
> I just referred to
> https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759

I read that as saying load tearing is technically allowed but doesn't
happen in gcc, and so must use the _ONCE macros.

> I didn't say it doesn't refetch the value without the READ_ONCE.
> 
> What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> context because I though there is no load-tearing issue there since
> bitmap is word-aligned/accessed. No?

It does both. AFAIK our memory model has no guarentees on what naked C
statements will do. Tearing, multi-load, etc - it is all technically
permitted. Use the proper accessors.

Jason

Paul E. McKenney May 24, 2022, 4:37 p.m. UTC | #44

On Tue, May 24, 2022 at 12:48:31PM -0300, Jason Gunthorpe wrote:
> On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> > On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > > ...
> > > > > > > So then:
> > > > > > > 
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > > > > >          bitidx &= (BITS_PER_LONG-1);
> > > > > > > 
> > > > > > > -       word = bitmap[word_bitidx];
> > > > > > > +       /*
> > > > > > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > > >                                               set_pfnblock_flags_mask would be better?
> > > > > > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > > > > > 
> > > > > > Thanks for proceeding and suggestion, John.
> > > > > > 
> > > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > > > 
> > > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > > code is now safe against tearing in all possible cases, especially given
> > > > > the recent discussion here. Specifically, having this code do a read,
> > > > > then follow that up with calculations, seems correct. Anything else is
> > > > 
> > > > The load tearing you are trying to explain in the comment would be
> > > > solved by [1] since the bits will always align on a word and accessing
> > > > word size based on word aligned address is always atomic so there is
> > > > no load tearing problem IIUC.
> > > 
> > > That is not technically true. It is exactly the sort of thing
> > > READ_ONCE is intended to guard against.
> > 
> > Oh, does word access based on the aligned address still happen
> > load tearing? 
> > 
> > I just referred to
> > https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
> 
> I read that as saying load tearing is technically allowed but doesn't
> happen in gcc, and so must use the _ONCE macros.

This is in fact the intent, except...

And as that passage goes on to state, there really are compilers (such
as GCC) that tear stores of constants to machine aligned/sized locations.

In short, use of the _ONCE() macros can save you a lot of pain.

> > I didn't say it doesn't refetch the value without the READ_ONCE.
> > 
> > What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> > issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> > context because I though there is no load-tearing issue there since
> > bitmap is word-aligned/accessed. No?
> 
> It does both. AFAIK our memory model has no guarentees on what naked C
> statements will do. Tearing, multi-load, etc - it is all technically
> permitted. Use the proper accessors.

I am with Jason on this one.

In fact, I believe that any naked C-language access to mutable shared
variables should have a comment stating why the compiler cannot mangle
that access.

							Thanx, Paul

Minchan Kim May 24, 2022, 4:59 p.m. UTC | #45

On Tue, May 24, 2022 at 09:37:28AM -0700, Paul E. McKenney wrote:
> On Tue, May 24, 2022 at 12:48:31PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 24, 2022 at 08:43:27AM -0700, Minchan Kim wrote:
> > > On Tue, May 24, 2022 at 11:19:37AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, May 23, 2022 at 10:16:58PM -0700, Minchan Kim wrote:
> > > > > On Mon, May 23, 2022 at 07:55:25PM -0700, John Hubbard wrote:
> > > > > > On 5/23/22 09:33, Minchan Kim wrote:
> > > > > > ...
> > > > > > > > So then:
> > > > > > > > 
> > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > > index 0e42038382c1..b404f87e2682 100644
> > > > > > > > +++ b/mm/page_alloc.c
> > > > > > > > @@ -482,7 +482,12 @@ unsigned long __get_pfnblock_flags_mask(const struct page *page,
> > > > > > > >          word_bitidx = bitidx / BITS_PER_LONG;
> > > > > > > >          bitidx &= (BITS_PER_LONG-1);
> > > > > > > > 
> > > > > > > > -       word = bitmap[word_bitidx];
> > > > > > > > +       /*
> > > > > > > > +        * This races, without locks, with set_pageblock_migratetype(). Ensure
> > > > > > >                                               set_pfnblock_flags_mask would be better?
> > > > > > > > +        * a consistent (non-tearing) read of the memory array, so that results,
> > > > > > > 
> > > > > > > Thanks for proceeding and suggestion, John.
> > > > > > > 
> > > > > > > IIUC, the load tearing wouldn't be an issue since [1] fixed the issue.
> > > > > > 
> > > > > > Did it? [1] fixed something, but I'm not sure we can claim that that
> > > > > > code is now safe against tearing in all possible cases, especially given
> > > > > > the recent discussion here. Specifically, having this code do a read,
> > > > > > then follow that up with calculations, seems correct. Anything else is
> > > > > 
> > > > > The load tearing you are trying to explain in the comment would be
> > > > > solved by [1] since the bits will always align on a word and accessing
> > > > > word size based on word aligned address is always atomic so there is
> > > > > no load tearing problem IIUC.
> > > > 
> > > > That is not technically true. It is exactly the sort of thing
> > > > READ_ONCE is intended to guard against.
> > > 
> > > Oh, does word access based on the aligned address still happen
> > > load tearing? 
> > > 
> > > I just referred to
> > > https://elixir.bootlin.com/linux/latest/source/Documentation/memory-barriers.txt#L1759
> > 
> > I read that as saying load tearing is technically allowed but doesn't
> > happen in gcc, and so must use the _ONCE macros.
> 
> This is in fact the intent, except...
> 
> And as that passage goes on to state, there really are compilers (such
> as GCC) that tear stores of constants to machine aligned/sized locations.
> 
> In short, use of the _ONCE() macros can save you a lot of pain.

Thanks for the correction, Jason and Paul

> 
> > > I didn't say it doesn't refetch the value without the READ_ONCE.
> > > 
> > > What I am saying is READ_ONCE(bitmap_word_bitidx] prevents "refetching"
> > > issue rather than "tearing" issue in specific __get_pfnblock_flags_mask
> > > context because I though there is no load-tearing issue there since
> > > bitmap is word-aligned/accessed. No?
> > 
> > It does both. AFAIK our memory model has no guarentees on what naked C
> > statements will do. Tearing, multi-load, etc - it is all technically
> > permitted. Use the proper accessors.

Seems like there was some misunderstanding here.

I didn't mean not to use READ_ONCE for the bitmap but wanted to have
more concrete comment. Since you guys corrected "even though word-alinged
access could be wrong without READ_ONCE", I would keep the comment John
suggested.

> 
> I am with Jason on this one.
> 
> In fact, I believe that any naked C-language access to mutable shared
> variables should have a comment stating why the compiler cannot mangle
> that access.

Agreed.

Thanks!

[v4] mm: fix is_pinnable_page against on cma page

Commit Message

Comments

Patch