diff mbox series

mm/sparsemem: fix race in accessing memory_section->usage

Message ID 1697202267-23600-1-git-send-email-quic_charante@quicinc.com (mailing list archive)
State New
Headers show
Series mm/sparsemem: fix race in accessing memory_section->usage | expand

Commit Message

Charan Teja Kalla Oct. 13, 2023, 1:04 p.m. UTC
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
end pfn contains the device memory PFN's as well, the compaction
triggered will try on the device memory PFN's too though they end up in
NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
sections). When from other core, the section mappings are being removed
for the ZONE_DEVICE region, that the PFN in question belongs to,
on which compaction is currently being operated is resulting into the
kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.

compact_zone()			memunmap_pages
-------------			---------------
__pageblock_pfn_to_page
   ......
 (a)pfn_valid():
     valid_section()//return true
			      (b)__remove_pages()->
				  sparse_remove_section()->
				    section_deactivate():
				    [Free the array ms->usage and set
				     ms->usage = NULL]
     pfn_section_valid()
     [Access ms->usage which
     is NULL]

NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.

The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.

Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL when
SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Synchronize the rcu on the write side and free the ->usage. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.

Since the section_deactivate() is a rare operation and will come in the
hot remove path, impact of synchronize_rcu() should be negligble.

Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 include/linux/mmzone.h | 11 +++++++++--
 mm/sparse.c            | 14 ++++++++------
 2 files changed, 17 insertions(+), 8 deletions(-)

Comments

Andrew Morton Oct. 14, 2023, 10:25 p.m. UTC | #1
On Fri, 13 Oct 2023 18:34:27 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> The below race is observed on a PFN which falls into the device memory
> region with the system memory configuration where PFN's are such that
> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
> end pfn contains the device memory PFN's as well, the compaction
> triggered will try on the device memory PFN's too though they end up in
> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
> sections). When from other core, the section mappings are being removed
> for the ZONE_DEVICE region, that the PFN in question belongs to,
> on which compaction is currently being operated is resulting into the
> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.

Seems this bug is four years old, yes?  It must be quite hard to hit.

When people review this, please offer opinions on whether a fix should
be backported into -stable kernels, thanks.

> compact_zone()			memunmap_page
> -------------			---------------
> __pageblock_pfn_to_page
>    ......
>  (a)pfn_valid():
>      valid_section()//return true
> 			      (b)__remove_pages()->
> 				  sparse_remove_section()->
> 				    section_deactivate():
> 				    [Free the array ms->usage and set
> 				     ms->usage = NULL]
>      pfn_section_valid()
>      [Access ms->usage which
>      is NULL]
> 
> NOTE: From the above it can be said that the race is reduced to between
> the pfn_valid()/pfn_section_valid() and the section deactivate with
> SPASEMEM_VMEMAP enabled.
> 
> The commit b943f045a9af("mm/sparse: fix kernel crash with
> pfn_section_valid check") tried to address the same problem by clearing
> the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
> false thus ms->usage is not accessed.
> 
> Fix this issue by the below steps:
> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
> b) RCU protected read side critical section will either return NULL when
> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
> c) Synchronize the rcu on the write side and free the ->usage. No
> attempt will be made to access ->usage after this as the
> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
> 
> Since the section_deactivate() is a rare operation and will come in the
> hot remove path, impact of synchronize_rcu() should be negligble.
> 
> Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
David Hildenbrand Oct. 16, 2023, 8:23 a.m. UTC | #2
On 15.10.23 00:25, Andrew Morton wrote:
> On Fri, 13 Oct 2023 18:34:27 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
> 
>> The below race is observed on a PFN which falls into the device memory
>> region with the system memory configuration where PFN's are such that
>> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
>> end pfn contains the device memory PFN's as well, the compaction
>> triggered will try on the device memory PFN's too though they end up in
>> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
>> sections). When from other core, the section mappings are being removed
>> for the ZONE_DEVICE region, that the PFN in question belongs to,
>> on which compaction is currently being operated is resulting into the
>> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.
> 
> Seems this bug is four years old, yes?  It must be quite hard to hit.

 From the description, it's not quite clear to me if this was actually 
hit -- usually people include the dmesg bug/crash info.

> 
> When people review this, please offer opinions on whether a fix should
> be backported into -stable kernels, thanks.
> 
>> compact_zone()			memunmap_page
>> -------------			---------------
>> __pageblock_pfn_to_page
>>     ......
>>   (a)pfn_valid():
>>       valid_section()//return true
>> 			      (b)__remove_pages()->
>> 				  sparse_remove_section()->
>> 				    section_deactivate():
>> 				    [Free the array ms->usage and set
>> 				     ms->usage = NULL]
>>       pfn_section_valid()
>>       [Access ms->usage which
>>       is NULL]
>>
>> NOTE: From the above it can be said that the race is reduced to between
>> the pfn_valid()/pfn_section_valid() and the section deactivate with
>> SPASEMEM_VMEMAP enabled.
>>
>> The commit b943f045a9af("mm/sparse: fix kernel crash with
>> pfn_section_valid check") tried to address the same problem by clearing
>> the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
>> false thus ms->usage is not accessed.
>>
>> Fix this issue by the below steps:
>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
>> b) RCU protected read side critical section will either return NULL when
>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
>> c) Synchronize the rcu on the write side and free the ->usage. No
>> attempt will be made to access ->usage after this as the
>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.


This affects any kind of memory hotunplug. When hotunplugging memory we 
will end up calling synchronize_rcu() for each and every memory section, 
which sounds extremely wasteful.

Can't we find a way to kfree_rcu() that thing and read/write the pointer 
using READ?ONCE?WRITE_ONCE instead?
Pavan Kondeti Oct. 16, 2023, 10:33 a.m. UTC | #3
On Fri, Oct 13, 2023 at 06:34:27PM +0530, Charan Teja Kalla wrote:
> The below race is observed on a PFN which falls into the device memory
> region with the system memory configuration where PFN's are such that
> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
> end pfn contains the device memory PFN's as well, the compaction
> triggered will try on the device memory PFN's too though they end up in
> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
> sections). When from other core, the section mappings are being removed
> for the ZONE_DEVICE region, that the PFN in question belongs to,
> on which compaction is currently being operated is resulting into the
> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.
> 
> compact_zone()			memunmap_pages
> -------------			---------------
> __pageblock_pfn_to_page
>    ......
>  (a)pfn_valid():
>      valid_section()//return true
> 			      (b)__remove_pages()->
> 				  sparse_remove_section()->
> 				    section_deactivate():
> 				    [Free the array ms->usage and set
> 				     ms->usage = NULL]
>      pfn_section_valid()
>      [Access ms->usage which
>      is NULL]
> 
> NOTE: From the above it can be said that the race is reduced to between
> the pfn_valid()/pfn_section_valid() and the section deactivate with
> SPASEMEM_VMEMAP enabled.
> 
> The commit b943f045a9af("mm/sparse: fix kernel crash with
> pfn_section_valid check") tried to address the same problem by clearing
> the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
> false thus ms->usage is not accessed.
> 
> Fix this issue by the below steps:
> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
> b) RCU protected read side critical section will either return NULL when
> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
> c) Synchronize the rcu on the write side and free the ->usage. No
> attempt will be made to access ->usage after this as the
> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
> 
> Since the section_deactivate() is a rare operation and will come in the
> hot remove path, impact of synchronize_rcu() should be negligble.

struct mem_section_usage has other field like pageblock_flags. Do we
need to protect its readers with RCU? Also can we annotate usage field
in struct mem_section with __rcu and use RCU accessors like
rcu_dereference() while using memsection::usage field?

> 
> Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  include/linux/mmzone.h | 11 +++++++++--
>  mm/sparse.c            | 14 ++++++++------
>  2 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc..c877396 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1987,6 +1987,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>  static inline int pfn_valid(unsigned long pfn)
>  {
>  	struct mem_section *ms;
> +	int ret;
>  
>  	/*
>  	 * Ensure the upper PAGE_SHIFT bits are clear in the
> @@ -2000,13 +2001,19 @@ static inline int pfn_valid(unsigned long pfn)
>  	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>  		return 0;
>  	ms = __pfn_to_section(pfn);
> -	if (!valid_section(ms))
> +	rcu_read_lock();
> +	if (!valid_section(ms)) {
> +		rcu_read_unlock();
>  		return 0;
> +	}
>  	/*
>  	 * Traditionally early sections always returned pfn_valid() for
>  	 * the entire section-sized span.
>  	 */
> -	return early_section(ms) || pfn_section_valid(ms, pfn);
> +	ret = early_section(ms) || pfn_section_valid(ms, pfn);
> +	rcu_read_unlock();
> +
> +	return ret;
>  }
>  #endif
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e5..ca7dbe1 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -792,6 +792,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		unsigned long section_nr = pfn_to_section_nr(pfn);
>  
>  		/*
> +		 * Mark the section invalid so that valid_section()
> +		 * return false. This prevents code from dereferencing
> +		 * ms->usage array.
> +		 */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> +

This trick may not be needed if we add proper NULL checks around ms->usage. We are anyway
introducing a new rule this check needs to be done under RCU lock, so why not revisit it?

> +		/*
>  		 * When removing an early section, the usage map is kept (as the
>  		 * usage maps of other sections fall into the same page). It
>  		 * will be re-used when re-adding the section - which is then no
> @@ -799,16 +806,11 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		 * was allocated during boot.
>  		 */
>  		if (!PageReserved(virt_to_page(ms->usage))) {
> +			synchronize_rcu();
>  			kfree(ms->usage);
>  			ms->usage = NULL;
>  		}

If we add NULL checks around ms->usage, this becomes

tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked());
syncrhonize_rcu();
kfree(tmp);

btw, Do we come here with any global locks? if yes, synchronize_rcu() can add delays in releasing
the lock. In that case we may have to go for async RCU free.

>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> -		/*
> -		 * Mark the section invalid so that valid_section()
> -		 * return false. This prevents code from dereferencing
> -		 * ms->usage array.
> -		 */
> -		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
>  	/*
> 

Thanks,
Pavan
Charan Teja Kalla Oct. 16, 2023, 1:38 p.m. UTC | #4
Thanks Andrew/David,

On 10/16/2023 1:53 PM, David Hildenbrand wrote:
>>> The below race is observed on a PFN which falls into the device memory
>>> region with the system memory configuration where PFN's are such that
>>> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
>>> end pfn contains the device memory PFN's as well, the compaction
>>> triggered will try on the device memory PFN's too though they end up in
>>> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
>>> sections). When from other core, the section mappings are being removed
>>> for the ZONE_DEVICE region, that the PFN in question belongs to,
>>> on which compaction is currently being operated is resulting into the
>>> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.
>>
>> Seems this bug is four years old, yes?  It must be quite hard to hit.
> 
> From the description, it's not quite clear to me if this was actually
> hit -- usually people include the dmesg bug/crash info.

On Snapdragon SoC,  with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL],  we are able to see bunch of
issues daily while testing on a device farm.

I note that from next time on wards will send the demsg bug/crash info
for these type of issues. For this particular issue below is the log.
Though the below log is not directly pointing to the
pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32
lauterbach tool, it is pointing.

[  540.578056] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[  540.578068] Mem abort info:
[  540.578070]   ESR = 0x0000000096000005
[  540.578073]   EC = 0x25: DABT (current EL), IL = 32 bits
[  540.578077]   SET = 0, FnV = 0
[  540.578080]   EA = 0, S1PTW = 0
[  540.578082]   FSC = 0x05: level 1 translation fault
[  540.578085] Data abort info:
[  540.578086]   ISV = 0, ISS = 0x00000005
[  540.578088]   CM = 0, WnR = 0
[  540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS
BTYPE=--)
[  540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
[  540.579454] lr : compact_zone+0x994/0x1058
[  540.579460] sp : ffffffc03579b510
[  540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:
000000000000000c
[  540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:
ffffffc03579b640
[  540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:
0000000000000000
[  540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:
ffffffdebf66d140
[  540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:
00000000009f4bff
[  540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:
0000000000000001
[  540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :
ffffff897d2cd440
[  540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :
ffffffc03579b5b4
[  540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :
0000000000000001
[  540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :
0000000000235800
[  540.579524] Call trace:
[  540.579527]  __pageblock_pfn_to_page+0x6c/0x14c
[  540.579533]  compact_zone+0x994/0x1058
[  540.579536]  try_to_compact_pages+0x128/0x378
[  540.579540]  __alloc_pages_direct_compact+0x80/0x2b0
[  540.579544]  __alloc_pages_slowpath+0x5c0/0xe10
[  540.579547]  __alloc_pages+0x250/0x2d0
[  540.579550]  __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
[  540.579561]  iommu_dma_alloc+0xa0/0x320
[  540.579565]  dma_alloc_attrs+0xd4/0x108

>>> Fix this issue by the below steps:
>>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
>>> b) RCU protected read side critical section will either return NULL when
>>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
>>> c) Synchronize the rcu on the write side and free the ->usage. No
>>> attempt will be made to access ->usage after this as the
>>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
> 
> 
> This affects any kind of memory hotunplug. When hotunplugging memory we
> will end up calling synchronize_rcu() for each and every memory section,
> which sounds extremely wasteful.
> 
> Can't we find a way to kfree_rcu() that thing and read/write the pointer
> using READ?ONCE?WRITE_ONCE instead?

I am inspired to use the synchronize_rcu() because of [1] where we did
use it in offline_page_ext(). And my limited understanding is that, a
user can trigger the offline operation more often than the unplug operation.

I agree here that there is a scope to use kfree_rcu() unlike in [1]. Let
me check for a way to use it.

[1]
https://lore.kernel.org/all/1661496993-11473-1-git-send-email-quic_charante@quicinc.com/

Thanks,
Charan
Andrew Morton Oct. 16, 2023, 10:34 p.m. UTC | #5
On Mon, 16 Oct 2023 19:08:00 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> > From the description, it's not quite clear to me if this was actually
> > hit -- usually people include the dmesg bug/crash info.
> 
> On Snapdragon SoC,  with the mentioned memory configuration of PFN's as
> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL],  we are able to see bunch of
> issues daily while testing on a device farm.
> 
> I note that from next time on wards will send the demsg bug/crash info
> for these type of issues. For this particular issue below is the log.
> Though the below log is not directly pointing to the
> pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32
> lauterbach tool, it is pointing.
> 
> [  540.578056] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000000
> [  540.578068] Mem abort info:
> [  540.578070]   ESR = 0x0000000096000005
> [  540.578073]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  540.578077]   SET = 0, FnV = 0
> [  540.578080]   EA = 0, S1PTW = 0
> [  540.578082]   FSC = 0x05: level 1 translation fault
> [  540.578085] Data abort info:
> [  540.578086]   ISV = 0, ISS = 0x00000005
> [  540.578088]   CM = 0, WnR = 0
> [  540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS
> BTYPE=--)
> [  540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579454] lr : compact_zone+0x994/0x1058
> [  540.579460] sp : ffffffc03579b510
> [  540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:
> 000000000000000c
> [  540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:
> ffffffc03579b640
> [  540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:
> 0000000000000000
> [  540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:
> ffffffdebf66d140
> [  540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:
> 00000000009f4bff
> [  540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:
> 0000000000000001
> [  540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :
> ffffff897d2cd440
> [  540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :
> ffffffc03579b5b4
> [  540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :
> 0000000000000001
> [  540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :
> 0000000000235800
> [  540.579524] Call trace:
> [  540.579527]  __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579533]  compact_zone+0x994/0x1058
> [  540.579536]  try_to_compact_pages+0x128/0x378
> [  540.579540]  __alloc_pages_direct_compact+0x80/0x2b0
> [  540.579544]  __alloc_pages_slowpath+0x5c0/0xe10
> [  540.579547]  __alloc_pages+0x250/0x2d0
> [  540.579550]  __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
> [  540.579561]  iommu_dma_alloc+0xa0/0x320
> [  540.579565]  dma_alloc_attrs+0xd4/0x108

Thanks.  I added the above info to the changelog, added a cc:stable and
I added a note-to-myself that a new version of the fix may be
forthcoming.
Charan Teja Kalla Oct. 17, 2023, 2:10 p.m. UTC | #6
Thanks Pavan!!

On 10/16/2023 4:03 PM, Pavan Kondeti wrote:
>> Fix this issue by the below steps:
>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
>> b) RCU protected read side critical section will either return NULL when
>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
>> c) Synchronize the rcu on the write side and free the ->usage. No
>> attempt will be made to access ->usage after this as the
>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
>>
>> Since the section_deactivate() is a rare operation and will come in the
>> hot remove path, impact of synchronize_rcu() should be negligble.
> struct mem_section_usage has other field like pageblock_flags. Do we
> need to protect its readers with RCU? Also can we annotate usage field
> in struct mem_section with __rcu and use RCU accessors like
> rcu_dereference() while using memsection::usage field?

Good question about the pageblock_flags!! I think we rely on the
get_pageblock_bitmap() to read the ms->usage->pageblock_flags by passing
struct page*.

1) All the functions that I have come across calling
get_pageblock_bitmap()/get_pfnblock_flags_mask() passing the page* which
it get from buddy. I think we are safe here as the device pages(from
which the problem is coming will never be onlined/added to buddy)

2) There are functions in compaction which directly operate on the pfn's
through pfn_to_online_page(). As for device pages, it is going to return
NULL, I think we are safe here too.

3) alloc_contig_range() which also operate on the pfn's directly, and
TMK, we will not pass the [start , end) values of this function to
contains the hole/offlined pages. I think we are safe here too.

May be David/other reviewers can help in commenting if there are some
mistakes above.

I am currently working on to use the __rcu accessors.

> 
>>  		/*
>> +		 * Mark the section invalid so that valid_section()
>> +		 * return false. This prevents code from dereferencing
>> +		 * ms->usage array.
>> +		 */
>> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>> +
> This trick may not be needed if we add proper NULL checks around ms->usage. We are anyway
> introducing a new rule this check needs to be done under RCU lock, so why not revisit it?
> 

I think this is for valid_section().

>>  		 * was allocated during boot.
>>  		 */
>>  		if (!PageReserved(virt_to_page(ms->usage))) {
>> +			synchronize_rcu();
>>  			kfree(ms->usage);
>>  			ms->usage = NULL;
>>  		}
> If we add NULL checks around ms->usage, this becomes
> 
> tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked());
> syncrhonize_rcu();
> kfree(tmp);
Per David input, I am working on using kfree_rcu().

Thanks,
Charan
David Hildenbrand Oct. 17, 2023, 2:53 p.m. UTC | #7
On 17.10.23 16:10, Charan Teja Kalla wrote:
> Thanks Pavan!!
> 
> On 10/16/2023 4:03 PM, Pavan Kondeti wrote:
>>> Fix this issue by the below steps:
>>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
>>> b) RCU protected read side critical section will either return NULL when
>>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
>>> c) Synchronize the rcu on the write side and free the ->usage. No
>>> attempt will be made to access ->usage after this as the
>>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
>>>
>>> Since the section_deactivate() is a rare operation and will come in the
>>> hot remove path, impact of synchronize_rcu() should be negligble.
>> struct mem_section_usage has other field like pageblock_flags. Do we
>> need to protect its readers with RCU? Also can we annotate usage field
>> in struct mem_section with __rcu and use RCU accessors like
>> rcu_dereference() while using memsection::usage field?
> 
> Good question about the pageblock_flags!! I think we rely on the
> get_pageblock_bitmap() to read the ms->usage->pageblock_flags by passing
> struct page*.
> 
> 1) All the functions that I have come across calling
> get_pageblock_bitmap()/get_pfnblock_flags_mask() passing the page* which
> it get from buddy. I think we are safe here as the device pages(from
> which the problem is coming will never be onlined/added to buddy)
> 
> 2) There are functions in compaction which directly operate on the pfn's
> through pfn_to_online_page(). As for device pages, it is going to return
> NULL, I think we are safe here too.
> 
> 3) alloc_contig_range() which also operate on the pfn's directly, and
> TMK, we will not pass the [start , end) values of this function to
> contains the hole/offlined pages. I think we are safe here too.
> 
> May be David/other reviewers can help in commenting if there are some
> mistakes above.

Sound reasonable to me; most PFN walkers shouldn't deal with pageblock 
flags. alloc_contig_range() is certainly interesting, I suspect it's 
fine but we better double-check.
David Hildenbrand Oct. 18, 2023, 7:52 a.m. UTC | #8
On 16.10.23 15:38, Charan Teja Kalla wrote:
> Thanks Andrew/David,
> 
> On 10/16/2023 1:53 PM, David Hildenbrand wrote:
>>>> The below race is observed on a PFN which falls into the device memory
>>>> region with the system memory configuration where PFN's are such that
>>>> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
>>>> end pfn contains the device memory PFN's as well, the compaction
>>>> triggered will try on the device memory PFN's too though they end up in
>>>> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
>>>> sections). When from other core, the section mappings are being removed
>>>> for the ZONE_DEVICE region, that the PFN in question belongs to,
>>>> on which compaction is currently being operated is resulting into the
>>>> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.
>>>
>>> Seems this bug is four years old, yes?  It must be quite hard to hit.
>>
>>  From the description, it's not quite clear to me if this was actually
>> hit -- usually people include the dmesg bug/crash info.
> 
> On Snapdragon SoC,  with the mentioned memory configuration of PFN's as
> [ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL],  we are able to see bunch of
> issues daily while testing on a device farm.
> 
> I note that from next time on wards will send the demsg bug/crash info
> for these type of issues. For this particular issue below is the log.
> Though the below log is not directly pointing to the
> pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32
> lauterbach tool, it is pointing.
> 
> [  540.578056] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000000
> [  540.578068] Mem abort info:
> [  540.578070]   ESR = 0x0000000096000005
> [  540.578073]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  540.578077]   SET = 0, FnV = 0
> [  540.578080]   EA = 0, S1PTW = 0
> [  540.578082]   FSC = 0x05: level 1 translation fault
> [  540.578085] Data abort info:
> [  540.578086]   ISV = 0, ISS = 0x00000005
> [  540.578088]   CM = 0, WnR = 0
> [  540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS
> BTYPE=--)
> [  540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579454] lr : compact_zone+0x994/0x1058
> [  540.579460] sp : ffffffc03579b510
> [  540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:
> 000000000000000c
> [  540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:
> ffffffc03579b640
> [  540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:
> 0000000000000000
> [  540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:
> ffffffdebf66d140
> [  540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:
> 00000000009f4bff
> [  540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:
> 0000000000000001
> [  540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :
> ffffff897d2cd440
> [  540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :
> ffffffc03579b5b4
> [  540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :
> 0000000000000001
> [  540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :
> 0000000000235800
> [  540.579524] Call trace:
> [  540.579527]  __pageblock_pfn_to_page+0x6c/0x14c
> [  540.579533]  compact_zone+0x994/0x1058
> [  540.579536]  try_to_compact_pages+0x128/0x378
> [  540.579540]  __alloc_pages_direct_compact+0x80/0x2b0
> [  540.579544]  __alloc_pages_slowpath+0x5c0/0xe10
> [  540.579547]  __alloc_pages+0x250/0x2d0
> [  540.579550]  __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
> [  540.579561]  iommu_dma_alloc+0xa0/0x320
> [  540.579565]  dma_alloc_attrs+0xd4/0x108
> 
>>>> Fix this issue by the below steps:
>>>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
>>>> b) RCU protected read side critical section will either return NULL when
>>>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
>>>> c) Synchronize the rcu on the write side and free the ->usage. No
>>>> attempt will be made to access ->usage after this as the
>>>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
>>
>>
>> This affects any kind of memory hotunplug. When hotunplugging memory we
>> will end up calling synchronize_rcu() for each and every memory section,
>> which sounds extremely wasteful.
>>
>> Can't we find a way to kfree_rcu() that thing and read/write the pointer
>> using READ?ONCE?WRITE_ONCE instead?
> 
> I am inspired to use the synchronize_rcu() because of [1] where we did
> use it in offline_page_ext(). And my limited understanding is that, a
> user can trigger the offline operation more often than the unplug operation.

In theory yes. In practice there are not many use cases where we do that.

Further, page_ext is already not used that frequently (page owner, 
young, idle tracking only), especially in most production (!debug) 
environments.

> 
> I agree here that there is a scope to use kfree_rcu() unlike in [1]. Let
> me check for a way to use it.

Exactly, if we could use that, it would be ideal.
Andrew Morton Oct. 25, 2023, 9:35 p.m. UTC | #9
On Tue, 17 Oct 2023 19:40:15 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> >>  		 * was allocated during boot.
> >>  		 */
> >>  		if (!PageReserved(virt_to_page(ms->usage))) {
> >> +			synchronize_rcu();
> >>  			kfree(ms->usage);
> >>  			ms->usage = NULL;
> >>  		}
> > If we add NULL checks around ms->usage, this becomes
> > 
> > tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked());
> > syncrhonize_rcu();
> > kfree(tmp);
> Per David input, I am working on using kfree_rcu().

How's it coming along?

Given that we're at 6.6-rc7 and given that this issue is causing
daily crashes in your device farm, I'm thinking that we use the current
version of your patch for 6.6 and for -stable.  We can look at the
kfree_rcu() optimization for later kernel releases?
David Hildenbrand Oct. 26, 2023, 7 a.m. UTC | #10
On 25.10.23 23:35, Andrew Morton wrote:
> On Tue, 17 Oct 2023 19:40:15 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
> 
>>>>   		 * was allocated during boot.
>>>>   		 */
>>>>   		if (!PageReserved(virt_to_page(ms->usage))) {
>>>> +			synchronize_rcu();
>>>>   			kfree(ms->usage);
>>>>   			ms->usage = NULL;
>>>>   		}
>>> If we add NULL checks around ms->usage, this becomes
>>>
>>> tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked());
>>> syncrhonize_rcu();
>>> kfree(tmp);
>> Per David input, I am working on using kfree_rcu().
> 
> How's it coming along?
> 
> Given that we're at 6.6-rc7 and given that this issue is causing
> daily crashes in your device farm, I'm thinking that we use the current
> version of your patch for 6.6 and for -stable.  We can look at the
> kfree_rcu() optimization for later kernel releases?

Any particular reason we have to rush this in? It's been seen by one 
company in a testing farm; there were no other reports, especially not 
from production systems. ... and the issue seems to be quite old.
Charan Teja Kalla Oct. 26, 2023, 7:18 a.m. UTC | #11
Thanks Andrew/David,

On 10/26/2023 12:30 PM, David Hildenbrand wrote:
>>
>> How's it coming along?
>>

I was in vacation... Will send V2 by Monday IST.
>> Given that we're at 6.6-rc7 and given that this issue is causing
>> daily crashes in your device farm, I'm thinking that we use the current
>> version of your patch for 6.6 and for -stable.  We can look at the
>> kfree_rcu() optimization for later kernel releases?
> 
> Any particular reason we have to rush this in? It's been seen by one
> company in a testing farm; there were no other reports, especially not
> from production systems. ... and the issue seems to be quite old.

I think we can work on to merge the final solution here rather than an
incremental one. At least internally we are unblocked with V1.

Thanks,
Charan
Alexander Potapenko Jan. 15, 2024, 6:44 p.m. UTC | #12
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>

Hi folks,

(adding KMSAN reviewers and IBM people who are currently porting KMSAN to other
architectures, plus Paul for his opinion on refactoring RCU)

this patch broke x86 KMSAN in a subtle way.

For every memory access in the code instrumented by KMSAN we call
kmsan_get_metadata() to obtain the metadata for the memory being accessed. For
virtual memory the metadata pointers are stored in the corresponding `struct
page`, therefore we need to call virt_to_page() to get them.

According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr)
returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to
call virt_addr_valid() as well.

To avoid recursion, kmsan_get_metadata() must not call instrumented code,
therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c
to check whether a virtual address is valid or not.

But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU
API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(),
so there is an infinite recursion now. I do not think it is correct to stop that
recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in
kmsan_get_metadata(): that would prevent instrumented functions called from
within the runtime from tracking the shadow values, which might introduce false
positives.

I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into
KMSAN code to prevent it from being instrumented, but that might require factoring
out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this
is feasible?

Another option is to cut some edges in the code calling virt_to_page(). First,
my observation is that virt_addr_valid() is quite rare in the kernel code, i.e.
not all cases of calling virt_to_page() are covered with it. Second, every
memory access to KMSAN metadata residing in virt_to_page(kaddr)->shadow always
accompanies an access to `kaddr` itself, so if there is a race on a PFN then
the access to `kaddr` will probably also trigger a fault. Third, KMSAN metadata
accesses are inherently non-atomic, and even if we ensure pfn_valid() is
returning a consistent value for a single memory access, calling it twice may
already return different results.

Considering the above, how bad would it be to drop synchronization for KMSAN's
version of pfn_valid() called from kmsan_virt_addr_valid()?
Marco Elver Jan. 15, 2024, 8:34 p.m. UTC | #13
On Mon, 15 Jan 2024 at 19:44, Alexander Potapenko <glider@google.com> wrote:
>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Marco Elver <elver@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: kasan-dev@googlegroups.com
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
>
> Hi folks,
>
> (adding KMSAN reviewers and IBM people who are currently porting KMSAN to other
> architectures, plus Paul for his opinion on refactoring RCU)
>
> this patch broke x86 KMSAN in a subtle way.
>
> For every memory access in the code instrumented by KMSAN we call
> kmsan_get_metadata() to obtain the metadata for the memory being accessed. For
> virtual memory the metadata pointers are stored in the corresponding `struct
> page`, therefore we need to call virt_to_page() to get them.
>
> According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr)
> returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to
> call virt_addr_valid() as well.
>
> To avoid recursion, kmsan_get_metadata() must not call instrumented code,
> therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c
> to check whether a virtual address is valid or not.
>
> But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU
> API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(),
> so there is an infinite recursion now. I do not think it is correct to stop that
> recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in
> kmsan_get_metadata(): that would prevent instrumented functions called from
> within the runtime from tracking the shadow values, which might introduce false
> positives.
>
> I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into
> KMSAN code to prevent it from being instrumented, but that might require factoring
> out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this
> is feasible?

__rcu_read_lock/unlock() is only outlined in PREEMPT_RCU. Not sure that helps.

Otherwise, there is rcu_read_lock_sched_notrace() which does the bare
minimum and is static inline.

Does that help?

> Another option is to cut some edges in the code calling virt_to_page(). First,
> my observation is that virt_addr_valid() is quite rare in the kernel code, i.e.
> not all cases of calling virt_to_page() are covered with it. Second, every
> memory access to KMSAN metadata residing in virt_to_page(kaddr)->shadow always
> accompanies an access to `kaddr` itself, so if there is a race on a PFN then
> the access to `kaddr` will probably also trigger a fault. Third, KMSAN metadata
> accesses are inherently non-atomic, and even if we ensure pfn_valid() is
> returning a consistent value for a single memory access, calling it twice may
> already return different results.
>
> Considering the above, how bad would it be to drop synchronization for KMSAN's
> version of pfn_valid() called from kmsan_virt_addr_valid()?
Marco Elver Jan. 17, 2024, 7:18 p.m. UTC | #14
On Mon, Jan 15, 2024 at 09:34PM +0100, Marco Elver wrote:
> On Mon, 15 Jan 2024 at 19:44, Alexander Potapenko <glider@google.com> wrote:
> >
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Marco Elver <elver@google.com>
> > Cc: Dmitry Vyukov <dvyukov@google.com>
> > Cc: kasan-dev@googlegroups.com
> > Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> > Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
> >
> > Hi folks,
> >
> > (adding KMSAN reviewers and IBM people who are currently porting KMSAN to other
> > architectures, plus Paul for his opinion on refactoring RCU)
> >
> > this patch broke x86 KMSAN in a subtle way.
> >
> > For every memory access in the code instrumented by KMSAN we call
> > kmsan_get_metadata() to obtain the metadata for the memory being accessed. For
> > virtual memory the metadata pointers are stored in the corresponding `struct
> > page`, therefore we need to call virt_to_page() to get them.
> >
> > According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr)
> > returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to
> > call virt_addr_valid() as well.
> >
> > To avoid recursion, kmsan_get_metadata() must not call instrumented code,
> > therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c
> > to check whether a virtual address is valid or not.
> >
> > But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU
> > API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(),
> > so there is an infinite recursion now. I do not think it is correct to stop that
> > recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in
> > kmsan_get_metadata(): that would prevent instrumented functions called from
> > within the runtime from tracking the shadow values, which might introduce false
> > positives.
> >
> > I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into
> > KMSAN code to prevent it from being instrumented, but that might require factoring
> > out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this
> > is feasible?
> 
> __rcu_read_lock/unlock() is only outlined in PREEMPT_RCU. Not sure that helps.
> 
> Otherwise, there is rcu_read_lock_sched_notrace() which does the bare
> minimum and is static inline.
> 
> Does that help?

Hrm, rcu_read_unlock_sched_notrace() can still call
__preempt_schedule_notrace(), which is again instrumented by KMSAN.

This patch gets me a working kernel:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4ed33b127821..2d62df462d88 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2000,6 +2000,7 @@ static inline int pfn_valid(unsigned long pfn)
 {
 	struct mem_section *ms;
 	int ret;
+	unsigned long flags;
 
 	/*
 	 * Ensure the upper PAGE_SHIFT bits are clear in the
@@ -2013,9 +2014,9 @@ static inline int pfn_valid(unsigned long pfn)
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	ms = __pfn_to_section(pfn);
-	rcu_read_lock();
+	local_irq_save(flags);
 	if (!valid_section(ms)) {
-		rcu_read_unlock();
+		local_irq_restore(flags);
 		return 0;
 	}
 	/*
@@ -2023,7 +2024,7 @@ static inline int pfn_valid(unsigned long pfn)
 	 * the entire section-sized span.
 	 */
 	ret = early_section(ms) || pfn_section_valid(ms, pfn);
-	rcu_read_unlock();
+	local_irq_restore(flags);
 
 	return ret;
 }

Disabling interrupts is a little heavy handed - it also assumes the
current RCU implementation. There is
preempt_enable_no_resched_notrace(), but that might be worse because it
breaks scheduling guarantees.

That being said, whatever we do here should be wrapped in some
rcu_read_lock/unlock_<newvariant>() helper.

Is there an existing helper we can use? If not, we need a variant that
can be used from extremely constrained contexts that can't even call
into the scheduler. And if we want pfn_valid() to switch to it, it also
should be fast.

Thanks,
-- Marco
Alexander Potapenko Jan. 18, 2024, 9:01 a.m. UTC | #15
>
> Hrm, rcu_read_unlock_sched_notrace() can still call
> __preempt_schedule_notrace(), which is again instrumented by KMSAN.
>
> This patch gets me a working kernel:
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4ed33b127821..2d62df462d88 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -2000,6 +2000,7 @@ static inline int pfn_valid(unsigned long pfn)
>  {
>         struct mem_section *ms;
>         int ret;
> +       unsigned long flags;
>
>         /*
>          * Ensure the upper PAGE_SHIFT bits are clear in the
> @@ -2013,9 +2014,9 @@ static inline int pfn_valid(unsigned long pfn)
>         if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>                 return 0;
>         ms = __pfn_to_section(pfn);
> -       rcu_read_lock();
> +       local_irq_save(flags);
>         if (!valid_section(ms)) {
> -               rcu_read_unlock();
> +               local_irq_restore(flags);
>                 return 0;
>         }
>         /*
> @@ -2023,7 +2024,7 @@ static inline int pfn_valid(unsigned long pfn)
>          * the entire section-sized span.
>          */
>         ret = early_section(ms) || pfn_section_valid(ms, pfn);
> -       rcu_read_unlock();
> +       local_irq_restore(flags);
>
>         return ret;
>  }
>
> Disabling interrupts is a little heavy handed - it also assumes the
> current RCU implementation. There is
> preempt_enable_no_resched_notrace(), but that might be worse because it
> breaks scheduling guarantees.
>
> That being said, whatever we do here should be wrapped in some
> rcu_read_lock/unlock_<newvariant>() helper.

We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c
(or the x86-specific KMSAN header, depending on whether people are
seeing the problem on s390 and Power) with some header magic.
But that's probably more fragile than adding a helper.

>
> Is there an existing helper we can use? If not, we need a variant that
> can be used from extremely constrained contexts that can't even call
> into the scheduler. And if we want pfn_valid() to switch to it, it also
> should be fast.
Marco Elver Jan. 18, 2024, 9:43 a.m. UTC | #16
On Thu, Jan 18, 2024 at 10:01AM +0100, Alexander Potapenko wrote:
> >
> > Hrm, rcu_read_unlock_sched_notrace() can still call
> > __preempt_schedule_notrace(), which is again instrumented by KMSAN.
> >
> > This patch gets me a working kernel:
> >
[...]
> > Disabling interrupts is a little heavy handed - it also assumes the
> > current RCU implementation. There is
> > preempt_enable_no_resched_notrace(), but that might be worse because it
> > breaks scheduling guarantees.
> >
> > That being said, whatever we do here should be wrapped in some
> > rcu_read_lock/unlock_<newvariant>() helper.
> 
> We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c
> (or the x86-specific KMSAN header, depending on whether people are
> seeing the problem on s390 and Power) with some header magic.
> But that's probably more fragile than adding a helper.
> 
> >
> > Is there an existing helper we can use? If not, we need a variant that
> > can be used from extremely constrained contexts that can't even call
> > into the scheduler. And if we want pfn_valid() to switch to it, it also
> > should be fast.

The below patch also gets me a working kernel. For pfn_valid(), using
rcu_read_lock_sched() should be reasonable, given its critical section
is very small and also enables it to be called from more constrained
contexts again (like KMSAN).

Within KMSAN we also have to suppress reschedules. This is again not
ideal, but since it's limited to KMSAN should be tolerable.

WDYT?

------ >8 ------

diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h
index 8fa6ac0e2d76..bbb1ba102129 100644
--- a/arch/x86/include/asm/kmsan.h
+++ b/arch/x86/include/asm/kmsan.h
@@ -64,6 +64,7 @@ static inline bool kmsan_virt_addr_valid(void *addr)
 {
 	unsigned long x = (unsigned long)addr;
 	unsigned long y = x - __START_KERNEL_map;
+	bool ret;
 
 	/* use the carry flag to determine if x was < __START_KERNEL_map */
 	if (unlikely(x > y)) {
@@ -79,7 +80,21 @@ static inline bool kmsan_virt_addr_valid(void *addr)
 			return false;
 	}
 
-	return pfn_valid(x >> PAGE_SHIFT);
+	/*
+	 * pfn_valid() relies on RCU, and may call into the scheduler on exiting
+	 * the critical section. However, this would result in recursion with
+	 * KMSAN. Therefore, disable preemption here, and re-enable preemption
+	 * below while suppressing rescheduls to avoid recursion.
+	 *
+	 * Note, this sacrifices occasionally breaking scheduling guarantees.
+	 * Although, a kernel compiled with KMSAN has already given up on any
+	 * performance guarantees due to being heavily instrumented.
+	 */
+	preempt_disable();
+	ret = pfn_valid(x >> PAGE_SHIFT);
+	preempt_enable_no_resched();
+
+	return ret;
 }
 
 #endif /* !MODULE */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4ed33b127821..a497f189d988 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -2013,9 +2013,9 @@ static inline int pfn_valid(unsigned long pfn)
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	ms = __pfn_to_section(pfn);
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	if (!valid_section(ms)) {
-		rcu_read_unlock();
+		rcu_read_unlock_sched();
 		return 0;
 	}
 	/*
@@ -2023,7 +2023,7 @@ static inline int pfn_valid(unsigned long pfn)
 	 * the entire section-sized span.
 	 */
 	ret = early_section(ms) || pfn_section_valid(ms, pfn);
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 
 	return ret;
 }
Paul E. McKenney Jan. 25, 2024, 1:20 p.m. UTC | #17
On Thu, Jan 18, 2024 at 10:43:06AM +0100, Marco Elver wrote:
> On Thu, Jan 18, 2024 at 10:01AM +0100, Alexander Potapenko wrote:
> > >
> > > Hrm, rcu_read_unlock_sched_notrace() can still call
> > > __preempt_schedule_notrace(), which is again instrumented by KMSAN.
> > >
> > > This patch gets me a working kernel:
> > >
> [...]
> > > Disabling interrupts is a little heavy handed - it also assumes the
> > > current RCU implementation. There is
> > > preempt_enable_no_resched_notrace(), but that might be worse because it
> > > breaks scheduling guarantees.
> > >
> > > That being said, whatever we do here should be wrapped in some
> > > rcu_read_lock/unlock_<newvariant>() helper.
> > 
> > We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c
> > (or the x86-specific KMSAN header, depending on whether people are
> > seeing the problem on s390 and Power) with some header magic.
> > But that's probably more fragile than adding a helper.
> > 
> > >
> > > Is there an existing helper we can use? If not, we need a variant that
> > > can be used from extremely constrained contexts that can't even call
> > > into the scheduler. And if we want pfn_valid() to switch to it, it also
> > > should be fast.
> 
> The below patch also gets me a working kernel. For pfn_valid(), using
> rcu_read_lock_sched() should be reasonable, given its critical section
> is very small and also enables it to be called from more constrained
> contexts again (like KMSAN).
> 
> Within KMSAN we also have to suppress reschedules. This is again not
> ideal, but since it's limited to KMSAN should be tolerable.
> 
> WDYT?

I like this one better from a purely selfish RCU perspective.  ;-)

							Thanx, Paul

> ------ >8 ------
> 
> diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h
> index 8fa6ac0e2d76..bbb1ba102129 100644
> --- a/arch/x86/include/asm/kmsan.h
> +++ b/arch/x86/include/asm/kmsan.h
> @@ -64,6 +64,7 @@ static inline bool kmsan_virt_addr_valid(void *addr)
>  {
>  	unsigned long x = (unsigned long)addr;
>  	unsigned long y = x - __START_KERNEL_map;
> +	bool ret;
>  
>  	/* use the carry flag to determine if x was < __START_KERNEL_map */
>  	if (unlikely(x > y)) {
> @@ -79,7 +80,21 @@ static inline bool kmsan_virt_addr_valid(void *addr)
>  			return false;
>  	}
>  
> -	return pfn_valid(x >> PAGE_SHIFT);
> +	/*
> +	 * pfn_valid() relies on RCU, and may call into the scheduler on exiting
> +	 * the critical section. However, this would result in recursion with
> +	 * KMSAN. Therefore, disable preemption here, and re-enable preemption
> +	 * below while suppressing rescheduls to avoid recursion.
> +	 *
> +	 * Note, this sacrifices occasionally breaking scheduling guarantees.
> +	 * Although, a kernel compiled with KMSAN has already given up on any
> +	 * performance guarantees due to being heavily instrumented.
> +	 */
> +	preempt_disable();
> +	ret = pfn_valid(x >> PAGE_SHIFT);
> +	preempt_enable_no_resched();
> +
> +	return ret;
>  }
>  
>  #endif /* !MODULE */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4ed33b127821..a497f189d988 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -2013,9 +2013,9 @@ static inline int pfn_valid(unsigned long pfn)
>  	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>  		return 0;
>  	ms = __pfn_to_section(pfn);
> -	rcu_read_lock();
> +	rcu_read_lock_sched();
>  	if (!valid_section(ms)) {
> -		rcu_read_unlock();
> +		rcu_read_unlock_sched();
>  		return 0;
>  	}
>  	/*
> @@ -2023,7 +2023,7 @@ static inline int pfn_valid(unsigned long pfn)
>  	 * the entire section-sized span.
>  	 */
>  	ret = early_section(ms) || pfn_section_valid(ms, pfn);
> -	rcu_read_unlock();
> +	rcu_read_unlock_sched();
>  
>  	return ret;
>  }
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc..c877396 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1987,6 +1987,7 @@  static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 static inline int pfn_valid(unsigned long pfn)
 {
 	struct mem_section *ms;
+	int ret;
 
 	/*
 	 * Ensure the upper PAGE_SHIFT bits are clear in the
@@ -2000,13 +2001,19 @@  static inline int pfn_valid(unsigned long pfn)
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	ms = __pfn_to_section(pfn);
-	if (!valid_section(ms))
+	rcu_read_lock();
+	if (!valid_section(ms)) {
+		rcu_read_unlock();
 		return 0;
+	}
 	/*
 	 * Traditionally early sections always returned pfn_valid() for
 	 * the entire section-sized span.
 	 */
-	return early_section(ms) || pfn_section_valid(ms, pfn);
+	ret = early_section(ms) || pfn_section_valid(ms, pfn);
+	rcu_read_unlock();
+
+	return ret;
 }
 #endif
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e5..ca7dbe1 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -792,6 +792,13 @@  static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		unsigned long section_nr = pfn_to_section_nr(pfn);
 
 		/*
+		 * Mark the section invalid so that valid_section()
+		 * return false. This prevents code from dereferencing
+		 * ms->usage array.
+		 */
+		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
+
+		/*
 		 * When removing an early section, the usage map is kept (as the
 		 * usage maps of other sections fall into the same page). It
 		 * will be re-used when re-adding the section - which is then no
@@ -799,16 +806,11 @@  static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		 * was allocated during boot.
 		 */
 		if (!PageReserved(virt_to_page(ms->usage))) {
+			synchronize_rcu();
 			kfree(ms->usage);
 			ms->usage = NULL;
 		}
 		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-		/*
-		 * Mark the section invalid so that valid_section()
-		 * return false. This prevents code from dereferencing
-		 * ms->usage array.
-		 */
-		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
 	}
 
 	/*