mm/kmemleak: Avoid scanning potential huge holes

Message ID	20211105035241.1239751-1-lang.yu@amd.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=bSwl=PY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2696B61165 Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; From: Lang Yu <lang.yu@amd.com> To: <linux-mm@kvack.org> CC: Catalin Marinas <catalin.marinas@arm.com>, <linux-kernel@vger.kernel.org>, Lang Yu <lang.yu@amd.com> Subject: [PATCH] mm/kmemleak: Avoid scanning potential huge holes Date: Fri, 5 Nov 2021 11:52:41 +0800 Message-ID: <20211105035241.1239751-1-lang.yu@amd.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/kmemleak: Avoid scanning potential huge holes \| expand mm/kmemleak: Avoid scanning potential huge holes

Lang Yu Nov. 5, 2021, 3:52 a.m. UTC

When using devm_request_free_mem_region() and
devm_memremap_pages() to add ZONE_DEVICE memory, if requested
free mem region pfn were huge(e.g., 0x0x400000000 ,we found
on some amd apus, amdkfd svm will request a such free mem region),
the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
It creates a huge hole between node_start_pfn() and node_end_pfn().

In such a case, following code snippet acctually was
just doing busy test_bit() looping on the huge hole.

for (pfn = start_pfn; pfn < end_pfn; pfn++) {
	struct page *page = pfn_to_online_page(pfn);
		if (!page)
			continue;
	...
}

So we got a soft lockup:

 watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
 CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
 RIP: 0010:pfn_to_online_page+0x5/0xd0
 Call Trace:
  ? kmemleak_scan+0x16a/0x440
  kmemleak_write+0x306/0x3a0
  ? common_file_perm+0x72/0x170
  full_proxy_write+0x5c/0x90
  vfs_write+0xb9/0x260
  ksys_write+0x67/0xe0
  __x64_sys_write+0x1a/0x20
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

I did some tests with the patch.

(1) amdgpu module unloaded

before the patch:

real    0m0.976s
user    0m0.000s
sys     0m0.968s

after the patch:

real    0m0.981s
user    0m0.000s
sys     0m0.973s

(2) amdgpu module loaded

before the patch:

real    0m35.365s
user    0m0.000s
sys     0m35.354s

after the patch:

real    0m1.049s
user    0m0.000s
sys     0m1.042s

Signed-off-by: Lang Yu <lang.yu@amd.com>
---
 mm/kmemleak.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

David Hildenbrand Nov. 5, 2021, 1:14 p.m. UTC | #1

On 05.11.21 04:52, Lang Yu wrote:
> When using devm_request_free_mem_region() and
> devm_memremap_pages() to add ZONE_DEVICE memory, if requested
> free mem region pfn were huge(e.g., 0x0x400000000 ,we found
> on some amd apus, amdkfd svm will request a such free mem region),
> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
> It creates a huge hole between node_start_pfn() and node_end_pfn().
> 
> In such a case, following code snippet acctually was
> just doing busy test_bit() looping on the huge hole.
> 
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> 	struct page *page = pfn_to_online_page(pfn);
> 		if (!page)
> 			continue;
> 	...
> }
> 
> So we got a soft lockup:
> 
>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
>  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
>  RIP: 0010:pfn_to_online_page+0x5/0xd0
>  Call Trace:
>   ? kmemleak_scan+0x16a/0x440
>   kmemleak_write+0x306/0x3a0
>   ? common_file_perm+0x72/0x170
>   full_proxy_write+0x5c/0x90
>   vfs_write+0xb9/0x260
>   ksys_write+0x67/0xe0
>   __x64_sys_write+0x1a/0x20
>   do_syscall_64+0x3b/0xc0
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> I did some tests with the patch.
> 
> (1) amdgpu module unloaded
> 
> before the patch:
> 
> real    0m0.976s
> user    0m0.000s
> sys     0m0.968s
> 
> after the patch:
> 
> real    0m0.981s
> user    0m0.000s
> sys     0m0.973s
> 
> (2) amdgpu module loaded
> 
> before the patch:
> 
> real    0m35.365s
> user    0m0.000s
> sys     0m35.354s
> 
> after the patch:
> 
> real    0m1.049s
> user    0m0.000s
> sys     0m1.042s
> 
> Signed-off-by: Lang Yu <lang.yu@amd.com>
> ---
>  mm/kmemleak.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index b57383c17cf6..d07444613a84 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
>  {
>  	unsigned long flags;
>  	struct kmemleak_object *object;
> +	struct zone *zone;
>  	int i;
>  	int new_leaks = 0;
>  
> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
>  	 * Struct page scanning for each node.
>  	 */
>  	get_online_mems();
> -	for_each_online_node(i) {
> -		unsigned long start_pfn = node_start_pfn(i);
> -		unsigned long end_pfn = node_end_pfn(i);
> +	for_each_populated_zone(zone) {
> +		unsigned long start_pfn = zone->zone_start_pfn;
> +		unsigned long end_pfn = zone_end_pfn(zone);
>  		unsigned long pfn;
>  
>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
>  				continue;
>  
>  			/* only scan pages belonging to this node */
> -			if (page_to_nid(page) != i)
> +			if (page_to_nid(page) != zone_to_nid(zone))

With overlapping zones you might rescan ranges ... instead we should do:

/* only scan pages belonging to this zone */
if (zone != page_zone(page))
	...

Or alternatively:

/* only scan pages belonging to this node */
if (page_to_nid(page) != zone_to_nid(zone))
	continue;
/* only scan pages belonging to this zone */
if (page_zonenum(page) != zone_idx(zone))
	continue;

kernel test robot Nov. 6, 2021, 7:20 p.m. UTC | #2

Hi Lang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]

url:    https://github.com/0day-ci/linux/commits/Lang-Yu/mm-kmemleak-Avoid-scanning-potential-huge-holes/20211105-115425
base:   https://github.com/hnaz/linux-mm master
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/d455f236523de8c6b91774a74eb3163d2ff16e2e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Lang-Yu/mm-kmemleak-Avoid-scanning-potential-huge-holes/20211105-115425
        git checkout d455f236523de8c6b91774a74eb3163d2ff16e2e
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=xtensa 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/kmemleak.c: In function 'kmemleak_scan':
>> mm/kmemleak.c:1407:13: error: unused variable 'i' [-Werror=unused-variable]
    1407 |         int i;
         |             ^
   cc1: all warnings being treated as errors


vim +/i +1407 mm/kmemleak.c

04609ccc40c4e8 Catalin Marinas 2009-10-28  1396  
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1397  /*
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1398   * Scan data sections and all the referenced memory blocks allocated via the
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1399   * kernel's standard allocators. This function must be called with the
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1400   * scan_mutex held.
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1401   */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1402  static void kmemleak_scan(void)
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1403  {
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1404  	unsigned long flags;
04609ccc40c4e8 Catalin Marinas 2009-10-28  1405  	struct kmemleak_object *object;
d455f236523de8 Lang Yu         2021-11-05  1406  	struct zone *zone;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 @1407  	int i;
4698c1f2bbe44c Catalin Marinas 2009-06-26  1408  	int new_leaks = 0;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1409  
acf4968ec9dea4 Catalin Marinas 2009-06-26  1410  	jiffies_last_scan = jiffies;
acf4968ec9dea4 Catalin Marinas 2009-06-26  1411  
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1412  	/* prepare the kmemleak_object's */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1413  	rcu_read_lock();
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1414  	list_for_each_entry_rcu(object, &object_list, object_list) {
8c96f1bc6fc49c He Zhe          2020-01-30  1415  		raw_spin_lock_irqsave(&object->lock, flags);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1416  #ifdef DEBUG
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1417  		/*
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1418  		 * With a few exceptions there should be a maximum of
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1419  		 * 1 reference to any object at this point.
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1420  		 */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1421  		if (atomic_read(&object->use_count) > 1) {
ae281064be1643 Joe Perches     2009-06-23  1422  			pr_debug("object->use_count = %d\n",
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1423  				 atomic_read(&object->use_count));
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1424  			dump_object_info(object);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1425  		}
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1426  #endif
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1427  		/* reset the reference count (whiten the object) */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1428  		object->count = 0;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1429  		if (color_gray(object) && get_object(object))
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1430  			list_add_tail(&object->gray_list, &gray_list);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1431  
8c96f1bc6fc49c He Zhe          2020-01-30  1432  		raw_spin_unlock_irqrestore(&object->lock, flags);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1433  	}
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1434  	rcu_read_unlock();
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11  1435  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

Lang Yu Nov. 8, 2021, 7:27 a.m. UTC | #3

On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote:
> On 05.11.21 04:52, Lang Yu wrote:
> > When using devm_request_free_mem_region() and
> > devm_memremap_pages() to add ZONE_DEVICE memory, if requested
> > free mem region pfn were huge(e.g., 0x400000000 ,we found
> > on some amd apus, amdkfd svm will request a such free mem region),
> > the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
> > It creates a huge hole between node_start_pfn() and node_end_pfn().
> > 
> > In such a case, following code snippet acctually was
> > just doing busy test_bit() looping on the huge hole.
> > 
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > 	struct page *page = pfn_to_online_page(pfn);
> > 		if (!page)
> > 			continue;
> > 	...
> > }
> > 
> > So we got a soft lockup:
> > 
> >  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
> >  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
> >  RIP: 0010:pfn_to_online_page+0x5/0xd0
> >  Call Trace:
> >   ? kmemleak_scan+0x16a/0x440
> >   kmemleak_write+0x306/0x3a0
> >   ? common_file_perm+0x72/0x170
> >   full_proxy_write+0x5c/0x90
> >   vfs_write+0xb9/0x260
> >   ksys_write+0x67/0xe0
> >   __x64_sys_write+0x1a/0x20
> >   do_syscall_64+0x3b/0xc0
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > I did some tests with the patch.
> > 
> > (1) amdgpu module unloaded
> > 
> > before the patch:
> > 
> > real    0m0.976s
> > user    0m0.000s
> > sys     0m0.968s
> > 
> > after the patch:
> > 
> > real    0m0.981s
> > user    0m0.000s
> > sys     0m0.973s
> > 
> > (2) amdgpu module loaded
> > 
> > before the patch:
> > 
> > real    0m35.365s
> > user    0m0.000s
> > sys     0m35.354s
> > 
> > after the patch:
> > 
> > real    0m1.049s
> > user    0m0.000s
> > sys     0m1.042s
> > 
> > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > ---
> >  mm/kmemleak.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> > index b57383c17cf6..d07444613a84 100644
> > --- a/mm/kmemleak.c
> > +++ b/mm/kmemleak.c
> > @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
> >  {
> >  	unsigned long flags;
> >  	struct kmemleak_object *object;
> > +	struct zone *zone;
> >  	int i;
> >  	int new_leaks = 0;
> >  
> > @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
> >  	 * Struct page scanning for each node.
> >  	 */
> >  	get_online_mems();
> > -	for_each_online_node(i) {
> > -		unsigned long start_pfn = node_start_pfn(i);
> > -		unsigned long end_pfn = node_end_pfn(i);
> > +	for_each_populated_zone(zone) {
> > +		unsigned long start_pfn = zone->zone_start_pfn;
> > +		unsigned long end_pfn = zone_end_pfn(zone);
> >  		unsigned long pfn;
> >  
> >  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
> >  				continue;
> >  
> >  			/* only scan pages belonging to this node */
> > -			if (page_to_nid(page) != i)
> > +			if (page_to_nid(page) != zone_to_nid(zone))
> 
> With overlapping zones you might rescan ranges ... instead we should do:
> 
> /* only scan pages belonging to this zone */
> if (zone != page_zone(page))
> 	...
> 
> Or alternatively:
> 
> /* only scan pages belonging to this node */
> if (page_to_nid(page) != zone_to_nid(zone))
> 	continue;
> /* only scan pages belonging to this zone */
> if (page_zonenum(page) != zone_idx(zone))
> 	continue;

The original code has covered that, i.e., 
only scan pages belonging to this node.
I didn't change that behavior.

Thanks,
Lang

> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand Nov. 8, 2021, 8:23 a.m. UTC | #4

On 08.11.21 08:27, Lang Yu wrote:
> On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote:
>> On 05.11.21 04:52, Lang Yu wrote:
>>> When using devm_request_free_mem_region() and
>>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested
>>> free mem region pfn were huge(e.g., 0x400000000 ,we found
>>> on some amd apus, amdkfd svm will request a such free mem region),
>>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
>>> It creates a huge hole between node_start_pfn() and node_end_pfn().
>>>
>>> In such a case, following code snippet acctually was
>>> just doing busy test_bit() looping on the huge hole.
>>>
>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>> 	struct page *page = pfn_to_online_page(pfn);
>>> 		if (!page)
>>> 			continue;
>>> 	...
>>> }
>>>
>>> So we got a soft lockup:
>>>
>>>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
>>>  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
>>>  RIP: 0010:pfn_to_online_page+0x5/0xd0
>>>  Call Trace:
>>>   ? kmemleak_scan+0x16a/0x440
>>>   kmemleak_write+0x306/0x3a0
>>>   ? common_file_perm+0x72/0x170
>>>   full_proxy_write+0x5c/0x90
>>>   vfs_write+0xb9/0x260
>>>   ksys_write+0x67/0xe0
>>>   __x64_sys_write+0x1a/0x20
>>>   do_syscall_64+0x3b/0xc0
>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>
>>> I did some tests with the patch.
>>>
>>> (1) amdgpu module unloaded
>>>
>>> before the patch:
>>>
>>> real    0m0.976s
>>> user    0m0.000s
>>> sys     0m0.968s
>>>
>>> after the patch:
>>>
>>> real    0m0.981s
>>> user    0m0.000s
>>> sys     0m0.973s
>>>
>>> (2) amdgpu module loaded
>>>
>>> before the patch:
>>>
>>> real    0m35.365s
>>> user    0m0.000s
>>> sys     0m35.354s
>>>
>>> after the patch:
>>>
>>> real    0m1.049s
>>> user    0m0.000s
>>> sys     0m1.042s
>>>
>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>> ---
>>>  mm/kmemleak.c | 9 +++++----
>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>>> index b57383c17cf6..d07444613a84 100644
>>> --- a/mm/kmemleak.c
>>> +++ b/mm/kmemleak.c
>>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
>>>  {
>>>  	unsigned long flags;
>>>  	struct kmemleak_object *object;
>>> +	struct zone *zone;
>>>  	int i;
>>>  	int new_leaks = 0;
>>>  
>>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
>>>  	 * Struct page scanning for each node.
>>>  	 */
>>>  	get_online_mems();
>>> -	for_each_online_node(i) {
>>> -		unsigned long start_pfn = node_start_pfn(i);
>>> -		unsigned long end_pfn = node_end_pfn(i);
>>> +	for_each_populated_zone(zone) {
>>> +		unsigned long start_pfn = zone->zone_start_pfn;
>>> +		unsigned long end_pfn = zone_end_pfn(zone);
>>>  		unsigned long pfn;
>>>  
>>>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
>>>  				continue;
>>>  
>>>  			/* only scan pages belonging to this node */
>>> -			if (page_to_nid(page) != i)
>>> +			if (page_to_nid(page) != zone_to_nid(zone))
>>
>> With overlapping zones you might rescan ranges ... instead we should do:
>>
>> /* only scan pages belonging to this zone */
>> if (zone != page_zone(page))
>> 	...
>>
>> Or alternatively:
>>
>> /* only scan pages belonging to this node */
>> if (page_to_nid(page) != zone_to_nid(zone))
>> 	continue;
>> /* only scan pages belonging to this zone */
>> if (page_zonenum(page) != zone_idx(zone))
>> 	continue;
> 
> The original code has covered that, i.e., 
> only scan pages belonging to this node.
> I didn't change that behavior.

Again, you can easily have overlapping zones -- ZONE_NORMAL and
ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but
only belongs to a single zone.

The original code would scan each PFN exactly once, as it was iterating
the node PFNs. Your changed code might scan a single PFN multiple times,
if it's spanned by multiple zones.

Lang Yu Nov. 8, 2021, 9:06 a.m. UTC | #5

On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote:
> On 08.11.21 08:27, Lang Yu wrote:
> > On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote:
> >> On 05.11.21 04:52, Lang Yu wrote:
> >>> When using devm_request_free_mem_region() and
> >>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested
> >>> free mem region pfn were huge(e.g., 0x400000000 ,we found
> >>> on some amd apus, amdkfd svm will request a such free mem region),
> >>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
> >>> It creates a huge hole between node_start_pfn() and node_end_pfn().
> >>>
> >>> In such a case, following code snippet acctually was
> >>> just doing busy test_bit() looping on the huge hole.
> >>>
> >>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >>> 	struct page *page = pfn_to_online_page(pfn);
> >>> 		if (!page)
> >>> 			continue;
> >>> 	...
> >>> }
> >>>
> >>> So we got a soft lockup:
> >>>
> >>>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
> >>>  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
> >>>  RIP: 0010:pfn_to_online_page+0x5/0xd0
> >>>  Call Trace:
> >>>   ? kmemleak_scan+0x16a/0x440
> >>>   kmemleak_write+0x306/0x3a0
> >>>   ? common_file_perm+0x72/0x170
> >>>   full_proxy_write+0x5c/0x90
> >>>   vfs_write+0xb9/0x260
> >>>   ksys_write+0x67/0xe0
> >>>   __x64_sys_write+0x1a/0x20
> >>>   do_syscall_64+0x3b/0xc0
> >>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>
> >>> I did some tests with the patch.
> >>>
> >>> (1) amdgpu module unloaded
> >>>
> >>> before the patch:
> >>>
> >>> real    0m0.976s
> >>> user    0m0.000s
> >>> sys     0m0.968s
> >>>
> >>> after the patch:
> >>>
> >>> real    0m0.981s
> >>> user    0m0.000s
> >>> sys     0m0.973s
> >>>
> >>> (2) amdgpu module loaded
> >>>
> >>> before the patch:
> >>>
> >>> real    0m35.365s
> >>> user    0m0.000s
> >>> sys     0m35.354s
> >>>
> >>> after the patch:
> >>>
> >>> real    0m1.049s
> >>> user    0m0.000s
> >>> sys     0m1.042s
> >>>
> >>> Signed-off-by: Lang Yu <lang.yu@amd.com>
> >>> ---
> >>>  mm/kmemleak.c | 9 +++++----
> >>>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> >>> index b57383c17cf6..d07444613a84 100644
> >>> --- a/mm/kmemleak.c
> >>> +++ b/mm/kmemleak.c
> >>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
> >>>  {
> >>>  	unsigned long flags;
> >>>  	struct kmemleak_object *object;
> >>> +	struct zone *zone;
> >>>  	int i;
> >>>  	int new_leaks = 0;
> >>>  
> >>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
> >>>  	 * Struct page scanning for each node.
> >>>  	 */
> >>>  	get_online_mems();
> >>> -	for_each_online_node(i) {
> >>> -		unsigned long start_pfn = node_start_pfn(i);
> >>> -		unsigned long end_pfn = node_end_pfn(i);
> >>> +	for_each_populated_zone(zone) {
> >>> +		unsigned long start_pfn = zone->zone_start_pfn;
> >>> +		unsigned long end_pfn = zone_end_pfn(zone);
> >>>  		unsigned long pfn;
> >>>  
> >>>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
> >>>  				continue;
> >>>  
> >>>  			/* only scan pages belonging to this node */
> >>> -			if (page_to_nid(page) != i)
> >>> +			if (page_to_nid(page) != zone_to_nid(zone))
> >>
> >> With overlapping zones you might rescan ranges ... instead we should do:
> >>
> >> /* only scan pages belonging to this zone */
> >> if (zone != page_zone(page))
> >> 	...
> >>
> >> Or alternatively:
> >>
> >> /* only scan pages belonging to this node */
> >> if (page_to_nid(page) != zone_to_nid(zone))
> >> 	continue;
> >> /* only scan pages belonging to this zone */
> >> if (page_zonenum(page) != zone_idx(zone))
> >> 	continue;
> > 
> > The original code has covered that, i.e., 
> > only scan pages belonging to this node.
> > I didn't change that behavior.
> 
> Again, you can easily have overlapping zones -- ZONE_NORMAL and
> ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but
> only belongs to a single zone.
> 
> The original code would scan each PFN exactly once, as it was iterating
> the node PFNs. Your changed code might scan a single PFN multiple times,
> if it's spanned by multiple zones.
>

Did you mean a single PFN is shared by multiple zones belonging to the 
same node here? Thanks!

Regards,
Lang

> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand Nov. 8, 2021, 9:24 a.m. UTC | #6

On 08.11.21 10:06, Lang Yu wrote:
> On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote:
>> On 08.11.21 08:27, Lang Yu wrote:
>>> On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote:
>>>> On 05.11.21 04:52, Lang Yu wrote:
>>>>> When using devm_request_free_mem_region() and
>>>>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested
>>>>> free mem region pfn were huge(e.g., 0x400000000 ,we found
>>>>> on some amd apus, amdkfd svm will request a such free mem region),
>>>>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
>>>>> It creates a huge hole between node_start_pfn() and node_end_pfn().
>>>>>
>>>>> In such a case, following code snippet acctually was
>>>>> just doing busy test_bit() looping on the huge hole.
>>>>>
>>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>>> 	struct page *page = pfn_to_online_page(pfn);
>>>>> 		if (!page)
>>>>> 			continue;
>>>>> 	...
>>>>> }
>>>>>
>>>>> So we got a soft lockup:
>>>>>
>>>>>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
>>>>>  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
>>>>>  RIP: 0010:pfn_to_online_page+0x5/0xd0
>>>>>  Call Trace:
>>>>>   ? kmemleak_scan+0x16a/0x440
>>>>>   kmemleak_write+0x306/0x3a0
>>>>>   ? common_file_perm+0x72/0x170
>>>>>   full_proxy_write+0x5c/0x90
>>>>>   vfs_write+0xb9/0x260
>>>>>   ksys_write+0x67/0xe0
>>>>>   __x64_sys_write+0x1a/0x20
>>>>>   do_syscall_64+0x3b/0xc0
>>>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>
>>>>> I did some tests with the patch.
>>>>>
>>>>> (1) amdgpu module unloaded
>>>>>
>>>>> before the patch:
>>>>>
>>>>> real    0m0.976s
>>>>> user    0m0.000s
>>>>> sys     0m0.968s
>>>>>
>>>>> after the patch:
>>>>>
>>>>> real    0m0.981s
>>>>> user    0m0.000s
>>>>> sys     0m0.973s
>>>>>
>>>>> (2) amdgpu module loaded
>>>>>
>>>>> before the patch:
>>>>>
>>>>> real    0m35.365s
>>>>> user    0m0.000s
>>>>> sys     0m35.354s
>>>>>
>>>>> after the patch:
>>>>>
>>>>> real    0m1.049s
>>>>> user    0m0.000s
>>>>> sys     0m1.042s
>>>>>
>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>>> ---
>>>>>  mm/kmemleak.c | 9 +++++----
>>>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>>>>> index b57383c17cf6..d07444613a84 100644
>>>>> --- a/mm/kmemleak.c
>>>>> +++ b/mm/kmemleak.c
>>>>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
>>>>>  {
>>>>>  	unsigned long flags;
>>>>>  	struct kmemleak_object *object;
>>>>> +	struct zone *zone;
>>>>>  	int i;
>>>>>  	int new_leaks = 0;
>>>>>  
>>>>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
>>>>>  	 * Struct page scanning for each node.
>>>>>  	 */
>>>>>  	get_online_mems();
>>>>> -	for_each_online_node(i) {
>>>>> -		unsigned long start_pfn = node_start_pfn(i);
>>>>> -		unsigned long end_pfn = node_end_pfn(i);
>>>>> +	for_each_populated_zone(zone) {
>>>>> +		unsigned long start_pfn = zone->zone_start_pfn;
>>>>> +		unsigned long end_pfn = zone_end_pfn(zone);
>>>>>  		unsigned long pfn;
>>>>>  
>>>>>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
>>>>>  				continue;
>>>>>  
>>>>>  			/* only scan pages belonging to this node */
>>>>> -			if (page_to_nid(page) != i)
>>>>> +			if (page_to_nid(page) != zone_to_nid(zone))
>>>>
>>>> With overlapping zones you might rescan ranges ... instead we should do:
>>>>
>>>> /* only scan pages belonging to this zone */
>>>> if (zone != page_zone(page))
>>>> 	...
>>>>
>>>> Or alternatively:
>>>>
>>>> /* only scan pages belonging to this node */
>>>> if (page_to_nid(page) != zone_to_nid(zone))
>>>> 	continue;
>>>> /* only scan pages belonging to this zone */
>>>> if (page_zonenum(page) != zone_idx(zone))
>>>> 	continue;
>>>
>>> The original code has covered that, i.e., 
>>> only scan pages belonging to this node.
>>> I didn't change that behavior.
>>
>> Again, you can easily have overlapping zones -- ZONE_NORMAL and
>> ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but
>> only belongs to a single zone.
>>
>> The original code would scan each PFN exactly once, as it was iterating
>> the node PFNs. Your changed code might scan a single PFN multiple times,
>> if it's spanned by multiple zones.
>>
> 
> Did you mean a single PFN is shared by multiple zones belonging to the 
> same node here? Thanks!

Not shared, spanned. A PFN always belongs to exactly one ZONE+NODE, but
might be "spanned" by multiple nodes or multiple zones, because nodes
and zones can overlap We can get the actual zone of a PFN via
page_zone(page) in my example above. Note that checking for the zone
structure (not the zone number/idx) implicitly checks for the node.


Let's take a look at an example:

...
[root@vm-0 ~]# cat /sys/devices/system/memory/memory32/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory33/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory34/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory35/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory36/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory37/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory38/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory39/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory40/valid_zones
Movable
[root@vm-0 ~]# cat /sys/devices/system/memory/memory41/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory42/valid_zones
Movable
[root@vm-0 ~]# cat /sys/devices/system/memory/memory43/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory44/valid_zones
Movable
[root@vm-0 ~]# cat /sys/devices/system/memory/memory45/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory46/valid_zones
Movable
[root@vm-0 ~]# cat /sys/devices/system/memory/memory47/valid_zones
Normal
[root@vm-0 ~]# cat /sys/devices/system/memory/memory48/valid_zones


# cat /proc/zoneinfo
Node 0, zone      DMA
	...
        spanned  4095
        present  3998
        managed  3977
	...
  start_pfn:           1
Node 0, zone    DMA32
	...
        spanned  1044480
        present  782304
        managed  765920
	...
  start_pfn:           4096
Node 0, zone   Normal
	...
        spanned  524288
        present  393216
        managed  365736
	...
  start_pfn:           1048576
Node 0, zone  Movable
	...
        spanned  229376
        present  131072
        managed  131072
  start_pfn:           1310720


So Normal spans:

1048576 -> 1572863

And Movable spans:

1310720 -> 1540095

Both zones overlap.

Lang Yu Nov. 8, 2021, 9:44 a.m. UTC | #7

On Mon, Nov 08, 2021 at 10:24:43AM +0100, David Hildenbrand wrote:
> On 08.11.21 10:06, Lang Yu wrote:
> > On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote:
> >> On 08.11.21 08:27, Lang Yu wrote:
> >>> On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote:
> >>>> On 05.11.21 04:52, Lang Yu wrote:
> >>>>> When using devm_request_free_mem_region() and
> >>>>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested
> >>>>> free mem region pfn were huge(e.g., 0x400000000 ,we found
> >>>>> on some amd apus, amdkfd svm will request a such free mem region),
> >>>>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()).
> >>>>> It creates a huge hole between node_start_pfn() and node_end_pfn().
> >>>>>
> >>>>> In such a case, following code snippet acctually was
> >>>>> just doing busy test_bit() looping on the huge hole.
> >>>>>
> >>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >>>>> 	struct page *page = pfn_to_online_page(pfn);
> >>>>> 		if (!page)
> >>>>> 			continue;
> >>>>> 	...
> >>>>> }
> >>>>>
> >>>>> So we got a soft lockup:
> >>>>>
> >>>>>  watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
> >>>>>  CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
> >>>>>  RIP: 0010:pfn_to_online_page+0x5/0xd0
> >>>>>  Call Trace:
> >>>>>   ? kmemleak_scan+0x16a/0x440
> >>>>>   kmemleak_write+0x306/0x3a0
> >>>>>   ? common_file_perm+0x72/0x170
> >>>>>   full_proxy_write+0x5c/0x90
> >>>>>   vfs_write+0xb9/0x260
> >>>>>   ksys_write+0x67/0xe0
> >>>>>   __x64_sys_write+0x1a/0x20
> >>>>>   do_syscall_64+0x3b/0xc0
> >>>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>
> >>>>> I did some tests with the patch.
> >>>>>
> >>>>> (1) amdgpu module unloaded
> >>>>>
> >>>>> before the patch:
> >>>>>
> >>>>> real    0m0.976s
> >>>>> user    0m0.000s
> >>>>> sys     0m0.968s
> >>>>>
> >>>>> after the patch:
> >>>>>
> >>>>> real    0m0.981s
> >>>>> user    0m0.000s
> >>>>> sys     0m0.973s
> >>>>>
> >>>>> (2) amdgpu module loaded
> >>>>>
> >>>>> before the patch:
> >>>>>
> >>>>> real    0m35.365s
> >>>>> user    0m0.000s
> >>>>> sys     0m35.354s
> >>>>>
> >>>>> after the patch:
> >>>>>
> >>>>> real    0m1.049s
> >>>>> user    0m0.000s
> >>>>> sys     0m1.042s
> >>>>>
> >>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
> >>>>> ---
> >>>>>  mm/kmemleak.c | 9 +++++----
> >>>>>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> >>>>> index b57383c17cf6..d07444613a84 100644
> >>>>> --- a/mm/kmemleak.c
> >>>>> +++ b/mm/kmemleak.c
> >>>>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void)
> >>>>>  {
> >>>>>  	unsigned long flags;
> >>>>>  	struct kmemleak_object *object;
> >>>>> +	struct zone *zone;
> >>>>>  	int i;
> >>>>>  	int new_leaks = 0;
> >>>>>  
> >>>>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
> >>>>>  	 * Struct page scanning for each node.
> >>>>>  	 */
> >>>>>  	get_online_mems();
> >>>>> -	for_each_online_node(i) {
> >>>>> -		unsigned long start_pfn = node_start_pfn(i);
> >>>>> -		unsigned long end_pfn = node_end_pfn(i);
> >>>>> +	for_each_populated_zone(zone) {
> >>>>> +		unsigned long start_pfn = zone->zone_start_pfn;
> >>>>> +		unsigned long end_pfn = zone_end_pfn(zone);
> >>>>>  		unsigned long pfn;
> >>>>>  
> >>>>>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >>>>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void)
> >>>>>  				continue;
> >>>>>  
> >>>>>  			/* only scan pages belonging to this node */
> >>>>> -			if (page_to_nid(page) != i)
> >>>>> +			if (page_to_nid(page) != zone_to_nid(zone))
> >>>>
> >>>> With overlapping zones you might rescan ranges ... instead we should do:
> >>>>
> >>>> /* only scan pages belonging to this zone */
> >>>> if (zone != page_zone(page))
> >>>> 	...
> >>>>
> >>>> Or alternatively:
> >>>>
> >>>> /* only scan pages belonging to this node */
> >>>> if (page_to_nid(page) != zone_to_nid(zone))
> >>>> 	continue;
> >>>> /* only scan pages belonging to this zone */
> >>>> if (page_zonenum(page) != zone_idx(zone))
> >>>> 	continue;
> >>>
> >>> The original code has covered that, i.e., 
> >>> only scan pages belonging to this node.
> >>> I didn't change that behavior.
> >>
> >> Again, you can easily have overlapping zones -- ZONE_NORMAL and
> >> ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but
> >> only belongs to a single zone.
> >>
> >> The original code would scan each PFN exactly once, as it was iterating
> >> the node PFNs. Your changed code might scan a single PFN multiple times,
> >> if it's spanned by multiple zones.
> >>
> > 
> > Did you mean a single PFN is shared by multiple zones belonging to the 
> > same node here? Thanks!
> 
> Not shared, spanned. A PFN always belongs to exactly one ZONE+NODE, but
> might be "spanned" by multiple nodes or multiple zones, because nodes
> and zones can overlap We can get the actual zone of a PFN via
> page_zone(page) in my example above. Note that checking for the zone
> structure (not the zone number/idx) implicitly checks for the node.
> 
> 
> Let's take a look at an example:
> 
> ...
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory32/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory33/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory34/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory35/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory36/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory37/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory38/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory39/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory40/valid_zones
> Movable
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory41/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory42/valid_zones
> Movable
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory43/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory44/valid_zones
> Movable
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory45/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory46/valid_zones
> Movable
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory47/valid_zones
> Normal
> [root@vm-0 ~]# cat /sys/devices/system/memory/memory48/valid_zones
> 
> 
> # cat /proc/zoneinfo
> Node 0, zone      DMA
> 	...
>         spanned  4095
>         present  3998
>         managed  3977
> 	...
>   start_pfn:           1
> Node 0, zone    DMA32
> 	...
>         spanned  1044480
>         present  782304
>         managed  765920
> 	...
>   start_pfn:           4096
> Node 0, zone   Normal
> 	...
>         spanned  524288
>         present  393216
>         managed  365736
> 	...
>   start_pfn:           1048576
> Node 0, zone  Movable
> 	...
>         spanned  229376
>         present  131072
>         managed  131072
>   start_pfn:           1310720
> 
> 
> So Normal spans:
> 
> 1048576 -> 1572863
> 
> And Movable spans:
> 
> 1310720 -> 1540095
> 
> Both zones overlap.

Thanks for your clarification! I got it.

Regards,
Lang

> -- 
> Thanks,
> 
> David / dhildenb
>

mm/kmemleak: Avoid scanning potential huge holes

Commit Message

Comments

Patch