Message ID | 20211105035241.1239751-1-lang.yu@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/kmemleak: Avoid scanning potential huge holes | expand |
On 05.11.21 04:52, Lang Yu wrote: > When using devm_request_free_mem_region() and > devm_memremap_pages() to add ZONE_DEVICE memory, if requested > free mem region pfn were huge(e.g., 0x0x400000000 ,we found > on some amd apus, amdkfd svm will request a such free mem region), > the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). > It creates a huge hole between node_start_pfn() and node_end_pfn(). > > In such a case, following code snippet acctually was > just doing busy test_bit() looping on the huge hole. > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_online_page(pfn); > if (!page) > continue; > ... > } > > So we got a soft lockup: > > watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] > CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 > RIP: 0010:pfn_to_online_page+0x5/0xd0 > Call Trace: > ? kmemleak_scan+0x16a/0x440 > kmemleak_write+0x306/0x3a0 > ? common_file_perm+0x72/0x170 > full_proxy_write+0x5c/0x90 > vfs_write+0xb9/0x260 > ksys_write+0x67/0xe0 > __x64_sys_write+0x1a/0x20 > do_syscall_64+0x3b/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > I did some tests with the patch. > > (1) amdgpu module unloaded > > before the patch: > > real 0m0.976s > user 0m0.000s > sys 0m0.968s > > after the patch: > > real 0m0.981s > user 0m0.000s > sys 0m0.973s > > (2) amdgpu module loaded > > before the patch: > > real 0m35.365s > user 0m0.000s > sys 0m35.354s > > after the patch: > > real 0m1.049s > user 0m0.000s > sys 0m1.042s > > Signed-off-by: Lang Yu <lang.yu@amd.com> > --- > mm/kmemleak.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > index b57383c17cf6..d07444613a84 100644 > --- a/mm/kmemleak.c > +++ b/mm/kmemleak.c > @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) > { > unsigned long flags; > struct kmemleak_object *object; > + struct zone *zone; > int i; > int new_leaks = 0; > > @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) > * Struct page scanning for each node. > */ > get_online_mems(); > - for_each_online_node(i) { > - unsigned long start_pfn = node_start_pfn(i); > - unsigned long end_pfn = node_end_pfn(i); > + for_each_populated_zone(zone) { > + unsigned long start_pfn = zone->zone_start_pfn; > + unsigned long end_pfn = zone_end_pfn(zone); > unsigned long pfn; > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) > continue; > > /* only scan pages belonging to this node */ > - if (page_to_nid(page) != i) > + if (page_to_nid(page) != zone_to_nid(zone)) With overlapping zones you might rescan ranges ... instead we should do: /* only scan pages belonging to this zone */ if (zone != page_zone(page)) ... Or alternatively: /* only scan pages belonging to this node */ if (page_to_nid(page) != zone_to_nid(zone)) continue; /* only scan pages belonging to this zone */ if (page_zonenum(page) != zone_idx(zone)) continue;
Hi Lang,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
url: https://github.com/0day-ci/linux/commits/Lang-Yu/mm-kmemleak-Avoid-scanning-potential-huge-holes/20211105-115425
base: https://github.com/hnaz/linux-mm master
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/d455f236523de8c6b91774a74eb3163d2ff16e2e
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Lang-Yu/mm-kmemleak-Avoid-scanning-potential-huge-holes/20211105-115425
git checkout d455f236523de8c6b91774a74eb3163d2ff16e2e
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=xtensa
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
mm/kmemleak.c: In function 'kmemleak_scan':
>> mm/kmemleak.c:1407:13: error: unused variable 'i' [-Werror=unused-variable]
1407 | int i;
| ^
cc1: all warnings being treated as errors
vim +/i +1407 mm/kmemleak.c
04609ccc40c4e8 Catalin Marinas 2009-10-28 1396
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1397 /*
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1398 * Scan data sections and all the referenced memory blocks allocated via the
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1399 * kernel's standard allocators. This function must be called with the
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1400 * scan_mutex held.
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1401 */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1402 static void kmemleak_scan(void)
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1403 {
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1404 unsigned long flags;
04609ccc40c4e8 Catalin Marinas 2009-10-28 1405 struct kmemleak_object *object;
d455f236523de8 Lang Yu 2021-11-05 1406 struct zone *zone;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 @1407 int i;
4698c1f2bbe44c Catalin Marinas 2009-06-26 1408 int new_leaks = 0;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1409
acf4968ec9dea4 Catalin Marinas 2009-06-26 1410 jiffies_last_scan = jiffies;
acf4968ec9dea4 Catalin Marinas 2009-06-26 1411
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1412 /* prepare the kmemleak_object's */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1413 rcu_read_lock();
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1414 list_for_each_entry_rcu(object, &object_list, object_list) {
8c96f1bc6fc49c He Zhe 2020-01-30 1415 raw_spin_lock_irqsave(&object->lock, flags);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1416 #ifdef DEBUG
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1417 /*
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1418 * With a few exceptions there should be a maximum of
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1419 * 1 reference to any object at this point.
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1420 */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1421 if (atomic_read(&object->use_count) > 1) {
ae281064be1643 Joe Perches 2009-06-23 1422 pr_debug("object->use_count = %d\n",
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1423 atomic_read(&object->use_count));
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1424 dump_object_info(object);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1425 }
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1426 #endif
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1427 /* reset the reference count (whiten the object) */
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1428 object->count = 0;
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1429 if (color_gray(object) && get_object(object))
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1430 list_add_tail(&object->gray_list, &gray_list);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1431
8c96f1bc6fc49c He Zhe 2020-01-30 1432 raw_spin_unlock_irqrestore(&object->lock, flags);
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1433 }
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1434 rcu_read_unlock();
3c7b4e6b8be4c1 Catalin Marinas 2009-06-11 1435
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote: > On 05.11.21 04:52, Lang Yu wrote: > > When using devm_request_free_mem_region() and > > devm_memremap_pages() to add ZONE_DEVICE memory, if requested > > free mem region pfn were huge(e.g., 0x400000000 ,we found > > on some amd apus, amdkfd svm will request a such free mem region), > > the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). > > It creates a huge hole between node_start_pfn() and node_end_pfn(). > > > > In such a case, following code snippet acctually was > > just doing busy test_bit() looping on the huge hole. > > > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > struct page *page = pfn_to_online_page(pfn); > > if (!page) > > continue; > > ... > > } > > > > So we got a soft lockup: > > > > watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] > > CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 > > RIP: 0010:pfn_to_online_page+0x5/0xd0 > > Call Trace: > > ? kmemleak_scan+0x16a/0x440 > > kmemleak_write+0x306/0x3a0 > > ? common_file_perm+0x72/0x170 > > full_proxy_write+0x5c/0x90 > > vfs_write+0xb9/0x260 > > ksys_write+0x67/0xe0 > > __x64_sys_write+0x1a/0x20 > > do_syscall_64+0x3b/0xc0 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > I did some tests with the patch. > > > > (1) amdgpu module unloaded > > > > before the patch: > > > > real 0m0.976s > > user 0m0.000s > > sys 0m0.968s > > > > after the patch: > > > > real 0m0.981s > > user 0m0.000s > > sys 0m0.973s > > > > (2) amdgpu module loaded > > > > before the patch: > > > > real 0m35.365s > > user 0m0.000s > > sys 0m35.354s > > > > after the patch: > > > > real 0m1.049s > > user 0m0.000s > > sys 0m1.042s > > > > Signed-off-by: Lang Yu <lang.yu@amd.com> > > --- > > mm/kmemleak.c | 9 +++++---- > > 1 file changed, 5 insertions(+), 4 deletions(-) > > > > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > > index b57383c17cf6..d07444613a84 100644 > > --- a/mm/kmemleak.c > > +++ b/mm/kmemleak.c > > @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) > > { > > unsigned long flags; > > struct kmemleak_object *object; > > + struct zone *zone; > > int i; > > int new_leaks = 0; > > > > @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) > > * Struct page scanning for each node. > > */ > > get_online_mems(); > > - for_each_online_node(i) { > > - unsigned long start_pfn = node_start_pfn(i); > > - unsigned long end_pfn = node_end_pfn(i); > > + for_each_populated_zone(zone) { > > + unsigned long start_pfn = zone->zone_start_pfn; > > + unsigned long end_pfn = zone_end_pfn(zone); > > unsigned long pfn; > > > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) > > continue; > > > > /* only scan pages belonging to this node */ > > - if (page_to_nid(page) != i) > > + if (page_to_nid(page) != zone_to_nid(zone)) > > With overlapping zones you might rescan ranges ... instead we should do: > > /* only scan pages belonging to this zone */ > if (zone != page_zone(page)) > ... > > Or alternatively: > > /* only scan pages belonging to this node */ > if (page_to_nid(page) != zone_to_nid(zone)) > continue; > /* only scan pages belonging to this zone */ > if (page_zonenum(page) != zone_idx(zone)) > continue; The original code has covered that, i.e., only scan pages belonging to this node. I didn't change that behavior. Thanks, Lang > -- > Thanks, > > David / dhildenb >
On 08.11.21 08:27, Lang Yu wrote: > On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote: >> On 05.11.21 04:52, Lang Yu wrote: >>> When using devm_request_free_mem_region() and >>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested >>> free mem region pfn were huge(e.g., 0x400000000 ,we found >>> on some amd apus, amdkfd svm will request a such free mem region), >>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). >>> It creates a huge hole between node_start_pfn() and node_end_pfn(). >>> >>> In such a case, following code snippet acctually was >>> just doing busy test_bit() looping on the huge hole. >>> >>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { >>> struct page *page = pfn_to_online_page(pfn); >>> if (!page) >>> continue; >>> ... >>> } >>> >>> So we got a soft lockup: >>> >>> watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] >>> CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 >>> RIP: 0010:pfn_to_online_page+0x5/0xd0 >>> Call Trace: >>> ? kmemleak_scan+0x16a/0x440 >>> kmemleak_write+0x306/0x3a0 >>> ? common_file_perm+0x72/0x170 >>> full_proxy_write+0x5c/0x90 >>> vfs_write+0xb9/0x260 >>> ksys_write+0x67/0xe0 >>> __x64_sys_write+0x1a/0x20 >>> do_syscall_64+0x3b/0xc0 >>> entry_SYSCALL_64_after_hwframe+0x44/0xae >>> >>> I did some tests with the patch. >>> >>> (1) amdgpu module unloaded >>> >>> before the patch: >>> >>> real 0m0.976s >>> user 0m0.000s >>> sys 0m0.968s >>> >>> after the patch: >>> >>> real 0m0.981s >>> user 0m0.000s >>> sys 0m0.973s >>> >>> (2) amdgpu module loaded >>> >>> before the patch: >>> >>> real 0m35.365s >>> user 0m0.000s >>> sys 0m35.354s >>> >>> after the patch: >>> >>> real 0m1.049s >>> user 0m0.000s >>> sys 0m1.042s >>> >>> Signed-off-by: Lang Yu <lang.yu@amd.com> >>> --- >>> mm/kmemleak.c | 9 +++++---- >>> 1 file changed, 5 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c >>> index b57383c17cf6..d07444613a84 100644 >>> --- a/mm/kmemleak.c >>> +++ b/mm/kmemleak.c >>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) >>> { >>> unsigned long flags; >>> struct kmemleak_object *object; >>> + struct zone *zone; >>> int i; >>> int new_leaks = 0; >>> >>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) >>> * Struct page scanning for each node. >>> */ >>> get_online_mems(); >>> - for_each_online_node(i) { >>> - unsigned long start_pfn = node_start_pfn(i); >>> - unsigned long end_pfn = node_end_pfn(i); >>> + for_each_populated_zone(zone) { >>> + unsigned long start_pfn = zone->zone_start_pfn; >>> + unsigned long end_pfn = zone_end_pfn(zone); >>> unsigned long pfn; >>> >>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { >>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) >>> continue; >>> >>> /* only scan pages belonging to this node */ >>> - if (page_to_nid(page) != i) >>> + if (page_to_nid(page) != zone_to_nid(zone)) >> >> With overlapping zones you might rescan ranges ... instead we should do: >> >> /* only scan pages belonging to this zone */ >> if (zone != page_zone(page)) >> ... >> >> Or alternatively: >> >> /* only scan pages belonging to this node */ >> if (page_to_nid(page) != zone_to_nid(zone)) >> continue; >> /* only scan pages belonging to this zone */ >> if (page_zonenum(page) != zone_idx(zone)) >> continue; > > The original code has covered that, i.e., > only scan pages belonging to this node. > I didn't change that behavior. Again, you can easily have overlapping zones -- ZONE_NORMAL and ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but only belongs to a single zone. The original code would scan each PFN exactly once, as it was iterating the node PFNs. Your changed code might scan a single PFN multiple times, if it's spanned by multiple zones.
On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote: > On 08.11.21 08:27, Lang Yu wrote: > > On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote: > >> On 05.11.21 04:52, Lang Yu wrote: > >>> When using devm_request_free_mem_region() and > >>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested > >>> free mem region pfn were huge(e.g., 0x400000000 ,we found > >>> on some amd apus, amdkfd svm will request a such free mem region), > >>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). > >>> It creates a huge hole between node_start_pfn() and node_end_pfn(). > >>> > >>> In such a case, following code snippet acctually was > >>> just doing busy test_bit() looping on the huge hole. > >>> > >>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >>> struct page *page = pfn_to_online_page(pfn); > >>> if (!page) > >>> continue; > >>> ... > >>> } > >>> > >>> So we got a soft lockup: > >>> > >>> watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] > >>> CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 > >>> RIP: 0010:pfn_to_online_page+0x5/0xd0 > >>> Call Trace: > >>> ? kmemleak_scan+0x16a/0x440 > >>> kmemleak_write+0x306/0x3a0 > >>> ? common_file_perm+0x72/0x170 > >>> full_proxy_write+0x5c/0x90 > >>> vfs_write+0xb9/0x260 > >>> ksys_write+0x67/0xe0 > >>> __x64_sys_write+0x1a/0x20 > >>> do_syscall_64+0x3b/0xc0 > >>> entry_SYSCALL_64_after_hwframe+0x44/0xae > >>> > >>> I did some tests with the patch. > >>> > >>> (1) amdgpu module unloaded > >>> > >>> before the patch: > >>> > >>> real 0m0.976s > >>> user 0m0.000s > >>> sys 0m0.968s > >>> > >>> after the patch: > >>> > >>> real 0m0.981s > >>> user 0m0.000s > >>> sys 0m0.973s > >>> > >>> (2) amdgpu module loaded > >>> > >>> before the patch: > >>> > >>> real 0m35.365s > >>> user 0m0.000s > >>> sys 0m35.354s > >>> > >>> after the patch: > >>> > >>> real 0m1.049s > >>> user 0m0.000s > >>> sys 0m1.042s > >>> > >>> Signed-off-by: Lang Yu <lang.yu@amd.com> > >>> --- > >>> mm/kmemleak.c | 9 +++++---- > >>> 1 file changed, 5 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c > >>> index b57383c17cf6..d07444613a84 100644 > >>> --- a/mm/kmemleak.c > >>> +++ b/mm/kmemleak.c > >>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) > >>> { > >>> unsigned long flags; > >>> struct kmemleak_object *object; > >>> + struct zone *zone; > >>> int i; > >>> int new_leaks = 0; > >>> > >>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) > >>> * Struct page scanning for each node. > >>> */ > >>> get_online_mems(); > >>> - for_each_online_node(i) { > >>> - unsigned long start_pfn = node_start_pfn(i); > >>> - unsigned long end_pfn = node_end_pfn(i); > >>> + for_each_populated_zone(zone) { > >>> + unsigned long start_pfn = zone->zone_start_pfn; > >>> + unsigned long end_pfn = zone_end_pfn(zone); > >>> unsigned long pfn; > >>> > >>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) > >>> continue; > >>> > >>> /* only scan pages belonging to this node */ > >>> - if (page_to_nid(page) != i) > >>> + if (page_to_nid(page) != zone_to_nid(zone)) > >> > >> With overlapping zones you might rescan ranges ... instead we should do: > >> > >> /* only scan pages belonging to this zone */ > >> if (zone != page_zone(page)) > >> ... > >> > >> Or alternatively: > >> > >> /* only scan pages belonging to this node */ > >> if (page_to_nid(page) != zone_to_nid(zone)) > >> continue; > >> /* only scan pages belonging to this zone */ > >> if (page_zonenum(page) != zone_idx(zone)) > >> continue; > > > > The original code has covered that, i.e., > > only scan pages belonging to this node. > > I didn't change that behavior. > > Again, you can easily have overlapping zones -- ZONE_NORMAL and > ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but > only belongs to a single zone. > > The original code would scan each PFN exactly once, as it was iterating > the node PFNs. Your changed code might scan a single PFN multiple times, > if it's spanned by multiple zones. > Did you mean a single PFN is shared by multiple zones belonging to the same node here? Thanks! Regards, Lang > -- > Thanks, > > David / dhildenb >
On 08.11.21 10:06, Lang Yu wrote: > On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote: >> On 08.11.21 08:27, Lang Yu wrote: >>> On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote: >>>> On 05.11.21 04:52, Lang Yu wrote: >>>>> When using devm_request_free_mem_region() and >>>>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested >>>>> free mem region pfn were huge(e.g., 0x400000000 ,we found >>>>> on some amd apus, amdkfd svm will request a such free mem region), >>>>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). >>>>> It creates a huge hole between node_start_pfn() and node_end_pfn(). >>>>> >>>>> In such a case, following code snippet acctually was >>>>> just doing busy test_bit() looping on the huge hole. >>>>> >>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { >>>>> struct page *page = pfn_to_online_page(pfn); >>>>> if (!page) >>>>> continue; >>>>> ... >>>>> } >>>>> >>>>> So we got a soft lockup: >>>>> >>>>> watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] >>>>> CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 >>>>> RIP: 0010:pfn_to_online_page+0x5/0xd0 >>>>> Call Trace: >>>>> ? kmemleak_scan+0x16a/0x440 >>>>> kmemleak_write+0x306/0x3a0 >>>>> ? common_file_perm+0x72/0x170 >>>>> full_proxy_write+0x5c/0x90 >>>>> vfs_write+0xb9/0x260 >>>>> ksys_write+0x67/0xe0 >>>>> __x64_sys_write+0x1a/0x20 >>>>> do_syscall_64+0x3b/0xc0 >>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>> >>>>> I did some tests with the patch. >>>>> >>>>> (1) amdgpu module unloaded >>>>> >>>>> before the patch: >>>>> >>>>> real 0m0.976s >>>>> user 0m0.000s >>>>> sys 0m0.968s >>>>> >>>>> after the patch: >>>>> >>>>> real 0m0.981s >>>>> user 0m0.000s >>>>> sys 0m0.973s >>>>> >>>>> (2) amdgpu module loaded >>>>> >>>>> before the patch: >>>>> >>>>> real 0m35.365s >>>>> user 0m0.000s >>>>> sys 0m35.354s >>>>> >>>>> after the patch: >>>>> >>>>> real 0m1.049s >>>>> user 0m0.000s >>>>> sys 0m1.042s >>>>> >>>>> Signed-off-by: Lang Yu <lang.yu@amd.com> >>>>> --- >>>>> mm/kmemleak.c | 9 +++++---- >>>>> 1 file changed, 5 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c >>>>> index b57383c17cf6..d07444613a84 100644 >>>>> --- a/mm/kmemleak.c >>>>> +++ b/mm/kmemleak.c >>>>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) >>>>> { >>>>> unsigned long flags; >>>>> struct kmemleak_object *object; >>>>> + struct zone *zone; >>>>> int i; >>>>> int new_leaks = 0; >>>>> >>>>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) >>>>> * Struct page scanning for each node. >>>>> */ >>>>> get_online_mems(); >>>>> - for_each_online_node(i) { >>>>> - unsigned long start_pfn = node_start_pfn(i); >>>>> - unsigned long end_pfn = node_end_pfn(i); >>>>> + for_each_populated_zone(zone) { >>>>> + unsigned long start_pfn = zone->zone_start_pfn; >>>>> + unsigned long end_pfn = zone_end_pfn(zone); >>>>> unsigned long pfn; >>>>> >>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { >>>>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) >>>>> continue; >>>>> >>>>> /* only scan pages belonging to this node */ >>>>> - if (page_to_nid(page) != i) >>>>> + if (page_to_nid(page) != zone_to_nid(zone)) >>>> >>>> With overlapping zones you might rescan ranges ... instead we should do: >>>> >>>> /* only scan pages belonging to this zone */ >>>> if (zone != page_zone(page)) >>>> ... >>>> >>>> Or alternatively: >>>> >>>> /* only scan pages belonging to this node */ >>>> if (page_to_nid(page) != zone_to_nid(zone)) >>>> continue; >>>> /* only scan pages belonging to this zone */ >>>> if (page_zonenum(page) != zone_idx(zone)) >>>> continue; >>> >>> The original code has covered that, i.e., >>> only scan pages belonging to this node. >>> I didn't change that behavior. >> >> Again, you can easily have overlapping zones -- ZONE_NORMAL and >> ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but >> only belongs to a single zone. >> >> The original code would scan each PFN exactly once, as it was iterating >> the node PFNs. Your changed code might scan a single PFN multiple times, >> if it's spanned by multiple zones. >> > > Did you mean a single PFN is shared by multiple zones belonging to the > same node here? Thanks! Not shared, spanned. A PFN always belongs to exactly one ZONE+NODE, but might be "spanned" by multiple nodes or multiple zones, because nodes and zones can overlap We can get the actual zone of a PFN via page_zone(page) in my example above. Note that checking for the zone structure (not the zone number/idx) implicitly checks for the node. Let's take a look at an example: ... [root@vm-0 ~]# cat /sys/devices/system/memory/memory32/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory33/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory34/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory35/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory36/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory37/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory38/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory39/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory40/valid_zones Movable [root@vm-0 ~]# cat /sys/devices/system/memory/memory41/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory42/valid_zones Movable [root@vm-0 ~]# cat /sys/devices/system/memory/memory43/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory44/valid_zones Movable [root@vm-0 ~]# cat /sys/devices/system/memory/memory45/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory46/valid_zones Movable [root@vm-0 ~]# cat /sys/devices/system/memory/memory47/valid_zones Normal [root@vm-0 ~]# cat /sys/devices/system/memory/memory48/valid_zones # cat /proc/zoneinfo Node 0, zone DMA ... spanned 4095 present 3998 managed 3977 ... start_pfn: 1 Node 0, zone DMA32 ... spanned 1044480 present 782304 managed 765920 ... start_pfn: 4096 Node 0, zone Normal ... spanned 524288 present 393216 managed 365736 ... start_pfn: 1048576 Node 0, zone Movable ... spanned 229376 present 131072 managed 131072 start_pfn: 1310720 So Normal spans: 1048576 -> 1572863 And Movable spans: 1310720 -> 1540095 Both zones overlap.
On Mon, Nov 08, 2021 at 10:24:43AM +0100, David Hildenbrand wrote: > On 08.11.21 10:06, Lang Yu wrote: > > On Mon, Nov 08, 2021 at 09:23:16AM +0100, David Hildenbrand wrote: > >> On 08.11.21 08:27, Lang Yu wrote: > >>> On Fri, Nov 05, 2021 at 02:14:50PM +0100, David Hildenbrand wrote: > >>>> On 05.11.21 04:52, Lang Yu wrote: > >>>>> When using devm_request_free_mem_region() and > >>>>> devm_memremap_pages() to add ZONE_DEVICE memory, if requested > >>>>> free mem region pfn were huge(e.g., 0x400000000 ,we found > >>>>> on some amd apus, amdkfd svm will request a such free mem region), > >>>>> the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). > >>>>> It creates a huge hole between node_start_pfn() and node_end_pfn(). > >>>>> > >>>>> In such a case, following code snippet acctually was > >>>>> just doing busy test_bit() looping on the huge hole. > >>>>> > >>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >>>>> struct page *page = pfn_to_online_page(pfn); > >>>>> if (!page) > >>>>> continue; > >>>>> ... > >>>>> } > >>>>> > >>>>> So we got a soft lockup: > >>>>> > >>>>> watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] > >>>>> CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 > >>>>> RIP: 0010:pfn_to_online_page+0x5/0xd0 > >>>>> Call Trace: > >>>>> ? kmemleak_scan+0x16a/0x440 > >>>>> kmemleak_write+0x306/0x3a0 > >>>>> ? common_file_perm+0x72/0x170 > >>>>> full_proxy_write+0x5c/0x90 > >>>>> vfs_write+0xb9/0x260 > >>>>> ksys_write+0x67/0xe0 > >>>>> __x64_sys_write+0x1a/0x20 > >>>>> do_syscall_64+0x3b/0xc0 > >>>>> entry_SYSCALL_64_after_hwframe+0x44/0xae > >>>>> > >>>>> I did some tests with the patch. > >>>>> > >>>>> (1) amdgpu module unloaded > >>>>> > >>>>> before the patch: > >>>>> > >>>>> real 0m0.976s > >>>>> user 0m0.000s > >>>>> sys 0m0.968s > >>>>> > >>>>> after the patch: > >>>>> > >>>>> real 0m0.981s > >>>>> user 0m0.000s > >>>>> sys 0m0.973s > >>>>> > >>>>> (2) amdgpu module loaded > >>>>> > >>>>> before the patch: > >>>>> > >>>>> real 0m35.365s > >>>>> user 0m0.000s > >>>>> sys 0m35.354s > >>>>> > >>>>> after the patch: > >>>>> > >>>>> real 0m1.049s > >>>>> user 0m0.000s > >>>>> sys 0m1.042s > >>>>> > >>>>> Signed-off-by: Lang Yu <lang.yu@amd.com> > >>>>> --- > >>>>> mm/kmemleak.c | 9 +++++---- > >>>>> 1 file changed, 5 insertions(+), 4 deletions(-) > >>>>> > >>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c > >>>>> index b57383c17cf6..d07444613a84 100644 > >>>>> --- a/mm/kmemleak.c > >>>>> +++ b/mm/kmemleak.c > >>>>> @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) > >>>>> { > >>>>> unsigned long flags; > >>>>> struct kmemleak_object *object; > >>>>> + struct zone *zone; > >>>>> int i; > >>>>> int new_leaks = 0; > >>>>> > >>>>> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) > >>>>> * Struct page scanning for each node. > >>>>> */ > >>>>> get_online_mems(); > >>>>> - for_each_online_node(i) { > >>>>> - unsigned long start_pfn = node_start_pfn(i); > >>>>> - unsigned long end_pfn = node_end_pfn(i); > >>>>> + for_each_populated_zone(zone) { > >>>>> + unsigned long start_pfn = zone->zone_start_pfn; > >>>>> + unsigned long end_pfn = zone_end_pfn(zone); > >>>>> unsigned long pfn; > >>>>> > >>>>> for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >>>>> @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) > >>>>> continue; > >>>>> > >>>>> /* only scan pages belonging to this node */ > >>>>> - if (page_to_nid(page) != i) > >>>>> + if (page_to_nid(page) != zone_to_nid(zone)) > >>>> > >>>> With overlapping zones you might rescan ranges ... instead we should do: > >>>> > >>>> /* only scan pages belonging to this zone */ > >>>> if (zone != page_zone(page)) > >>>> ... > >>>> > >>>> Or alternatively: > >>>> > >>>> /* only scan pages belonging to this node */ > >>>> if (page_to_nid(page) != zone_to_nid(zone)) > >>>> continue; > >>>> /* only scan pages belonging to this zone */ > >>>> if (page_zonenum(page) != zone_idx(zone)) > >>>> continue; > >>> > >>> The original code has covered that, i.e., > >>> only scan pages belonging to this node. > >>> I didn't change that behavior. > >> > >> Again, you can easily have overlapping zones -- ZONE_NORMAL and > >> ZONE_MOVABLE -- in which case, a PFN is spanned by multiple zones, but > >> only belongs to a single zone. > >> > >> The original code would scan each PFN exactly once, as it was iterating > >> the node PFNs. Your changed code might scan a single PFN multiple times, > >> if it's spanned by multiple zones. > >> > > > > Did you mean a single PFN is shared by multiple zones belonging to the > > same node here? Thanks! > > Not shared, spanned. A PFN always belongs to exactly one ZONE+NODE, but > might be "spanned" by multiple nodes or multiple zones, because nodes > and zones can overlap We can get the actual zone of a PFN via > page_zone(page) in my example above. Note that checking for the zone > structure (not the zone number/idx) implicitly checks for the node. > > > Let's take a look at an example: > > ... > [root@vm-0 ~]# cat /sys/devices/system/memory/memory32/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory33/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory34/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory35/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory36/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory37/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory38/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory39/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory40/valid_zones > Movable > [root@vm-0 ~]# cat /sys/devices/system/memory/memory41/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory42/valid_zones > Movable > [root@vm-0 ~]# cat /sys/devices/system/memory/memory43/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory44/valid_zones > Movable > [root@vm-0 ~]# cat /sys/devices/system/memory/memory45/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory46/valid_zones > Movable > [root@vm-0 ~]# cat /sys/devices/system/memory/memory47/valid_zones > Normal > [root@vm-0 ~]# cat /sys/devices/system/memory/memory48/valid_zones > > > # cat /proc/zoneinfo > Node 0, zone DMA > ... > spanned 4095 > present 3998 > managed 3977 > ... > start_pfn: 1 > Node 0, zone DMA32 > ... > spanned 1044480 > present 782304 > managed 765920 > ... > start_pfn: 4096 > Node 0, zone Normal > ... > spanned 524288 > present 393216 > managed 365736 > ... > start_pfn: 1048576 > Node 0, zone Movable > ... > spanned 229376 > present 131072 > managed 131072 > start_pfn: 1310720 > > > So Normal spans: > > 1048576 -> 1572863 > > And Movable spans: > > 1310720 -> 1540095 > > Both zones overlap. Thanks for your clarification! I got it. Regards, Lang > -- > Thanks, > > David / dhildenb >
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index b57383c17cf6..d07444613a84 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1403,6 +1403,7 @@ static void kmemleak_scan(void) { unsigned long flags; struct kmemleak_object *object; + struct zone *zone; int i; int new_leaks = 0; @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void) * Struct page scanning for each node. */ get_online_mems(); - for_each_online_node(i) { - unsigned long start_pfn = node_start_pfn(i); - unsigned long end_pfn = node_end_pfn(i); + for_each_populated_zone(zone) { + unsigned long start_pfn = zone->zone_start_pfn; + unsigned long end_pfn = zone_end_pfn(zone); unsigned long pfn; for (pfn = start_pfn; pfn < end_pfn; pfn++) { @@ -1455,7 +1456,7 @@ static void kmemleak_scan(void) continue; /* only scan pages belonging to this node */ - if (page_to_nid(page) != i) + if (page_to_nid(page) != zone_to_nid(zone)) continue; /* only scan if page is in use */ if (page_count(page) == 0)
When using devm_request_free_mem_region() and devm_memremap_pages() to add ZONE_DEVICE memory, if requested free mem region pfn were huge(e.g., 0x0x400000000 ,we found on some amd apus, amdkfd svm will request a such free mem region), the node_end_pfn() will be also huge(see move_pfn_range_to_zone()). It creates a huge hole between node_start_pfn() and node_end_pfn(). In such a case, following code snippet acctually was just doing busy test_bit() looping on the huge hole. for (pfn = start_pfn; pfn < end_pfn; pfn++) { struct page *page = pfn_to_online_page(pfn); if (!page) continue; ... } So we got a soft lockup: watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 RIP: 0010:pfn_to_online_page+0x5/0xd0 Call Trace: ? kmemleak_scan+0x16a/0x440 kmemleak_write+0x306/0x3a0 ? common_file_perm+0x72/0x170 full_proxy_write+0x5c/0x90 vfs_write+0xb9/0x260 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae I did some tests with the patch. (1) amdgpu module unloaded before the patch: real 0m0.976s user 0m0.000s sys 0m0.968s after the patch: real 0m0.981s user 0m0.000s sys 0m0.973s (2) amdgpu module loaded before the patch: real 0m35.365s user 0m0.000s sys 0m35.354s after the patch: real 0m1.049s user 0m0.000s sys 0m1.042s Signed-off-by: Lang Yu <lang.yu@amd.com> --- mm/kmemleak.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)