diff mbox series

mm: hugetlb: independent PMD page table shared count

Message ID 20241214104401.1052550-1-liushixin2@huawei.com (mailing list archive)
State New
Headers show
Series mm: hugetlb: independent PMD page table shared count | expand

Commit Message

Liu Shixin Dec. 14, 2024, 10:44 a.m. UTC
The folio refcount may be increased unexpectly through try_get_folio() by
caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
check whether a pmd page table is shared. The check is incorrect if the
refcount is increased by the above caller, and this can cause the page
table leaked:

 BUG: Bad page state in process sh  pfn:109324
 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
 page_type: f2(table)
 raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
 page dumped because: nonzero mapcount
 ...
 CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ #7
 Tainted: [B]=BAD_PAGE
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
  show_stack+0x20/0x38 (C)
  dump_stack_lvl+0x80/0xf8
  dump_stack+0x18/0x28
  bad_page+0x8c/0x130
  free_page_is_bad_report+0xa4/0xb0
  free_unref_page+0x3cc/0x620
  __folio_put+0xf4/0x158
  split_huge_pages_all+0x1e0/0x3e8
  split_huge_pages_write+0x25c/0x2d8
  full_proxy_write+0x64/0xd8
  vfs_write+0xcc/0x280
  ksys_write+0x70/0x110
  __arm64_sys_write+0x24/0x38
  invoke_syscall+0x50/0x120
  el0_svc_common.constprop.0+0xc8/0xf0
  do_el0_svc+0x24/0x38
  el0_svc+0x34/0x128
  el0t_64_sync_handler+0xc8/0xd0
  el0t_64_sync+0x190/0x198

The issue may be triggered by damon, offline_page, page_idle etc. which
will increase the refcount of page table.

Fix it by introducing independent PMD page table shared count.

Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h | 28 ++++++++++++++++++++++++++++
 mm/hugetlb.c             | 16 +++++++---------
 3 files changed, 36 insertions(+), 9 deletions(-)

Comments

kernel test robot Dec. 15, 2024, 5:11 a.m. UTC | #1
Hi Liu,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Liu-Shixin/mm-hugetlb-independent-PMD-page-table-shared-count/20241214-184912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241214104401.1052550-1-liushixin2%40huawei.com
patch subject: [PATCH] mm: hugetlb: independent PMD page table shared count
config: i386-buildonly-randconfig-006-20241215 (https://download.01.org/0day-ci/archive/20241215/202412151253.G1eTYt78-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412151253.G1eTYt78-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412151253.G1eTYt78-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
>> include/linux/mm.h:3192:21: error: called object type 'void *' is not a function or function pointer
    3192 |         ptdesc_pmd_pts_init(ptdesc);
         |         ~~~~~~~~~~~~~~~~~~~^
   1 error generated.
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=1966107866
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1263: prepare0] Error 2 shuffle=1966107866
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=1966107866
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=1966107866
   make: Target 'prepare' not remade because of errors.


vim +3192 include/linux/mm.h

  3184	
  3185	static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
  3186	{
  3187		struct folio *folio = ptdesc_folio(ptdesc);
  3188	
  3189		if (!pmd_ptlock_init(ptdesc))
  3190			return false;
  3191		__folio_set_pgtable(folio);
> 3192		ptdesc_pmd_pts_init(ptdesc);
  3193		lruvec_stat_add_folio(folio, NR_PAGETABLE);
  3194		return true;
  3195	}
  3196
kernel test robot Dec. 15, 2024, 5:11 a.m. UTC | #2
Hi Liu,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Liu-Shixin/mm-hugetlb-independent-PMD-page-table-shared-count/20241214-184912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241214104401.1052550-1-liushixin2%40huawei.com
patch subject: [PATCH] mm: hugetlb: independent PMD page table shared count
config: i386-buildonly-randconfig-001-20241215 (https://download.01.org/0day-ci/archive/20241215/202412151252.7ESa61O9-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412151252.7ESa61O9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412151252.7ESa61O9-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/uapi/linux/posix_types.h:5,
                    from include/uapi/linux/types.h:14,
                    from include/linux/types.h:6,
                    from include/linux/kasan-checks.h:5,
                    from include/asm-generic/rwonce.h:26,
                    from ./arch/x86/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:337,
                    from include/linux/build_bug.h:5,
                    from include/linux/container_of.h:5,
                    from include/linux/list.h:5,
                    from include/linux/swait.h:5,
                    from include/linux/completion.h:12,
                    from include/linux/crypto.h:15,
                    from arch/x86/kernel/asm-offsets.c:9:
   include/linux/mm.h: In function 'pagetable_pmd_ctor':
>> include/linux/stddef.h:8:14: error: called object is not a function or function pointer
       8 | #define NULL ((void *)0)
         |              ^
   include/linux/mm_types.h:553:29: note: in expansion of macro 'NULL'
     553 | #define ptdesc_pmd_pts_init NULL
         |                             ^~~~
   include/linux/mm.h:3192:9: note: in expansion of macro 'ptdesc_pmd_pts_init'
    3192 |         ptdesc_pmd_pts_init(ptdesc);
         |         ^~~~~~~~~~~~~~~~~~~
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=1659946440
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1263: prepare0] Error 2 shuffle=1659946440
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=1659946440
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=1659946440
   make: Target 'prepare' not remade because of errors.


vim +8 include/linux/stddef.h

^1da177e4c3f415 Linus Torvalds   2005-04-16  6  
^1da177e4c3f415 Linus Torvalds   2005-04-16  7  #undef NULL
^1da177e4c3f415 Linus Torvalds   2005-04-16 @8  #define NULL ((void *)0)
6e2182874324727 Richard Knutsson 2006-09-30  9
kernel test robot Dec. 15, 2024, 7:45 a.m. UTC | #3
Hi Liu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Liu-Shixin/mm-hugetlb-independent-PMD-page-table-shared-count/20241214-184912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241214104401.1052550-1-liushixin2%40huawei.com
patch subject: [PATCH] mm: hugetlb: independent PMD page table shared count
config: hexagon-randconfig-001-20241215 (https://download.01.org/0day-ci/archive/20241215/202412151541.SQEZStgy-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 2dc22615fd46ab2566d0f26d5ba234ab12dc4bf8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412151541.SQEZStgy-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412151541.SQEZStgy-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/dma/direct.c:7:
   In file included from include/linux/memblock.h:12:
   include/linux/mm.h:3192:21: error: called object type 'void *' is not a function or function pointer
    3192 |         ptdesc_pmd_pts_init(ptdesc);
         |         ~~~~~~~~~~~~~~~~~~~^
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:73:54: note: expanded from macro 'DMA_BIT_MASK'
      73 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:52: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                    ^~~~
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:73:54: note: expanded from macro 'DMA_BIT_MASK'
      73 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:61: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                             ^~~~
>> kernel/dma/direct.c:147:20: warning: shift count >= width of type [-Wshift-count-overflow]
     146 |                 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     147 |                     phys_limit < DMA_BIT_MASK(64) &&
         |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
     148 |                     !(gfp & (GFP_DMA32 | GFP_DMA))) {
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:73:54: note: expanded from macro 'DMA_BIT_MASK'
      73 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:57:86: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                                     ~~~~~~~~~~~~~~~~~^~~~~
   include/linux/compiler.h:68:3: note: expanded from macro '__trace_if_value'
      68 |         (cond) ?                                        \
         |          ^~~~
   3 warnings and 1 error generated.


vim +147 kernel/dma/direct.c

aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  117  
26749b3201ab05e kernel/dma/direct.c Christoph Hellwig      2020-06-15  118  static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  119  		gfp_t gfp, bool allow_highmem)
a8463d4b0e47d1f lib/dma-noop.c      Christian Borntraeger  2016-02-02  120  {
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  121  	int node = dev_to_node(dev);
080321d3b3139b3 lib/dma-direct.c    Christoph Hellwig      2017-12-22  122  	struct page *page = NULL;
a7ba70f1787f977 kernel/dma/direct.c Nicolas Saenz Julienne 2019-11-21  123  	u64 phys_limit;
a8463d4b0e47d1f lib/dma-noop.c      Christian Borntraeger  2016-02-02  124  
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  125  	WARN_ON_ONCE(!PAGE_ALIGNED(size));
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  126  
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  127  	if (is_swiotlb_for_alloc(dev))
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  128  		return dma_direct_alloc_swiotlb(dev, size);
aea7e2a86a94b25 kernel/dma/direct.c Christoph Hellwig      2021-10-21  129  
25a4ce564921db0 kernel/dma/direct.c Petr Tesarik           2023-02-20  130  	gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  131  	page = dma_alloc_contiguous(dev, size, gfp);
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  132  	if (page) {
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  133  		if (!dma_coherent_ok(dev, page_to_phys(page), size) ||
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  134  		    (!allow_highmem && PageHighMem(page))) {
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  135  			dma_free_contiguous(dev, page, size);
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  136  			page = NULL;
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  137  		}
92826e967535db2 kernel/dma/direct.c Christoph Hellwig      2022-04-23  138  	}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  139  again:
90ae409f9eb3bca kernel/dma/direct.c Christoph Hellwig      2019-08-20  140  	if (!page)
633d5fce78a61e8 kernel/dma/direct.c David Rientjes         2020-06-11  141  		page = alloc_pages_node(node, gfp, get_order(size));
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  142  	if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
f689a3ab7b8ece9 kernel/dma/direct.c Chen Yu                2024-08-31  143  		__free_pages(page, get_order(size));
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  144  		page = NULL;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  145  
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  146  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
a7ba70f1787f977 kernel/dma/direct.c Nicolas Saenz Julienne 2019-11-21 @147  		    phys_limit < DMA_BIT_MASK(64) &&
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  148  		    !(gfp & (GFP_DMA32 | GFP_DMA))) {
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  149  			gfp |= GFP_DMA32;
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  150  			goto again;
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  151  		}
de7eab301de7886 lib/dma-direct.c    Takashi Iwai           2018-04-16  152  
fbce251baa6e357 kernel/dma/direct.c Christoph Hellwig      2019-02-13  153  		if (IS_ENABLED(CONFIG_ZONE_DMA) && !(gfp & GFP_DMA)) {
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  154  			gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  155  			goto again;
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  156  		}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  157  	}
95f183916d4b0bc lib/dma-direct.c    Christoph Hellwig      2018-01-09  158  
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  159  	return page;
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  160  }
b18814e767a4455 kernel/dma/direct.c Christoph Hellwig      2018-11-04  161
David Hildenbrand Dec. 16, 2024, 3:34 p.m. UTC | #4
On 14.12.24 11:44, Liu Shixin wrote:
> The folio refcount may be increased unexpectly through try_get_folio() by
> caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
> check whether a pmd page table is shared. The check is incorrect if the
> refcount is increased by the above caller, and this can cause the page
> table leaked:

Are you sure it is "leaked" ?

I assume what happens is that we end up freeing a page table without 
calling its constructor. That's why page freeing code complains about 
"nonzero mapcount" (overlayed by something else).

 > >   BUG: Bad page state in process sh  pfn:109324
>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
>   flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
>   page_type: f2(table)
>   raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
>   raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
>   page dumped because: nonzero mapcount
>   ...
>   CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ #7
>   Tainted: [B]=BAD_PAGE
>   Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
>   Call trace:
>    show_stack+0x20/0x38 (C)
>    dump_stack_lvl+0x80/0xf8
>    dump_stack+0x18/0x28
>    bad_page+0x8c/0x130
>    free_page_is_bad_report+0xa4/0xb0
>    free_unref_page+0x3cc/0x620
>    __folio_put+0xf4/0x158
>    split_huge_pages_all+0x1e0/0x3e8
>    split_huge_pages_write+0x25c/0x2d8
>    full_proxy_write+0x64/0xd8
>    vfs_write+0xcc/0x280
>    ksys_write+0x70/0x110
>    __arm64_sys_write+0x24/0x38
>    invoke_syscall+0x50/0x120
>    el0_svc_common.constprop.0+0xc8/0xf0
>    do_el0_svc+0x24/0x38
>    el0_svc+0x34/0x128
>    el0t_64_sync_handler+0xc8/0xd0
>    el0t_64_sync+0x190/0x198
> 
> The issue may be triggered by damon, offline_page, page_idle etc. which
> will increase the refcount of page table.

Right, many do have a racy folio_test_lru() check in there, that 
prevents "most of the harm", but not all of them.
Jane Chu Dec. 16, 2024, 6:33 p.m. UTC | #5
On 12/14/2024 2:44 AM, Liu Shixin wrote:

> The folio refcount may be increased unexpectly through try_get_folio() by
> caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
> check whether a pmd page table is shared. The check is incorrect if the
> refcount is increased by the above caller, and this can cause the page
> table leaked:

hugetlb and THP don't overlap, right?  how does split_huge_pages() end 
up messing up huge_pmd_share() ?

Am I missing something?

>
>   BUG: Bad page state in process sh  pfn:109324
>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
>   flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
>   page_type: f2(table)
>   raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
>   raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
>   page dumped because: nonzero mapcount
>   ...
>   CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ #7
>   Tainted: [B]=BAD_PAGE
>   Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
>   Call trace:
>    show_stack+0x20/0x38 (C)
>    dump_stack_lvl+0x80/0xf8
>    dump_stack+0x18/0x28
>    bad_page+0x8c/0x130
>    free_page_is_bad_report+0xa4/0xb0
>    free_unref_page+0x3cc/0x620
>    __folio_put+0xf4/0x158
>    split_huge_pages_all+0x1e0/0x3e8
>    split_huge_pages_write+0x25c/0x2d8
>    full_proxy_write+0x64/0xd8
>    vfs_write+0xcc/0x280
>    ksys_write+0x70/0x110
>    __arm64_sys_write+0x24/0x38
>    invoke_syscall+0x50/0x120
>    el0_svc_common.constprop.0+0xc8/0xf0
>    do_el0_svc+0x24/0x38
>    el0_svc+0x34/0x128
>    el0t_64_sync_handler+0xc8/0xd0
>    el0t_64_sync+0x190/0x198
>
> The issue may be triggered by damon, offline_page, page_idle etc. which
> will increase the refcount of page table.
>
> Fix it by introducing independent PMD page table shared count.
>
> Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
> Signed-off-by: Liu Shixin <liushixin2@huawei.com>

thanks,

-jane

> ---
>   include/linux/mm.h       |  1 +
>   include/linux/mm_types.h | 28 ++++++++++++++++++++++++++++
>   mm/hugetlb.c             | 16 +++++++---------
>   3 files changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c39c4945946c..50fbf2a1b0ad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3115,6 +3115,7 @@ static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
>   	if (!pmd_ptlock_init(ptdesc))
>   		return false;
>   	__folio_set_pgtable(folio);
> +	ptdesc_pmd_pts_init(ptdesc);
>   	lruvec_stat_add_folio(folio, NR_PAGETABLE);
>   	return true;
>   }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7361a8f3ab68..740b765ac91e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -445,6 +445,7 @@ FOLIO_MATCH(compound_head, _head_2a);
>    * @pt_index:         Used for s390 gmap.
>    * @pt_mm:            Used for x86 pgds.
>    * @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
> + * @pt_share_count:   Used for HugeTLB PMD page table share count.
>    * @_pt_pad_2:        Padding to ensure proper alignment.
>    * @ptl:              Lock for the page table.
>    * @__page_type:      Same as page->page_type. Unused for page tables.
> @@ -471,6 +472,9 @@ struct ptdesc {
>   		pgoff_t pt_index;
>   		struct mm_struct *pt_mm;
>   		atomic_t pt_frag_refcount;
> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
> +		atomic_t pt_share_count;
> +#endif
>   	};
>   
>   	union {
> @@ -516,6 +520,30 @@ static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
>   	const struct page *:		(const struct ptdesc *)(p),	\
>   	struct page *:			(struct ptdesc *)(p)))
>   
> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
> +static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
> +{
> +	atomic_set(&ptdesc->pt_share_count, 0);
> +}
> +
> +static inline void ptdesc_pmd_pts_inc(struct ptdesc *ptdesc)
> +{
> +	atomic_inc(&ptdesc->pt_share_count);
> +}
> +
> +static inline void ptdesc_pmd_pts_dec(struct ptdesc *ptdesc)
> +{
> +	atomic_dec(&ptdesc->pt_share_count);
> +}
> +
> +static inline int ptdesc_pmd_pts_count(struct ptdesc *ptdesc)
> +{
> +	return atomic_read(&ptdesc->pt_share_count);
> +}
> +#else
> +#define ptdesc_pmd_pts_init NULL
> +#endif
> +
>   /*
>    * Used for sizing the vmemmap region on some architectures
>    */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea2ed8e301ef..60846b060b87 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7212,7 +7212,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>   			spte = hugetlb_walk(svma, saddr,
>   					    vma_mmu_pagesize(svma));
>   			if (spte) {
> -				get_page(virt_to_page(spte));
> +				ptdesc_pmd_pts_inc(virt_to_ptdesc(spte));
>   				break;
>   			}
>   		}
> @@ -7227,7 +7227,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>   				(pmd_t *)((unsigned long)spte & PAGE_MASK));
>   		mm_inc_nr_pmds(mm);
>   	} else {
> -		put_page(virt_to_page(spte));
> +		ptdesc_pmd_pts_dec(virt_to_ptdesc(spte));
>   	}
>   	spin_unlock(&mm->page_table_lock);
>   out:
> @@ -7239,10 +7239,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>   /*
>    * unmap huge page backed by shared pte.
>    *
> - * Hugetlb pte page is ref counted at the time of mapping.  If pte is shared
> - * indicated by page_count > 1, unmap is achieved by clearing pud and
> - * decrementing the ref count. If count == 1, the pte page is not shared.
> - *
>    * Called with page table lock held.
>    *
>    * returns: 1 successfully unmapped a shared pte page
> @@ -7251,18 +7247,20 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>   int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>   					unsigned long addr, pte_t *ptep)
>   {
> +	unsigned long sz = huge_page_size(hstate_vma(vma));
>   	pgd_t *pgd = pgd_offset(mm, addr);
>   	p4d_t *p4d = p4d_offset(pgd, addr);
>   	pud_t *pud = pud_offset(p4d, addr);
>   
>   	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>   	hugetlb_vma_assert_locked(vma);
> -	BUG_ON(page_count(virt_to_page(ptep)) == 0);
> -	if (page_count(virt_to_page(ptep)) == 1)
> +	if (sz != PMD_SIZE)
> +		return 0;
> +	if (!ptdesc_pmd_pts_count(virt_to_ptdesc(ptep)))
>   		return 0;
>   
>   	pud_clear(pud);
> -	put_page(virt_to_page(ptep));
> +	ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
>   	mm_dec_nr_pmds(mm);
>   	return 1;
>   }
Liu Shixin Dec. 17, 2024, 2:02 a.m. UTC | #6
On 2024/12/16 23:34, David Hildenbrand wrote:
> On 14.12.24 11:44, Liu Shixin wrote:
>> The folio refcount may be increased unexpectly through try_get_folio() by
>> caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
>> check whether a pmd page table is shared. The check is incorrect if the
>> refcount is increased by the above caller, and this can cause the page
>> table leaked:
>
> Are you sure it is "leaked" ?
>
> I assume what happens is that we end up freeing a page table without calling its constructor. That's why page freeing code complains about "nonzero mapcount" (overlayed by something else).

1. The page table itself will be discarded after reporting the "nonzero mapcount".

2. The HugeTLB page mapped by the page table miss freeing since we treat the page table as shared
    and a shared page table will not be to unmap.

>
> > >   BUG: Bad page state in process sh  pfn:109324
>>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
>>   flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
>>   page_type: f2(table)
>>   raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
>>   raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
>>   page dumped because: nonzero mapcount
>>   ...
>>   CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ #7
>>   Tainted: [B]=BAD_PAGE
>>   Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
>>   Call trace:
>>    show_stack+0x20/0x38 (C)
>>    dump_stack_lvl+0x80/0xf8
>>    dump_stack+0x18/0x28
>>    bad_page+0x8c/0x130
>>    free_page_is_bad_report+0xa4/0xb0
>>    free_unref_page+0x3cc/0x620
>>    __folio_put+0xf4/0x158
>>    split_huge_pages_all+0x1e0/0x3e8
>>    split_huge_pages_write+0x25c/0x2d8
>>    full_proxy_write+0x64/0xd8
>>    vfs_write+0xcc/0x280
>>    ksys_write+0x70/0x110
>>    __arm64_sys_write+0x24/0x38
>>    invoke_syscall+0x50/0x120
>>    el0_svc_common.constprop.0+0xc8/0xf0
>>    do_el0_svc+0x24/0x38
>>    el0_svc+0x34/0x128
>>    el0t_64_sync_handler+0xc8/0xd0
>>    el0t_64_sync+0x190/0x198
>>
>> The issue may be triggered by damon, offline_page, page_idle etc. which
>> will increase the refcount of page table.
>
> Right, many do have a racy folio_test_lru() check in there, that prevents "most of the harm", but not all of them.
Yes, this makes the problems nearly impossible to happen for some function, but not really safe.

thanks,
>
>
David Hildenbrand Dec. 17, 2024, 11:55 a.m. UTC | #7
On 16.12.24 19:33, jane.chu@oracle.com wrote:
> On 12/14/2024 2:44 AM, Liu Shixin wrote:
> 
>> The folio refcount may be increased unexpectly through try_get_folio() by
>> caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
>> check whether a pmd page table is shared. The check is incorrect if the
>> refcount is increased by the above caller, and this can cause the page
>> table leaked:
> 
> hugetlb and THP don't overlap, right?  how does split_huge_pages() end
> up messing up huge_pmd_share() ?
> 
> Am I missing something?
> 

If first grabs a reference to then check if it's a THP. So we can end up 
grabbing anything temporarily.


In general, we'll have to be smarter about what we try grabbing, but 
handling races for now is tricky.
David Hildenbrand Dec. 17, 2024, 11:56 a.m. UTC | #8
On 17.12.24 03:02, Liu Shixin wrote:
> 
> 
> On 2024/12/16 23:34, David Hildenbrand wrote:
>> On 14.12.24 11:44, Liu Shixin wrote:
>>> The folio refcount may be increased unexpectly through try_get_folio() by
>>> caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to
>>> check whether a pmd page table is shared. The check is incorrect if the
>>> refcount is increased by the above caller, and this can cause the page
>>> table leaked:
>>
>> Are you sure it is "leaked" ?
>>
>> I assume what happens is that we end up freeing a page table without calling its constructor. That's why page freeing code complains about "nonzero mapcount" (overlayed by something else).
> 
> 1. The page table itself will be discarded after reporting the "nonzero mapcount".
> 
> 2. The HugeTLB page mapped by the page table miss freeing since we treat the page table as shared
>      and a shared page table will not be to unmap.

Ah, the page table still maps something, that makes sense. So we're 
leaking that indeed.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c39c4945946c..50fbf2a1b0ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3115,6 +3115,7 @@  static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
 	if (!pmd_ptlock_init(ptdesc))
 		return false;
 	__folio_set_pgtable(folio);
+	ptdesc_pmd_pts_init(ptdesc);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
 	return true;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..740b765ac91e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,6 +445,7 @@  FOLIO_MATCH(compound_head, _head_2a);
  * @pt_index:         Used for s390 gmap.
  * @pt_mm:            Used for x86 pgds.
  * @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
+ * @pt_share_count:   Used for HugeTLB PMD page table share count.
  * @_pt_pad_2:        Padding to ensure proper alignment.
  * @ptl:              Lock for the page table.
  * @__page_type:      Same as page->page_type. Unused for page tables.
@@ -471,6 +472,9 @@  struct ptdesc {
 		pgoff_t pt_index;
 		struct mm_struct *pt_mm;
 		atomic_t pt_frag_refcount;
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+		atomic_t pt_share_count;
+#endif
 	};
 
 	union {
@@ -516,6 +520,30 @@  static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
 	const struct page *:		(const struct ptdesc *)(p),	\
 	struct page *:			(struct ptdesc *)(p)))
 
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
+{
+	atomic_set(&ptdesc->pt_share_count, 0);
+}
+
+static inline void ptdesc_pmd_pts_inc(struct ptdesc *ptdesc)
+{
+	atomic_inc(&ptdesc->pt_share_count);
+}
+
+static inline void ptdesc_pmd_pts_dec(struct ptdesc *ptdesc)
+{
+	atomic_dec(&ptdesc->pt_share_count);
+}
+
+static inline int ptdesc_pmd_pts_count(struct ptdesc *ptdesc)
+{
+	return atomic_read(&ptdesc->pt_share_count);
+}
+#else
+#define ptdesc_pmd_pts_init NULL
+#endif
+
 /*
  * Used for sizing the vmemmap region on some architectures
  */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea2ed8e301ef..60846b060b87 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7212,7 +7212,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 			spte = hugetlb_walk(svma, saddr,
 					    vma_mmu_pagesize(svma));
 			if (spte) {
-				get_page(virt_to_page(spte));
+				ptdesc_pmd_pts_inc(virt_to_ptdesc(spte));
 				break;
 			}
 		}
@@ -7227,7 +7227,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
 		mm_inc_nr_pmds(mm);
 	} else {
-		put_page(virt_to_page(spte));
+		ptdesc_pmd_pts_dec(virt_to_ptdesc(spte));
 	}
 	spin_unlock(&mm->page_table_lock);
 out:
@@ -7239,10 +7239,6 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 /*
  * unmap huge page backed by shared pte.
  *
- * Hugetlb pte page is ref counted at the time of mapping.  If pte is shared
- * indicated by page_count > 1, unmap is achieved by clearing pud and
- * decrementing the ref count. If count == 1, the pte page is not shared.
- *
  * Called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
@@ -7251,18 +7247,20 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
+	unsigned long sz = huge_page_size(hstate_vma(vma));
 	pgd_t *pgd = pgd_offset(mm, addr);
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	pud_t *pud = pud_offset(p4d, addr);
 
 	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
 	hugetlb_vma_assert_locked(vma);
-	BUG_ON(page_count(virt_to_page(ptep)) == 0);
-	if (page_count(virt_to_page(ptep)) == 1)
+	if (sz != PMD_SIZE)
+		return 0;
+	if (!ptdesc_pmd_pts_count(virt_to_ptdesc(ptep)))
 		return 0;
 
 	pud_clear(pud);
-	put_page(virt_to_page(ptep));
+	ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
 	mm_dec_nr_pmds(mm);
 	return 1;
 }