diff mbox series

[1/5] mm: Free non-hugetlb large folios in a batch

Message ID 20240405153228.2563754-2-willy@infradead.org (mailing list archive)
State New
Headers show
Series Clean up __folio_put() | expand

Commit Message

Matthew Wilcox April 5, 2024, 3:32 p.m. UTC
free_unref_folios() can now handle non-hugetlb large folios, so keep
normal large folios in the batch.  hugetlb folios still need to be
handled specially.  I believe that folios freed using put_pages_list()
cannot be accounted to a memcg (or the small folios would trip the "page
still charged to cgroup" warning), but put an assertion in to check that.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Peter Xu April 24, 2024, 3:20 p.m. UTC | #1
On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> free_unref_folios() can now handle non-hugetlb large folios, so keep
> normal large folios in the batch.  hugetlb folios still need to be
> handled specially.  I believe that folios freed using put_pages_list()
> cannot be accounted to a memcg (or the small folios would trip the "page
> still charged to cgroup" warning), but put an assertion in to check that.

There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
they can be memcg accounted; since 2023 iommu_map switched to use
GFP_KERNEL_ACCOUNT.

I hit below panic when testing my local branch over mm-everthing when
running some VFIO workloads.

For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
iommu allocations").

I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
be properly taken care of later in free_pages_prepare().  Fixup attached at
the end that will fix this crash for me.

Thanks,

[   10.092411] kernel BUG at mm/swap.c:152!
[   10.092686] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   10.093034] CPU: 3 PID: 634 Comm: vfio-pci-mmap-t Tainted: G        W          6.9.0-rc4-peterx+ #2
[   10.093628] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   10.094361] RIP: 0010:put_pages_list+0x12b/0x150
[   10.094675] Code: 6d 08 48 81 c4 00 01 00 00 5b 5d c3 cc cc cc cc 48 c7 c6 f0 fd 9f 82 e8 63 e8 03 00 0f 0b 48 c7 c6 48 00 a0 82 e8 55 e8 03 00 <0f> 0b 48 c7 c6 28 fe 9f 82 e8 47f
[   10.095896] RSP: 0018:ffffc9000221bc50 EFLAGS: 00010282
[   10.096242] RAX: 0000000000000038 RBX: ffffea00042695c0 RCX: 0000000000000000
[   10.096707] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[   10.097177] RBP: ffffc9000221bd68 R08: 0000000000000000 R09: 0000000000000003
[   10.097642] R10: ffffc9000221bb08 R11: ffffffff8335db48 R12: ffff8881070172c0
[   10.098113] R13: ffff888102fd0000 R14: ffff888107017210 R15: ffff888110a6c7c0
[   10.098586] FS:  0000000000000000(0000) GS:ffff888276a00000(0000) knlGS:0000000000000000
[   10.099117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.099494] CR2: 00007f1910000000 CR3: 000000000323c006 CR4: 0000000000770ef0
[   10.099972] PKRU: 55555554
[   10.100154] Call Trace:
[   10.100321]  <TASK>
[   10.100466]  ? die+0x32/0x80
[   10.100666]  ? do_trap+0xd9/0x100
[   10.100897]  ? put_pages_list+0x12b/0x150
[   10.101168]  ? put_pages_list+0x12b/0x150
[   10.101434]  ? do_error_trap+0x81/0x110
[   10.101688]  ? put_pages_list+0x12b/0x150
[   10.101957]  ? exc_invalid_op+0x4c/0x60
[   10.102216]  ? put_pages_list+0x12b/0x150
[   10.102484]  ? asm_exc_invalid_op+0x16/0x20
[   10.102771]  ? put_pages_list+0x12b/0x150
[   10.103026]  ? 0xffffffff81000000
[   10.103246]  ? dma_pte_list_pagetables.isra.0+0x38/0xa0
[   10.103592]  ? dma_pte_list_pagetables.isra.0+0x9b/0xa0
[   10.103933]  ? dma_pte_clear_level+0x18c/0x1a0
[   10.104228]  ? domain_unmap+0x65/0x130
[   10.104481]  ? domain_unmap+0xe6/0x130
[   10.104735]  domain_exit+0x47/0x80
[   10.104968]  vfio_iommu_type1_detach_group+0x3f1/0x5f0
[   10.105308]  ? vfio_group_detach_container+0x3c/0x1a0
[   10.105644]  vfio_group_detach_container+0x60/0x1a0
[   10.105977]  vfio_group_fops_release+0x46/0x80
[   10.106274]  __fput+0x9a/0x2d0
[   10.106479]  task_work_run+0x55/0x90
[   10.106717]  do_exit+0x32f/0xb70
[   10.106945]  ? _raw_spin_unlock_irq+0x24/0x50
[   10.107237]  do_group_exit+0x32/0xa0
[   10.107481]  __x64_sys_exit_group+0x14/0x20
[   10.107760]  do_syscall_64+0x75/0x190
[   10.108007]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

==================================

diff --git a/mm/swap.c b/mm/swap.c
index f0d478eee292..8ae5cd4ed180 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -149,7 +149,6 @@ void put_pages_list(struct list_head *pages)
                        free_huge_folio(folio);
                        continue;
                }
-               VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
                /* LRU flag must be clear because it's passed using the lru */
                if (folio_batch_add(&fbatch, folio) > 0)
                        continue;
Matthew Wilcox April 25, 2024, 3:39 a.m. UTC | #2
On Wed, Apr 24, 2024 at 11:20:28AM -0400, Peter Xu wrote:
> On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> > free_unref_folios() can now handle non-hugetlb large folios, so keep
> > normal large folios in the batch.  hugetlb folios still need to be
> > handled specially.  I believe that folios freed using put_pages_list()
> > cannot be accounted to a memcg (or the small folios would trip the "page
> > still charged to cgroup" warning), but put an assertion in to check that.
> 
> There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
> they can be memcg accounted; since 2023 iommu_map switched to use
> GFP_KERNEL_ACCOUNT.
> 
> I hit below panic when testing my local branch over mm-everthing when
> running some VFIO workloads.
> 
> For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
> iommu allocations").
> 
> I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
> be properly taken care of later in free_pages_prepare().  Fixup attached at
> the end that will fix this crash for me.

Yes, I think you're right.

I was concerned about the deferred split list / memcg charge problem,
but (a) page table pages can't ever be on the deferred split list, (b)
just passing them through to free_unref_folios() works fine.  The problem
was that folios_put_refs() was uncharging a batch before passing them
to free_unref_folios().

That does bring up the question though ... should we be uncharging
these folios as a batch for better performance?  Do you have a workload
which frees a lot of page tables?  Presumably an exit would do that.
If so, does adding a call to mem_cgroup_uncharge_folios() before calling
free_unref_folios() improve performance in any noticable way?

In the meantime, this patch:

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

although I think Andrew will just fold it into
"mm: free non-hugetlb large folios in a batch"

Andrew, if you do do that, please also edit out the last couple of
sentences from the commit message:

    free_unref_folios() can now handle non-hugetlb large folios, so keep
    normal large folios in the batch.  hugetlb folios still need to be handled
-   specially.  I believe that folios freed using put_pages_list() cannot be
-   accounted to a memcg (or the small folios would trip the "page still
-   charged to cgroup" warning), but put an assertion in to check that.
+   specially.
Peter Xu April 25, 2024, 3 p.m. UTC | #3
On Thu, Apr 25, 2024 at 04:39:14AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 24, 2024 at 11:20:28AM -0400, Peter Xu wrote:
> > On Fri, Apr 05, 2024 at 04:32:23PM +0100, Matthew Wilcox (Oracle) wrote:
> > > free_unref_folios() can now handle non-hugetlb large folios, so keep
> > > normal large folios in the batch.  hugetlb folios still need to be
> > > handled specially.  I believe that folios freed using put_pages_list()
> > > cannot be accounted to a memcg (or the small folios would trip the "page
> > > still charged to cgroup" warning), but put an assertion in to check that.
> > 
> > There's such user, iommu uses put_pages_list() to free IOMMU pgtables, and
> > they can be memcg accounted; since 2023 iommu_map switched to use
> > GFP_KERNEL_ACCOUNT.
> > 
> > I hit below panic when testing my local branch over mm-everthing when
> > running some VFIO workloads.
> > 
> > For this specific vfio use case, see 160912fc3d4a ("vfio/type1: account
> > iommu allocations").
> > 
> > I think we should remove the VM_BUG_ON_FOLIO() line, as the memcg will then
> > be properly taken care of later in free_pages_prepare().  Fixup attached at
> > the end that will fix this crash for me.
> 
> Yes, I think you're right.
> 
> I was concerned about the deferred split list / memcg charge problem,
> but (a) page table pages can't ever be on the deferred split list, (b)
> just passing them through to free_unref_folios() works fine.  The problem
> was that folios_put_refs() was uncharging a batch before passing them
> to free_unref_folios().
> 
> That does bring up the question though ... should we be uncharging
> these folios as a batch for better performance?  Do you have a workload
> which frees a lot of page tables?  Presumably an exit would do that.
> If so, does adding a call to mem_cgroup_uncharge_folios() before calling
> free_unref_folios() improve performance in any noticable way?

Looks like something worth trying indeed.

The trace I hit was an exit path, but we can double check whether it can
even happen in some iommu hot paths too like unmap, so maybe such change
would justify better in that case?

AFAIU based on my reading to the current iommu pgtable mgmt it's more
aggresive than cpu pgtable on freeing pgtable pages, so it looks like such
batched release can happen during an iommu unmap too rather than exit only:

intel_iommu_unmap
  domain_unmap
    dma_pte_clear_level
      dma_pte_list_pagetables

But still worth checking in a test, perhaps the easiest way is to use
ioctl(VFIO_IOMMU_[UN]MAP_DMA).

> 
> In the meantime, this patch:
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> although I think Andrew will just fold it into
> "mm: free non-hugetlb large folios in a batch"
> 
> Andrew, if you do do that, please also edit out the last couple of
> sentences from the commit message:
> 
>     free_unref_folios() can now handle non-hugetlb large folios, so keep
>     normal large folios in the batch.  hugetlb folios still need to be handled
> -   specially.  I believe that folios freed using put_pages_list() cannot be
> -   accounted to a memcg (or the small folios would trip the "page still
> -   charged to cgroup" warning), but put an assertion in to check that.
> +   specially.

Yes, we should drop these lines too.

Thanks,
Andrew Morton April 25, 2024, 8:54 p.m. UTC | #4
On Thu, 25 Apr 2024 04:39:14 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> Andrew, if you do do that, please also edit out the last couple of
> sentences from the commit message:
> 
>     free_unref_folios() can now handle non-hugetlb large folios, so keep
>     normal large folios in the batch.  hugetlb folios still need to be handled
> -   specially.  I believe that folios freed using put_pages_list() cannot be
> -   accounted to a memcg (or the small folios would trip the "page still
> -   charged to cgroup" warning), but put an assertion in to check that.
> +   specially.

Done, thanks.
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index f72364e92d5f..4643e0d53124 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -158,10 +158,11 @@  void put_pages_list(struct list_head *pages)
 	list_for_each_entry_safe(folio, next, pages, lru) {
 		if (!folio_put_testzero(folio))
 			continue;
-		if (folio_test_large(folio)) {
-			__folio_put_large(folio);
+		if (folio_test_hugetlb(folio)) {
+			free_huge_folio(folio);
 			continue;
 		}
+		VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
 		/* LRU flag must be clear because it's passed using the lru */
 		if (folio_batch_add(&fbatch, folio) > 0)
 			continue;