diff mbox series

[05/12] powerpc: add pte_free_defer() for pgtables sharing page

Message ID 28eb289f-ea2c-8eb9-63bb-9f7d7b9ccc11@google.com (mailing list archive)
State New
Headers show
Series mm: free retracted page table by RCU | expand

Commit Message

Hugh Dickins May 29, 2023, 6:20 a.m. UTC
Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

Comments

Matthew Wilcox (Oracle) May 29, 2023, 2:02 p.m. UTC | #1
On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}

This can't be safe (on ppc).  IIRC you might have up to 16x4k page
tables sharing one 64kB page.  So if you have two page tables from the
same page being defer-freed simultaneously, you'll reuse the rcu_head
and I cannot imagine things go well from that point.

I have no idea how to solve this problem.
Hugh Dickins May 29, 2023, 2:36 p.m. UTC | #2
On Mon, 29 May 2023, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.

Oh yes, of course, thanks for catching that so quickly.
So my s390 and sparc implementations will be equally broken.

> 
> I have no idea how to solve this problem.

I do: I'll have to go back to the more complicated implementation we
actually ran with on powerpc - I was thinking those complications just
related to deposit/withdraw matters, forgetting the one-rcu_head issue.

It uses large (0x10000) increments of the page refcount, avoiding
call_rcu() when already active.

It's not a complication I had wanted to explain or test for now,
but we shall have to.  Should apply equally well to sparc, but s390
more of a problem, since s390 already has its own refcount cleverness.

Thanks, I must dash, out much of the day.

Hugh
Hugh Dickins June 7, 2023, 3:49 a.m. UTC | #3
On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > > 
> > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > > >  		__free_page(page);
> > > >  	}
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > > +
> > > > +static void pte_free_now(struct rcu_head *head)
> > > > +{
> > > > +	struct page *page;
> > > > +	int refcount;
> > > > +
> > > > +	page = container_of(head, struct page, rcu_head);
> > > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > > +				     &page->pt_frag_refcount);
> > > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > > +		return;
> > > > +	}
> > > 
> > > From what I can tell power doesn't recycle the sub fragment into any
> > > kind of free list. It just waits for the last fragment to be unused
> > > and then frees the whole page.

Yes, it's relatively simple in that way: not as sophisticated as s390.

> > > 
> > > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > > 
> > > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > > 		if (!kernel)
> > > 			pgtable_pte_page_dtor(page);
> > > 		call_rcu(&page->rcu_head, free_page_rcu)
> > 
> > We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> > in Hugh's series IIUC we need the spinlock being there for the rcu section
> > alongside the page itself.  So even if to do so we'll need to also rcu call 
> > pgtable_pte_page_dtor() when needed.

Thanks, Peter, yes that's right.

> 
> Er yes, I botched that, the dtor and the free_page should be in a the
> rcu callback function

But it was just a botched detail, and won't have answered Jason's doubt.

I had three (or perhaps it amounts to two) reasons for doing it this way:
none of which may seem good enough reasons to you.  Certainly I'd agree
that the way it's done seems... arcane.

One, as I've indicated before, I don't actually dare to go all
the way into RCU freeing of all page tables for powerpc (or any other):
I should think it's a good idea that everyone wants in the end, but I'm
limited by my time and competence - and dread of losing my way in the
mmu_gather TLB #ifdef maze.  It's work for someone else not me.

(pte_free_defer() do as you suggest, without changing pte_fragment_free()
itself?  No, that doesn't work out when defer does, say, the decrement of
pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement
from 1 to 0: page freed without deferral.)

Two, this was the code I'd worked out before, and was used in production,
so I had confidence in it - it was just my mistake that I'd forgotten the
single rcu_head issue, and thought I could avoid it in the initial posting.
powerpc has changed around since then, but apparently not in any way that
affects this.  And it's too easy to agree in review that something can be
simpler, without bringing back to mind why the complications are there.

Three (just an explanation of why the old code was like this), powerpc
relies on THP's page table deposit+withdraw protocol, even for shmem/
file THPs.  I've skirted that issue in this series, by sticking with
retract_page_tables(), not attempting to insert huge pmd immediately.
But if huge pmd is inserted to replace ptetable pmd, then ptetable must
be deposited: pte_free_defer() as written protects the deposited ptetable
from then being freed without deferral (rather like in the example above).

But does not protect it from being withdrawn and reused within that
grace period.  Jann has grave doubts whether that can ever be allowed
(or perhaps I should grant him certainty, and examples that it cannot).
I did convince myself, back in the day, that it was safe here: but I'll
have to put in a lot more thought to re-justify it now, and on the way
may instead be completely persuaded by Jann.

Not very good reasons: good enough, or can you supply a better patch?

Thanks,
Hugh
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@  static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..3a3dac77faf2 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,21 @@  void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_fragment_free((unsigned long *)page_to_virt(page), 0);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */