diff mbox series

[-next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

Message ID 1559656836-24940-1-git-send-email-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series [-next] arm64/mm: fix a bogus GFP flag in pgd_alloc() | expand

Commit Message

Qian Cai June 4, 2019, 2 p.m. UTC
The commit "arm64: switch to generic version of pte allocation"
introduced endless failures during boot like,

kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
-2 parent: cgroup)

It turns out __GFP_ACCOUNT is passed to kernel page table allocations
and then later memcg finds out those don't belong to any cgroup.

backtrace:
  kobject_add_internal
  kobject_init_and_add
  sysfs_slab_add+0x1a8
  __kmem_cache_create
  create_cache
  memcg_create_kmem_cache
  memcg_kmem_cache_create_func
  process_one_work
  worker_thread
  kthread

Signed-off-by: Qian Cai <cai@lca.pw>
---
 arch/arm64/mm/pgd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Mark Rutland June 4, 2019, 2:23 p.m. UTC | #1
On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> The commit "arm64: switch to generic version of pte allocation"
> introduced endless failures during boot like,
> 
> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> -2 parent: cgroup)
> 
> It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> and then later memcg finds out those don't belong to any cgroup.

Mike, I understood from [1] that this wasn't expected to be a problem,
as the accounting should bypass kernel threads.

Was that assumption wrong, or is something different happening here?

> 
> backtrace:
>   kobject_add_internal
>   kobject_init_and_add
>   sysfs_slab_add+0x1a8
>   __kmem_cache_create
>   create_cache
>   memcg_create_kmem_cache
>   memcg_kmem_cache_create_func
>   process_one_work
>   worker_thread
>   kthread
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  arch/arm64/mm/pgd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> index 769516cb6677..53c48f5c8765 100644
> --- a/arch/arm64/mm/pgd.c
> +++ b/arch/arm64/mm/pgd.c
> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>  	if (PGD_SIZE == PAGE_SIZE)
>  		return (pgd_t *)__get_free_page(gfp);
>  	else
> -		return kmem_cache_alloc(pgd_cache, gfp);
> +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);

This is used to allocate PGDs for both user and kernel pagetables (e.g.
for the efi runtime services), so while this may fix the regression, I'm
not sure it's the right fix.

Do we need a separate pgd_alloc_kernel()?

Thanks,
Mark.

[1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx
Mark Rutland June 4, 2019, 2:30 p.m. UTC | #2
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> > 
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> > 
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
> 
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
> 
> Was that assumption wrong, or is something different happening here?
> 
> > backtrace:
> >   kobject_add_internal
> >   kobject_init_and_add
> >   sysfs_slab_add+0x1a8
> >   __kmem_cache_create
> >   create_cache
> >   memcg_create_kmem_cache
> >   memcg_kmem_cache_create_func
> >   process_one_work
> >   worker_thread
> >   kthread
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  arch/arm64/mm/pgd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> >  	if (PGD_SIZE == PAGE_SIZE)
> >  		return (pgd_t *)__get_free_page(gfp);
> >  	else
> > -		return kmem_cache_alloc(pgd_cache, gfp);
> > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> 
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.

I see that since [1], pgd_alloc() was updated to special-case the
init_mm, which is not sufficient for cases like:

	efi_mm.pgd = pgd_alloc(&efi_mm)

... which occurs in a kthread.

So let's have a pgd_alloc_kernel() to make that explicit.

Thanks,
Mark.
Mike Rapoport June 4, 2019, 2:54 p.m. UTC | #3
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> > 
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> > 
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
> 
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
> 
> Was that assumption wrong, or is something different happening here?

I was under impression that all allocations are going through
__memcg_kmem_charge() which does the bypass.

Apparently, it's not the case :(

> > 
> > backtrace:
> >   kobject_add_internal
> >   kobject_init_and_add
> >   sysfs_slab_add+0x1a8
> >   __kmem_cache_create
> >   create_cache
> >   memcg_create_kmem_cache
> >   memcg_kmem_cache_create_func
> >   process_one_work
> >   worker_thread
> >   kthread
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  arch/arm64/mm/pgd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> >  	if (PGD_SIZE == PAGE_SIZE)
> >  		return (pgd_t *)__get_free_page(gfp);
> >  	else
> > -		return kmem_cache_alloc(pgd_cache, gfp);
> > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> 
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.

Me neither.
 
> Do we need a separate pgd_alloc_kernel()?
 
I'd like to take a closer look at memcg paths once again before adding
pgd_alloc_kernel().

Johannes, Roman, can you please advise anything?

> Thanks,
> Mark.
> 
> [1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx
>
Mike Rapoport June 5, 2019, 9:33 p.m. UTC | #4
On Tue, Jun 04, 2019 at 03:30:20PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > The commit "arm64: switch to generic version of pte allocation"
> > > introduced endless failures during boot like,
> > > 
> > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > -2 parent: cgroup)
> > > 
> > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > and then later memcg finds out those don't belong to any cgroup.
> > 
> > Mike, I understood from [1] that this wasn't expected to be a problem,
> > as the accounting should bypass kernel threads.
> > 
> > Was that assumption wrong, or is something different happening here?
> > 
> > > backtrace:
> > >   kobject_add_internal
> > >   kobject_init_and_add
> > >   sysfs_slab_add+0x1a8
> > >   __kmem_cache_create
> > >   create_cache
> > >   memcg_create_kmem_cache
> > >   memcg_kmem_cache_create_func
> > >   process_one_work
> > >   worker_thread
> > >   kthread
> > > 
> > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > ---
> > >  arch/arm64/mm/pgd.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > index 769516cb6677..53c48f5c8765 100644
> > > --- a/arch/arm64/mm/pgd.c
> > > +++ b/arch/arm64/mm/pgd.c
> > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > >  	if (PGD_SIZE == PAGE_SIZE)
> > >  		return (pgd_t *)__get_free_page(gfp);
> > >  	else
> > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > 
> > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > for the efi runtime services), so while this may fix the regression, I'm
> > not sure it's the right fix.
> 
> I see that since [1], pgd_alloc() was updated to special-case the
> init_mm, which is not sufficient for cases like:
> 
> 	efi_mm.pgd = pgd_alloc(&efi_mm)
> 
> ... which occurs in a kthread.
> 
> So let's have a pgd_alloc_kernel() to make that explicit.

I've hit "send" before seeing this one :)

Well, to be completely on the safe side an explicit pgd_alloc_kernel()
sounds right. Then it won't be subject to future changes in memcg and will
always "Do The Right Thing".
 
> Thanks,
> Mark.
>
Will Deacon June 10, 2019, 11:43 a.m. UTC | #5
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> > 
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> > 
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
> 
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
> 
> Was that assumption wrong, or is something different happening here?
> 
> > 
> > backtrace:
> >   kobject_add_internal
> >   kobject_init_and_add
> >   sysfs_slab_add+0x1a8
> >   __kmem_cache_create
> >   create_cache
> >   memcg_create_kmem_cache
> >   memcg_kmem_cache_create_func
> >   process_one_work
> >   worker_thread
> >   kthread
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  arch/arm64/mm/pgd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> >  	if (PGD_SIZE == PAGE_SIZE)
> >  		return (pgd_t *)__get_free_page(gfp);
> >  	else
> > -		return kmem_cache_alloc(pgd_cache, gfp);
> > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> 
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.
> 
> Do we need a separate pgd_alloc_kernel()?

So can I take the above for -rc5, or is somebody else working on a different
fix to implement pgd_alloc_kernel()?

/confused

Will
Qian Cai June 10, 2019, 5:26 p.m. UTC | #6
On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > The commit "arm64: switch to generic version of pte allocation"
> > > introduced endless failures during boot like,
> > > 
> > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > -2 parent: cgroup)
> > > 
> > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > and then later memcg finds out those don't belong to any cgroup.
> > 
> > Mike, I understood from [1] that this wasn't expected to be a problem,
> > as the accounting should bypass kernel threads.
> > 
> > Was that assumption wrong, or is something different happening here?
> > 
> > > 
> > > backtrace:
> > >   kobject_add_internal
> > >   kobject_init_and_add
> > >   sysfs_slab_add+0x1a8
> > >   __kmem_cache_create
> > >   create_cache
> > >   memcg_create_kmem_cache
> > >   memcg_kmem_cache_create_func
> > >   process_one_work
> > >   worker_thread
> > >   kthread
> > > 
> > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > ---
> > >  arch/arm64/mm/pgd.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > index 769516cb6677..53c48f5c8765 100644
> > > --- a/arch/arm64/mm/pgd.c
> > > +++ b/arch/arm64/mm/pgd.c
> > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > >  	if (PGD_SIZE == PAGE_SIZE)
> > >  		return (pgd_t *)__get_free_page(gfp);
> > >  	else
> > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > 
> > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > for the efi runtime services), so while this may fix the regression, I'm
> > not sure it's the right fix.
> > 
> > Do we need a separate pgd_alloc_kernel()?
> 
> So can I take the above for -rc5, or is somebody else working on a different
> fix to implement pgd_alloc_kernel()?

The offensive commit "arm64: switch to generic version of pte allocation" is not
yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
Andrew will push this out any time sooner given it is broken.
Mark Rutland June 11, 2019, 10:03 a.m. UTC | #7
On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > The commit "arm64: switch to generic version of pte allocation"
> > > > introduced endless failures during boot like,
> > > > 
> > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > -2 parent: cgroup)
> > > > 
> > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > and then later memcg finds out those don't belong to any cgroup.
> > > 
> > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > as the accounting should bypass kernel threads.
> > > 
> > > Was that assumption wrong, or is something different happening here?
> > > 
> > > > 
> > > > backtrace:
> > > >   kobject_add_internal
> > > >   kobject_init_and_add
> > > >   sysfs_slab_add+0x1a8
> > > >   __kmem_cache_create
> > > >   create_cache
> > > >   memcg_create_kmem_cache
> > > >   memcg_kmem_cache_create_func
> > > >   process_one_work
> > > >   worker_thread
> > > >   kthread
> > > > 
> > > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > > ---
> > > >  arch/arm64/mm/pgd.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > index 769516cb6677..53c48f5c8765 100644
> > > > --- a/arch/arm64/mm/pgd.c
> > > > +++ b/arch/arm64/mm/pgd.c
> > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > >  	if (PGD_SIZE == PAGE_SIZE)
> > > >  		return (pgd_t *)__get_free_page(gfp);
> > > >  	else
> > > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > 
> > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > for the efi runtime services), so while this may fix the regression, I'm
> > > not sure it's the right fix.
> > > 
> > > Do we need a separate pgd_alloc_kernel()?
> > 
> > So can I take the above for -rc5, or is somebody else working on a different
> > fix to implement pgd_alloc_kernel()?
> 
> The offensive commit "arm64: switch to generic version of pte allocation" is not
> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> Andrew will push this out any time sooner given it is broken.

I'd assumed that Mike would respin these patches to implement and use
pgd_alloc_kernel() (or take gfp flags) and the updated patches would
replace these in akpm's tree.

Mike, could you confirm what your plan is? I'm happy to review/test
updated patches for arm64.

Thanks,
Mark.
Mike Rapoport June 11, 2019, 12:41 p.m. UTC | #8
On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > introduced endless failures during boot like,
> > > > > 
> > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > -2 parent: cgroup)
> > > > > 
> > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > 
> > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > as the accounting should bypass kernel threads.
> > > > 
> > > > Was that assumption wrong, or is something different happening here?
> > > > 
> > > > > 
> > > > > backtrace:
> > > > >   kobject_add_internal
> > > > >   kobject_init_and_add
> > > > >   sysfs_slab_add+0x1a8
> > > > >   __kmem_cache_create
> > > > >   create_cache
> > > > >   memcg_create_kmem_cache
> > > > >   memcg_kmem_cache_create_func
> > > > >   process_one_work
> > > > >   worker_thread
> > > > >   kthread
> > > > > 
> > > > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > > > ---
> > > > >  arch/arm64/mm/pgd.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > --- a/arch/arm64/mm/pgd.c
> > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > >  	if (PGD_SIZE == PAGE_SIZE)
> > > > >  		return (pgd_t *)__get_free_page(gfp);
> > > > >  	else
> > > > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > 
> > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > not sure it's the right fix.
> > > > 
> > > > Do we need a separate pgd_alloc_kernel()?
> > > 
> > > So can I take the above for -rc5, or is somebody else working on a different
> > > fix to implement pgd_alloc_kernel()?
> > 
> > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > Andrew will push this out any time sooner given it is broken.
> 
> I'd assumed that Mike would respin these patches to implement and use
> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> replace these in akpm's tree.
> 
> Mike, could you confirm what your plan is? I'm happy to review/test
> updated patches for arm64.

Sorry for the delay, I'm mostly offline these days.

I wanted to understand first what is the reason for the failure. I've tried
to reproduce it with qemu, but I failed to find a bootable configuration
that will have PGD_SIZE != PAGE_SIZE :(

Qian Cai, can you share what is your environment and the kernel config?
 
> Thanks,
> Mark.
>
Qian Cai June 11, 2019, 12:46 p.m. UTC | #9
> On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
>> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
>>> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
>>>> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
>>>>> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
>>>>>> The commit "arm64: switch to generic version of pte allocation"
>>>>>> introduced endless failures during boot like,
>>>>>> 
>>>>>> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
>>>>>> -2 parent: cgroup)
>>>>>> 
>>>>>> It turns out __GFP_ACCOUNT is passed to kernel page table allocations
>>>>>> and then later memcg finds out those don't belong to any cgroup.
>>>>> 
>>>>> Mike, I understood from [1] that this wasn't expected to be a problem,
>>>>> as the accounting should bypass kernel threads.
>>>>> 
>>>>> Was that assumption wrong, or is something different happening here?
>>>>> 
>>>>>> 
>>>>>> backtrace:
>>>>>>   kobject_add_internal
>>>>>>   kobject_init_and_add
>>>>>>   sysfs_slab_add+0x1a8
>>>>>>   __kmem_cache_create
>>>>>>   create_cache
>>>>>>   memcg_create_kmem_cache
>>>>>>   memcg_kmem_cache_create_func
>>>>>>   process_one_work
>>>>>>   worker_thread
>>>>>>   kthread
>>>>>> 
>>>>>> Signed-off-by: Qian Cai <cai@lca.pw>
>>>>>> ---
>>>>>>  arch/arm64/mm/pgd.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
>>>>>> index 769516cb6677..53c48f5c8765 100644
>>>>>> --- a/arch/arm64/mm/pgd.c
>>>>>> +++ b/arch/arm64/mm/pgd.c
>>>>>> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>>>>>>  	if (PGD_SIZE == PAGE_SIZE)
>>>>>>  		return (pgd_t *)__get_free_page(gfp);
>>>>>>  	else
>>>>>> -		return kmem_cache_alloc(pgd_cache, gfp);
>>>>>> +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
>>>>> 
>>>>> This is used to allocate PGDs for both user and kernel pagetables (e.g.
>>>>> for the efi runtime services), so while this may fix the regression, I'm
>>>>> not sure it's the right fix.
>>>>> 
>>>>> Do we need a separate pgd_alloc_kernel()?
>>>> 
>>>> So can I take the above for -rc5, or is somebody else working on a different
>>>> fix to implement pgd_alloc_kernel()?
>>> 
>>> The offensive commit "arm64: switch to generic version of pte allocation" is not
>>> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
>>> Andrew will push this out any time sooner given it is broken.
>> 
>> I'd assumed that Mike would respin these patches to implement and use
>> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
>> replace these in akpm's tree.
>> 
>> Mike, could you confirm what your plan is? I'm happy to review/test
>> updated patches for arm64.
> 
> Sorry for the delay, I'm mostly offline these days.
> 
> I wanted to understand first what is the reason for the failure. I've tried
> to reproduce it with qemu, but I failed to find a bootable configuration
> that will have PGD_SIZE != PAGE_SIZE :(
> 
> Qian Cai, can you share what is your environment and the kernel config?


https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config

# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-127
NUMA node1 CPU(s):   128-255
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

# dmidecode
Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: HPE
        Product Name: Apollo 70             
        Version: X1
        Wake-up Type: Power Switch
        Family: CN99XX
Mark Rutland June 11, 2019, 1:02 p.m. UTC | #10
On Tue, Jun 11, 2019 at 03:41:19PM +0300, Mike Rapoport wrote:
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > > introduced endless failures during boot like,
> > > > > > 
> > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > > -2 parent: cgroup)
> > > > > > 
> > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > > 
> > > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > > as the accounting should bypass kernel threads.
> > > > > 
> > > > > Was that assumption wrong, or is something different happening here?
> > > > > 
> > > > > > 
> > > > > > backtrace:
> > > > > >   kobject_add_internal
> > > > > >   kobject_init_and_add
> > > > > >   sysfs_slab_add+0x1a8
> > > > > >   __kmem_cache_create
> > > > > >   create_cache
> > > > > >   memcg_create_kmem_cache
> > > > > >   memcg_kmem_cache_create_func
> > > > > >   process_one_work
> > > > > >   worker_thread
> > > > > >   kthread
> > > > > > 
> > > > > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > > > > ---
> > > > > >  arch/arm64/mm/pgd.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > > --- a/arch/arm64/mm/pgd.c
> > > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > >  	if (PGD_SIZE == PAGE_SIZE)
> > > > > >  		return (pgd_t *)__get_free_page(gfp);
> > > > > >  	else
> > > > > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > > > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > > 
> > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > > not sure it's the right fix.
> > > > > 
> > > > > Do we need a separate pgd_alloc_kernel()?
> > > > 
> > > > So can I take the above for -rc5, or is somebody else working on a different
> > > > fix to implement pgd_alloc_kernel()?
> > > 
> > > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > > Andrew will push this out any time sooner given it is broken.
> > 
> > I'd assumed that Mike would respin these patches to implement and use
> > pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> > replace these in akpm's tree.
> > 
> > Mike, could you confirm what your plan is? I'm happy to review/test
> > updated patches for arm64.
> 
> Sorry for the delay, I'm mostly offline these days.
> 
> I wanted to understand first what is the reason for the failure. I've tried
> to reproduce it with qemu, but I failed to find a bootable configuration
> that will have PGD_SIZE != PAGE_SIZE :(

This is the case with 48-bit VA and 64K pages. In that case we have
three levels of table, and the PGD is 1/16th of a page, as it only needs
to resolve 9 bits of virtual address rather than 13.

If you build defconfig + ARM64_64K_PAGES=y, that should be the case:

[mark@lakrids:~/src/linux]% usekorg 8.1.0 aarch64-linux-objdump -d arch/arm64/mm/pgd.o     

arch/arm64/mm/pgd.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <pgd_alloc>:
   0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
   4:   90000000        adrp    x0, 0 <pgd_alloc>
   8:   5281b801        mov     w1, #0xdc0                      // #3520
   c:   910003fd        mov     x29, sp
  10:   f9400000        ldr     x0, [x0]
  14:   94000000        bl      0 <kmem_cache_alloc>
  18:   a8c17bfd        ldp     x29, x30, [sp], #16
  1c:   d65f03c0        ret

0000000000000020 <pgd_free>:
  20:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
  24:   90000000        adrp    x0, 0 <pgd_alloc>
  28:   910003fd        mov     x29, sp
  2c:   f9400000        ldr     x0, [x0]
  30:   94000000        bl      0 <kmem_cache_free>
  34:   a8c17bfd        ldp     x29, x30, [sp], #16
  38:   d65f03c0        ret

Disassembly of section .init.text:

0000000000000000 <pgd_cache_init>:
   0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
   4:   52804002        mov     w2, #0x200                      // #512
   8:   d2800004        mov     x4, #0x0                        // #0
   c:   910003fd        mov     x29, sp
  10:   2a0203e1        mov     w1, w2
  14:   52a00083        mov     w3, #0x40000                    // #262144
  18:   90000000        adrp    x0, 0 <pgd_cache_init>
  1c:   91000000        add     x0, x0, #0x0
  20:   94000000        bl      0 <kmem_cache_create>
  24:   90000001        adrp    x1, 0 <pgd_cache_init>
  28:   a8c17bfd        ldp     x29, x30, [sp], #16
  2c:   f9000020        str     x0, [x1]
  30:   d65f03c0        ret

Thanks,
Mark.
Mike Rapoport June 12, 2019, 6:57 a.m. UTC | #11
Hi,

On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote:
> 
> > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > Sorry for the delay, I'm mostly offline these days.
> > 
> > I wanted to understand first what is the reason for the failure. I've tried
> > to reproduce it with qemu, but I failed to find a bootable configuration
> > that will have PGD_SIZE != PAGE_SIZE :(
> > 
> > Qian Cai, can you share what is your environment and the kernel config?
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
> 
> # lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  4
> Core(s) per socket:  32
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           Cavium
> Model:               1
> Model name:          ThunderX2 99xx
> Stepping:            0x1
> BogoMIPS:            400.00
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-127
> NUMA node1 CPU(s):   128-255
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm
> 
> # dmidecode
> Handle 0x0001, DMI type 1, 27 bytes
> System Information
>         Manufacturer: HPE
>         Product Name: Apollo 70             
>         Version: X1
>         Wake-up Type: Power Switch
>         Family: CN99XX
> 

Can you please also send the entire log when the failure happens?

Another question, is the problem exist with PGD_SIZE == PAGE_SIZE?
Qian Cai June 12, 2019, 6:35 p.m. UTC | #12
On Wed, 2019-06-12 at 09:57 +0300, Mike Rapoport wrote:
> Hi,
> 
> On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote:
> > 
> > > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > 
> > > Sorry for the delay, I'm mostly offline these days.
> > > 
> > > I wanted to understand first what is the reason for the failure. I've
> > > tried
> > > to reproduce it with qemu, but I failed to find a bootable configuration
> > > that will have PGD_SIZE != PAGE_SIZE :(
> > > 
> > > Qian Cai, can you share what is your environment and the kernel config?
> > 
> > https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
> > 
> > # lscpu
> > Architecture:        aarch64
> > Byte Order:          Little Endian
> > CPU(s):              256
> > On-line CPU(s) list: 0-255
> > Thread(s) per core:  4
> > Core(s) per socket:  32
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           Cavium
> > Model:               1
> > Model name:          ThunderX2 99xx
> > Stepping:            0x1
> > BogoMIPS:            400.00
> > L1d cache:           32K
> > L1i cache:           32K
> > L2 cache:            256K
> > L3 cache:            32768K
> > NUMA node0 CPU(s):   0-127
> > NUMA node1 CPU(s):   128-255
> > Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> > cpuid asimdrdm
> > 
> > # dmidecode
> > Handle 0x0001, DMI type 1, 27 bytes
> > System Information
> >         Manufacturer: HPE
> >         Product Name: Apollo 70             
> >         Version: X1
> >         Wake-up Type: Power Switch
> >         Family: CN99XX
> > 
> 
> Can you please also send the entire log when the failure happens?

https://cailca.github.io/files/dmesg.txt

> Another question, is the problem exist with PGD_SIZE == PAGE_SIZE?

No.
Mike Rapoport June 13, 2019, 12:11 p.m. UTC | #13
(added Roman)

On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > introduced endless failures during boot like,
> > > > > 
> > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > -2 parent: cgroup)
> > > > > 
> > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > 
> > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > as the accounting should bypass kernel threads.
> > > > 
> > > > Was that assumption wrong, or is something different happening here?
> > > > 
> > > > > 
> > > > > backtrace:
> > > > >   kobject_add_internal
> > > > >   kobject_init_and_add
> > > > >   sysfs_slab_add+0x1a8
> > > > >   __kmem_cache_create
> > > > >   create_cache
> > > > >   memcg_create_kmem_cache
> > > > >   memcg_kmem_cache_create_func
> > > > >   process_one_work
> > > > >   worker_thread
> > > > >   kthread
> > > > > 
> > > > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > > > ---
> > > > >  arch/arm64/mm/pgd.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > --- a/arch/arm64/mm/pgd.c
> > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > >  	if (PGD_SIZE == PAGE_SIZE)
> > > > >  		return (pgd_t *)__get_free_page(gfp);
> > > > >  	else
> > > > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > 
> > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > not sure it's the right fix.
> > > > 
> > > > Do we need a separate pgd_alloc_kernel()?
> > > 
> > > So can I take the above for -rc5, or is somebody else working on a different
> > > fix to implement pgd_alloc_kernel()?
> > 
> > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > Andrew will push this out any time sooner given it is broken.
> 
> I'd assumed that Mike would respin these patches to implement and use
> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> replace these in akpm's tree.
> 
> Mike, could you confirm what your plan is? I'm happy to review/test
> updated patches for arm64.

The log Qian Cai posted at [1] and partially cited below confirms that the
failure happens when *user* PGDs are allocated and the addition of
__GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
issue.

I'm still failing to reproduce it with qemu and I'm not really familiar
with slub/memcg code to say anything smart about it. Will keep looking.

Note, that as failures start way after efi_virtmap_init() that allocates a
PGD for efi_mm, there are no real fixes required for the original series,
except that the check for mm == &init_mm I copied for some reason from
powerpc is bogus and can be removed.

I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
we won't run into issues with memcg in the future.

[   82.125966] Freeing unused kernel memory: 28672K
[   87.940365] Checked W+X mappings: passed, no W+X pages found
[   87.946769] Run /init as init process
[   88.040040] systemd[1]: System time before build time, advancing clock.
[   88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or directory
[   88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
[   88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
+SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
+GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
default-hierarchy=legacy)
[   88.498398] systemd[1]: Detected architecture arm64.
[   88.506517] systemd[1]: Running in initial RAM disk.
[   89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
[   90.222658] random: systemd: uninitialized urandom read (16 bytes read)
[   90.230072] systemd[1]: Reached target Swap.
[   90.240205] random: systemd: uninitialized urandom read (16 bytes read)
[   90.251088] systemd[1]: Reached target Timers.
[   90.261303] random: systemd: uninitialized urandom read (16 bytes read)
[   90.271209] systemd[1]: Listening on udev Control Socket.
[   90.283238] systemd[1]: Reached target Local File Systems.
[   90.296232] systemd[1]: Reached target Slices.
[   90.307239] systemd[1]: Listening on udev Kernel Socket.
[   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error: -2 parent: cgroup)
[   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
[   90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes left
[   90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error: -2 parent: cgroup)
 
> Thanks,
> Mark.
> 

[1] https://cailca.github.io/files/dmesg.txt
Qian Cai June 13, 2019, 1:22 p.m. UTC | #14
On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote:
> The log Qian Cai posted at [1] and partially cited below confirms that the
> failure happens when *user* PGDs are allocated and the addition of
> __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> issue.
> 
> I'm still failing to reproduce it with qemu and I'm not really familiar
> with slub/memcg code to say anything smart about it. Will keep looking.
> 
> Note, that as failures start way after efi_virtmap_init() that allocates a
> PGD for efi_mm, there are no real fixes required for the original series,
> except that the check for mm == &init_mm I copied for some reason from
> powerpc is bogus and can be removed.

Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm.
For example,

[  132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd-
udevd.service) (error: -2 parent: cgroup)
[  132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted:
G        W         5.2.0-rc4-next-20190613+ #8
[  132.807356][ T1889] Hardware name: HPE Apollo
70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.9 03/01/2019
[  132.817872][ T1889] Call trace:
[  132.821017][ T1889]  dump_backtrace+0x0/0x268
[  132.825372][ T1889]  show_stack+0x20/0x2c
[  132.829380][ T1889]  dump_stack+0xb4/0x108
[  132.833475][ T1889]  pgd_alloc+0x34/0x5c
[  132.837396][ T1889]  mm_init+0x27c/0x32c
[  132.841315][ T1889]  dup_mm+0x84/0x7b4
[  132.845061][ T1889]  copy_process+0xf20/0x24cc
[  132.849500][ T1889]  _do_fork+0xa4/0x66c
[  132.853420][ T1889]  __arm64_sys_clone+0x114/0x1b4
[  132.858208][ T1889]  el0_svc_handler+0x198/0x260
[  132.862821][ T1889]  el0_svc+0x8/0xc

> 
> I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
> we won't run into issues with memcg in the future.
> 
> [   82.125966] Freeing unused kernel memory: 28672K
> [   87.940365] Checked W+X mappings: passed, no W+X pages found
> [   87.946769] Run /init as init process
> [   88.040040] systemd[1]: System time before build time, advancing clock.
> [   88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or
> directory
> [   88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
> [   88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
> +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
> +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
> default-hierarchy=legacy)
> [   88.498398] systemd[1]: Detected architecture arm64.
> [   88.506517] systemd[1]: Running in initial RAM disk.
> [   89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
> [   90.222658] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.230072] systemd[1]: Reached target Swap.
> [   90.240205] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.251088] systemd[1]: Reached target Timers.
> [   90.261303] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.271209] systemd[1]: Listening on udev Control Socket.
> [   90.283238] systemd[1]: Reached target Local File Systems.
> [   90.296232] systemd[1]: Reached target Slices.
> [   90.307239] systemd[1]: Listening on udev Kernel Socket.
> [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope)
> (error: -2 parent: cgroup)
> [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error:
> -2 parent: cgroup)
> [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-
> setup.service) (error: -2 parent: cgroup)
> [   90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes
> left
> [   90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error:
> -2 parent: cgroup)
>  
> > Thanks,
> > Mark.
> > 
> 
> [1] https://cailca.github.io/files/dmesg.txt
>
Mike Rapoport June 13, 2019, 7:44 p.m. UTC | #15
On Thu, Jun 13, 2019 at 09:22:36AM -0400, Qian Cai wrote:
> On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote:
> > The log Qian Cai posted at [1] and partially cited below confirms that the
> > failure happens when *user* PGDs are allocated and the addition of
> > __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> > issue.
> > 
> > I'm still failing to reproduce it with qemu and I'm not really familiar
> > with slub/memcg code to say anything smart about it. Will keep looking.
> > 
> > Note, that as failures start way after efi_virtmap_init() that allocates a
> > PGD for efi_mm, there are no real fixes required for the original series,
> > except that the check for mm == &init_mm I copied for some reason from
> > powerpc is bogus and can be removed.
> 
> Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm.
> For example,

Here we allocate the pgd for a user process and it should be accounted.

Actually, the whole point of changing the gfp flags in arm64::pgd_alloc()
was to enable the accounting for memory occupied by user pgds, just like
x86 and powerpc do.

> [  132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd-
> udevd.service) (error: -2 parent: cgroup)
> [  132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted:
> G        W         5.2.0-rc4-next-20190613+ #8
> [  132.807356][ T1889] Hardware name: HPE Apollo
> 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.9 03/01/2019
> [  132.817872][ T1889] Call trace:
> [  132.821017][ T1889]  dump_backtrace+0x0/0x268
> [  132.825372][ T1889]  show_stack+0x20/0x2c
> [  132.829380][ T1889]  dump_stack+0xb4/0x108
> [  132.833475][ T1889]  pgd_alloc+0x34/0x5c
> [  132.837396][ T1889]  mm_init+0x27c/0x32c
> [  132.841315][ T1889]  dup_mm+0x84/0x7b4
> [  132.845061][ T1889]  copy_process+0xf20/0x24cc
> [  132.849500][ T1889]  _do_fork+0xa4/0x66c
> [  132.853420][ T1889]  __arm64_sys_clone+0x114/0x1b4
> [  132.858208][ T1889]  el0_svc_handler+0x198/0x260
> [  132.862821][ T1889]  el0_svc+0x8/0xc
> 
> > 
> > I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
> > we won't run into issues with memcg in the future.
> > 
> > [   82.125966] Freeing unused kernel memory: 28672K
> > [   87.940365] Checked W+X mappings: passed, no W+X pages found
> > [   87.946769] Run /init as init process
> > [   88.040040] systemd[1]: System time before build time, advancing clock.
> > [   88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or
> > directory
> > [   88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
> > [   88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
> > +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
> > +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
> > default-hierarchy=legacy)
> > [   88.498398] systemd[1]: Detected architecture arm64.
> > [   88.506517] systemd[1]: Running in initial RAM disk.
> > [   89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
> > [   90.222658] random: systemd: uninitialized urandom read (16 bytes read)
> > [   90.230072] systemd[1]: Reached target Swap.
> > [   90.240205] random: systemd: uninitialized urandom read (16 bytes read)
> > [   90.251088] systemd[1]: Reached target Timers.
> > [   90.261303] random: systemd: uninitialized urandom read (16 bytes read)
> > [   90.271209] systemd[1]: Listening on udev Control Socket.
> > [   90.283238] systemd[1]: Reached target Local File Systems.
> > [   90.296232] systemd[1]: Reached target Slices.
> > [   90.307239] systemd[1]: Listening on udev Kernel Socket.
> > [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope)
> > (error: -2 parent: cgroup)
> > [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error:
> > -2 parent: cgroup)
> > [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-
> > setup.service) (error: -2 parent: cgroup)
> > [   90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes
> > left
> > [   90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error:
> > -2 parent: cgroup)
> >  
> > > Thanks,
> > > Mark.
> > > 
> > 
> > [1] https://cailca.github.io/files/dmesg.txt
> >
Mike Rapoport June 17, 2019, 3:12 p.m. UTC | #16
On Thu, Jun 13, 2019 at 03:11:01PM +0300, Mike Rapoport wrote:
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > > introduced endless failures during boot like,
> > > > > > 
> > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > > -2 parent: cgroup)
> > > > > > 
> > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > > 
> > > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > > as the accounting should bypass kernel threads.
> > > > > 
> > > > > Was that assumption wrong, or is something different happening here?
> > > > > 
> > > > > > 
> > > > > > backtrace:
> > > > > >   kobject_add_internal
> > > > > >   kobject_init_and_add
> > > > > >   sysfs_slab_add+0x1a8
> > > > > >   __kmem_cache_create
> > > > > >   create_cache
> > > > > >   memcg_create_kmem_cache
> > > > > >   memcg_kmem_cache_create_func
> > > > > >   process_one_work
> > > > > >   worker_thread
> > > > > >   kthread
> > > > > > 
> > > > > > Signed-off-by: Qian Cai <cai@lca.pw>
> > > > > > ---
> > > > > >  arch/arm64/mm/pgd.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > > --- a/arch/arm64/mm/pgd.c
> > > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > >  	if (PGD_SIZE == PAGE_SIZE)
> > > > > >  		return (pgd_t *)__get_free_page(gfp);
> > > > > >  	else
> > > > > > -		return kmem_cache_alloc(pgd_cache, gfp);
> > > > > > +		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > > 
> > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > > not sure it's the right fix.
> > > > > 
> > > > > Do we need a separate pgd_alloc_kernel()?
> > > > 
> > > > So can I take the above for -rc5, or is somebody else working on a different
> > > > fix to implement pgd_alloc_kernel()?
> > > 
> > > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > > Andrew will push this out any time sooner given it is broken.
> > 
> > I'd assumed that Mike would respin these patches to implement and use
> > pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> > replace these in akpm's tree.
> > 
> > Mike, could you confirm what your plan is? I'm happy to review/test
> > updated patches for arm64.
> 
> The log Qian Cai posted at [1] and partially cited below confirms that the
> failure happens when *user* PGDs are allocated and the addition of
> __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> issue.

Indeed the accounting of the PGD memory uncovered a dangling pointer to
pgd_cache :)

The pgd_cache was initialized twice and it made memcg and slub sysfs go
nuts. To be frank, I've got lost in their cross-initialization,
cross-referencing and update sequences, but for sure extra initialization
of pgd_cache was bogus.

I've double checked the 'if (mm == &init_mm)' and it's not needed. The EFI
PGD is allocated before memcg is up and other kernel allocations of pgd (if
we'll have any) would be bypassed by memcg_kmem_bypass().

Andrew, can you please add the patch below as an incremental fix?

With this the arm64::pgd_alloc() should be in the right shape.


From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Mon, 17 Jun 2019 17:37:43 +0300
Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice

When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
memory. That cache was initialized twice: first through
pgtable_cache_init() alias and then as an override for weak
pgd_cache_init().

After enabling accounting for the PGD memory, this created a confusion for
memcg and slub sysfs code which resulted in the following errors:

[   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)

Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
initialization in pgd_cache_init() resolves the problem and allows
accounting of PGD memory.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/arm64/include/asm/pgtable.h | 3 +--
 arch/arm64/mm/pgd.c              | 5 +----
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3191b9f..c7a802d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -851,8 +851,7 @@ extern int kern_addr_valid(unsigned long addr);
 
 #include <asm-generic/pgtable.h>
 
-void pgd_cache_init(void);
-#define pgtable_cache_init	pgd_cache_init
+static inline void pgtable_cache_init(void) { }
 
 /*
  * On AArch64, the cache coherency is handled via the set_pte_at() function.
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 53c48f5..3f0a744 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -32,13 +32,10 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	gfp_t gfp = GFP_PGTABLE_USER;
 
-	if (unlikely(mm == &init_mm))
-		gfp = GFP_PGTABLE_KERNEL;
-
 	if (PGD_SIZE == PAGE_SIZE)
 		return (pgd_t *)__get_free_page(gfp);
 	else
-		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
+		return kmem_cache_alloc(pgd_cache, gfp);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
Will Deacon June 17, 2019, 4:36 p.m. UTC | #17
Hi Mike,

On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote:
> Andrew, can you please add the patch below as an incremental fix?
> 
> With this the arm64::pgd_alloc() should be in the right shape.
> 
> 
> From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt@linux.ibm.com>
> Date: Mon, 17 Jun 2019 17:37:43 +0300
> Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice
> 
> When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
> memory. That cache was initialized twice: first through
> pgtable_cache_init() alias and then as an override for weak
> pgd_cache_init().
> 
> After enabling accounting for the PGD memory, this created a confusion for
> memcg and slub sysfs code which resulted in the following errors:
> 
> [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
> 
> Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
> initialization in pgd_cache_init() resolves the problem and allows
> accounting of PGD memory.
> 
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 3 +--
>  arch/arm64/mm/pgd.c              | 5 +----
>  2 files changed, 2 insertions(+), 6 deletions(-)

Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache
during mm initialization") due to an unlucky naming conflict!

In which case, I'd actually prefer to take this fix asap via the arm64
tree. Is that ok?

Will
Mike Rapoport June 18, 2019, 6:12 a.m. UTC | #18
On Mon, Jun 17, 2019 at 05:36:30PM +0100, Will Deacon wrote:
> Hi Mike,
> 
> On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote:
> > Andrew, can you please add the patch below as an incremental fix?
> > 
> > With this the arm64::pgd_alloc() should be in the right shape.
> > 
> > 
> > From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > Date: Mon, 17 Jun 2019 17:37:43 +0300
> > Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice
> > 
> > When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
> > memory. That cache was initialized twice: first through
> > pgtable_cache_init() alias and then as an override for weak
> > pgd_cache_init().
> > 
> > After enabling accounting for the PGD memory, this created a confusion for
> > memcg and slub sysfs code which resulted in the following errors:
> > 
> > [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> > [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> > [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
> > 
> > Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
> > initialization in pgd_cache_init() resolves the problem and allows
> > accounting of PGD memory.
> > 
> > Reported-by: Qian Cai <cai@lca.pw>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 3 +--
> >  arch/arm64/mm/pgd.c              | 5 +----
> >  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache
> during mm initialization") due to an unlucky naming conflict!
> 
> In which case, I'd actually prefer to take this fix asap via the arm64
> tree. Is that ok?

I suppose so, it just won't apply as is. Would you like a patch against the
current upstream?

> Will
Will Deacon June 18, 2019, 6:54 a.m. UTC | #19
On Tue, Jun 18, 2019 at 09:12:59AM +0300, Mike Rapoport wrote:
> On Mon, Jun 17, 2019 at 05:36:30PM +0100, Will Deacon wrote:
> > On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote:
> > > Andrew, can you please add the patch below as an incremental fix?
> > > 
> > > With this the arm64::pgd_alloc() should be in the right shape.
> > > 
> > > 
> > > From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > Date: Mon, 17 Jun 2019 17:37:43 +0300
> > > Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice
> > > 
> > > When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
> > > memory. That cache was initialized twice: first through
> > > pgtable_cache_init() alias and then as an override for weak
> > > pgd_cache_init().
> > > 
> > > After enabling accounting for the PGD memory, this created a confusion for
> > > memcg and slub sysfs code which resulted in the following errors:
> > > 
> > > [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> > > [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> > > [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
> > > 
> > > Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
> > > initialization in pgd_cache_init() resolves the problem and allows
> > > accounting of PGD memory.
> > > 
> > > Reported-by: Qian Cai <cai@lca.pw>
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > ---
> > >  arch/arm64/include/asm/pgtable.h | 3 +--
> > >  arch/arm64/mm/pgd.c              | 5 +----
> > >  2 files changed, 2 insertions(+), 6 deletions(-)
> > 
> > Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache
> > during mm initialization") due to an unlucky naming conflict!
> > 
> > In which case, I'd actually prefer to take this fix asap via the arm64
> > tree. Is that ok?
> 
> I suppose so, it just won't apply as is. Would you like a patch against the
> current upstream?

Yes, please. I'm assuming it's a straightforward change (please shout if it
isn't).

Will
diff mbox series

Patch

diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 769516cb6677..53c48f5c8765 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -38,7 +38,7 @@  pgd_t *pgd_alloc(struct mm_struct *mm)
 	if (PGD_SIZE == PAGE_SIZE)
 		return (pgd_t *)__get_free_page(gfp);
 	else
-		return kmem_cache_alloc(pgd_cache, gfp);
+		return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)