Message ID | 1559656836-24940-1-git-send-email-cai@lca.pw (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [-next] arm64/mm: fix a bogus GFP flag in pgd_alloc() | expand |
On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > The commit "arm64: switch to generic version of pte allocation" > introduced endless failures during boot like, > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > -2 parent: cgroup) > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > and then later memcg finds out those don't belong to any cgroup. Mike, I understood from [1] that this wasn't expected to be a problem, as the accounting should bypass kernel threads. Was that assumption wrong, or is something different happening here? > > backtrace: > kobject_add_internal > kobject_init_and_add > sysfs_slab_add+0x1a8 > __kmem_cache_create > create_cache > memcg_create_kmem_cache > memcg_kmem_cache_create_func > process_one_work > worker_thread > kthread > > Signed-off-by: Qian Cai <cai@lca.pw> > --- > arch/arm64/mm/pgd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > index 769516cb6677..53c48f5c8765 100644 > --- a/arch/arm64/mm/pgd.c > +++ b/arch/arm64/mm/pgd.c > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > if (PGD_SIZE == PAGE_SIZE) > return (pgd_t *)__get_free_page(gfp); > else > - return kmem_cache_alloc(pgd_cache, gfp); > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); This is used to allocate PGDs for both user and kernel pagetables (e.g. for the efi runtime services), so while this may fix the regression, I'm not sure it's the right fix. Do we need a separate pgd_alloc_kernel()? Thanks, Mark. [1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > The commit "arm64: switch to generic version of pte allocation" > > introduced endless failures during boot like, > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > -2 parent: cgroup) > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > and then later memcg finds out those don't belong to any cgroup. > > Mike, I understood from [1] that this wasn't expected to be a problem, > as the accounting should bypass kernel threads. > > Was that assumption wrong, or is something different happening here? > > > backtrace: > > kobject_add_internal > > kobject_init_and_add > > sysfs_slab_add+0x1a8 > > __kmem_cache_create > > create_cache > > memcg_create_kmem_cache > > memcg_kmem_cache_create_func > > process_one_work > > worker_thread > > kthread > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > --- > > arch/arm64/mm/pgd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > index 769516cb6677..53c48f5c8765 100644 > > --- a/arch/arm64/mm/pgd.c > > +++ b/arch/arm64/mm/pgd.c > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > if (PGD_SIZE == PAGE_SIZE) > > return (pgd_t *)__get_free_page(gfp); > > else > > - return kmem_cache_alloc(pgd_cache, gfp); > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > for the efi runtime services), so while this may fix the regression, I'm > not sure it's the right fix. I see that since [1], pgd_alloc() was updated to special-case the init_mm, which is not sufficient for cases like: efi_mm.pgd = pgd_alloc(&efi_mm) ... which occurs in a kthread. So let's have a pgd_alloc_kernel() to make that explicit. Thanks, Mark.
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > The commit "arm64: switch to generic version of pte allocation" > > introduced endless failures during boot like, > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > -2 parent: cgroup) > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > and then later memcg finds out those don't belong to any cgroup. > > Mike, I understood from [1] that this wasn't expected to be a problem, > as the accounting should bypass kernel threads. > > Was that assumption wrong, or is something different happening here? I was under impression that all allocations are going through __memcg_kmem_charge() which does the bypass. Apparently, it's not the case :( > > > > backtrace: > > kobject_add_internal > > kobject_init_and_add > > sysfs_slab_add+0x1a8 > > __kmem_cache_create > > create_cache > > memcg_create_kmem_cache > > memcg_kmem_cache_create_func > > process_one_work > > worker_thread > > kthread > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > --- > > arch/arm64/mm/pgd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > index 769516cb6677..53c48f5c8765 100644 > > --- a/arch/arm64/mm/pgd.c > > +++ b/arch/arm64/mm/pgd.c > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > if (PGD_SIZE == PAGE_SIZE) > > return (pgd_t *)__get_free_page(gfp); > > else > > - return kmem_cache_alloc(pgd_cache, gfp); > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > for the efi runtime services), so while this may fix the regression, I'm > not sure it's the right fix. Me neither. > Do we need a separate pgd_alloc_kernel()? I'd like to take a closer look at memcg paths once again before adding pgd_alloc_kernel(). Johannes, Roman, can you please advise anything? > Thanks, > Mark. > > [1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx >
On Tue, Jun 04, 2019 at 03:30:20PM +0100, Mark Rutland wrote: > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > The commit "arm64: switch to generic version of pte allocation" > > > introduced endless failures during boot like, > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > -2 parent: cgroup) > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > and then later memcg finds out those don't belong to any cgroup. > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > as the accounting should bypass kernel threads. > > > > Was that assumption wrong, or is something different happening here? > > > > > backtrace: > > > kobject_add_internal > > > kobject_init_and_add > > > sysfs_slab_add+0x1a8 > > > __kmem_cache_create > > > create_cache > > > memcg_create_kmem_cache > > > memcg_kmem_cache_create_func > > > process_one_work > > > worker_thread > > > kthread > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > --- > > > arch/arm64/mm/pgd.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > index 769516cb6677..53c48f5c8765 100644 > > > --- a/arch/arm64/mm/pgd.c > > > +++ b/arch/arm64/mm/pgd.c > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > if (PGD_SIZE == PAGE_SIZE) > > > return (pgd_t *)__get_free_page(gfp); > > > else > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > for the efi runtime services), so while this may fix the regression, I'm > > not sure it's the right fix. > > I see that since [1], pgd_alloc() was updated to special-case the > init_mm, which is not sufficient for cases like: > > efi_mm.pgd = pgd_alloc(&efi_mm) > > ... which occurs in a kthread. > > So let's have a pgd_alloc_kernel() to make that explicit. I've hit "send" before seeing this one :) Well, to be completely on the safe side an explicit pgd_alloc_kernel() sounds right. Then it won't be subject to future changes in memcg and will always "Do The Right Thing". > Thanks, > Mark. >
On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > The commit "arm64: switch to generic version of pte allocation" > > introduced endless failures during boot like, > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > -2 parent: cgroup) > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > and then later memcg finds out those don't belong to any cgroup. > > Mike, I understood from [1] that this wasn't expected to be a problem, > as the accounting should bypass kernel threads. > > Was that assumption wrong, or is something different happening here? > > > > > backtrace: > > kobject_add_internal > > kobject_init_and_add > > sysfs_slab_add+0x1a8 > > __kmem_cache_create > > create_cache > > memcg_create_kmem_cache > > memcg_kmem_cache_create_func > > process_one_work > > worker_thread > > kthread > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > --- > > arch/arm64/mm/pgd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > index 769516cb6677..53c48f5c8765 100644 > > --- a/arch/arm64/mm/pgd.c > > +++ b/arch/arm64/mm/pgd.c > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > if (PGD_SIZE == PAGE_SIZE) > > return (pgd_t *)__get_free_page(gfp); > > else > > - return kmem_cache_alloc(pgd_cache, gfp); > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > for the efi runtime services), so while this may fix the regression, I'm > not sure it's the right fix. > > Do we need a separate pgd_alloc_kernel()? So can I take the above for -rc5, or is somebody else working on a different fix to implement pgd_alloc_kernel()? /confused Will
On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > The commit "arm64: switch to generic version of pte allocation" > > > introduced endless failures during boot like, > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > -2 parent: cgroup) > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > and then later memcg finds out those don't belong to any cgroup. > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > as the accounting should bypass kernel threads. > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > backtrace: > > > kobject_add_internal > > > kobject_init_and_add > > > sysfs_slab_add+0x1a8 > > > __kmem_cache_create > > > create_cache > > > memcg_create_kmem_cache > > > memcg_kmem_cache_create_func > > > process_one_work > > > worker_thread > > > kthread > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > --- > > > arch/arm64/mm/pgd.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > index 769516cb6677..53c48f5c8765 100644 > > > --- a/arch/arm64/mm/pgd.c > > > +++ b/arch/arm64/mm/pgd.c > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > if (PGD_SIZE == PAGE_SIZE) > > > return (pgd_t *)__get_free_page(gfp); > > > else > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > for the efi runtime services), so while this may fix the regression, I'm > > not sure it's the right fix. > > > > Do we need a separate pgd_alloc_kernel()? > > So can I take the above for -rc5, or is somebody else working on a different > fix to implement pgd_alloc_kernel()? The offensive commit "arm64: switch to generic version of pte allocation" is not yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt Andrew will push this out any time sooner given it is broken.
On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > > The commit "arm64: switch to generic version of pte allocation" > > > > introduced endless failures during boot like, > > > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > > -2 parent: cgroup) > > > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > > and then later memcg finds out those don't belong to any cgroup. > > > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > > as the accounting should bypass kernel threads. > > > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > > > > backtrace: > > > > kobject_add_internal > > > > kobject_init_and_add > > > > sysfs_slab_add+0x1a8 > > > > __kmem_cache_create > > > > create_cache > > > > memcg_create_kmem_cache > > > > memcg_kmem_cache_create_func > > > > process_one_work > > > > worker_thread > > > > kthread > > > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > > --- > > > > arch/arm64/mm/pgd.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > > index 769516cb6677..53c48f5c8765 100644 > > > > --- a/arch/arm64/mm/pgd.c > > > > +++ b/arch/arm64/mm/pgd.c > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > > if (PGD_SIZE == PAGE_SIZE) > > > > return (pgd_t *)__get_free_page(gfp); > > > > else > > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > > for the efi runtime services), so while this may fix the regression, I'm > > > not sure it's the right fix. > > > > > > Do we need a separate pgd_alloc_kernel()? > > > > So can I take the above for -rc5, or is somebody else working on a different > > fix to implement pgd_alloc_kernel()? > > The offensive commit "arm64: switch to generic version of pte allocation" is not > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt > Andrew will push this out any time sooner given it is broken. I'd assumed that Mike would respin these patches to implement and use pgd_alloc_kernel() (or take gfp flags) and the updated patches would replace these in akpm's tree. Mike, could you confirm what your plan is? I'm happy to review/test updated patches for arm64. Thanks, Mark.
On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > > > The commit "arm64: switch to generic version of pte allocation" > > > > > introduced endless failures during boot like, > > > > > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > > > -2 parent: cgroup) > > > > > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > > > and then later memcg finds out those don't belong to any cgroup. > > > > > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > > > as the accounting should bypass kernel threads. > > > > > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > > > > > > > backtrace: > > > > > kobject_add_internal > > > > > kobject_init_and_add > > > > > sysfs_slab_add+0x1a8 > > > > > __kmem_cache_create > > > > > create_cache > > > > > memcg_create_kmem_cache > > > > > memcg_kmem_cache_create_func > > > > > process_one_work > > > > > worker_thread > > > > > kthread > > > > > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > > > --- > > > > > arch/arm64/mm/pgd.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > > > index 769516cb6677..53c48f5c8765 100644 > > > > > --- a/arch/arm64/mm/pgd.c > > > > > +++ b/arch/arm64/mm/pgd.c > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > > > if (PGD_SIZE == PAGE_SIZE) > > > > > return (pgd_t *)__get_free_page(gfp); > > > > > else > > > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > > > for the efi runtime services), so while this may fix the regression, I'm > > > > not sure it's the right fix. > > > > > > > > Do we need a separate pgd_alloc_kernel()? > > > > > > So can I take the above for -rc5, or is somebody else working on a different > > > fix to implement pgd_alloc_kernel()? > > > > The offensive commit "arm64: switch to generic version of pte allocation" is not > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt > > Andrew will push this out any time sooner given it is broken. > > I'd assumed that Mike would respin these patches to implement and use > pgd_alloc_kernel() (or take gfp flags) and the updated patches would > replace these in akpm's tree. > > Mike, could you confirm what your plan is? I'm happy to review/test > updated patches for arm64. Sorry for the delay, I'm mostly offline these days. I wanted to understand first what is the reason for the failure. I've tried to reproduce it with qemu, but I failed to find a bootable configuration that will have PGD_SIZE != PAGE_SIZE :( Qian Cai, can you share what is your environment and the kernel config? > Thanks, > Mark. >
> On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote: > > On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: >> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: >>> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: >>>> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: >>>>> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: >>>>>> The commit "arm64: switch to generic version of pte allocation" >>>>>> introduced endless failures during boot like, >>>>>> >>>>>> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: >>>>>> -2 parent: cgroup) >>>>>> >>>>>> It turns out __GFP_ACCOUNT is passed to kernel page table allocations >>>>>> and then later memcg finds out those don't belong to any cgroup. >>>>> >>>>> Mike, I understood from [1] that this wasn't expected to be a problem, >>>>> as the accounting should bypass kernel threads. >>>>> >>>>> Was that assumption wrong, or is something different happening here? >>>>> >>>>>> >>>>>> backtrace: >>>>>> kobject_add_internal >>>>>> kobject_init_and_add >>>>>> sysfs_slab_add+0x1a8 >>>>>> __kmem_cache_create >>>>>> create_cache >>>>>> memcg_create_kmem_cache >>>>>> memcg_kmem_cache_create_func >>>>>> process_one_work >>>>>> worker_thread >>>>>> kthread >>>>>> >>>>>> Signed-off-by: Qian Cai <cai@lca.pw> >>>>>> --- >>>>>> arch/arm64/mm/pgd.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c >>>>>> index 769516cb6677..53c48f5c8765 100644 >>>>>> --- a/arch/arm64/mm/pgd.c >>>>>> +++ b/arch/arm64/mm/pgd.c >>>>>> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) >>>>>> if (PGD_SIZE == PAGE_SIZE) >>>>>> return (pgd_t *)__get_free_page(gfp); >>>>>> else >>>>>> - return kmem_cache_alloc(pgd_cache, gfp); >>>>>> + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); >>>>> >>>>> This is used to allocate PGDs for both user and kernel pagetables (e.g. >>>>> for the efi runtime services), so while this may fix the regression, I'm >>>>> not sure it's the right fix. >>>>> >>>>> Do we need a separate pgd_alloc_kernel()? >>>> >>>> So can I take the above for -rc5, or is somebody else working on a different >>>> fix to implement pgd_alloc_kernel()? >>> >>> The offensive commit "arm64: switch to generic version of pte allocation" is not >>> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt >>> Andrew will push this out any time sooner given it is broken. >> >> I'd assumed that Mike would respin these patches to implement and use >> pgd_alloc_kernel() (or take gfp flags) and the updated patches would >> replace these in akpm's tree. >> >> Mike, could you confirm what your plan is? I'm happy to review/test >> updated patches for arm64. > > Sorry for the delay, I'm mostly offline these days. > > I wanted to understand first what is the reason for the failure. I've tried > to reproduce it with qemu, but I failed to find a bootable configuration > that will have PGD_SIZE != PAGE_SIZE :( > > Qian Cai, can you share what is your environment and the kernel config? https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 4 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 BogoMIPS: 400.00 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 32768K NUMA node0 CPU(s): 0-127 NUMA node1 CPU(s): 128-255 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm # dmidecode Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: HPE Product Name: Apollo 70 Version: X1 Wake-up Type: Power Switch Family: CN99XX
On Tue, Jun 11, 2019 at 03:41:19PM +0300, Mike Rapoport wrote: > On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: > > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: > > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > > > > The commit "arm64: switch to generic version of pte allocation" > > > > > > introduced endless failures during boot like, > > > > > > > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > > > > -2 parent: cgroup) > > > > > > > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > > > > and then later memcg finds out those don't belong to any cgroup. > > > > > > > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > > > > as the accounting should bypass kernel threads. > > > > > > > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > > > > > > > > > > backtrace: > > > > > > kobject_add_internal > > > > > > kobject_init_and_add > > > > > > sysfs_slab_add+0x1a8 > > > > > > __kmem_cache_create > > > > > > create_cache > > > > > > memcg_create_kmem_cache > > > > > > memcg_kmem_cache_create_func > > > > > > process_one_work > > > > > > worker_thread > > > > > > kthread > > > > > > > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > > > > --- > > > > > > arch/arm64/mm/pgd.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > > > > index 769516cb6677..53c48f5c8765 100644 > > > > > > --- a/arch/arm64/mm/pgd.c > > > > > > +++ b/arch/arm64/mm/pgd.c > > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > > > > if (PGD_SIZE == PAGE_SIZE) > > > > > > return (pgd_t *)__get_free_page(gfp); > > > > > > else > > > > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > > > > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > > > > for the efi runtime services), so while this may fix the regression, I'm > > > > > not sure it's the right fix. > > > > > > > > > > Do we need a separate pgd_alloc_kernel()? > > > > > > > > So can I take the above for -rc5, or is somebody else working on a different > > > > fix to implement pgd_alloc_kernel()? > > > > > > The offensive commit "arm64: switch to generic version of pte allocation" is not > > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt > > > Andrew will push this out any time sooner given it is broken. > > > > I'd assumed that Mike would respin these patches to implement and use > > pgd_alloc_kernel() (or take gfp flags) and the updated patches would > > replace these in akpm's tree. > > > > Mike, could you confirm what your plan is? I'm happy to review/test > > updated patches for arm64. > > Sorry for the delay, I'm mostly offline these days. > > I wanted to understand first what is the reason for the failure. I've tried > to reproduce it with qemu, but I failed to find a bootable configuration > that will have PGD_SIZE != PAGE_SIZE :( This is the case with 48-bit VA and 64K pages. In that case we have three levels of table, and the PGD is 1/16th of a page, as it only needs to resolve 9 bits of virtual address rather than 13. If you build defconfig + ARM64_64K_PAGES=y, that should be the case: [mark@lakrids:~/src/linux]% usekorg 8.1.0 aarch64-linux-objdump -d arch/arm64/mm/pgd.o arch/arm64/mm/pgd.o: file format elf64-littleaarch64 Disassembly of section .text: 0000000000000000 <pgd_alloc>: 0: a9bf7bfd stp x29, x30, [sp, #-16]! 4: 90000000 adrp x0, 0 <pgd_alloc> 8: 5281b801 mov w1, #0xdc0 // #3520 c: 910003fd mov x29, sp 10: f9400000 ldr x0, [x0] 14: 94000000 bl 0 <kmem_cache_alloc> 18: a8c17bfd ldp x29, x30, [sp], #16 1c: d65f03c0 ret 0000000000000020 <pgd_free>: 20: a9bf7bfd stp x29, x30, [sp, #-16]! 24: 90000000 adrp x0, 0 <pgd_alloc> 28: 910003fd mov x29, sp 2c: f9400000 ldr x0, [x0] 30: 94000000 bl 0 <kmem_cache_free> 34: a8c17bfd ldp x29, x30, [sp], #16 38: d65f03c0 ret Disassembly of section .init.text: 0000000000000000 <pgd_cache_init>: 0: a9bf7bfd stp x29, x30, [sp, #-16]! 4: 52804002 mov w2, #0x200 // #512 8: d2800004 mov x4, #0x0 // #0 c: 910003fd mov x29, sp 10: 2a0203e1 mov w1, w2 14: 52a00083 mov w3, #0x40000 // #262144 18: 90000000 adrp x0, 0 <pgd_cache_init> 1c: 91000000 add x0, x0, #0x0 20: 94000000 bl 0 <kmem_cache_create> 24: 90000001 adrp x1, 0 <pgd_cache_init> 28: a8c17bfd ldp x29, x30, [sp], #16 2c: f9000020 str x0, [x1] 30: d65f03c0 ret Thanks, Mark.
Hi, On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote: > > > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > Sorry for the delay, I'm mostly offline these days. > > > > I wanted to understand first what is the reason for the failure. I've tried > > to reproduce it with qemu, but I failed to find a bootable configuration > > that will have PGD_SIZE != PAGE_SIZE :( > > > > Qian Cai, can you share what is your environment and the kernel config? > > https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config > > # lscpu > Architecture: aarch64 > Byte Order: Little Endian > CPU(s): 256 > On-line CPU(s) list: 0-255 > Thread(s) per core: 4 > Core(s) per socket: 32 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: Cavium > Model: 1 > Model name: ThunderX2 99xx > Stepping: 0x1 > BogoMIPS: 400.00 > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 32768K > NUMA node0 CPU(s): 0-127 > NUMA node1 CPU(s): 128-255 > Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm > > # dmidecode > Handle 0x0001, DMI type 1, 27 bytes > System Information > Manufacturer: HPE > Product Name: Apollo 70 > Version: X1 > Wake-up Type: Power Switch > Family: CN99XX > Can you please also send the entire log when the failure happens? Another question, is the problem exist with PGD_SIZE == PAGE_SIZE?
On Wed, 2019-06-12 at 09:57 +0300, Mike Rapoport wrote: > Hi, > > On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote: > > > > > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > > > Sorry for the delay, I'm mostly offline these days. > > > > > > I wanted to understand first what is the reason for the failure. I've > > > tried > > > to reproduce it with qemu, but I failed to find a bootable configuration > > > that will have PGD_SIZE != PAGE_SIZE :( > > > > > > Qian Cai, can you share what is your environment and the kernel config? > > > > https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config > > > > # lscpu > > Architecture: aarch64 > > Byte Order: Little Endian > > CPU(s): 256 > > On-line CPU(s) list: 0-255 > > Thread(s) per core: 4 > > Core(s) per socket: 32 > > Socket(s): 2 > > NUMA node(s): 2 > > Vendor ID: Cavium > > Model: 1 > > Model name: ThunderX2 99xx > > Stepping: 0x1 > > BogoMIPS: 400.00 > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 256K > > L3 cache: 32768K > > NUMA node0 CPU(s): 0-127 > > NUMA node1 CPU(s): 128-255 > > Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics > > cpuid asimdrdm > > > > # dmidecode > > Handle 0x0001, DMI type 1, 27 bytes > > System Information > > Manufacturer: HPE > > Product Name: Apollo 70 > > Version: X1 > > Wake-up Type: Power Switch > > Family: CN99XX > > > > Can you please also send the entire log when the failure happens? https://cailca.github.io/files/dmesg.txt > Another question, is the problem exist with PGD_SIZE == PAGE_SIZE? No.
(added Roman) On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > > > The commit "arm64: switch to generic version of pte allocation" > > > > > introduced endless failures during boot like, > > > > > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > > > -2 parent: cgroup) > > > > > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > > > and then later memcg finds out those don't belong to any cgroup. > > > > > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > > > as the accounting should bypass kernel threads. > > > > > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > > > > > > > backtrace: > > > > > kobject_add_internal > > > > > kobject_init_and_add > > > > > sysfs_slab_add+0x1a8 > > > > > __kmem_cache_create > > > > > create_cache > > > > > memcg_create_kmem_cache > > > > > memcg_kmem_cache_create_func > > > > > process_one_work > > > > > worker_thread > > > > > kthread > > > > > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > > > --- > > > > > arch/arm64/mm/pgd.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > > > index 769516cb6677..53c48f5c8765 100644 > > > > > --- a/arch/arm64/mm/pgd.c > > > > > +++ b/arch/arm64/mm/pgd.c > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > > > if (PGD_SIZE == PAGE_SIZE) > > > > > return (pgd_t *)__get_free_page(gfp); > > > > > else > > > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > > > for the efi runtime services), so while this may fix the regression, I'm > > > > not sure it's the right fix. > > > > > > > > Do we need a separate pgd_alloc_kernel()? > > > > > > So can I take the above for -rc5, or is somebody else working on a different > > > fix to implement pgd_alloc_kernel()? > > > > The offensive commit "arm64: switch to generic version of pte allocation" is not > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt > > Andrew will push this out any time sooner given it is broken. > > I'd assumed that Mike would respin these patches to implement and use > pgd_alloc_kernel() (or take gfp flags) and the updated patches would > replace these in akpm's tree. > > Mike, could you confirm what your plan is? I'm happy to review/test > updated patches for arm64. The log Qian Cai posted at [1] and partially cited below confirms that the failure happens when *user* PGDs are allocated and the addition of __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another issue. I'm still failing to reproduce it with qemu and I'm not really familiar with slub/memcg code to say anything smart about it. Will keep looking. Note, that as failures start way after efi_virtmap_init() that allocates a PGD for efi_mm, there are no real fixes required for the original series, except that the check for mm == &init_mm I copied for some reason from powerpc is bogus and can be removed. I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure we won't run into issues with memcg in the future. [ 82.125966] Freeing unused kernel memory: 28672K [ 87.940365] Checked W+X mappings: passed, no W+X pages found [ 87.946769] Run /init as init process [ 88.040040] systemd[1]: System time before build time, advancing clock. [ 88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or directory [ 88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left [ 88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy) [ 88.498398] systemd[1]: Detected architecture arm64. [ 88.506517] systemd[1]: Running in initial RAM disk. [ 89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left [ 90.222658] random: systemd: uninitialized urandom read (16 bytes read) [ 90.230072] systemd[1]: Reached target Swap. [ 90.240205] random: systemd: uninitialized urandom read (16 bytes read) [ 90.251088] systemd[1]: Reached target Timers. [ 90.261303] random: systemd: uninitialized urandom read (16 bytes read) [ 90.271209] systemd[1]: Listening on udev Control Socket. [ 90.283238] systemd[1]: Reached target Local File Systems. [ 90.296232] systemd[1]: Reached target Slices. [ 90.307239] systemd[1]: Listening on udev Kernel Socket. [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error: -2 parent: cgroup) [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup) [ 90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes left [ 90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error: -2 parent: cgroup) > Thanks, > Mark. > [1] https://cailca.github.io/files/dmesg.txt
On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote: > The log Qian Cai posted at [1] and partially cited below confirms that the > failure happens when *user* PGDs are allocated and the addition of > __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another > issue. > > I'm still failing to reproduce it with qemu and I'm not really familiar > with slub/memcg code to say anything smart about it. Will keep looking. > > Note, that as failures start way after efi_virtmap_init() that allocates a > PGD for efi_mm, there are no real fixes required for the original series, > except that the check for mm == &init_mm I copied for some reason from > powerpc is bogus and can be removed. Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm. For example, [ 132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd- udevd.service) (error: -2 parent: cgroup) [ 132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted: G W 5.2.0-rc4-next-20190613+ #8 [ 132.807356][ T1889] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.9 03/01/2019 [ 132.817872][ T1889] Call trace: [ 132.821017][ T1889] dump_backtrace+0x0/0x268 [ 132.825372][ T1889] show_stack+0x20/0x2c [ 132.829380][ T1889] dump_stack+0xb4/0x108 [ 132.833475][ T1889] pgd_alloc+0x34/0x5c [ 132.837396][ T1889] mm_init+0x27c/0x32c [ 132.841315][ T1889] dup_mm+0x84/0x7b4 [ 132.845061][ T1889] copy_process+0xf20/0x24cc [ 132.849500][ T1889] _do_fork+0xa4/0x66c [ 132.853420][ T1889] __arm64_sys_clone+0x114/0x1b4 [ 132.858208][ T1889] el0_svc_handler+0x198/0x260 [ 132.862821][ T1889] el0_svc+0x8/0xc > > I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure > we won't run into issues with memcg in the future. > > [ 82.125966] Freeing unused kernel memory: 28672K > [ 87.940365] Checked W+X mappings: passed, no W+X pages found > [ 87.946769] Run /init as init process > [ 88.040040] systemd[1]: System time before build time, advancing clock. > [ 88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or > directory > [ 88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left > [ 88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT > +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT > +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 > default-hierarchy=legacy) > [ 88.498398] systemd[1]: Detected architecture arm64. > [ 88.506517] systemd[1]: Running in initial RAM disk. > [ 89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left > [ 90.222658] random: systemd: uninitialized urandom read (16 bytes read) > [ 90.230072] systemd[1]: Reached target Swap. > [ 90.240205] random: systemd: uninitialized urandom read (16 bytes read) > [ 90.251088] systemd[1]: Reached target Timers. > [ 90.261303] random: systemd: uninitialized urandom read (16 bytes read) > [ 90.271209] systemd[1]: Listening on udev Control Socket. > [ 90.283238] systemd[1]: Reached target Local File Systems. > [ 90.296232] systemd[1]: Reached target Slices. > [ 90.307239] systemd[1]: Listening on udev Kernel Socket. > [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) > (error: -2 parent: cgroup) > [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error: > -2 parent: cgroup) > [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles- > setup.service) (error: -2 parent: cgroup) > [ 90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes > left > [ 90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error: > -2 parent: cgroup) > > > Thanks, > > Mark. > > > > [1] https://cailca.github.io/files/dmesg.txt >
On Thu, Jun 13, 2019 at 09:22:36AM -0400, Qian Cai wrote: > On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote: > > The log Qian Cai posted at [1] and partially cited below confirms that the > > failure happens when *user* PGDs are allocated and the addition of > > __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another > > issue. > > > > I'm still failing to reproduce it with qemu and I'm not really familiar > > with slub/memcg code to say anything smart about it. Will keep looking. > > > > Note, that as failures start way after efi_virtmap_init() that allocates a > > PGD for efi_mm, there are no real fixes required for the original series, > > except that the check for mm == &init_mm I copied for some reason from > > powerpc is bogus and can be removed. > > Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm. > For example, Here we allocate the pgd for a user process and it should be accounted. Actually, the whole point of changing the gfp flags in arm64::pgd_alloc() was to enable the accounting for memory occupied by user pgds, just like x86 and powerpc do. > [ 132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd- > udevd.service) (error: -2 parent: cgroup) > [ 132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted: > G W 5.2.0-rc4-next-20190613+ #8 > [ 132.807356][ T1889] Hardware name: HPE Apollo > 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.9 03/01/2019 > [ 132.817872][ T1889] Call trace: > [ 132.821017][ T1889] dump_backtrace+0x0/0x268 > [ 132.825372][ T1889] show_stack+0x20/0x2c > [ 132.829380][ T1889] dump_stack+0xb4/0x108 > [ 132.833475][ T1889] pgd_alloc+0x34/0x5c > [ 132.837396][ T1889] mm_init+0x27c/0x32c > [ 132.841315][ T1889] dup_mm+0x84/0x7b4 > [ 132.845061][ T1889] copy_process+0xf20/0x24cc > [ 132.849500][ T1889] _do_fork+0xa4/0x66c > [ 132.853420][ T1889] __arm64_sys_clone+0x114/0x1b4 > [ 132.858208][ T1889] el0_svc_handler+0x198/0x260 > [ 132.862821][ T1889] el0_svc+0x8/0xc > > > > > I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure > > we won't run into issues with memcg in the future. > > > > [ 82.125966] Freeing unused kernel memory: 28672K > > [ 87.940365] Checked W+X mappings: passed, no W+X pages found > > [ 87.946769] Run /init as init process > > [ 88.040040] systemd[1]: System time before build time, advancing clock. > > [ 88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or > > directory > > [ 88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left > > [ 88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT > > +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT > > +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 > > default-hierarchy=legacy) > > [ 88.498398] systemd[1]: Detected architecture arm64. > > [ 88.506517] systemd[1]: Running in initial RAM disk. > > [ 89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left > > [ 90.222658] random: systemd: uninitialized urandom read (16 bytes read) > > [ 90.230072] systemd[1]: Reached target Swap. > > [ 90.240205] random: systemd: uninitialized urandom read (16 bytes read) > > [ 90.251088] systemd[1]: Reached target Timers. > > [ 90.261303] random: systemd: uninitialized urandom read (16 bytes read) > > [ 90.271209] systemd[1]: Listening on udev Control Socket. > > [ 90.283238] systemd[1]: Reached target Local File Systems. > > [ 90.296232] systemd[1]: Reached target Slices. > > [ 90.307239] systemd[1]: Listening on udev Kernel Socket. > > [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) > > (error: -2 parent: cgroup) > > [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error: > > -2 parent: cgroup) > > [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles- > > setup.service) (error: -2 parent: cgroup) > > [ 90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes > > left > > [ 90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error: > > -2 parent: cgroup) > > > > > Thanks, > > > Mark. > > > > > > > [1] https://cailca.github.io/files/dmesg.txt > >
On Thu, Jun 13, 2019 at 03:11:01PM +0300, Mike Rapoport wrote: > On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote: > > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote: > > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote: > > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote: > > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote: > > > > > > The commit "arm64: switch to generic version of pte allocation" > > > > > > introduced endless failures during boot like, > > > > > > > > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: > > > > > > -2 parent: cgroup) > > > > > > > > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations > > > > > > and then later memcg finds out those don't belong to any cgroup. > > > > > > > > > > Mike, I understood from [1] that this wasn't expected to be a problem, > > > > > as the accounting should bypass kernel threads. > > > > > > > > > > Was that assumption wrong, or is something different happening here? > > > > > > > > > > > > > > > > > backtrace: > > > > > > kobject_add_internal > > > > > > kobject_init_and_add > > > > > > sysfs_slab_add+0x1a8 > > > > > > __kmem_cache_create > > > > > > create_cache > > > > > > memcg_create_kmem_cache > > > > > > memcg_kmem_cache_create_func > > > > > > process_one_work > > > > > > worker_thread > > > > > > kthread > > > > > > > > > > > > Signed-off-by: Qian Cai <cai@lca.pw> > > > > > > --- > > > > > > arch/arm64/mm/pgd.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c > > > > > > index 769516cb6677..53c48f5c8765 100644 > > > > > > --- a/arch/arm64/mm/pgd.c > > > > > > +++ b/arch/arm64/mm/pgd.c > > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) > > > > > > if (PGD_SIZE == PAGE_SIZE) > > > > > > return (pgd_t *)__get_free_page(gfp); > > > > > > else > > > > > > - return kmem_cache_alloc(pgd_cache, gfp); > > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); > > > > > > > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g. > > > > > for the efi runtime services), so while this may fix the regression, I'm > > > > > not sure it's the right fix. > > > > > > > > > > Do we need a separate pgd_alloc_kernel()? > > > > > > > > So can I take the above for -rc5, or is somebody else working on a different > > > > fix to implement pgd_alloc_kernel()? > > > > > > The offensive commit "arm64: switch to generic version of pte allocation" is not > > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt > > > Andrew will push this out any time sooner given it is broken. > > > > I'd assumed that Mike would respin these patches to implement and use > > pgd_alloc_kernel() (or take gfp flags) and the updated patches would > > replace these in akpm's tree. > > > > Mike, could you confirm what your plan is? I'm happy to review/test > > updated patches for arm64. > > The log Qian Cai posted at [1] and partially cited below confirms that the > failure happens when *user* PGDs are allocated and the addition of > __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another > issue. Indeed the accounting of the PGD memory uncovered a dangling pointer to pgd_cache :) The pgd_cache was initialized twice and it made memcg and slub sysfs go nuts. To be frank, I've got lost in their cross-initialization, cross-referencing and update sequences, but for sure extra initialization of pgd_cache was bogus. I've double checked the 'if (mm == &init_mm)' and it's not needed. The EFI PGD is allocated before memcg is up and other kernel allocations of pgd (if we'll have any) would be bypassed by memcg_kmem_bypass(). Andrew, can you please add the patch below as an incremental fix? With this the arm64::pgd_alloc() should be in the right shape. From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001 From: Mike Rapoport <rppt@linux.ibm.com> Date: Mon, 17 Jun 2019 17:37:43 +0300 Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD memory. That cache was initialized twice: first through pgtable_cache_init() alias and then as an override for weak pgd_cache_init(). After enabling accounting for the PGD memory, this created a confusion for memcg and slub sysfs code which resulted in the following errors: [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup) Removing the alias from pgtable_cache_init() and keeping the only pgd_cache initialization in pgd_cache_init() resolves the problem and allows accounting of PGD memory. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- arch/arm64/include/asm/pgtable.h | 3 +-- arch/arm64/mm/pgd.c | 5 +---- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 3191b9f..c7a802d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -851,8 +851,7 @@ extern int kern_addr_valid(unsigned long addr); #include <asm-generic/pgtable.h> -void pgd_cache_init(void); -#define pgtable_cache_init pgd_cache_init +static inline void pgtable_cache_init(void) { } /* * On AArch64, the cache coherency is handled via the set_pte_at() function. diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c index 53c48f5..3f0a744 100644 --- a/arch/arm64/mm/pgd.c +++ b/arch/arm64/mm/pgd.c @@ -32,13 +32,10 @@ pgd_t *pgd_alloc(struct mm_struct *mm) { gfp_t gfp = GFP_PGTABLE_USER; - if (unlikely(mm == &init_mm)) - gfp = GFP_PGTABLE_KERNEL; - if (PGD_SIZE == PAGE_SIZE) return (pgd_t *)__get_free_page(gfp); else - return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); + return kmem_cache_alloc(pgd_cache, gfp); } void pgd_free(struct mm_struct *mm, pgd_t *pgd)
Hi Mike, On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote: > Andrew, can you please add the patch below as an incremental fix? > > With this the arm64::pgd_alloc() should be in the right shape. > > > From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001 > From: Mike Rapoport <rppt@linux.ibm.com> > Date: Mon, 17 Jun 2019 17:37:43 +0300 > Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice > > When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD > memory. That cache was initialized twice: first through > pgtable_cache_init() alias and then as an override for weak > pgd_cache_init(). > > After enabling accounting for the PGD memory, this created a confusion for > memcg and slub sysfs code which resulted in the following errors: > > [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup) > > Removing the alias from pgtable_cache_init() and keeping the only pgd_cache > initialization in pgd_cache_init() resolves the problem and allows > accounting of PGD memory. > > Reported-by: Qian Cai <cai@lca.pw> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > arch/arm64/include/asm/pgtable.h | 3 +-- > arch/arm64/mm/pgd.c | 5 +---- > 2 files changed, 2 insertions(+), 6 deletions(-) Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache during mm initialization") due to an unlucky naming conflict! In which case, I'd actually prefer to take this fix asap via the arm64 tree. Is that ok? Will
On Mon, Jun 17, 2019 at 05:36:30PM +0100, Will Deacon wrote: > Hi Mike, > > On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote: > > Andrew, can you please add the patch below as an incremental fix? > > > > With this the arm64::pgd_alloc() should be in the right shape. > > > > > > From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001 > > From: Mike Rapoport <rppt@linux.ibm.com> > > Date: Mon, 17 Jun 2019 17:37:43 +0300 > > Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice > > > > When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD > > memory. That cache was initialized twice: first through > > pgtable_cache_init() alias and then as an override for weak > > pgd_cache_init(). > > > > After enabling accounting for the PGD memory, this created a confusion for > > memcg and slub sysfs code which resulted in the following errors: > > > > [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > > [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > > [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup) > > > > Removing the alias from pgtable_cache_init() and keeping the only pgd_cache > > initialization in pgd_cache_init() resolves the problem and allows > > accounting of PGD memory. > > > > Reported-by: Qian Cai <cai@lca.pw> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > --- > > arch/arm64/include/asm/pgtable.h | 3 +-- > > arch/arm64/mm/pgd.c | 5 +---- > > 2 files changed, 2 insertions(+), 6 deletions(-) > > Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache > during mm initialization") due to an unlucky naming conflict! > > In which case, I'd actually prefer to take this fix asap via the arm64 > tree. Is that ok? I suppose so, it just won't apply as is. Would you like a patch against the current upstream? > Will
On Tue, Jun 18, 2019 at 09:12:59AM +0300, Mike Rapoport wrote: > On Mon, Jun 17, 2019 at 05:36:30PM +0100, Will Deacon wrote: > > On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote: > > > Andrew, can you please add the patch below as an incremental fix? > > > > > > With this the arm64::pgd_alloc() should be in the right shape. > > > > > > > > > From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001 > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > Date: Mon, 17 Jun 2019 17:37:43 +0300 > > > Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice > > > > > > When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD > > > memory. That cache was initialized twice: first through > > > pgtable_cache_init() alias and then as an override for weak > > > pgd_cache_init(). > > > > > > After enabling accounting for the PGD memory, this created a confusion for > > > memcg and slub sysfs code which resulted in the following errors: > > > > > > [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > > > [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup) > > > [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup) > > > > > > Removing the alias from pgtable_cache_init() and keeping the only pgd_cache > > > initialization in pgd_cache_init() resolves the problem and allows > > > accounting of PGD memory. > > > > > > Reported-by: Qian Cai <cai@lca.pw> > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > > --- > > > arch/arm64/include/asm/pgtable.h | 3 +-- > > > arch/arm64/mm/pgd.c | 5 +---- > > > 2 files changed, 2 insertions(+), 6 deletions(-) > > > > Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache > > during mm initialization") due to an unlucky naming conflict! > > > > In which case, I'd actually prefer to take this fix asap via the arm64 > > tree. Is that ok? > > I suppose so, it just won't apply as is. Would you like a patch against the > current upstream? Yes, please. I'm assuming it's a straightforward change (please shout if it isn't). Will
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c index 769516cb6677..53c48f5c8765 100644 --- a/arch/arm64/mm/pgd.c +++ b/arch/arm64/mm/pgd.c @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) if (PGD_SIZE == PAGE_SIZE) return (pgd_t *)__get_free_page(gfp); else - return kmem_cache_alloc(pgd_cache, gfp); + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL); } void pgd_free(struct mm_struct *mm, pgd_t *pgd)
The commit "arm64: switch to generic version of pte allocation" introduced endless failures during boot like, kobject_add_internal failed for pgd_cache(285:chronyd.service) (error: -2 parent: cgroup) It turns out __GFP_ACCOUNT is passed to kernel page table allocations and then later memcg finds out those don't belong to any cgroup. backtrace: kobject_add_internal kobject_init_and_add sysfs_slab_add+0x1a8 __kmem_cache_create create_cache memcg_create_kmem_cache memcg_kmem_cache_create_func process_one_work worker_thread kthread Signed-off-by: Qian Cai <cai@lca.pw> --- arch/arm64/mm/pgd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)