diff mbox series

kernel BUG at mm/huge_memory.c:2613!

Message ID 20200619001938.GA135965@carbon.dhcp.thefacebook.com (mailing list archive)
State New, archived
Headers show
Series kernel BUG at mm/huge_memory.c:2613! | expand

Commit Message

Roman Gushchin June 19, 2020, 12:19 a.m. UTC
Hi!

I was consistently hitting a VM_BUG_ON_PAGE() in split_huge_page_to_list()
when running vanilla 5.8-rc1 on my desktop. It was happening on every boot
during the system start. I haven't seen this issue on 5.7.

It looks like split_huge_page() expects the page to be locked,
but it hasn't been changed from 5.7. I do not see any suspicious
commits around the call side either.

I've tried the following patch:

--
From 4af38fbf06a9354fadf22a78f1a42dfbb24fbc3a Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Thu, 18 Jun 2020 16:33:47 -0700
Subject: [PATCH] iommu/dma: lock page before calling split_huge_page()

split_huge_page() expects a locked page. The following stacktrace
is generated if debug is on. Fix this by locking the page before
passing it to split_huge_page().

[   24.861385] page:ffffef044fb1fa00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 head:ffffef044fb1fa00 order:2 compound_mapcount:0 compound_pincount:0
[   24.861389] flags: 0x17ffffc0010000(head)
[   24.861393] raw: 0017ffffc0010000 dead000000000100 dead000000000122 0000000000000000
[   24.861395] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[   24.861396] page dumped because: VM_BUG_ON_PAGE(!PageLocked(head))
[   24.861411] ------------[ cut here ]------------
[   24.861413] kernel BUG at mm/huge_memory.c:2613!
[   24.861428] invalid opcode: 0000 [#1] SMP NOPTI
[   24.861432] CPU: 10 PID: 1505 Comm: pulseaudio Not tainted 5.8.0-rc1+ #689
[   24.861433] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
[   24.861441] RIP: 0010:split_huge_page_to_list+0x731/0xae0
[   24.861444] Code: 44 00 00 8b 47 34 85 c0 0f 84 b4 02 00 00 f0 ff 4f 34 75 c2 e8 e0 12 f7 ff eb bb 48 c7 c6 d0 16 39 ba 4c 89 c7 e8 ef 85 f9 ff <0f> 0b 48 c7 44 24 10 ff ff ff ff 31 db e9 bb fa ff ff 48 8b 7c 24
[   24.861446] RSP: 0018:ffffc1030254bb50 EFLAGS: 00010286
[   24.861449] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9b54cee98d08
[   24.861451] RDX: 00000000ffffffd8 RSI: 0000000000000000 RDI: ffff9b54cee98d00
[   24.861452] RBP: ffffef044fb1fa00 R08: 0000000000000547 R09: 0000000000000003
[   24.861454] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b54df37f188
[   24.861455] R13: ffff9b54df355000 R14: ffffef044fb1fa00 R15: ffffef044fb1fa00
[   24.861458] FS:  00007fd2dc132880(0000) GS:ffff9b54cee80000(0000) knlGS:0000000000000000
[   24.861460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.861461] CR2: 00007fd2cb100000 CR3: 00000003feb16000 CR4: 00000000003406e0
[   24.861464] Call Trace:
[   24.861473]  ? __mod_lruvec_state+0x41/0xf0
[   24.861478]  ? __alloc_pages_nodemask+0x15c/0x320
[   24.861483]  iommu_dma_alloc+0x316/0x580
[   24.861496]  snd_dma_alloc_pages+0xdf/0x160 [snd_pcm]
[   24.861508]  snd_dma_alloc_pages_fallback+0x5d/0x80 [snd_pcm]
[   24.861516]  snd_malloc_sgbuf_pages+0x166/0x380 [snd_pcm]
[   24.861523]  ? snd_pcm_hw_refine+0x29d/0x310 [snd_pcm]
[   24.861529]  ? _cond_resched+0x16/0x40
[   24.861535]  snd_dma_alloc_pages+0x64/0x160 [snd_pcm]
[   24.861542]  snd_pcm_lib_malloc_pages+0x136/0x1d0 [snd_pcm]
[   24.861550]  ? snd_pcm_lib_ioctl+0x167/0x210 [snd_pcm]
[   24.861556]  snd_pcm_hw_params+0x3c0/0x490 [snd_pcm]
[   24.861563]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
[   24.861571]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
[   24.861578]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
[   24.861583]  ksys_ioctl+0x82/0xc0
[   24.861587]  __x64_sys_ioctl+0x16/0x20
[   24.861593]  do_syscall_64+0x4d/0x90
[   24.861597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/iommu/dma-iommu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Yang Shi June 19, 2020, 12:46 a.m. UTC | #1
On Thu, Jun 18, 2020 at 5:19 PM Roman Gushchin <guro@fb.com> wrote:
>
> Hi!
>
> I was consistently hitting a VM_BUG_ON_PAGE() in split_huge_page_to_list()
> when running vanilla 5.8-rc1 on my desktop. It was happening on every boot
> during the system start. I haven't seen this issue on 5.7.
>
> It looks like split_huge_page() expects the page to be locked,
> but it hasn't been changed from 5.7. I do not see any suspicious
> commits around the call side either.
>
> I've tried the following patch:
>
> --
> From 4af38fbf06a9354fadf22a78f1a42dfbb24fbc3a Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 18 Jun 2020 16:33:47 -0700
> Subject: [PATCH] iommu/dma: lock page before calling split_huge_page()
>
> split_huge_page() expects a locked page. The following stacktrace
> is generated if debug is on. Fix this by locking the page before
> passing it to split_huge_page().
>
> [   24.861385] page:ffffef044fb1fa00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 head:ffffef044fb1fa00 order:2 compound_mapcount:0 compound_pincount:0
> [   24.861389] flags: 0x17ffffc0010000(head)
> [   24.861393] raw: 0017ffffc0010000 dead000000000100 dead000000000122 0000000000000000
> [   24.861395] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [   24.861396] page dumped because: VM_BUG_ON_PAGE(!PageLocked(head))
> [   24.861411] ------------[ cut here ]------------
> [   24.861413] kernel BUG at mm/huge_memory.c:2613!
> [   24.861428] invalid opcode: 0000 [#1] SMP NOPTI
> [   24.861432] CPU: 10 PID: 1505 Comm: pulseaudio Not tainted 5.8.0-rc1+ #689
> [   24.861433] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> [   24.861441] RIP: 0010:split_huge_page_to_list+0x731/0xae0
> [   24.861444] Code: 44 00 00 8b 47 34 85 c0 0f 84 b4 02 00 00 f0 ff 4f 34 75 c2 e8 e0 12 f7 ff eb bb 48 c7 c6 d0 16 39 ba 4c 89 c7 e8 ef 85 f9 ff <0f> 0b 48 c7 44 24 10 ff ff ff ff 31 db e9 bb fa ff ff 48 8b 7c 24
> [   24.861446] RSP: 0018:ffffc1030254bb50 EFLAGS: 00010286
> [   24.861449] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9b54cee98d08
> [   24.861451] RDX: 00000000ffffffd8 RSI: 0000000000000000 RDI: ffff9b54cee98d00
> [   24.861452] RBP: ffffef044fb1fa00 R08: 0000000000000547 R09: 0000000000000003
> [   24.861454] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b54df37f188
> [   24.861455] R13: ffff9b54df355000 R14: ffffef044fb1fa00 R15: ffffef044fb1fa00
> [   24.861458] FS:  00007fd2dc132880(0000) GS:ffff9b54cee80000(0000) knlGS:0000000000000000
> [   24.861460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.861461] CR2: 00007fd2cb100000 CR3: 00000003feb16000 CR4: 00000000003406e0
> [   24.861464] Call Trace:
> [   24.861473]  ? __mod_lruvec_state+0x41/0xf0
> [   24.861478]  ? __alloc_pages_nodemask+0x15c/0x320
> [   24.861483]  iommu_dma_alloc+0x316/0x580
> [   24.861496]  snd_dma_alloc_pages+0xdf/0x160 [snd_pcm]
> [   24.861508]  snd_dma_alloc_pages_fallback+0x5d/0x80 [snd_pcm]
> [   24.861516]  snd_malloc_sgbuf_pages+0x166/0x380 [snd_pcm]
> [   24.861523]  ? snd_pcm_hw_refine+0x29d/0x310 [snd_pcm]
> [   24.861529]  ? _cond_resched+0x16/0x40
> [   24.861535]  snd_dma_alloc_pages+0x64/0x160 [snd_pcm]
> [   24.861542]  snd_pcm_lib_malloc_pages+0x136/0x1d0 [snd_pcm]
> [   24.861550]  ? snd_pcm_lib_ioctl+0x167/0x210 [snd_pcm]
> [   24.861556]  snd_pcm_hw_params+0x3c0/0x490 [snd_pcm]
> [   24.861563]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> [   24.861571]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> [   24.861578]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> [   24.861583]  ksys_ioctl+0x82/0xc0
> [   24.861587]  __x64_sys_ioctl+0x16/0x20
> [   24.861593]  do_syscall_64+0x4d/0x90
> [   24.861597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  drivers/iommu/dma-iommu.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 4959f5df21bd..31e4e305d8d5 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -24,6 +24,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/vmalloc.h>
>  #include <linux/crash_dump.h>
> +#include <linux/pagemap.h>
>
>  struct iommu_dma_msi_page {
>         struct list_head        list;
> @@ -549,8 +550,15 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>                         if (!PageCompound(page)) {
>                                 split_page(page, order);
>                                 break;
> -                       } else if (!split_huge_page(page)) {
> -                               break;
> +                       } else {
> +                               int err;
> +
> +                               lock_page(page);
> +                               err = split_huge_page(page);
> +                               unlock_page(page);

Yes, THP split does need the page locked, in addition it needs the
caller hold a pin on the page too (refcount bump).

But, I don't get how the code could even really work by a quick look.
Actually split_huge_page() assumes the passed in THP is user THP (anon
or file cache) and the order is PMD order However, it looks the iommu
driver just wants to allocate a bunch of base pages by allocating a
huge page (could by any order if I read the code correctly) then split
them to base pages. I don't think this is the correct approach IMO.
Anyway I'm not iommu expert, if I miss anything please feel free to
let me know.

> +
> +                               if (!err)
> +                                       break;
>                         }
>                         __free_pages(page, order);
>                 }
> --
> 2.26.2
>
>
> --
>
> But applying it made the kernel panic somewhere else:
>
> [   25.148419] BUG: unable to handle page fault for address: ffffb1a9c2429000
> [   25.148424] #PF: supervisor write access in kernel mode
> [   25.148426] #PF: error_code(0x000b) - reserved bit violation
> [   25.148427] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3e9938067 PTE 8000112400b4b163
> [   25.148433] Oops: 000b [#1] SMP NOPTI
> [   25.148436] CPU: 10 PID: 1504 Comm: pulseaudio Not tainted 5.8.0-rc1+ #690
> [   25.148438] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> [   25.148445] RIP: 0010:__memset+0x24/0x30
> [   25.148448] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> [   25.148450] RSP: 0018:ffffb1a9c2497e08 EFLAGS: 00010216
> [   25.148453] RAX: 0000000000000000 RBX: ffffa089ab428000 RCX: 00000000000008a0
> [   25.148454] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1a9c2429000
> [   25.148456] RBP: ffffa089b0036c00 R08: ffffa089c84c56e8 R09: ffffb1a9c2429000
> [   25.148457] R10: ffffb1a9c2429000 R11: ffffa089ae3c1800 R12: 0000000000000000
> [   25.148458] R13: ffffa089aa828600 R14: ffffffffc0f82880 R15: ffffa089c5121200
> [   25.148461] FS:  00007f533f679880(0000) GS:ffffa089cee80000(0000) knlGS:0000000000000000
> [   25.148463] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   25.148464] CR2: ffffb1a9c2429000 CR3: 0000000405f42000 CR4: 00000000003406e0
> [   25.148466] Call Trace:
> [   25.148479]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> [   25.148488]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> [   25.148496]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> [   25.148504]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
>
>
> Any ideas? Is it a known issue?
> It might be that some changes revealed one or more old bugs.
>
> Does the proposed patch look sane?
>
> Thanks!
>
Roman Gushchin June 19, 2020, 1:14 a.m. UTC | #2
On Thu, Jun 18, 2020 at 05:46:24PM -0700, Yang Shi wrote:
> On Thu, Jun 18, 2020 at 5:19 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Hi!
> >
> > I was consistently hitting a VM_BUG_ON_PAGE() in split_huge_page_to_list()
> > when running vanilla 5.8-rc1 on my desktop. It was happening on every boot
> > during the system start. I haven't seen this issue on 5.7.
> >
> > It looks like split_huge_page() expects the page to be locked,
> > but it hasn't been changed from 5.7. I do not see any suspicious
> > commits around the call side either.
> >
> > I've tried the following patch:
> >
> > --
> > From 4af38fbf06a9354fadf22a78f1a42dfbb24fbc3a Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Thu, 18 Jun 2020 16:33:47 -0700
> > Subject: [PATCH] iommu/dma: lock page before calling split_huge_page()
> >
> > split_huge_page() expects a locked page. The following stacktrace
> > is generated if debug is on. Fix this by locking the page before
> > passing it to split_huge_page().
> >
> > [   24.861385] page:ffffef044fb1fa00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 head:ffffef044fb1fa00 order:2 compound_mapcount:0 compound_pincount:0
> > [   24.861389] flags: 0x17ffffc0010000(head)
> > [   24.861393] raw: 0017ffffc0010000 dead000000000100 dead000000000122 0000000000000000
> > [   24.861395] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > [   24.861396] page dumped because: VM_BUG_ON_PAGE(!PageLocked(head))
> > [   24.861411] ------------[ cut here ]------------
> > [   24.861413] kernel BUG at mm/huge_memory.c:2613!
> > [   24.861428] invalid opcode: 0000 [#1] SMP NOPTI
> > [   24.861432] CPU: 10 PID: 1505 Comm: pulseaudio Not tainted 5.8.0-rc1+ #689
> > [   24.861433] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> > [   24.861441] RIP: 0010:split_huge_page_to_list+0x731/0xae0
> > [   24.861444] Code: 44 00 00 8b 47 34 85 c0 0f 84 b4 02 00 00 f0 ff 4f 34 75 c2 e8 e0 12 f7 ff eb bb 48 c7 c6 d0 16 39 ba 4c 89 c7 e8 ef 85 f9 ff <0f> 0b 48 c7 44 24 10 ff ff ff ff 31 db e9 bb fa ff ff 48 8b 7c 24
> > [   24.861446] RSP: 0018:ffffc1030254bb50 EFLAGS: 00010286
> > [   24.861449] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9b54cee98d08
> > [   24.861451] RDX: 00000000ffffffd8 RSI: 0000000000000000 RDI: ffff9b54cee98d00
> > [   24.861452] RBP: ffffef044fb1fa00 R08: 0000000000000547 R09: 0000000000000003
> > [   24.861454] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b54df37f188
> > [   24.861455] R13: ffff9b54df355000 R14: ffffef044fb1fa00 R15: ffffef044fb1fa00
> > [   24.861458] FS:  00007fd2dc132880(0000) GS:ffff9b54cee80000(0000) knlGS:0000000000000000
> > [   24.861460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   24.861461] CR2: 00007fd2cb100000 CR3: 00000003feb16000 CR4: 00000000003406e0
> > [   24.861464] Call Trace:
> > [   24.861473]  ? __mod_lruvec_state+0x41/0xf0
> > [   24.861478]  ? __alloc_pages_nodemask+0x15c/0x320
> > [   24.861483]  iommu_dma_alloc+0x316/0x580
> > [   24.861496]  snd_dma_alloc_pages+0xdf/0x160 [snd_pcm]
> > [   24.861508]  snd_dma_alloc_pages_fallback+0x5d/0x80 [snd_pcm]
> > [   24.861516]  snd_malloc_sgbuf_pages+0x166/0x380 [snd_pcm]
> > [   24.861523]  ? snd_pcm_hw_refine+0x29d/0x310 [snd_pcm]
> > [   24.861529]  ? _cond_resched+0x16/0x40
> > [   24.861535]  snd_dma_alloc_pages+0x64/0x160 [snd_pcm]
> > [   24.861542]  snd_pcm_lib_malloc_pages+0x136/0x1d0 [snd_pcm]
> > [   24.861550]  ? snd_pcm_lib_ioctl+0x167/0x210 [snd_pcm]
> > [   24.861556]  snd_pcm_hw_params+0x3c0/0x490 [snd_pcm]
> > [   24.861563]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> > [   24.861571]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> > [   24.861578]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> > [   24.861583]  ksys_ioctl+0x82/0xc0
> > [   24.861587]  __x64_sys_ioctl+0x16/0x20
> > [   24.861593]  do_syscall_64+0x4d/0x90
> > [   24.861597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  drivers/iommu/dma-iommu.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 4959f5df21bd..31e4e305d8d5 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/scatterlist.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/crash_dump.h>
> > +#include <linux/pagemap.h>
> >
> >  struct iommu_dma_msi_page {
> >         struct list_head        list;
> > @@ -549,8 +550,15 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> >                         if (!PageCompound(page)) {
> >                                 split_page(page, order);
> >                                 break;
> > -                       } else if (!split_huge_page(page)) {
> > -                               break;
> > +                       } else {
> > +                               int err;
> > +
> > +                               lock_page(page);
> > +                               err = split_huge_page(page);
> > +                               unlock_page(page);
> 
> Yes, THP split does need the page locked, in addition it needs the
> caller hold a pin on the page too (refcount bump).
> 
> But, I don't get how the code could even really work by a quick look.
> Actually split_huge_page() assumes the passed in THP is user THP (anon
> or file cache) and the order is PMD order However, it looks the iommu
> driver just wants to allocate a bunch of base pages by allocating a
> huge page (could by any order if I read the code correctly) then split
> them to base pages. I don't think this is the correct approach IMO.
> Anyway I'm not iommu expert, if I miss anything please feel free to
> let me know.

I agree. The whole

	page = alloc_pages_node(nid, alloc_flags, order);
	if (!page)
		continue;
	if (!order)
		break;
	if (!PageCompound(page)) {
		split_page(page, order);
		break;
	} else if (!split_huge_page(page)) {
		break;
	}

looks very suspicious to me.
My wild guess is that gfp flags changed somewhere above, so we hit
the branch which was never hit before.
Yang Shi June 19, 2020, 1:20 a.m. UTC | #3
On Thu, Jun 18, 2020 at 6:15 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 05:46:24PM -0700, Yang Shi wrote:
> > On Thu, Jun 18, 2020 at 5:19 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Hi!
> > >
> > > I was consistently hitting a VM_BUG_ON_PAGE() in split_huge_page_to_list()
> > > when running vanilla 5.8-rc1 on my desktop. It was happening on every boot
> > > during the system start. I haven't seen this issue on 5.7.
> > >
> > > It looks like split_huge_page() expects the page to be locked,
> > > but it hasn't been changed from 5.7. I do not see any suspicious
> > > commits around the call side either.
> > >
> > > I've tried the following patch:
> > >
> > > --
> > > From 4af38fbf06a9354fadf22a78f1a42dfbb24fbc3a Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin <guro@fb.com>
> > > Date: Thu, 18 Jun 2020 16:33:47 -0700
> > > Subject: [PATCH] iommu/dma: lock page before calling split_huge_page()
> > >
> > > split_huge_page() expects a locked page. The following stacktrace
> > > is generated if debug is on. Fix this by locking the page before
> > > passing it to split_huge_page().
> > >
> > > [   24.861385] page:ffffef044fb1fa00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 head:ffffef044fb1fa00 order:2 compound_mapcount:0 compound_pincount:0
> > > [   24.861389] flags: 0x17ffffc0010000(head)
> > > [   24.861393] raw: 0017ffffc0010000 dead000000000100 dead000000000122 0000000000000000
> > > [   24.861395] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > [   24.861396] page dumped because: VM_BUG_ON_PAGE(!PageLocked(head))
> > > [   24.861411] ------------[ cut here ]------------
> > > [   24.861413] kernel BUG at mm/huge_memory.c:2613!
> > > [   24.861428] invalid opcode: 0000 [#1] SMP NOPTI
> > > [   24.861432] CPU: 10 PID: 1505 Comm: pulseaudio Not tainted 5.8.0-rc1+ #689
> > > [   24.861433] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> > > [   24.861441] RIP: 0010:split_huge_page_to_list+0x731/0xae0
> > > [   24.861444] Code: 44 00 00 8b 47 34 85 c0 0f 84 b4 02 00 00 f0 ff 4f 34 75 c2 e8 e0 12 f7 ff eb bb 48 c7 c6 d0 16 39 ba 4c 89 c7 e8 ef 85 f9 ff <0f> 0b 48 c7 44 24 10 ff ff ff ff 31 db e9 bb fa ff ff 48 8b 7c 24
> > > [   24.861446] RSP: 0018:ffffc1030254bb50 EFLAGS: 00010286
> > > [   24.861449] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9b54cee98d08
> > > [   24.861451] RDX: 00000000ffffffd8 RSI: 0000000000000000 RDI: ffff9b54cee98d00
> > > [   24.861452] RBP: ffffef044fb1fa00 R08: 0000000000000547 R09: 0000000000000003
> > > [   24.861454] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b54df37f188
> > > [   24.861455] R13: ffff9b54df355000 R14: ffffef044fb1fa00 R15: ffffef044fb1fa00
> > > [   24.861458] FS:  00007fd2dc132880(0000) GS:ffff9b54cee80000(0000) knlGS:0000000000000000
> > > [   24.861460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   24.861461] CR2: 00007fd2cb100000 CR3: 00000003feb16000 CR4: 00000000003406e0
> > > [   24.861464] Call Trace:
> > > [   24.861473]  ? __mod_lruvec_state+0x41/0xf0
> > > [   24.861478]  ? __alloc_pages_nodemask+0x15c/0x320
> > > [   24.861483]  iommu_dma_alloc+0x316/0x580
> > > [   24.861496]  snd_dma_alloc_pages+0xdf/0x160 [snd_pcm]
> > > [   24.861508]  snd_dma_alloc_pages_fallback+0x5d/0x80 [snd_pcm]
> > > [   24.861516]  snd_malloc_sgbuf_pages+0x166/0x380 [snd_pcm]
> > > [   24.861523]  ? snd_pcm_hw_refine+0x29d/0x310 [snd_pcm]
> > > [   24.861529]  ? _cond_resched+0x16/0x40
> > > [   24.861535]  snd_dma_alloc_pages+0x64/0x160 [snd_pcm]
> > > [   24.861542]  snd_pcm_lib_malloc_pages+0x136/0x1d0 [snd_pcm]
> > > [   24.861550]  ? snd_pcm_lib_ioctl+0x167/0x210 [snd_pcm]
> > > [   24.861556]  snd_pcm_hw_params+0x3c0/0x490 [snd_pcm]
> > > [   24.861563]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> > > [   24.861571]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> > > [   24.861578]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> > > [   24.861583]  ksys_ioctl+0x82/0xc0
> > > [   24.861587]  __x64_sys_ioctl+0x16/0x20
> > > [   24.861593]  do_syscall_64+0x4d/0x90
> > > [   24.861597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  drivers/iommu/dma-iommu.c | 12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > > index 4959f5df21bd..31e4e305d8d5 100644
> > > --- a/drivers/iommu/dma-iommu.c
> > > +++ b/drivers/iommu/dma-iommu.c
> > > @@ -24,6 +24,7 @@
> > >  #include <linux/scatterlist.h>
> > >  #include <linux/vmalloc.h>
> > >  #include <linux/crash_dump.h>
> > > +#include <linux/pagemap.h>
> > >
> > >  struct iommu_dma_msi_page {
> > >         struct list_head        list;
> > > @@ -549,8 +550,15 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> > >                         if (!PageCompound(page)) {
> > >                                 split_page(page, order);
> > >                                 break;
> > > -                       } else if (!split_huge_page(page)) {
> > > -                               break;
> > > +                       } else {
> > > +                               int err;
> > > +
> > > +                               lock_page(page);
> > > +                               err = split_huge_page(page);
> > > +                               unlock_page(page);
> >
> > Yes, THP split does need the page locked, in addition it needs the
> > caller hold a pin on the page too (refcount bump).
> >
> > But, I don't get how the code could even really work by a quick look.
> > Actually split_huge_page() assumes the passed in THP is user THP (anon
> > or file cache) and the order is PMD order However, it looks the iommu
> > driver just wants to allocate a bunch of base pages by allocating a
> > huge page (could by any order if I read the code correctly) then split
> > them to base pages. I don't think this is the correct approach IMO.
> > Anyway I'm not iommu expert, if I miss anything please feel free to
> > let me know.
>
> I agree. The whole
>
>         page = alloc_pages_node(nid, alloc_flags, order);
>         if (!page)
>                 continue;
>         if (!order)
>                 break;
>         if (!PageCompound(page)) {
>                 split_page(page, order);
>                 break;
>         } else if (!split_huge_page(page)) {
>                 break;
>         }
>
> looks very suspicious to me.
> My wild guess is that gfp flags changed somewhere above, so we hit
> the branch which was never hit before.

It seems so. The page flag has PG_head set then hit the branch, but
the order is 2. I guess __GFP_COMP is set in gfp flag somewhere.
Andrea Arcangeli June 19, 2020, 2:40 a.m. UTC | #4
Hello,

On Thu, Jun 18, 2020 at 06:14:49PM -0700, Roman Gushchin wrote:
> I agree. The whole
> 
> 	page = alloc_pages_node(nid, alloc_flags, order);
> 	if (!page)
> 		continue;
> 	if (!order)
> 		break;
> 	if (!PageCompound(page)) {
> 		split_page(page, order);
> 		break;
> 	} else if (!split_huge_page(page)) {
> 		break;
> 	}
> 
> looks very suspicious to me.
> My wild guess is that gfp flags changed somewhere above, so we hit
> the branch which was never hit before.

Right to be suspicious about the above: split_huge_page on a regular
page allocated by a driver was never meant to work.

The PageLocked BUG_ON is just a symptom of a bigger issue, basically
split_huge_page it may survive, but it'll stay compound and in turn it
must be freed as compound.

The respective free method doesn't even contemplate freeing compound
pages, the only way the free method can survive, is by removing
__GFP_COMP forcefully in the allocation that was perhaps set here
(there are that many __GFP_COMP in that directory):

static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
{
	gfp_t gfp_flags;

	gfp_flags = GFP_KERNEL
		| __GFP_COMP	/* compound page lets parts be mapped */

And I'm not sure what the comment means here, compound or non compound
doesn't make a difference when you map it, it's not a THP, the
mappings must be handled manually so nothing should check PG_compound
anyway in the mapping code.

Something like this may improve things, it's an untested quick hack,
but this assumes it's always a bug to setup a compound page for these
DMA allocations and given the API it's probably a correct
assumption.. Compound is slower, unless you need it, you can avoid it
and then split_page will give contiguous memory page granular. Ideally
the code shouldn't call split_page at all and it should free it all at
once by keeping track of the order and by returning the order to the
caller, something the API can't do right now as it returns a plain
array that can only represent individual small pages.

Once this is resolved, you may want to check your config, iommu passthrough
sounds more optimal for a soundcard.

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f68a62c3c32b..3dfbc010fa83 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -499,6 +499,10 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
 
 	/* IOMMU can map any pages, so himem can also be used here */
 	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+	if (unlikely(gfp & __GFP_COMP)) {
+		WARN();
+		gfp &= ~__GFP_COMP;
+	}
 
 	while (count) {
 		struct page *page = NULL;
@@ -522,13 +526,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
 				continue;
 			if (!order)
 				break;
-			if (!PageCompound(page)) {
-				split_page(page, order);
-				break;
-			} else if (!split_huge_page(page)) {
-				break;
-			}
-			__free_pages(page, order);
+			split_page(page, order);
+			break;
 		}
 		if (!page) {
 			__iommu_dma_free_pages(pages, i);
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
index 6850d13aa98c..378f5a36ec5f 100644
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -28,7 +28,6 @@ static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
 	gfp_t gfp_flags;
 
 	gfp_flags = GFP_KERNEL
-		| __GFP_COMP	/* compound page lets parts be mapped */
 		| __GFP_NORETRY /* don't trigger OOM-killer */
 		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
 	dmab->area = dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr,
Roman Gushchin June 19, 2020, 6:55 p.m. UTC | #5
On Thu, Jun 18, 2020 at 10:40:26PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Jun 18, 2020 at 06:14:49PM -0700, Roman Gushchin wrote:
> > I agree. The whole
> > 
> > 	page = alloc_pages_node(nid, alloc_flags, order);
> > 	if (!page)
> > 		continue;
> > 	if (!order)
> > 		break;
> > 	if (!PageCompound(page)) {
> > 		split_page(page, order);
> > 		break;
> > 	} else if (!split_huge_page(page)) {
> > 		break;
> > 	}
> > 
> > looks very suspicious to me.
> > My wild guess is that gfp flags changed somewhere above, so we hit
> > the branch which was never hit before.
> 
> Right to be suspicious about the above: split_huge_page on a regular
> page allocated by a driver was never meant to work.
> 
> The PageLocked BUG_ON is just a symptom of a bigger issue, basically
> split_huge_page it may survive, but it'll stay compound and in turn it
> must be freed as compound.
> 
> The respective free method doesn't even contemplate freeing compound
> pages, the only way the free method can survive, is by removing
> __GFP_COMP forcefully in the allocation that was perhaps set here
> (there are that many __GFP_COMP in that directory):
> 
> static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
> {
> 	gfp_t gfp_flags;
> 
> 	gfp_flags = GFP_KERNEL
> 		| __GFP_COMP	/* compound page lets parts be mapped */
> 
> And I'm not sure what the comment means here, compound or non compound
> doesn't make a difference when you map it, it's not a THP, the
> mappings must be handled manually so nothing should check PG_compound
> anyway in the mapping code.
> 
> Something like this may improve things, it's an untested quick hack,
> but this assumes it's always a bug to setup a compound page for these
> DMA allocations and given the API it's probably a correct
> assumption.. Compound is slower, unless you need it, you can avoid it
> and then split_page will give contiguous memory page granular. Ideally
> the code shouldn't call split_page at all and it should free it all at
> once by keeping track of the order and by returning the order to the
> caller, something the API can't do right now as it returns a plain
> array that can only represent individual small pages.
> 
> Once this is resolved, you may want to check your config, iommu passthrough
> sounds more optimal for a soundcard.

It's based on the default Fedora 32 config + all defaults for the 5.6..5.8-rc1
difference. But thanks for the advice!

> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index f68a62c3c32b..3dfbc010fa83 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -499,6 +499,10 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>  
>  	/* IOMMU can map any pages, so himem can also be used here */
>  	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
> +	if (unlikely(gfp & __GFP_COMP)) {
> +		WARN();
> +		gfp &= ~__GFP_COMP;
> +	}
>  
>  	while (count) {
>  		struct page *page = NULL;
> @@ -522,13 +526,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>  				continue;
>  			if (!order)
>  				break;
> -			if (!PageCompound(page)) {
> -				split_page(page, order);
> -				break;
> -			} else if (!split_huge_page(page)) {
> -				break;
> -			}
> -			__free_pages(page, order);
> +			split_page(page, order);
> +			break;
>  		}
>  		if (!page) {
>  			__iommu_dma_free_pages(pages, i);
> diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
> index 6850d13aa98c..378f5a36ec5f 100644
> --- a/sound/core/memalloc.c
> +++ b/sound/core/memalloc.c
> @@ -28,7 +28,6 @@ static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
>  	gfp_t gfp_flags;
>  
>  	gfp_flags = GFP_KERNEL
> -		| __GFP_COMP	/* compound page lets parts be mapped */
>  		| __GFP_NORETRY /* don't trigger OOM-killer */
>  		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
>  	dmab->area = dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr,
> 
> 

The patch looks very good to me, but unfortunately it seems to reveal
the next layer of problems:

[   23.864671] page:ffffcd0a8fc3a3c0 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0
[   23.864674] flags: 0x17ffffc0000800(arch_1)
[   23.864678] raw: 0017ffffc0000800 0000000000000000 ffffcd0a8fc3a388
0000000000000000
[   23.864680] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[   23.864705] page dumped because: VM_BUG_ON_PAGE(((unsigned int)
page_ref_count(page) + 127u <= 127u))
[   23.864771] ------------[ cut here ]------------
[   23.864771] kernel BUG at include/linux/mm.h:1137!
[   23.864783] invalid opcode: 0000 [#1] SMP NOPTI
[   23.864787] CPU: 14 PID: 1876 Comm: alsa-sink-HDMI  Not tainted
5.8.0-rc1+ #697
[   23.864788] Hardware name: Gigabyte Technology Co., Ltd.
AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
[   23.864799] RIP: 0010:snd_pcm_mmap_data_fault+0xe3/0x110 [snd_pcm]
[   23.864802] Code: 00 00 80 48 2b 05 f5 01 7a e1 48 01 f8 48 c1 e8
0c 48 c1 e0 06 48 03 05 d3 01 7a e1 eb 83 48 c7 c6 88 a9 cb c0 e8 bd
4f 5e e0 <0f> 0b 4c 89 c7 e8 23 53 60 e0 e9 68 ff ff ff e8 b9 af 00 00
e9 5e
[   23.864804] RSP: 0000:ffffa77b82477df0 EFLAGS: 00010286
[   23.864808] RAX: 0000000000000000 RBX: ffffa77b82477e30 RCX: 0000000000000000
[   23.864812] RDX: ffff90f78efa7060 RSI: 0000000000000000 RDI: ffff90f78ef98d00
[   23.864813] RBP: ffff90f76d164bb8 R08: 0000000000000548 R09: 0000000000000003
[   23.864815] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa77b82477e30
[   23.864817] R13: 00000000000007b8 R14: ffff90f76d164bb8 R15: ffff90f78d1eb300
[   23.864820] FS:  00007f971efbf700(0000) GS:ffff90f78ef80000(0000)
knlGS:0000000000000000
[   23.864821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.864823] CR2: 00007f971efcffa0 CR3: 00000003ffaa4000 CR4: 00000000003406e0
[   23.864825] Call Trace:
[   23.864834]  __do_fault+0x36/0x100
[   23.864838]  handle_mm_fault+0x11e3/0x1970
[   23.864846]  do_user_addr_fault+0x1f9/0x490
[   23.864853]  exc_page_fault+0x81/0x1a0
[   23.864858]  ? asm_exc_page_fault+0x8/0x30
[   23.864861]  asm_exc_page_fault+0x1e/0x30
[   23.864864] RIP: 0033:0x7f9730baba08
[   23.864866] Code: Bad RIP value.
[   23.864867] RSP: 002b:00007f971efbc538 EFLAGS: 00010202
[   23.864869] RAX: 00007f971efc0000 RBX: 0000000000000001 RCX: 00007f971efc0080
[   23.864871] RDX: 000000000000ffc0 RSI: 0000000000000000 RDI: 00007f971efc0000
[   23.864872] RBP: 00007f971efbec90 R08: 0000000000000004 R09: 000000000000ffc0
[   23.864874] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f971efbec90
[   23.864876] R13: 00005630003cc1b0 R14: 00007f971efbc640 R15: 00005630003cc140
[   23.864880] Modules linked in: xt_CHECKSUM xt_MASQUERADE
xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp
llc ccm nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ip_set nfnetlink ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat
iwlmvm edac_mce_amd mac80211 kvm_amd snd_hda_codec_realtek libarc4
snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio iwlwifi kvm
snd_hda_intel snd_intel_dspcfg irqbypass snd_usb_audio cfg80211
snd_hda_codec wmi_bmof snd_usbmidi_lib snd_rawmidi mc snd_hda_core
pcspkr snd_seq snd_hwdep snd_seq_device k10temp btusb snd_pcm
sp5100_tco btrtl i2c_piix4 btbcm btintel
[   23.864918]  bluetooth snd_timer snd soundcore ecdh_generic rfkill
ecc gpio_amdpt gpio_generic acpi_cpufreq ip_tables amdgpu iommu_v2
gpu_sched i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul
crc32_pclmul drm ghash_clmulni_intel ccp r8169 nvme nvme_core wmi
pinctrl_amd btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq
fuse
[   23.864940] ---[ end trace a6b3dead26df473b ]---
[   23.864947] RIP: 0010:snd_pcm_mmap_data_fault+0xe3/0x110 [snd_pcm]
[   23.864951] Code: 00 00 80 48 2b 05 f5 01 7a e1 48 01 f8 48 c1 e8
0c 48 c1 e0 06 48 03 05 d3 01 7a e1 eb 83 48 c7 c6 88 a9 cb c0 e8 bd
4f 5e e0 <0f> 0b 4c 89 c7 e8 23 53 60 e0 e9 68 ff ff ff e8 b9 af 00 00
e9 5e
[   23.864954] RSP: 0000:ffffa77b82477df0 EFLAGS: 00010286
[   23.864956] RAX: 0000000000000000 RBX: ffffa77b82477e30 RCX: 0000000000000000
[   23.864958] RDX: ffff90f78efa7060 RSI: 0000000000000000 RDI: ffff90f78ef98d00
[   23.864960] RBP: ffff90f76d164bb8 R08: 0000000000000548 R09: 0000000000000003
[   23.864961] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa77b82477e30
[   23.864963] R13: 00000000000007b8 R14: ffff90f76d164bb8 R15: ffff90f78d1eb300
[   23.864965] FS:  00007f971efbf700(0000) GS:ffff90f78ef80000(0000)
knlGS:0000000000000000
[   23.864967] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.864968] CR2: 00007f971efcffa0 CR3: 00000003ffaa4000 CR4: 00000000003406e0
[   39.037054] rfkill: input handler enabled
[   40.287524] BUG: unable to handle page fault for address: ffffa77b833df000
[   40.287529] #PF: supervisor write access in kernel mode
[   40.287531] #PF: error_code(0x000b) - reserved bit violation
[   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
PTE 80001688033d9163
[   40.287538] Oops: 000b [#2] SMP NOPTI
[   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G      D
      5.8.0-rc1+ #697
[   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
[   40.287550] RIP: 0010:__memset+0x24/0x30
[   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
d1 f3
[   40.287556] RSP: 0018:ffffa77b827a7e08 EFLAGS: 00010216
[   40.287558] RAX: 0000000000000000 RBX: ffff90f77dced800 RCX: 00000000000008a0
[   40.287560] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa77b833df000
[   40.287561] RBP: ffff90f7898c7000 R08: ffff90f78c507768 R09: ffffa77b833df000
[   40.287563] R10: ffffa77b833df000 R11: ffff90f7839f2d40 R12: 0000000000000000
[   40.287564] R13: ffff90f76a802e00 R14: ffffffffc0cb8880 R15: ffff90f770f4e700
[   40.287567] FS:  00007f3d8e8df880(0000) GS:ffff90f78ee40000(0000)
knlGS:0000000000000000
[   40.287569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.287570] CR2: ffffa77b833df000 CR3: 00000003fa556000 CR4: 00000000003406e0
[   40.287572] Call Trace:
[   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
[   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
[   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
[   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
[   40.287613]  ksys_ioctl+0x82/0xc0
[   40.287617]  __x64_sys_ioctl+0x16/0x20
[   40.287622]  do_syscall_64+0x4d/0x90
[   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   40.287630] RIP: 0033:0x7f3d8f21047b
[   40.287631] Code: Bad RIP value.
[   40.287632] RSP: 002b:00007ffe22938b08 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[   40.287635] RAX: ffffffffffffffda RBX: 00007ffe22938d50 RCX: 00007f3d8f21047b
[   40.287636] RDX: 00007ffe22938d50 RSI: 00000000c2604111 RDI: 000000000000001d
[   40.287637] RBP: 000055dc9480fdf0 R08: 0000000000000000 R09: 0000000000000000
[   40.287639] R10: 0000000000000004 R11: 0000000000000246 R12: 000055dc9480fd70
[   40.287640] R13: 00007ffe22938b44 R14: 0000000000000000 R15: 00007ffe22938d50
[   40.287644] Modules linked in: xt_CHECKSUM xt_MASQUERADE
xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp
llc ccm nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ip_set nfnetlink ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat
iwlmvm edac_mce_amd mac80211 kvm_amd snd_hda_codec_realtek libarc4
snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio iwlwifi kvm
snd_hda_intel snd_intel_dspcfg irqbypass snd_usb_audio cfg80211
snd_hda_codec wmi_bmof snd_usbmidi_lib snd_rawmidi mc snd_hda_core
pcspkr snd_seq snd_hwdep snd_seq_device k10temp btusb snd_pcm
sp5100_tco btrtl i2c_piix4 btbcm btintel
[   40.287688]  bluetooth snd_timer snd soundcore ecdh_generic rfkill
ecc gpio_amdpt gpio_generic acpi_cpufreq ip_tables amdgpu iommu_v2
gpu_sched i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul
crc32_pclmul drm ghash_clmulni_intel ccp r8169 nvme nvme_core wmi
pinctrl_amd btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq
fuse
[   40.287708] CR2: ffffa77b833df000
[   40.287711] ---[ end trace a6b3dead26df473c ]---
[   40.287717] RIP: 0010:snd_pcm_mmap_data_fault+0xe3/0x110 [snd_pcm]
[   40.287720] Code: 00 00 80 48 2b 05 f5 01 7a e1 48 01 f8 48 c1 e8
0c 48 c1 e0 06 48 03 05 d3 01 7a e1 eb 83 48 c7 c6 88 a9 cb c0 e8 bd
4f 5e e0 <0f> 0b 4c 89 c7 e8 23 53 60 e0 e9 68 ff ff ff e8 b9 af 00 00
e9 5e
[   40.287722] RSP: 0000:ffffa77b82477df0 EFLAGS: 00010286
[   40.287724] RAX: 0000000000000000 RBX: ffffa77b82477e30 RCX: 0000000000000000
[   40.287725] RDX: ffff90f78efa7060 RSI: 0000000000000000 RDI: ffff90f78ef98d00
[   40.287726] RBP: ffff90f76d164bb8 R08: 0000000000000548 R09: 0000000000000003
[   40.287728] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa77b82477e30
[   40.287729] R13: 00000000000007b8 R14: ffff90f76d164bb8 R15: ffff90f78d1eb300
[   40.287731] FS:  00007f3d8e8df880(0000) GS:ffff90f78ee40000(0000)
knlGS:0000000000000000
[   40.287733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.287734] CR2: ffffa77b833df000 CR3: 00000003fa556000 CR4: 00000000003406e0
David Rientjes June 19, 2020, 8:56 p.m. UTC | #6
On Fri, 19 Jun 2020, Roman Gushchin wrote:

> [   40.287524] BUG: unable to handle page fault for address: ffffa77b833df000
> [   40.287529] #PF: supervisor write access in kernel mode
> [   40.287531] #PF: error_code(0x000b) - reserved bit violation
> [   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
> PTE 80001688033d9163
> [   40.287538] Oops: 000b [#2] SMP NOPTI
> [   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G      D
>       5.8.0-rc1+ #697
> [   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
> AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> [   40.287550] RIP: 0010:__memset+0x24/0x30
> [   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> d1 f3
> [   40.287556] RSP: 0018:ffffa77b827a7e08 EFLAGS: 00010216
> [   40.287558] RAX: 0000000000000000 RBX: ffff90f77dced800 RCX: 00000000000008a0
> [   40.287560] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa77b833df000
> [   40.287561] RBP: ffff90f7898c7000 R08: ffff90f78c507768 R09: ffffa77b833df000
> [   40.287563] R10: ffffa77b833df000 R11: ffff90f7839f2d40 R12: 0000000000000000
> [   40.287564] R13: ffff90f76a802e00 R14: ffffffffc0cb8880 R15: ffff90f770f4e700
> [   40.287567] FS:  00007f3d8e8df880(0000) GS:ffff90f78ee40000(0000)
> knlGS:0000000000000000
> [   40.287569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   40.287570] CR2: ffffa77b833df000 CR3: 00000003fa556000 CR4: 00000000003406e0
> [   40.287572] Call Trace:
> [   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> [   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> [   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> [   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> [   40.287613]  ksys_ioctl+0x82/0xc0
> [   40.287617]  __x64_sys_ioctl+0x16/0x20
> [   40.287622]  do_syscall_64+0x4d/0x90
> [   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hi Roman,

If you have CONFIG_AMD_MEM_ENCRYPT set, this should be resolved by

commit dbed452a078d56bc7f1abecc3edd6a75e8e4484e
Author: David Rientjes <rientjes@google.com>
Date:   Thu Jun 11 00:25:57 2020 -0700

    dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL

Or you might want to wait for 5.8-rc2 instead which includes this fix.
Roman Gushchin June 19, 2020, 10:57 p.m. UTC | #7
On Fri, Jun 19, 2020 at 01:56:28PM -0700, David Rientjes wrote:
> On Fri, 19 Jun 2020, Roman Gushchin wrote:
> 
> > [   40.287524] BUG: unable to handle page fault for address: ffffa77b833df000
> > [   40.287529] #PF: supervisor write access in kernel mode
> > [   40.287531] #PF: error_code(0x000b) - reserved bit violation
> > [   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
> > PTE 80001688033d9163
> > [   40.287538] Oops: 000b [#2] SMP NOPTI
> > [   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G      D
> >       5.8.0-rc1+ #697
> > [   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
> > AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> > [   40.287550] RIP: 0010:__memset+0x24/0x30
> > [   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> > d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> > 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> > d1 f3
> > [   40.287556] RSP: 0018:ffffa77b827a7e08 EFLAGS: 00010216
> > [   40.287558] RAX: 0000000000000000 RBX: ffff90f77dced800 RCX: 00000000000008a0
> > [   40.287560] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa77b833df000
> > [   40.287561] RBP: ffff90f7898c7000 R08: ffff90f78c507768 R09: ffffa77b833df000
> > [   40.287563] R10: ffffa77b833df000 R11: ffff90f7839f2d40 R12: 0000000000000000
> > [   40.287564] R13: ffff90f76a802e00 R14: ffffffffc0cb8880 R15: ffff90f770f4e700
> > [   40.287567] FS:  00007f3d8e8df880(0000) GS:ffff90f78ee40000(0000)
> > knlGS:0000000000000000
> > [   40.287569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   40.287570] CR2: ffffa77b833df000 CR3: 00000003fa556000 CR4: 00000000003406e0
> > [   40.287572] Call Trace:
> > [   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> > [   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> > [   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> > [   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> > [   40.287613]  ksys_ioctl+0x82/0xc0
> > [   40.287617]  __x64_sys_ioctl+0x16/0x20
> > [   40.287622]  do_syscall_64+0x4d/0x90
> > [   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Hi Roman,
> 
> If you have CONFIG_AMD_MEM_ENCRYPT set, this should be resolved by
> 
> commit dbed452a078d56bc7f1abecc3edd6a75e8e4484e
> Author: David Rientjes <rientjes@google.com>
> Date:   Thu Jun 11 00:25:57 2020 -0700
> 
>     dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
> 
> Or you might want to wait for 5.8-rc2 instead which includes this fix.
> 

Hello, David!

Thank you for pointing at it! Unfortunately, there must be something wrong
with drivers, your patch didn't help much. I still see the same stacktrace.

I'll try again after 5.8-rc2 will be out.

Thanks!
David Rientjes June 21, 2020, 8:05 p.m. UTC | #8
On Fri, 19 Jun 2020, Roman Gushchin wrote:

> > > [   40.287524] BUG: unable to handle page fault for address: ffffa77b833df000
> > > [   40.287529] #PF: supervisor write access in kernel mode
> > > [   40.287531] #PF: error_code(0x000b) - reserved bit violation
> > > [   40.287532] PGD 40d14e067 P4D 40d14e067 PUD 40d14f067 PMD 3ec54d067
> > > PTE 80001688033d9163
> > > [   40.287538] Oops: 000b [#2] SMP NOPTI
> > > [   40.287542] CPU: 9 PID: 1986 Comm: pulseaudio Tainted: G      D
> > >       5.8.0-rc1+ #697
> > > [   40.287544] Hardware name: Gigabyte Technology Co., Ltd.
> > > AB350-Gaming/AB350-Gaming-CF, BIOS F25 01/16/2019
> > > [   40.287550] RIP: 0010:__memset+0x24/0x30
> > > [   40.287553] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> > > d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> > > 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> > > d1 f3
> > > [   40.287556] RSP: 0018:ffffa77b827a7e08 EFLAGS: 00010216
> > > [   40.287558] RAX: 0000000000000000 RBX: ffff90f77dced800 RCX: 00000000000008a0
> > > [   40.287560] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa77b833df000
> > > [   40.287561] RBP: ffff90f7898c7000 R08: ffff90f78c507768 R09: ffffa77b833df000
> > > [   40.287563] R10: ffffa77b833df000 R11: ffff90f7839f2d40 R12: 0000000000000000
> > > [   40.287564] R13: ffff90f76a802e00 R14: ffffffffc0cb8880 R15: ffff90f770f4e700
> > > [   40.287567] FS:  00007f3d8e8df880(0000) GS:ffff90f78ee40000(0000)
> > > knlGS:0000000000000000
> > > [   40.287569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   40.287570] CR2: ffffa77b833df000 CR3: 00000003fa556000 CR4: 00000000003406e0
> > > [   40.287572] Call Trace:
> > > [   40.287584]  snd_pcm_hw_params+0x3fd/0x490 [snd_pcm]
> > > [   40.287593]  snd_pcm_common_ioctl+0x1c5/0x1110 [snd_pcm]
> > > [   40.287601]  ? snd_pcm_info_user+0x64/0x80 [snd_pcm]
> > > [   40.287608]  snd_pcm_ioctl+0x23/0x30 [snd_pcm]
> > > [   40.287613]  ksys_ioctl+0x82/0xc0
> > > [   40.287617]  __x64_sys_ioctl+0x16/0x20
> > > [   40.287622]  do_syscall_64+0x4d/0x90
> > > [   40.287627]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Hi Roman,
> > 
> > If you have CONFIG_AMD_MEM_ENCRYPT set, this should be resolved by
> > 
> > commit dbed452a078d56bc7f1abecc3edd6a75e8e4484e
> > Author: David Rientjes <rientjes@google.com>
> > Date:   Thu Jun 11 00:25:57 2020 -0700
> > 
> >     dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
> > 
> > Or you might want to wait for 5.8-rc2 instead which includes this fix.
> > 
> 
> Hello, David!
> 
> Thank you for pointing at it! Unfortunately, there must be something wrong
> with drivers, your patch didn't help much. I still see the same stacktrace.
> 

Wow, I'm very surprised.  Do you have CONFIG_AMD_MEM_ENCRYPT enabled?

Adding Takashi.
Joerg Roedel June 22, 2020, 12:46 p.m. UTC | #9
+ Robin

Robin, any idea on this?

On Thu, Jun 18, 2020 at 10:40:26PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Jun 18, 2020 at 06:14:49PM -0700, Roman Gushchin wrote:
> > I agree. The whole
> > 
> > 	page = alloc_pages_node(nid, alloc_flags, order);
> > 	if (!page)
> > 		continue;
> > 	if (!order)
> > 		break;
> > 	if (!PageCompound(page)) {
> > 		split_page(page, order);
> > 		break;
> > 	} else if (!split_huge_page(page)) {
> > 		break;
> > 	}
> > 
> > looks very suspicious to me.
> > My wild guess is that gfp flags changed somewhere above, so we hit
> > the branch which was never hit before.
> 
> Right to be suspicious about the above: split_huge_page on a regular
> page allocated by a driver was never meant to work.
> 
> The PageLocked BUG_ON is just a symptom of a bigger issue, basically
> split_huge_page it may survive, but it'll stay compound and in turn it
> must be freed as compound.
> 
> The respective free method doesn't even contemplate freeing compound
> pages, the only way the free method can survive, is by removing
> __GFP_COMP forcefully in the allocation that was perhaps set here
> (there are that many __GFP_COMP in that directory):
> 
> static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
> {
> 	gfp_t gfp_flags;
> 
> 	gfp_flags = GFP_KERNEL
> 		| __GFP_COMP	/* compound page lets parts be mapped */
> 
> And I'm not sure what the comment means here, compound or non compound
> doesn't make a difference when you map it, it's not a THP, the
> mappings must be handled manually so nothing should check PG_compound
> anyway in the mapping code.
> 
> Something like this may improve things, it's an untested quick hack,
> but this assumes it's always a bug to setup a compound page for these
> DMA allocations and given the API it's probably a correct
> assumption.. Compound is slower, unless you need it, you can avoid it
> and then split_page will give contiguous memory page granular. Ideally
> the code shouldn't call split_page at all and it should free it all at
> once by keeping track of the order and by returning the order to the
> caller, something the API can't do right now as it returns a plain
> array that can only represent individual small pages.
> 
> Once this is resolved, you may want to check your config, iommu passthrough
> sounds more optimal for a soundcard.
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index f68a62c3c32b..3dfbc010fa83 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -499,6 +499,10 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>  
>  	/* IOMMU can map any pages, so himem can also be used here */
>  	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
> +	if (unlikely(gfp & __GFP_COMP)) {
> +		WARN();
> +		gfp &= ~__GFP_COMP;
> +	}
>  
>  	while (count) {
>  		struct page *page = NULL;
> @@ -522,13 +526,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>  				continue;
>  			if (!order)
>  				break;
> -			if (!PageCompound(page)) {
> -				split_page(page, order);
> -				break;
> -			} else if (!split_huge_page(page)) {
> -				break;
> -			}
> -			__free_pages(page, order);
> +			split_page(page, order);
> +			break;
>  		}
>  		if (!page) {
>  			__iommu_dma_free_pages(pages, i);
> diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
> index 6850d13aa98c..378f5a36ec5f 100644
> --- a/sound/core/memalloc.c
> +++ b/sound/core/memalloc.c
> @@ -28,7 +28,6 @@ static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
>  	gfp_t gfp_flags;
>  
>  	gfp_flags = GFP_KERNEL
> -		| __GFP_COMP	/* compound page lets parts be mapped */
>  		| __GFP_NORETRY /* don't trigger OOM-killer */
>  		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
>  	dmab->area = dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr,
Robin Murphy June 22, 2020, 3:30 p.m. UTC | #10
On 2020-06-22 13:46, Joerg Roedel wrote:
> + Robin
> 
> Robin, any idea on this?

After a bit of archaeology, this dates back to the original review:

https://lore.kernel.org/linux-arm-kernel/54C285D4.3070802@arm.com/
https://lore.kernel.org/linux-arm-kernel/54DA2666.9030003@arm.com/

In summary: originally this inherited from other arch code that did 
simply strip __GFP_COMP; that was deemed questionable because of the 
nonsensical comment about CONFIG_HUGETLBFS that was stuck to it; the 
current code is like it is because in 5 and a half years nobody said 
that it's wrong :)

If there actually *are* good reasons for stripping __GFP_COMP, then I've 
certainly no objection to doing so.

Robin.

> On Thu, Jun 18, 2020 at 10:40:26PM -0400, Andrea Arcangeli wrote:
>> Hello,
>>
>> On Thu, Jun 18, 2020 at 06:14:49PM -0700, Roman Gushchin wrote:
>>> I agree. The whole
>>>
>>> 	page = alloc_pages_node(nid, alloc_flags, order);
>>> 	if (!page)
>>> 		continue;
>>> 	if (!order)
>>> 		break;
>>> 	if (!PageCompound(page)) {
>>> 		split_page(page, order);
>>> 		break;
>>> 	} else if (!split_huge_page(page)) {
>>> 		break;
>>> 	}
>>>
>>> looks very suspicious to me.
>>> My wild guess is that gfp flags changed somewhere above, so we hit
>>> the branch which was never hit before.
>>
>> Right to be suspicious about the above: split_huge_page on a regular
>> page allocated by a driver was never meant to work.
>>
>> The PageLocked BUG_ON is just a symptom of a bigger issue, basically
>> split_huge_page it may survive, but it'll stay compound and in turn it
>> must be freed as compound.
>>
>> The respective free method doesn't even contemplate freeing compound
>> pages, the only way the free method can survive, is by removing
>> __GFP_COMP forcefully in the allocation that was perhaps set here
>> (there are that many __GFP_COMP in that directory):
>>
>> static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
>> {
>> 	gfp_t gfp_flags;
>>
>> 	gfp_flags = GFP_KERNEL
>> 		| __GFP_COMP	/* compound page lets parts be mapped */
>>
>> And I'm not sure what the comment means here, compound or non compound
>> doesn't make a difference when you map it, it's not a THP, the
>> mappings must be handled manually so nothing should check PG_compound
>> anyway in the mapping code.
>>
>> Something like this may improve things, it's an untested quick hack,
>> but this assumes it's always a bug to setup a compound page for these
>> DMA allocations and given the API it's probably a correct
>> assumption.. Compound is slower, unless you need it, you can avoid it
>> and then split_page will give contiguous memory page granular. Ideally
>> the code shouldn't call split_page at all and it should free it all at
>> once by keeping track of the order and by returning the order to the
>> caller, something the API can't do right now as it returns a plain
>> array that can only represent individual small pages.
>>
>> Once this is resolved, you may want to check your config, iommu passthrough
>> sounds more optimal for a soundcard.
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index f68a62c3c32b..3dfbc010fa83 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -499,6 +499,10 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>>   
>>   	/* IOMMU can map any pages, so himem can also be used here */
>>   	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>> +	if (unlikely(gfp & __GFP_COMP)) {
>> +		WARN();
>> +		gfp &= ~__GFP_COMP;
>> +	}
>>   
>>   	while (count) {
>>   		struct page *page = NULL;
>> @@ -522,13 +526,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>>   				continue;
>>   			if (!order)
>>   				break;
>> -			if (!PageCompound(page)) {
>> -				split_page(page, order);
>> -				break;
>> -			} else if (!split_huge_page(page)) {
>> -				break;
>> -			}
>> -			__free_pages(page, order);
>> +			split_page(page, order);
>> +			break;
>>   		}
>>   		if (!page) {
>>   			__iommu_dma_free_pages(pages, i);
>> diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
>> index 6850d13aa98c..378f5a36ec5f 100644
>> --- a/sound/core/memalloc.c
>> +++ b/sound/core/memalloc.c
>> @@ -28,7 +28,6 @@ static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
>>   	gfp_t gfp_flags;
>>   
>>   	gfp_flags = GFP_KERNEL
>> -		| __GFP_COMP	/* compound page lets parts be mapped */
>>   		| __GFP_NORETRY /* don't trigger OOM-killer */
>>   		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
>>   	dmab->area = dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr,
Andrea Arcangeli June 22, 2020, 9:56 p.m. UTC | #11
Hello,

On Mon, Jun 22, 2020 at 04:30:41PM +0100, Robin Murphy wrote:
> On 2020-06-22 13:46, Joerg Roedel wrote:
> > + Robin
> > 
> > Robin, any idea on this?
> 
> After a bit of archaeology, this dates back to the original review:
> 
> https://lore.kernel.org/linux-arm-kernel/54C285D4.3070802@arm.com/
> https://lore.kernel.org/linux-arm-kernel/54DA2666.9030003@arm.com/
> 
> In summary: originally this inherited from other arch code that did 
> simply strip __GFP_COMP; that was deemed questionable because of the 
> nonsensical comment about CONFIG_HUGETLBFS that was stuck to it; the 
> current code is like it is because in 5 and a half years nobody said 
> that it's wrong :)
> 
> If there actually *are* good reasons for stripping __GFP_COMP, then I've 
> certainly no objection to doing so.

The main question is if there's any good reasons for not forbidding
__GFP_COMP to be specified in the callers. The reason given in the
comment isn't convincing.

I don't see how a caller that gets a pointer can care about how the
page structure looks like and in turn why it's asking for __GFP_COMP.

As far as I can tell there are two orthogonal issues in play here:

1) The comment about __GFP_COMP facilitating the sound driver to do
   partial mapping doesn't make much sense. It's probably best to
   WARN_ON immediately in dma_alloc_coherent if __GFP_COMP is
   specified, not only down the call stack in the
   __iommu_dma_alloc_pages() path.

   Note: the CMA paths would already ignore __GFP_COMP if it's
   specified so that __GFP_COMP request can already be ignored. It
   sounds preferable to warn the caller it's asking something it can't
   get, than to silently ignore __GFP_COMP.

   On a side note: hugetlbfs/THP pages can only be allocated with
   __GFP_COMP because for example put_page() must work on all tail
   pages (you can't call compound_head() unless the tail page is part
   of a compound page). But for private driver pages mapped by
   remap_pfn_range, any full or partial mapping is done manually and
   nobody can call GUP on VM_PFNMAP|VM_IO anyway (there's not even the
   requirement of a page struct backing those mappings in fact).

2) __iommu_dma_alloc_pages cannot use __GFP_COMP if it intends to
   return an array of small pages, which is the only thing that the
   current sg_alloc_table_from_pages() supports in input. split_page
   will work as expected to generate small pages from non-compound
   order>0 pages, incidentally it's implement on mm/page_alloc.c, not
   in huge_memory.c.

   split_huge_page as opposed is not intended to be used on newly
   allocated compound page. Maybe we should renamed it to
   split_trans_huge_page to make it more explicit, since it won't even
   work on hugetlbfs (compound) pages.

Thanks,
Andrea
diff mbox series

Patch

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 4959f5df21bd..31e4e305d8d5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -24,6 +24,7 @@ 
 #include <linux/scatterlist.h>
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
+#include <linux/pagemap.h>
 
 struct iommu_dma_msi_page {
 	struct list_head	list;
@@ -549,8 +550,15 @@  static struct page **__iommu_dma_alloc_pages(struct device *dev,
 			if (!PageCompound(page)) {
 				split_page(page, order);
 				break;
-			} else if (!split_huge_page(page)) {
-				break;
+			} else {
+				int err;
+
+				lock_page(page);
+				err = split_huge_page(page);
+				unlock_page(page);
+
+				if (!err)
+					break;
 			}
 			__free_pages(page, order);
 		}