mbox series

[git,pull] drm urgent for 6.10-rc1

Message ID CAPM=9tx_KS1qc8E1kUB5PPBvO9EKHNkk7hYWu-WwWJ6os=otJA@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series [git,pull] drm urgent for 6.10-rc1 | expand

Pull-request

https://gitlab.freedesktop.org/drm/kernel.git tags/drm-next-2024-05-16

Message

Dave Airlie May 16, 2024, 2:53 a.m. UTC
Hi Linus,

Here is the buddy allocator fix I picked up from the list, please apply.

Dave.

drm-next-2024-05-16:
drm urgent for 6.10-rc1 merge:

buddy:
- fix breakage in buddy allocator.
The following changes since commit 275654c02f0ba09d409c36d71dc238e470741e30:

  Merge tag 'drm-xe-next-fixes-2024-05-09-1' of
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next (2024-05-10
12:41:34 +1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/drm/kernel.git tags/drm-next-2024-05-16

for you to fetch changes up to 431c590c3ab0469dfedad3a832fe73556396ee52:

  drm/tests: Add a unit test for range bias allocation (2024-05-16
12:50:14 +1000)

----------------------------------------------------------------
drm urgent for 6.10-rc1 merge:

buddy:
- fix breakage in buddy allocator.

----------------------------------------------------------------
Arunpravin Paneer Selvam (2):
      drm/buddy: Fix the range bias clear memory allocation issue
      drm/tests: Add a unit test for range bias allocation

 drivers/gpu/drm/drm_buddy.c            |  3 ++-
 drivers/gpu/drm/tests/drm_buddy_test.c | 36 +++++++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 2 deletions(-)

Comments

pr-tracker-bot@kernel.org May 16, 2024, 3:53 p.m. UTC | #1
The pull request you sent on Thu, 16 May 2024 12:53:52 +1000:

> https://gitlab.freedesktop.org/drm/kernel.git tags/drm-next-2024-05-16

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/972a2543e3dd87f7310d65944b857631b4290e12

Thank you!
Linus Torvalds May 16, 2024, 5:25 p.m. UTC | #2
On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote:
>
> Here is the buddy allocator fix I picked up from the list, please apply.

So I removed my reverts, and am running a kernel that includes the
merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of
https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of
warnings as per below.

I was going to say that the difference is that now they trigger
through the page fault path (amdgpu_gem_fault) while previously they
triggered through the system call path and amdgpu_drm_ioctl. But it
turns out it's both in both cases, and it just happened to be one or
the other in the particular warnings that I cut-and-pasted.

As before, there are tens of thousands of them after being up for less
than an hour, so this is not some kind of rare thing.

The machine hasn't _crashed_ yet, though. But I'm going to be out and
about and working on my laptop the rest of the day, so I won't be able
to test.

(And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted
in the WARN isn't some official kernel, I have about ten private
patches that I keep testing in my tree, so if you wondered what the
heck that git version is, it's not going to match anything you see,
but the ~ten patches also aren't relevant to this).

Nothing unusual in the config, although this is clang-built. Shouldn't
matter, never has before.

            Linus

---
CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G        W
6.9.0-08295-gfd39ab3b5289 #64
Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40
AORUS MASTER, BIOS F7 09/07/2022
RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246
RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800
RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0
RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000
R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18
R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000
FS:  00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 ? __warn+0xc1/0x190
 ? __force_merge+0x14f/0x180 [drm_buddy]
 ? report_bug+0x129/0x1a0
 ? handle_bug+0x3d/0x70
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? __force_merge+0x14f/0x180 [drm_buddy]
 drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
 ? __cond_resched+0x16/0x40
 amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
 ttm_resource_alloc+0x31/0x120 [ttm]
 ttm_bo_alloc_resource+0xbc/0x260 [ttm]
 ? memcg_account_kmem+0x4a/0xe0
 ? ttm_resource_compatible+0xbb/0xe0 [ttm]
 ttm_bo_validate+0x9f/0x210 [ttm]
 ? __alloc_pages+0x129/0x210
 amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu]
 amdgpu_gem_fault+0x53/0xd0 [amdgpu]
 __do_fault+0x41/0x140
 do_pte_missing+0x453/0xfd0
 handle_mm_fault+0x73c/0x1090
 do_user_addr_fault+0x2e2/0x6f0
 exc_page_fault+0x56/0x110
 asm_exc_page_fault+0x22/0x30
Alex Deucher May 16, 2024, 6:31 p.m. UTC | #3
On Thu, May 16, 2024 at 2:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote:
> >
> > Here is the buddy allocator fix I picked up from the list, please apply.
>
> So I removed my reverts, and am running a kernel that includes the
> merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of
> https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of
> warnings as per below.
>
> I was going to say that the difference is that now they trigger
> through the page fault path (amdgpu_gem_fault) while previously they
> triggered through the system call path and amdgpu_drm_ioctl. But it
> turns out it's both in both cases, and it just happened to be one or
> the other in the particular warnings that I cut-and-pasted.
>
> As before, there are tens of thousands of them after being up for less
> than an hour, so this is not some kind of rare thing.
>
> The machine hasn't _crashed_ yet, though. But I'm going to be out and
> about and working on my laptop the rest of the day, so I won't be able
> to test.
>
> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted
> in the WARN isn't some official kernel, I have about ten private
> patches that I keep testing in my tree, so if you wondered what the
> heck that git version is, it's not going to match anything you see,
> but the ~ten patches also aren't relevant to this).
>
> Nothing unusual in the config, although this is clang-built. Shouldn't
> matter, never has before.

Arun is investigating and trying to repro it.  You still have a
polaris based GPU right?

Thanks,

Alex

>
>             Linus
>
> ---
> CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G        W
> 6.9.0-08295-gfd39ab3b5289 #64
> Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40
> AORUS MASTER, BIOS F7 09/07/2022
> RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
> Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
> 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
> 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
> RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246
> RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800
> RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0
> RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000
> R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18
> R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000
> FS:  00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  ? __warn+0xc1/0x190
>  ? __force_merge+0x14f/0x180 [drm_buddy]
>  ? report_bug+0x129/0x1a0
>  ? handle_bug+0x3d/0x70
>  ? exc_invalid_op+0x16/0x40
>  ? asm_exc_invalid_op+0x16/0x20
>  ? __force_merge+0x14f/0x180 [drm_buddy]
>  drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
>  ? __cond_resched+0x16/0x40
>  amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
>  ttm_resource_alloc+0x31/0x120 [ttm]
>  ttm_bo_alloc_resource+0xbc/0x260 [ttm]
>  ? memcg_account_kmem+0x4a/0xe0
>  ? ttm_resource_compatible+0xbb/0xe0 [ttm]
>  ttm_bo_validate+0x9f/0x210 [ttm]
>  ? __alloc_pages+0x129/0x210
>  amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu]
>  amdgpu_gem_fault+0x53/0xd0 [amdgpu]
>  __do_fault+0x41/0x140
>  do_pte_missing+0x453/0xfd0
>  handle_mm_fault+0x73c/0x1090
>  do_user_addr_fault+0x2e2/0x6f0
>  exc_page_fault+0x56/0x110
>  asm_exc_page_fault+0x22/0x30
Paneer Selvam, Arunpravin May 16, 2024, 9:57 p.m. UTC | #4
On 5/17/2024 12:01 AM, Alex Deucher wrote:
> On Thu, May 16, 2024 at 2:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote:
>>> Here is the buddy allocator fix I picked up from the list, please apply.
>> So I removed my reverts, and am running a kernel that includes the
>> merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of
>> https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of
>> warnings as per below.
>>
>> I was going to say that the difference is that now they trigger
>> through the page fault path (amdgpu_gem_fault) while previously they
>> triggered through the system call path and amdgpu_drm_ioctl. But it
>> turns out it's both in both cases, and it just happened to be one or
>> the other in the particular warnings that I cut-and-pasted.
>>
>> As before, there are tens of thousands of them after being up for less
>> than an hour, so this is not some kind of rare thing.
>>
>> The machine hasn't _crashed_ yet, though. But I'm going to be out and
>> about and working on my laptop the rest of the day, so I won't be able
>> to test.
>>
>> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted
>> in the WARN isn't some official kernel, I have about ten private
>> patches that I keep testing in my tree, so if you wondered what the
>> heck that git version is, it's not going to match anything you see,
>> but the ~ten patches also aren't relevant to this).
>>
>> Nothing unusual in the config, although this is clang-built. Shouldn't
>> matter, never has before.
> Arun is investigating and trying to repro it.  You still have a
> polaris based GPU right?
We haven't been able to reproduce it across variety of GPU's. Would it 
please be possible
to send your dmesg logs and kernel config, I will check this on the same 
GPU you are using.

Thanks,
Arun.
>
> Thanks,
>
> Alex
>
>>              Linus
>>
>> ---
>> CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G        W
>> 6.9.0-08295-gfd39ab3b5289 #64
>> Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40
>> AORUS MASTER, BIOS F7 09/07/2022
>> RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
>> Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
>> 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
>> 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
>> RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246
>> RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800
>> RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0
>> RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000
>> R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18
>> R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000
>> FS:  00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0
>> Call Trace:
>>   <TASK>
>>   ? __warn+0xc1/0x190
>>   ? __force_merge+0x14f/0x180 [drm_buddy]
>>   ? report_bug+0x129/0x1a0
>>   ? handle_bug+0x3d/0x70
>>   ? exc_invalid_op+0x16/0x40
>>   ? asm_exc_invalid_op+0x16/0x20
>>   ? __force_merge+0x14f/0x180 [drm_buddy]
>>   drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
>>   ? __cond_resched+0x16/0x40
>>   amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
>>   ttm_resource_alloc+0x31/0x120 [ttm]
>>   ttm_bo_alloc_resource+0xbc/0x260 [ttm]
>>   ? memcg_account_kmem+0x4a/0xe0
>>   ? ttm_resource_compatible+0xbb/0xe0 [ttm]
>>   ttm_bo_validate+0x9f/0x210 [ttm]
>>   ? __alloc_pages+0x129/0x210
>>   amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu]
>>   amdgpu_gem_fault+0x53/0xd0 [amdgpu]
>>   __do_fault+0x41/0x140
>>   do_pte_missing+0x453/0xfd0
>>   handle_mm_fault+0x73c/0x1090
>>   do_user_addr_fault+0x2e2/0x6f0
>>   exc_page_fault+0x56/0x110
>>   asm_exc_page_fault+0x22/0x30
Dave Airlie May 17, 2024, 1:08 a.m. UTC | #5
> >>
> >> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted
> >> in the WARN isn't some official kernel, I have about ten private
> >> patches that I keep testing in my tree, so if you wondered what the
> >> heck that git version is, it's not going to match anything you see,
> >> but the ~ten patches also aren't relevant to this).
> >>
> >> Nothing unusual in the config, although this is clang-built. Shouldn't
> >> matter, never has before.
> > Arun is investigating and trying to repro it.  You still have a
> > polaris based GPU right?
> We haven't been able to reproduce it across variety of GPU's. Would it
> please be possible
> to send your dmesg logs and kernel config, I will check this on the same
> GPU you are using.

I just installed my RX480 polaris card in my AMD test machine, and
with current origin/master
I'm not seeing this at all.

Running an F40 GNOME desktop, doing firefox etc.

Linus, do you see it a boot straight away?

Dave.
Alex Deucher May 17, 2024, 1:55 p.m. UTC | #6
On Thu, May 16, 2024 at 2:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote:
> >
> > Here is the buddy allocator fix I picked up from the list, please apply.
>
> So I removed my reverts, and am running a kernel that includes the
> merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of
> https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of
> warnings as per below.
>
> I was going to say that the difference is that now they trigger
> through the page fault path (amdgpu_gem_fault) while previously they
> triggered through the system call path and amdgpu_drm_ioctl. But it
> turns out it's both in both cases, and it just happened to be one or
> the other in the particular warnings that I cut-and-pasted.
>
> As before, there are tens of thousands of them after being up for less
> than an hour, so this is not some kind of rare thing.
>
> The machine hasn't _crashed_ yet, though. But I'm going to be out and
> about and working on my laptop the rest of the day, so I won't be able
> to test.
>
> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted
> in the WARN isn't some official kernel, I have about ten private
> patches that I keep testing in my tree, so if you wondered what the
> heck that git version is, it's not going to match anything you see,
> but the ~ten patches also aren't relevant to this).
>
> Nothing unusual in the config, although this is clang-built. Shouldn't
> matter, never has before.

Can you try this patch?
https://patchwork.freedesktop.org/patch/594539/

Alex

>
>             Linus
>
> ---
> CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G        W
> 6.9.0-08295-gfd39ab3b5289 #64
> Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40
> AORUS MASTER, BIOS F7 09/07/2022
> RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
> Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
> 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
> 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
> RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246
> RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800
> RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0
> RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000
> R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18
> R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000
> FS:  00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  ? __warn+0xc1/0x190
>  ? __force_merge+0x14f/0x180 [drm_buddy]
>  ? report_bug+0x129/0x1a0
>  ? handle_bug+0x3d/0x70
>  ? exc_invalid_op+0x16/0x40
>  ? asm_exc_invalid_op+0x16/0x20
>  ? __force_merge+0x14f/0x180 [drm_buddy]
>  drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
>  ? __cond_resched+0x16/0x40
>  amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
>  ttm_resource_alloc+0x31/0x120 [ttm]
>  ttm_bo_alloc_resource+0xbc/0x260 [ttm]
>  ? memcg_account_kmem+0x4a/0xe0
>  ? ttm_resource_compatible+0xbb/0xe0 [ttm]
>  ttm_bo_validate+0x9f/0x210 [ttm]
>  ? __alloc_pages+0x129/0x210
>  amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu]
>  amdgpu_gem_fault+0x53/0xd0 [amdgpu]
>  __do_fault+0x41/0x140
>  do_pte_missing+0x453/0xfd0
>  handle_mm_fault+0x73c/0x1090
>  do_user_addr_fault+0x2e2/0x6f0
>  exc_page_fault+0x56/0x110
>  asm_exc_page_fault+0x22/0x30
Linus Torvalds May 17, 2024, 7:22 p.m. UTC | #7
On Thu, 16 May 2024 at 18:08, Dave Airlie <airlied@gmail.com> wrote:
>
> Linus, do you see it a boot straight away?

Ok, back at that computer now, and yes, I see those messages right
away. In fact, they seem to happen before gnome even starts up, ie I
see those messages long before the first messages from gnome-session:

    May 17 12:07:17 tr3970x kernel: WARNING: CPU: 4 PID: 1067 at
drivers/gpu/drm/drm_buddy.c:198 __force_merge+0x184/0x1b0 [drm_buddy]
    .. lots and lots and lots of them ..
    ...
    May 17 12:07:23 tr3970x systemd-cryptsetup[982]: ...
    ...
    May 17 12:07:25 tr3970x systemd[1]: Reached target basic.target
    ...
    May 17 12:07:25 tr3970x systemd[1]: Mounted sysroot.mount - /sysroot.
    ...
    May 17 12:07:25 tr3970x systemd[1]: Switching root.
    ...
    May 17 12:07:36 tr3970x gnome-session[2824]: ..
    ...
    May 17 12:07:36 tr3970x gnome-shell[2836]: Obtained a high
priority EGL context
    May 17 12:07:36 tr3970x kernel: WARNING: CPU: 31 PID: 2836 at
drivers/gpu/drm/drm_buddy.c:198 __force_merge+0x184/0x1b0 [drm_buddy]
    .. lots of warnings resume ...

IOW, it happens already during the graphical boot before I have even
typed in my disk encryption password.

Then it starts again when gnome starts.

I just checked: I have exactly 8192 warnings from the early boot
before the first gnome warning. Which sounds like too round a number
to be an accident.

I will try the patch Alex pointed at next:

    https://patchwork.freedesktop.org/patch/594539/

and see if that fixes it for me.

                 Linus
Linus Torvalds May 17, 2024, 7:27 p.m. UTC | #8
On Fri, 17 May 2024 at 06:55, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> Can you try this patch?
> https://patchwork.freedesktop.org/patch/594539/

Ack. This seems to fix it for me - unless the problem is random and
only happens sometimes, and I've just been *very* unlucky until now.

                Linus