mbox series

[v4,00/14] drm: Add a driver for CSF-based Mali GPUs

Message ID 20240122163047.1954733-1-boris.brezillon@collabora.com (mailing list archive)
Headers show
Series drm: Add a driver for CSF-based Mali GPUs | expand

Message

Boris Brezillon Jan. 22, 2024, 4:30 p.m. UTC
Hello,

This is the 4th version of the kernel driver for Mali CSF-based GPUs.

A branch based on drm-misc-next and containing all the dependencies
that are not yet available in drm-misc-next here[1], and another [2]
containing extra patches to have things working on rk3588. The CSF
firmware binary can be found here[3], and should be placed under
/lib/firmware/arm/mali/arch10.8/mali_csffw.bin.

The mesa MR adding v10 support on top of panthor is available here [4].

Steve, I intentionally dropped your R-b on "drm/panthor: Add the heap
logical block" and "drm/panthor: Add the scheduler logical block"
because the tiler-OOM handling changed enough to require a new review
IMHO.

Regarding the GPL2+MIT relicensing, I collected Clément's R-b for the
devfreq code, but am still lacking Alexey Sheplyakov for some bits in
panthor_gpu.c. The rest of the code is either new, or covered by the
Linaro, Arm and Collabora acks.

And here is a non-exhaustive changelog, check each commit for a detailed
changelog.

v4:
- Fix various bugs in the VM logic
- Address comments from Steven, Liviu, Ketil and Chris
- Move tiler OOM handling out of the scheduler interrupt handling path
  so we can properly recover when the system runs out of memory, and
  panthor is blocked trying to allocate heap chunks
- Rework the heap locking to support concurrent chunk allocation. Not
  sure if this is supposed to happen, but we need to be robust against
  userspace passing the same heap context to two scheduling groups.
  Wasn't needed before the tiler_oom rework, because heap allocation
  base serialized by the scheduler lock.
- Make kernel BO destruction robust to NULL/ERR pointers

v3;
- Quite a few changes at the MMU/sched level to make the fix some
  race conditions and deadlocks
- Addition of the a sync-only VM_BIND operation (to support
  vkQueueSparseBind with zero commands).
- Addition of a VM_GET_STATE ioctl
- Various cosmetic changes (see the commit changelogs for more details)
- Various fixes (see the commit changelogs for more details)

v2:
- Rename the driver (pancsf -> panthor)
- Split the commit adding the driver to ease review
- Use drm_sched for dependency tracking/job submission
- Add a VM_BIND ioctl
- Add the concept of exclusive VM for BOs that are only ever mapped to a
  single VM
- Document the code and uAPI
- Add a DT binding doc

Regards,

Boris

[1]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4
[2]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4+rk3588
[3]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
[4]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26358

Boris Brezillon (13):
  drm/panthor: Add uAPI
  drm/panthor: Add GPU register definitions
  drm/panthor: Add the device logical block
  drm/panthor: Add the GPU logical block
  drm/panthor: Add GEM logical block
  drm/panthor: Add the devfreq logical block
  drm/panthor: Add the MMU/VM logical block
  drm/panthor: Add the FW logical block
  drm/panthor: Add the heap logical block
  drm/panthor: Add the scheduler logical block
  drm/panthor: Add the driver frontend block
  drm/panthor: Allow driver compilation
  drm/panthor: Add an entry to MAINTAINERS

Liviu Dudau (1):
  dt-bindings: gpu: mali-valhall-csf: Add support for Arm Mali CSF GPUs

 .../bindings/gpu/arm,mali-valhall-csf.yaml    |  147 +
 Documentation/gpu/driver-uapi.rst             |    5 +
 MAINTAINERS                                   |   11 +
 drivers/gpu/drm/Kconfig                       |    2 +
 drivers/gpu/drm/Makefile                      |    1 +
 drivers/gpu/drm/panthor/Kconfig               |   23 +
 drivers/gpu/drm/panthor/Makefile              |   15 +
 drivers/gpu/drm/panthor/panthor_devfreq.c     |  283 ++
 drivers/gpu/drm/panthor/panthor_devfreq.h     |   25 +
 drivers/gpu/drm/panthor/panthor_device.c      |  544 +++
 drivers/gpu/drm/panthor/panthor_device.h      |  393 ++
 drivers/gpu/drm/panthor/panthor_drv.c         | 1470 +++++++
 drivers/gpu/drm/panthor/panthor_fw.c          | 1334 +++++++
 drivers/gpu/drm/panthor/panthor_fw.h          |  504 +++
 drivers/gpu/drm/panthor/panthor_gem.c         |  228 ++
 drivers/gpu/drm/panthor/panthor_gem.h         |  144 +
 drivers/gpu/drm/panthor/panthor_gpu.c         |  482 +++
 drivers/gpu/drm/panthor/panthor_gpu.h         |   52 +
 drivers/gpu/drm/panthor/panthor_heap.c        |  596 +++
 drivers/gpu/drm/panthor/panthor_heap.h        |   39 +
 drivers/gpu/drm/panthor/panthor_mmu.c         | 2760 +++++++++++++
 drivers/gpu/drm/panthor/panthor_mmu.h         |  102 +
 drivers/gpu/drm/panthor/panthor_regs.h        |  239 ++
 drivers/gpu/drm/panthor/panthor_sched.c       | 3500 +++++++++++++++++
 drivers/gpu/drm/panthor/panthor_sched.h       |   48 +
 include/uapi/drm/panthor_drm.h                |  945 +++++
 26 files changed, 13892 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
 create mode 100644 drivers/gpu/drm/panthor/Kconfig
 create mode 100644 drivers/gpu/drm/panthor/Makefile
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
 create mode 100644 include/uapi/drm/panthor_drm.h

Comments

Heiko Stuebner Jan. 23, 2024, 4:18 p.m. UTC | #1
Am Montag, 22. Januar 2024, 17:30:31 CET schrieb Boris Brezillon:
> Hello,
> 
> This is the 4th version of the kernel driver for Mali CSF-based GPUs.
> 
> A branch based on drm-misc-next and containing all the dependencies
> that are not yet available in drm-misc-next here[1], and another [2]
> containing extra patches to have things working on rk3588. The CSF
> firmware binary can be found here[3], and should be placed under
> /lib/firmware/arm/mali/arch10.8/mali_csffw.bin.
> 
> The mesa MR adding v10 support on top of panthor is available here [4].
> 
> Steve, I intentionally dropped your R-b on "drm/panthor: Add the heap
> logical block" and "drm/panthor: Add the scheduler logical block"
> because the tiler-OOM handling changed enough to require a new review
> IMHO.
> 
> Regarding the GPL2+MIT relicensing, I collected Clément's R-b for the
> devfreq code, but am still lacking Alexey Sheplyakov for some bits in
> panthor_gpu.c. The rest of the code is either new, or covered by the
> Linaro, Arm and Collabora acks.

On a rk3588-jaguar with a corresponding mesa build, series:
Tested-by: Heiko Stuebner <heiko@sntech.de>


Thanks for all the work on this
Heiko
Boris Brezillon Jan. 29, 2024, 7:58 a.m. UTC | #2
On Mon, 22 Jan 2024 17:30:31 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hello,
> 
> This is the 4th version of the kernel driver for Mali CSF-based GPUs.
> 
> A branch based on drm-misc-next and containing all the dependencies
> that are not yet available in drm-misc-next here[1], and another [2]
> containing extra patches to have things working on rk3588. The CSF
> firmware binary can be found here[3], and should be placed under
> /lib/firmware/arm/mali/arch10.8/mali_csffw.bin.
> 
> The mesa MR adding v10 support on top of panthor is available here [4].
> 
> Steve, I intentionally dropped your R-b on "drm/panthor: Add the heap
> logical block" and "drm/panthor: Add the scheduler logical block"
> because the tiler-OOM handling changed enough to require a new review
> IMHO.
> 
> Regarding the GPL2+MIT relicensing, I collected Clément's R-b for the
> devfreq code, but am still lacking Alexey Sheplyakov for some bits in
> panthor_gpu.c. The rest of the code is either new, or covered by the
> Linaro, Arm and Collabora acks.
> 
> And here is a non-exhaustive changelog, check each commit for a detailed
> changelog.
> 
> v4:
> - Fix various bugs in the VM logic
> - Address comments from Steven, Liviu, Ketil and Chris
> - Move tiler OOM handling out of the scheduler interrupt handling path
>   so we can properly recover when the system runs out of memory, and
>   panthor is blocked trying to allocate heap chunks
> - Rework the heap locking to support concurrent chunk allocation. Not
>   sure if this is supposed to happen, but we need to be robust against
>   userspace passing the same heap context to two scheduling groups.
>   Wasn't needed before the tiler_oom rework, because heap allocation
>   base serialized by the scheduler lock.
> - Make kernel BO destruction robust to NULL/ERR pointers
> 
> v3;
> - Quite a few changes at the MMU/sched level to make the fix some
>   race conditions and deadlocks
> - Addition of the a sync-only VM_BIND operation (to support
>   vkQueueSparseBind with zero commands).
> - Addition of a VM_GET_STATE ioctl
> - Various cosmetic changes (see the commit changelogs for more details)
> - Various fixes (see the commit changelogs for more details)
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Split the commit adding the driver to ease review
> - Use drm_sched for dependency tracking/job submission
> - Add a VM_BIND ioctl
> - Add the concept of exclusive VM for BOs that are only ever mapped to a
>   single VM
> - Document the code and uAPI
> - Add a DT binding doc
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4
> [2]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4+rk3588
> [3]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin

Here's a link to a more recent/maintained libmali tree:

[3]https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.bin


> [4]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26358
> 
> Boris Brezillon (13):
>   drm/panthor: Add uAPI
>   drm/panthor: Add GPU register definitions
>   drm/panthor: Add the device logical block
>   drm/panthor: Add the GPU logical block
>   drm/panthor: Add GEM logical block
>   drm/panthor: Add the devfreq logical block
>   drm/panthor: Add the MMU/VM logical block
>   drm/panthor: Add the FW logical block
>   drm/panthor: Add the heap logical block
>   drm/panthor: Add the scheduler logical block
>   drm/panthor: Add the driver frontend block
>   drm/panthor: Allow driver compilation
>   drm/panthor: Add an entry to MAINTAINERS
> 
> Liviu Dudau (1):
>   dt-bindings: gpu: mali-valhall-csf: Add support for Arm Mali CSF GPUs
> 
>  .../bindings/gpu/arm,mali-valhall-csf.yaml    |  147 +
>  Documentation/gpu/driver-uapi.rst             |    5 +
>  MAINTAINERS                                   |   11 +
>  drivers/gpu/drm/Kconfig                       |    2 +
>  drivers/gpu/drm/Makefile                      |    1 +
>  drivers/gpu/drm/panthor/Kconfig               |   23 +
>  drivers/gpu/drm/panthor/Makefile              |   15 +
>  drivers/gpu/drm/panthor/panthor_devfreq.c     |  283 ++
>  drivers/gpu/drm/panthor/panthor_devfreq.h     |   25 +
>  drivers/gpu/drm/panthor/panthor_device.c      |  544 +++
>  drivers/gpu/drm/panthor/panthor_device.h      |  393 ++
>  drivers/gpu/drm/panthor/panthor_drv.c         | 1470 +++++++
>  drivers/gpu/drm/panthor/panthor_fw.c          | 1334 +++++++
>  drivers/gpu/drm/panthor/panthor_fw.h          |  504 +++
>  drivers/gpu/drm/panthor/panthor_gem.c         |  228 ++
>  drivers/gpu/drm/panthor/panthor_gem.h         |  144 +
>  drivers/gpu/drm/panthor/panthor_gpu.c         |  482 +++
>  drivers/gpu/drm/panthor/panthor_gpu.h         |   52 +
>  drivers/gpu/drm/panthor/panthor_heap.c        |  596 +++
>  drivers/gpu/drm/panthor/panthor_heap.h        |   39 +
>  drivers/gpu/drm/panthor/panthor_mmu.c         | 2760 +++++++++++++
>  drivers/gpu/drm/panthor/panthor_mmu.h         |  102 +
>  drivers/gpu/drm/panthor/panthor_regs.h        |  239 ++
>  drivers/gpu/drm/panthor/panthor_sched.c       | 3500 +++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_sched.h       |   48 +
>  include/uapi/drm/panthor_drm.h                |  945 +++++
>  26 files changed, 13892 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
>  create mode 100644 drivers/gpu/drm/panthor/Kconfig
>  create mode 100644 drivers/gpu/drm/panthor/Makefile
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
>  create mode 100644 include/uapi/drm/panthor_drm.h
>
Andy Yan Jan. 29, 2024, 9:20 a.m. UTC | #3
Hi Boris:

Thanks for you great work。

One thing please take note:
commit (arm64: dts: rockchip: rk3588: Add GPU nodes)  in [1] seems remove the "disabled" status 
of usb_host2_xhci, this may cause a boot issue on some boards that use combphy2_psu  phy for other functions.

https://gitlab.freedesktop.org/panfrost/linux/-/commit/bf5b542294894219c2d366f6dd0d32a9dcf17252


At 2024-01-23 00:30:31, "Boris Brezillon" <boris.brezillon@collabora.com> wrote:
>Hello,
>
>This is the 4th version of the kernel driver for Mali CSF-based GPUs.
>
>A branch based on drm-misc-next and containing all the dependencies
>that are not yet available in drm-misc-next here[1], and another [2]
>containing extra patches to have things working on rk3588. The CSF
>firmware binary can be found here[3], and should be placed under
>/lib/firmware/arm/mali/arch10.8/mali_csffw.bin.
>
>The mesa MR adding v10 support on top of panthor is available here [4].
>
>Steve, I intentionally dropped your R-b on "drm/panthor: Add the heap
>logical block" and "drm/panthor: Add the scheduler logical block"
>because the tiler-OOM handling changed enough to require a new review
>IMHO.
>
>Regarding the GPL2+MIT relicensing, I collected Clément's R-b for the
>devfreq code, but am still lacking Alexey Sheplyakov for some bits in
>panthor_gpu.c. The rest of the code is either new, or covered by the
>Linaro, Arm and Collabora acks.
>
>And here is a non-exhaustive changelog, check each commit for a detailed
>changelog.
>
>v4:
>- Fix various bugs in the VM logic
>- Address comments from Steven, Liviu, Ketil and Chris
>- Move tiler OOM handling out of the scheduler interrupt handling path
>  so we can properly recover when the system runs out of memory, and
>  panthor is blocked trying to allocate heap chunks
>- Rework the heap locking to support concurrent chunk allocation. Not
>  sure if this is supposed to happen, but we need to be robust against
>  userspace passing the same heap context to two scheduling groups.
>  Wasn't needed before the tiler_oom rework, because heap allocation
>  base serialized by the scheduler lock.
>- Make kernel BO destruction robust to NULL/ERR pointers
>
>v3;
>- Quite a few changes at the MMU/sched level to make the fix some
>  race conditions and deadlocks
>- Addition of the a sync-only VM_BIND operation (to support
>  vkQueueSparseBind with zero commands).
>- Addition of a VM_GET_STATE ioctl
>- Various cosmetic changes (see the commit changelogs for more details)
>- Various fixes (see the commit changelogs for more details)
>
>v2:
>- Rename the driver (pancsf -> panthor)
>- Split the commit adding the driver to ease review
>- Use drm_sched for dependency tracking/job submission
>- Add a VM_BIND ioctl
>- Add the concept of exclusive VM for BOs that are only ever mapped to a
>  single VM
>- Document the code and uAPI
>- Add a DT binding doc
>
>Regards,
>
>Boris
>
>[1]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4
>[2]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4+rk3588
>[3]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
>[4]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26358
>
>Boris Brezillon (13):
>  drm/panthor: Add uAPI
>  drm/panthor: Add GPU register definitions
>  drm/panthor: Add the device logical block
>  drm/panthor: Add the GPU logical block
>  drm/panthor: Add GEM logical block
>  drm/panthor: Add the devfreq logical block
>  drm/panthor: Add the MMU/VM logical block
>  drm/panthor: Add the FW logical block
>  drm/panthor: Add the heap logical block
>  drm/panthor: Add the scheduler logical block
>  drm/panthor: Add the driver frontend block
>  drm/panthor: Allow driver compilation
>  drm/panthor: Add an entry to MAINTAINERS
>
>Liviu Dudau (1):
>  dt-bindings: gpu: mali-valhall-csf: Add support for Arm Mali CSF GPUs
>
> .../bindings/gpu/arm,mali-valhall-csf.yaml    |  147 +
> Documentation/gpu/driver-uapi.rst             |    5 +
> MAINTAINERS                                   |   11 +
> drivers/gpu/drm/Kconfig                       |    2 +
> drivers/gpu/drm/Makefile                      |    1 +
> drivers/gpu/drm/panthor/Kconfig               |   23 +
> drivers/gpu/drm/panthor/Makefile              |   15 +
> drivers/gpu/drm/panthor/panthor_devfreq.c     |  283 ++
> drivers/gpu/drm/panthor/panthor_devfreq.h     |   25 +
> drivers/gpu/drm/panthor/panthor_device.c      |  544 +++
> drivers/gpu/drm/panthor/panthor_device.h      |  393 ++
> drivers/gpu/drm/panthor/panthor_drv.c         | 1470 +++++++
> drivers/gpu/drm/panthor/panthor_fw.c          | 1334 +++++++
> drivers/gpu/drm/panthor/panthor_fw.h          |  504 +++
> drivers/gpu/drm/panthor/panthor_gem.c         |  228 ++
> drivers/gpu/drm/panthor/panthor_gem.h         |  144 +
> drivers/gpu/drm/panthor/panthor_gpu.c         |  482 +++
> drivers/gpu/drm/panthor/panthor_gpu.h         |   52 +
> drivers/gpu/drm/panthor/panthor_heap.c        |  596 +++
> drivers/gpu/drm/panthor/panthor_heap.h        |   39 +
> drivers/gpu/drm/panthor/panthor_mmu.c         | 2760 +++++++++++++
> drivers/gpu/drm/panthor/panthor_mmu.h         |  102 +
> drivers/gpu/drm/panthor/panthor_regs.h        |  239 ++
> drivers/gpu/drm/panthor/panthor_sched.c       | 3500 +++++++++++++++++
> drivers/gpu/drm/panthor/panthor_sched.h       |   48 +
> include/uapi/drm/panthor_drm.h                |  945 +++++
> 26 files changed, 13892 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> create mode 100644 drivers/gpu/drm/panthor/Kconfig
> create mode 100644 drivers/gpu/drm/panthor/Makefile
> create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
> create mode 100644 include/uapi/drm/panthor_drm.h
>
>-- 
>2.43.0
Boris Brezillon Jan. 29, 2024, 10:41 a.m. UTC | #4
On Mon, 29 Jan 2024 17:20:47 +0800 (CST)
"Andy Yan" <andyshrk@163.com> wrote:

> Hi Boris:
> 
> Thanks for you great work。
> 
> One thing please take note:
> commit (arm64: dts: rockchip: rk3588: Add GPU nodes)  in [1] seems remove the "disabled" status 
> of usb_host2_xhci, this may cause a boot issue on some boards that use combphy2_psu  phy for other functions.

Oops, should be fixed in
https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
now.

Thanks,

Boris
Andy Yan Feb. 4, 2024, 1:14 a.m. UTC | #5
Hi Boris:
I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
[15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
[37743.040737] ------------[ cut here ]------------
[37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
[37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
[37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
[37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
[37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
[37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
[37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
[37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
[37743.041181] sp : ffff80008d37bcc0
[37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
[37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
[37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
[37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
[37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
[37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
[37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
[37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
[37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
[37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
[37743.041290] Call trace:
[37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
[37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
[37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
[37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
[37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
[37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]
[37743.041507]  process_one_work+0x15c/0x3a4
[37743.041526]  worker_thread+0x32c/0x438
[37743.041536]  kthread+0x108/0x10c
[37743.041546]  ret_from_fork+0x10/0x20
[37743.041557] ---[ end trace 0000000000000000 ]---
rk3588@rk3588-evb1:~$ 
rk3588@rk3588-evb1:~$ neofetch 
                                                                                                                              
                                 ------------------ 
      █ █ █ █ █ █ █ █ █ █ █      OS: Armbian (24.2.0-trunk) aarch64 
     ███████████████████████     Host: Rockchip RK3588 EVB1 V10 Board 
   ▄▄██                   ██▄▄   Kernel: 6.8.0-rc1-edge-rockchip-rk3588 
   ▄▄██    ███████████    ██▄▄   Uptime: 13 hours, 28 mins 
   ▄▄██   ██         ██   ██▄▄   Packages: 1455 (dpkg) 
   ▄▄██   ██         ██   ██▄▄   Shell: bash 5.2.15 
   ▄▄██   ██         ██   ██▄▄   Resolution: 3840x2160 
   ▄▄██   █████████████   ██▄▄   Terminal: /dev/pts/1 
   ▄▄██   ██         ██   ██▄▄   CPU: (8) @ 1.800GHz 
   ▄▄██   ██         ██   ██▄▄   Memory: 2062MiB / 7687MiB 
   ▄▄██   ██         ██   ██▄▄
   ▄▄██                   ██▄▄                           
     ███████████████████████                             
      █ █ █ █ █ █ █ █ █ █ █




在 2024-01-29 18:41:47,"Boris Brezillon" <boris.brezillon@collabora.com> 写道:
>On Mon, 29 Jan 2024 17:20:47 +0800 (CST)
>"Andy Yan" <andyshrk@163.com> wrote:
>
>> Hi Boris:
>> 
>> Thanks for you great work。
>> 
>> One thing please take note:
>> commit (arm64: dts: rockchip: rk3588: Add GPU nodes)  in [1] seems remove the "disabled" status 
>> of usb_host2_xhci, this may cause a boot issue on some boards that use combphy2_psu  phy for other functions.
>
>Oops, should be fixed in
>https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
>now.
>
>Thanks,
>
>Boris
Boris Brezillon Feb. 4, 2024, 10:07 a.m. UTC | #6
On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
"Andy Yan" <andyshrk@163.com> wrote:

> Hi Boris:
> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。

No it's not, and I didn't manage to reproduce locally. Looks like
you're using a 6.8 kernel, but my panthor-v4/next branches are still
based on drm-misc-next from 2 weeks ago, which was based on a 6.7
kernel. Can you share the kernel branch you're using?

> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
> [37743.040737] ------------[ cut here ]------------
> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041181] sp : ffff80008d37bcc0
> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
> [37743.041290] Call trace:
> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]

I checked the _pin/_unpin calls in panthor, and they seem to be
balanced (we take a ref when we allocate a gpuvm_bo and release it
when the gpuvm_bo is gone). I wonder if something else is calling
_pin_pages() or _get_pages() without holding a GEM ref...

While investigating I found a double-cleanup in the code (see below)
which explains why those memset(0) were required in
panthor_vm_cleanup_op_ctx()), but I doubt it fixes your issue.

--->8---
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index d3ce29cd0662..5606ab4d6289 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -1085,17 +1085,12 @@ static void panthor_vm_cleanup_op_ctx(struct panthor_vm_op_ctx *op_ctx,
        }
 
        kfree(op_ctx->rsvd_page_tables.pages);
-       memset(&op_ctx->rsvd_page_tables, 0, sizeof(op_ctx->rsvd_page_tables));
 
        if (op_ctx->map.vm_bo)
                panthor_vm_bo_put(op_ctx->map.vm_bo);
 
-       memset(&op_ctx->map, 0, sizeof(op_ctx->map));
-
-       for (u32 i = 0; i < ARRAY_SIZE(op_ctx->preallocated_vmas); i++) {
+       for (u32 i = 0; i < ARRAY_SIZE(op_ctx->preallocated_vmas); i++)
                kfree(op_ctx->preallocated_vmas[i]);
-               op_ctx->preallocated_vmas[i] = NULL;
-       }
 
        list_for_each_entry_safe(vma, tmp_vma, &op_ctx->returned_vmas, node) {
                list_del(&vma->node);
@@ -2382,7 +2377,6 @@ static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
        struct panthor_vm_bind_job *job =
                container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
 
-       panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
        panthor_vm_bind_job_put(&job->base);
 }
Andy Yan Feb. 4, 2024, 10:36 a.m. UTC | #7
Hi Boris:


在 2024-02-04 18:07:56,"Boris Brezillon" <boris.brezillon@collabora.com> 写道:
>On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
>"Andy Yan" <andyshrk@163.com> wrote:
>
>> Hi Boris:
>> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
>
>No it's not, and I didn't manage to reproduce locally. Looks like
>you're using a 6.8 kernel, but my panthor-v4/next branches are still
>based on drm-misc-next from 2 weeks ago, which was based on a 6.7
>kernel. Can you share the kernel branch you're using?
>

Here is my kernel branch:
https://github.com/andyshrk/linux/commits/linux-6.8-rc1-rk3588_panthor_v4/


>> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
>> [37743.040737] ------------[ cut here ]------------
>> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
>> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
>> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
>> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
>> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
>> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041181] sp : ffff80008d37bcc0
>> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
>> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
>> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
>> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
>> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
>> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
>> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
>> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
>> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
>> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
>> [37743.041290] Call trace:
>> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
>> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
>> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
>
>I checked the _pin/_unpin calls in panthor, and they seem to be
>balanced (we take a ref when we allocate a gpuvm_bo and release it
>when the gpuvm_bo is gone). I wonder if something else is calling
>_pin_pages() or _get_pages() without holding a GEM ref...
>
>While investigating I found a double-cleanup in the code (see below)
>which explains why those memset(0) were required in
>panthor_vm_cleanup_op_ctx()), but I doubt it fixes your issue.
>
>--->8---
>diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
>index d3ce29cd0662..5606ab4d6289 100644
>--- a/drivers/gpu/drm/panthor/panthor_mmu.c
>+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
>@@ -1085,17 +1085,12 @@ static void panthor_vm_cleanup_op_ctx(struct panthor_vm_op_ctx *op_ctx,
>        }
> 
>        kfree(op_ctx->rsvd_page_tables.pages);
>-       memset(&op_ctx->rsvd_page_tables, 0, sizeof(op_ctx->rsvd_page_tables));
> 
>        if (op_ctx->map.vm_bo)
>                panthor_vm_bo_put(op_ctx->map.vm_bo);
> 
>-       memset(&op_ctx->map, 0, sizeof(op_ctx->map));
>-
>-       for (u32 i = 0; i < ARRAY_SIZE(op_ctx->preallocated_vmas); i++) {
>+       for (u32 i = 0; i < ARRAY_SIZE(op_ctx->preallocated_vmas); i++)
>                kfree(op_ctx->preallocated_vmas[i]);
>-               op_ctx->preallocated_vmas[i] = NULL;
>-       }
> 
>        list_for_each_entry_safe(vma, tmp_vma, &op_ctx->returned_vmas, node) {
>                list_del(&vma->node);
>@@ -2382,7 +2377,6 @@ static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
>        struct panthor_vm_bind_job *job =
>                container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
> 
>-       panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
>        panthor_vm_bind_job_put(&job->base);
> }
Boris Brezillon Feb. 5, 2024, 9:03 a.m. UTC | #8
+Danilo for the panthor gpuvm-needs update.

On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
"Andy Yan" <andyshrk@163.com> wrote:

> Hi Boris:
> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
> [37743.040737] ------------[ cut here ]------------
> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041181] sp : ffff80008d37bcc0
> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
> [37743.041290] Call trace:
> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
> [37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
> [37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]

Ok, I think I found the culprit: there's a race between
the drm_gpuvm_bo_put() call in panthor_vm_bo_put() and the list
iteration done by drm_gpuvm_prepare_objects(). Because we're not
setting DRM_GPUVM_RESV_PROTECTED, the code goes through the 'lockless'
iteration loop, and takes/release a vm_bo ref at each iteration. This
means our 'were we the last vm_bo user?' test in panthor_vm_bo_put()
might return false even if we were actually the last user, and when
for_each_vm_bo_in_list() releases the ref it acquired, it not only leaks
the pin reference, thus leaving GEM pages pinned (which explains this
WARN_ON() splat), but it also calls drm_gpuvm_bo_destroy() in a path
where we don't hold the GPUVA list lock, which is bad.

Long story short, I'll have to use DRM_GPUVM_RESV_PROTECTED, which is
fine because I'm deferring vm_bo removal to a work where taking the VM
resv lock is allowed. Since I was the one asking for this lockless
iterator in the first place, I wonder if we should kill that and make
DRM_GPUVM_RESV_PROTECTED the default (this would greatly simplify
the code). AFAICT, The PowerVR driver shouldn't be impacted because it's
using drm_gpuvm in synchronous mode only, and Xe already uses the
resv-protected mode. That leaves Nouveau, but IIRC, it's also doing VM
updates in the ioctl path.

Danilo, any opinions?

Andy, I pushed a new version to the panthor-next [1] and
panthor-next+rk3588 [2] branches. The fix I'm talking about is [3], but
you probably want to consider taking all the fixups in your branch.

Regards,

Boris

[1]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next
[2]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
[3]https://gitlab.freedesktop.org/panfrost/linux/-/commit/df48c09662a403275e76e679ee004085badea7c1
Andy Yan Feb. 5, 2024, 9:41 a.m. UTC | #9
Hi Boris:
在 2024-02-05 17:03:21,"Boris Brezillon" <boris.brezillon@collabora.com> 写道:
>+Danilo for the panthor gpuvm-needs update.
>
>On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
>"Andy Yan" <andyshrk@163.com> wrote:
>
>> Hi Boris:
>> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
>> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
>> [37743.040737] ------------[ cut here ]------------
>> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
>> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
>> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
>> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
>> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
>> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041181] sp : ffff80008d37bcc0
>> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
>> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
>> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
>> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
>> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
>> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
>> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
>> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
>> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
>> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
>> [37743.041290] Call trace:
>> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
>> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
>> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
>> [37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
>> [37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]
>
>Ok, I think I found the culprit: there's a race between
>the drm_gpuvm_bo_put() call in panthor_vm_bo_put() and the list
>iteration done by drm_gpuvm_prepare_objects(). Because we're not
>setting DRM_GPUVM_RESV_PROTECTED, the code goes through the 'lockless'
>iteration loop, and takes/release a vm_bo ref at each iteration. This
>means our 'were we the last vm_bo user?' test in panthor_vm_bo_put()
>might return false even if we were actually the last user, and when
>for_each_vm_bo_in_list() releases the ref it acquired, it not only leaks
>the pin reference, thus leaving GEM pages pinned (which explains this
>WARN_ON() splat), but it also calls drm_gpuvm_bo_destroy() in a path
>where we don't hold the GPUVA list lock, which is bad.
>
>Long story short, I'll have to use DRM_GPUVM_RESV_PROTECTED, which is
>fine because I'm deferring vm_bo removal to a work where taking the VM
>resv lock is allowed. Since I was the one asking for this lockless
>iterator in the first place, I wonder if we should kill that and make
>DRM_GPUVM_RESV_PROTECTED the default (this would greatly simplify
>the code). AFAICT, The PowerVR driver shouldn't be impacted because it's
>using drm_gpuvm in synchronous mode only, and Xe already uses the
>resv-protected mode. That leaves Nouveau, but IIRC, it's also doing VM
>updates in the ioctl path.
>
>Danilo, any opinions?
>
>Andy, I pushed a new version to the panthor-next [1] and
>panthor-next+rk3588 [2] branches. The fix I'm talking about is [3], but
>you probably want to consider taking all the fixups in your branch.

Okay, I will have a try。
Thanks
>
>Regards,
>
>Boris
>
>[1]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next
>[2]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
>[3]https://gitlab.freedesktop.org/panfrost/linux/-/commit/df48c09662a403275e76e679ee004085badea7c1
Danilo Krummrich Feb. 5, 2024, 9:54 a.m. UTC | #10
On 2/5/24 10:03, Boris Brezillon wrote:
> +Danilo for the panthor gpuvm-needs update.
> 
> On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
> "Andy Yan" <andyshrk@163.com> wrote:
> 
>> Hi Boris:
>> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
>> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
>> [37743.040737] ------------[ cut here ]------------
>> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
>> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
>> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
>> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
>> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
>> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041181] sp : ffff80008d37bcc0
>> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
>> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
>> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
>> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
>> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
>> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
>> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
>> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
>> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
>> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
>> [37743.041290] Call trace:
>> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
>> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
>> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
>> [37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
>> [37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]
> 
> Ok, I think I found the culprit: there's a race between
> the drm_gpuvm_bo_put() call in panthor_vm_bo_put() and the list
> iteration done by drm_gpuvm_prepare_objects(). Because we're not
> setting DRM_GPUVM_RESV_PROTECTED, the code goes through the 'lockless'
> iteration loop, and takes/release a vm_bo ref at each iteration. This
> means our 'were we the last vm_bo user?' test in panthor_vm_bo_put()
> might return false even if we were actually the last user, and when
> for_each_vm_bo_in_list() releases the ref it acquired, it not only leaks
> the pin reference, thus leaving GEM pages pinned (which explains this
> WARN_ON() splat), but it also calls drm_gpuvm_bo_destroy() in a path
> where we don't hold the GPUVA list lock, which is bad.

IIUC, GPUVM generally behaves correctly. It's just that the return value
of drm_gpuvm_bo_put() needs to be treated with care.

Maybe we should extend c50a291d621a ("drm/gpuvm: Let drm_gpuvm_bo_put()
report when the vm_bo object is destroyed") by a note explaining this
unexpected case, or, if not required anymore, simply revert the patch.

> 
> Long story short, I'll have to use DRM_GPUVM_RESV_PROTECTED, which is
> fine because I'm deferring vm_bo removal to a work where taking the VM
> resv lock is allowed. Since I was the one asking for this lockless
> iterator in the first place, I wonder if we should kill that and make
> DRM_GPUVM_RESV_PROTECTED the default (this would greatly simplify
> the code). AFAICT, The PowerVR driver shouldn't be impacted because it's
> using drm_gpuvm in synchronous mode only, and Xe already uses the
> resv-protected mode. That leaves Nouveau, but IIRC, it's also doing VM
> updates in the ioctl path.
> 
> Danilo, any opinions?

I agree that we should remove it once we got enough evidence that updating
the VA space in the asynchronous path isn't going to be a thing. I'm not
entirely sure, whether we'll attempt to re-work Nouveau, but I'd like to try
this approach with Nova. Hence, I'd still like to leave it in for a while.

OOC, knowing that you went a little back and forth with one or the other
approach, last I heard was that using DRM_GPUVM_RESV_PROTECTED didn't work
out well with shmem buffers. How did you fix it?

> 
> Andy, I pushed a new version to the panthor-next [1] and
> panthor-next+rk3588 [2] branches. The fix I'm talking about is [3], but
> you probably want to consider taking all the fixups in your branch.
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next
> [2]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
> [3]https://gitlab.freedesktop.org/panfrost/linux/-/commit/df48c09662a403275e76e679ee004085badea7c1
>
Boris Brezillon Feb. 5, 2024, 10:31 a.m. UTC | #11
On Mon, 5 Feb 2024 10:54:05 +0100
Danilo Krummrich <dakr@redhat.com> wrote:

> On 2/5/24 10:03, Boris Brezillon wrote:
> > +Danilo for the panthor gpuvm-needs update.
> > 
> > On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
> > "Andy Yan" <andyshrk@163.com> wrote:
> >   
> >> Hi Boris:
> >> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
> >> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
> >> [37743.040737] ------------[ cut here ]------------
> >> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
> >> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> >> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
> >> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
> >> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
> >> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
> >> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> >> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> >> [37743.041181] sp : ffff80008d37bcc0
> >> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
> >> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
> >> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
> >> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
> >> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
> >> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
> >> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
> >> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
> >> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
> >> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
> >> [37743.041290] Call trace:
> >> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
> >> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
> >> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
> >> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
> >> [37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
> >> [37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]  
> > 
> > Ok, I think I found the culprit: there's a race between
> > the drm_gpuvm_bo_put() call in panthor_vm_bo_put() and the list
> > iteration done by drm_gpuvm_prepare_objects(). Because we're not
> > setting DRM_GPUVM_RESV_PROTECTED, the code goes through the 'lockless'
> > iteration loop, and takes/release a vm_bo ref at each iteration. This
> > means our 'were we the last vm_bo user?' test in panthor_vm_bo_put()
> > might return false even if we were actually the last user, and when
> > for_each_vm_bo_in_list() releases the ref it acquired, it not only leaks
> > the pin reference, thus leaving GEM pages pinned (which explains this
> > WARN_ON() splat), but it also calls drm_gpuvm_bo_destroy() in a path
> > where we don't hold the GPUVA list lock, which is bad.  
> 
> IIUC, GPUVM generally behaves correctly. It's just that the return value
> of drm_gpuvm_bo_put() needs to be treated with care.
> 
> Maybe we should extend c50a291d621a ("drm/gpuvm: Let drm_gpuvm_bo_put()
> report when the vm_bo object is destroyed") by a note explaining this
> unexpected case,

I can do that.

> or, if not required anymore, simply revert the patch.

Having drm_gpuvm_bo_put() reflect when it released the last ref is
definitely needed in panthor, it's just that it can't be trusted if
there's no lock protecting against concurrent gpuvm lockless list
iterations, because of the get/put dance that happens here.

> 
> > 
> > Long story short, I'll have to use DRM_GPUVM_RESV_PROTECTED, which is
> > fine because I'm deferring vm_bo removal to a work where taking the VM
> > resv lock is allowed. Since I was the one asking for this lockless
> > iterator in the first place, I wonder if we should kill that and make
> > DRM_GPUVM_RESV_PROTECTED the default (this would greatly simplify
> > the code). AFAICT, The PowerVR driver shouldn't be impacted because it's
> > using drm_gpuvm in synchronous mode only, and Xe already uses the
> > resv-protected mode. That leaves Nouveau, but IIRC, it's also doing VM
> > updates in the ioctl path.
> > 
> > Danilo, any opinions?  
> 
> I agree that we should remove it once we got enough evidence that updating
> the VA space in the asynchronous path isn't going to be a thing. I'm not
> entirely sure, whether we'll attempt to re-work Nouveau, but I'd like to try
> this approach with Nova. Hence, I'd still like to leave it in for a while.

Oh, that' has nothing to do with updating the VM RB tree in the async
path. I keep doing it this way in panthor if that's what you're worried
about. The only thing I defer to a workqueue is the gpuvm_bo_put() call,
because this one needs to do all sort of locking that are not allowed
in the dma-signalling path.

> 
> OOC, knowing that you went a little back and forth with one or the other
> approach, last I heard was that using DRM_GPUVM_RESV_PROTECTED didn't work
> out well with shmem buffers. How did you fix it?

The problem I had with the shmem GEM implementation was not related to
DRM_GPUVM_RESV_PROTECTED, it was related to the fact I wasn't able to
take an extra pages ref for the prev/next remap blocks in the async VM
operation path, because drm_gem_shmem_{get,pin}_pages take the BO resv
lock. The idea to solve that was to use atomic refcounts and avoid
taking the lock when refcnt > 1, which we did, but for a different
reason (not merged yet, but it's part of [1] which should land soon).

In the meantime, I realized the vm_bo concept could solve my problem.
If, instead of having each drm_gpuva own a pages ref, I make the
drm_gpuvm_bo owns this ref, I no longer need to call
drm_gem_shmem_{get,pin}_pages() in the async update path.
All I need to do is acquire the pages ref in the ioctl path, when
obtaining a vm_bo that didn't exist previously
(drm_gpuvm_bo_obtain_prealloc() allows us to figure out when this is a
new vm_bo vs when this is an pre-existing one). This is the very reason
I need to know when a vm_bo is destroyed, because when that happens, I
also need to release the pages ref owned by this vm_bo.

Regarding DRM_GPUVM_RESV_PROTECTED, I honestly don't remember where all
the churn around lockless list iteration came from, but I checked, and
DRM_GPUVM_RESV_PROTECTED only matters for:

- objects resv preparation, which we do with the VM resv lock held in
  the ioctl path anyway
- VM BO destruction (drm_gpuvm_bo_put() calls leading to a
  drm_gpuvm_bo_destroy()), which we now defer to a workqueue, where we
  can take the VM resv lock and the GPUVA list lock
- BO validation, which we don't use yet because we don't have
  mem-reclaim hooked up, but once we have it, it will happen in the
  ioctl path too

So, from panthor's perspective, we should be good without the
vm-resv-lockless stuff.

[1]https://lore.kernel.org/lkml/20240105184624.508603-1-dmitry.osipenko@collabora.com/T/
Andy Yan Feb. 6, 2024, 10:35 a.m. UTC | #12
HI Boris:

在 2024-02-05 17:03:21,"Boris Brezillon" <boris.brezillon@collabora.com> 写道:
>+Danilo for the panthor gpuvm-needs update.
>
>On Sun, 4 Feb 2024 09:14:44 +0800 (CST)
>"Andy Yan" <andyshrk@163.com> wrote:
>
>> Hi Boris:
>> I saw this warning sometimes(Run on a armbain based bookworm),not sure is a know issue or something else。
>> [15368.293031] systemd-journald[715]: Received client request to relinquish /var/log/journal/1bc4a340506142af9bd31a6a3d2170ba access.
>> [37743.040737] ------------[ cut here ]------------
>> [37743.040764] panthor fb000000.gpu: drm_WARN_ON(shmem->pages_use_count)
>> [37743.040890] WARNING: CPU: 2 PID: 5702 at drivers/gpu/drm/drm_gem_shmem_helper.c:158 drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.040929] Modules linked in: joydev rfkill sunrpc lz4hc lz4 zram binfmt_misc hantro_vpu crct10dif_ce v4l2_vp9 v4l2_h264 snd_soc_simple_amplifier v4l2_mem2mem videobuf2_dma_contig snd_soc_es8328_i2c videobuf2_memops rk_crypto2 snd_soc_es8328 videobuf2_v4l2 sm3_generic videodev crypto_engine sm3 rockchip_rng videobuf2_common nvmem_rockchip_otp snd_soc_rockchip_i2s_tdm snd_soc_hdmi_codec snd_soc_simple_card mc snd_soc_simple_card_utils snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore dm_mod ip_tables x_tables autofs4 dw_hdmi_qp_i2s_audio dw_hdmi_qp_cec rk808_regulator rockchipdrm dw_mipi_dsi dw_hdmi_qp dw_hdmi analogix_dp drm_dma_helper fusb302 display_connector rk8xx_spi drm_display_helper phy_rockchip_snps_pcie3 phy_rockchip_samsung_hdptx_hdmi panthor tcpm rk8xx_core cec drm_gpuvm gpu_sched drm_kms_helper drm_shmem_helper drm_exec r8169 drm pwm_bl adc_keys
>> [37743.041108] CPU: 2 PID: 5702 Comm: kworker/u16:8 Not tainted 6.8.0-rc1-edge-rockchip-rk3588 #2
>> [37743.041115] Hardware name: Rockchip RK3588 EVB1 V10 Board (DT)
>> [37743.041120] Workqueue: panthor-cleanup panthor_vm_bind_job_cleanup_op_ctx_work [panthor]
>> [37743.041151] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [37743.041157] pc : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041169] lr : drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041181] sp : ffff80008d37bcc0
>> [37743.041184] x29: ffff80008d37bcc0 x28: ffff800081d379c0 x27: ffff800081d37000
>> [37743.041196] x26: ffff00019909a280 x25: ffff00019909a2c0 x24: ffff0001017a4c05
>> [37743.041206] x23: dead000000000100 x22: dead000000000122 x21: ffff0001627ac1a0
>> [37743.041217] x20: 0000000000000000 x19: ffff0001627ac000 x18: 0000000000000000
>> [37743.041227] x17: 000000040044ffff x16: 005000f2b5503510 x15: fffffffffff91b77
>> [37743.041238] x14: 0000000000000001 x13: 00000000000003c5 x12: 00000000ffffffea
>> [37743.041248] x11: 00000000ffffdfff x10: 00000000ffffdfff x9 : ffff800081e0e818
>> [37743.041259] x8 : 000000000002ffe8 x7 : c0000000ffffdfff x6 : 00000000000affa8
>> [37743.041269] x5 : 0000000000001fff x4 : 0000000000000000 x3 : ffff8000819a6008
>> [37743.041279] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00018465e900
>> [37743.041290] Call trace:
>> [37743.041293]  drm_gem_shmem_free+0x144/0x14c [drm_shmem_helper]
>> [37743.041306]  panthor_gem_free_object+0x24/0xa0 [panthor]
>> [37743.041321]  drm_gem_object_free+0x1c/0x30 [drm]
>> [37743.041452]  panthor_vm_bo_put+0xc4/0x12c [panthor]
>> [37743.041475]  panthor_vm_cleanup_op_ctx.constprop.0+0xb0/0x104 [panthor]
>> [37743.041491]  panthor_vm_bind_job_cleanup_op_ctx_work+0x28/0xd0 [panthor]
>
>Ok, I think I found the culprit: there's a race between
>the drm_gpuvm_bo_put() call in panthor_vm_bo_put() and the list
>iteration done by drm_gpuvm_prepare_objects(). Because we're not
>setting DRM_GPUVM_RESV_PROTECTED, the code goes through the 'lockless'
>iteration loop, and takes/release a vm_bo ref at each iteration. This
>means our 'were we the last vm_bo user?' test in panthor_vm_bo_put()
>might return false even if we were actually the last user, and when
>for_each_vm_bo_in_list() releases the ref it acquired, it not only leaks
>the pin reference, thus leaving GEM pages pinned (which explains this
>WARN_ON() splat), but it also calls drm_gpuvm_bo_destroy() in a path
>where we don't hold the GPUVA list lock, which is bad.
>
>Long story short, I'll have to use DRM_GPUVM_RESV_PROTECTED, which is
>fine because I'm deferring vm_bo removal to a work where taking the VM
>resv lock is allowed. Since I was the one asking for this lockless
>iterator in the first place, I wonder if we should kill that and make
>DRM_GPUVM_RESV_PROTECTED the default (this would greatly simplify
>the code). AFAICT, The PowerVR driver shouldn't be impacted because it's
>using drm_gpuvm in synchronous mode only, and Xe already uses the
>resv-protected mode. That leaves Nouveau, but IIRC, it's also doing VM
>updates in the ioctl path.
>
>Danilo, any opinions?
>
>Andy, I pushed a new version to the panthor-next [1] and
>panthor-next+rk3588 [2] branches. The fix I'm talking about is [3], but
>you probably want to consider taking all the fixups in your branch.

>
Just for your info,
After taking all the fixups in my branch,It has been running for 24 hours, and so far, this warning hasn't shown up.
>Regards,
>
>Boris
>
>[1]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next
>[2]https://gitlab.freedesktop.org/panfrost/linux/-/commits/panthor-next+rk3588
>[3]https://gitlab.freedesktop.org/panfrost/linux/-/commit/df48c09662a403275e76e679ee004085badea7c1
Boris Brezillon Feb. 6, 2024, 11:11 a.m. UTC | #13
Hi Andy,

On Tue, 6 Feb 2024 18:35:36 +0800 (CST)
"Andy Yan" <andyshrk@163.com> wrote:

> >Andy, I pushed a new version to the panthor-next [1] and
> >panthor-next+rk3588 [2] branches. The fix I'm talking about is [3],
> >but you probably want to consider taking all the fixups in your
> >branch.  
> 
> >  
> Just for your info,
> After taking all the fixups in my branch,It has been running for 24
> hours, and so far, this warning hasn't shown up.

Thanks for the update, and thanks for reporting the problem in the
first place.

Regards,

Boris
Liviu Dudau Feb. 8, 2024, 4:02 p.m. UTC | #14
On Mon, Jan 22, 2024 at 05:30:31PM +0100, Boris Brezillon wrote:
> Hello,

Hi,

> 
> This is the 4th version of the kernel driver for Mali CSF-based GPUs.
> 
> A branch based on drm-misc-next and containing all the dependencies
> that are not yet available in drm-misc-next here[1], and another [2]
> containing extra patches to have things working on rk3588. The CSF
> firmware binary can be found here[3], and should be placed under
> /lib/firmware/arm/mali/arch10.8/mali_csffw.bin.

Since today the linux-firmware 'main' branch also contains an official
copy of the CSF firmware for the 10.8 architecture. Thanks to Mario for
the quick inclusion!

Best regards,
Liviu

> 
> The mesa MR adding v10 support on top of panthor is available here [4].
> 
> Steve, I intentionally dropped your R-b on "drm/panthor: Add the heap
> logical block" and "drm/panthor: Add the scheduler logical block"
> because the tiler-OOM handling changed enough to require a new review
> IMHO.
> 
> Regarding the GPL2+MIT relicensing, I collected Clément's R-b for the
> devfreq code, but am still lacking Alexey Sheplyakov for some bits in
> panthor_gpu.c. The rest of the code is either new, or covered by the
> Linaro, Arm and Collabora acks.
> 
> And here is a non-exhaustive changelog, check each commit for a detailed
> changelog.
> 
> v4:
> - Fix various bugs in the VM logic
> - Address comments from Steven, Liviu, Ketil and Chris
> - Move tiler OOM handling out of the scheduler interrupt handling path
>   so we can properly recover when the system runs out of memory, and
>   panthor is blocked trying to allocate heap chunks
> - Rework the heap locking to support concurrent chunk allocation. Not
>   sure if this is supposed to happen, but we need to be robust against
>   userspace passing the same heap context to two scheduling groups.
>   Wasn't needed before the tiler_oom rework, because heap allocation
>   base serialized by the scheduler lock.
> - Make kernel BO destruction robust to NULL/ERR pointers
> 
> v3;
> - Quite a few changes at the MMU/sched level to make the fix some
>   race conditions and deadlocks
> - Addition of the a sync-only VM_BIND operation (to support
>   vkQueueSparseBind with zero commands).
> - Addition of a VM_GET_STATE ioctl
> - Various cosmetic changes (see the commit changelogs for more details)
> - Various fixes (see the commit changelogs for more details)
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Split the commit adding the driver to ease review
> - Use drm_sched for dependency tracking/job submission
> - Add a VM_BIND ioctl
> - Add the concept of exclusive VM for BOs that are only ever mapped to a
>   single VM
> - Document the code and uAPI
> - Add a DT binding doc
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4
> [2]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v4+rk3588
> [3]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
> [4]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26358
> 
> Boris Brezillon (13):
>   drm/panthor: Add uAPI
>   drm/panthor: Add GPU register definitions
>   drm/panthor: Add the device logical block
>   drm/panthor: Add the GPU logical block
>   drm/panthor: Add GEM logical block
>   drm/panthor: Add the devfreq logical block
>   drm/panthor: Add the MMU/VM logical block
>   drm/panthor: Add the FW logical block
>   drm/panthor: Add the heap logical block
>   drm/panthor: Add the scheduler logical block
>   drm/panthor: Add the driver frontend block
>   drm/panthor: Allow driver compilation
>   drm/panthor: Add an entry to MAINTAINERS
> 
> Liviu Dudau (1):
>   dt-bindings: gpu: mali-valhall-csf: Add support for Arm Mali CSF GPUs
> 
>  .../bindings/gpu/arm,mali-valhall-csf.yaml    |  147 +
>  Documentation/gpu/driver-uapi.rst             |    5 +
>  MAINTAINERS                                   |   11 +
>  drivers/gpu/drm/Kconfig                       |    2 +
>  drivers/gpu/drm/Makefile                      |    1 +
>  drivers/gpu/drm/panthor/Kconfig               |   23 +
>  drivers/gpu/drm/panthor/Makefile              |   15 +
>  drivers/gpu/drm/panthor/panthor_devfreq.c     |  283 ++
>  drivers/gpu/drm/panthor/panthor_devfreq.h     |   25 +
>  drivers/gpu/drm/panthor/panthor_device.c      |  544 +++
>  drivers/gpu/drm/panthor/panthor_device.h      |  393 ++
>  drivers/gpu/drm/panthor/panthor_drv.c         | 1470 +++++++
>  drivers/gpu/drm/panthor/panthor_fw.c          | 1334 +++++++
>  drivers/gpu/drm/panthor/panthor_fw.h          |  504 +++
>  drivers/gpu/drm/panthor/panthor_gem.c         |  228 ++
>  drivers/gpu/drm/panthor/panthor_gem.h         |  144 +
>  drivers/gpu/drm/panthor/panthor_gpu.c         |  482 +++
>  drivers/gpu/drm/panthor/panthor_gpu.h         |   52 +
>  drivers/gpu/drm/panthor/panthor_heap.c        |  596 +++
>  drivers/gpu/drm/panthor/panthor_heap.h        |   39 +
>  drivers/gpu/drm/panthor/panthor_mmu.c         | 2760 +++++++++++++
>  drivers/gpu/drm/panthor/panthor_mmu.h         |  102 +
>  drivers/gpu/drm/panthor/panthor_regs.h        |  239 ++
>  drivers/gpu/drm/panthor/panthor_sched.c       | 3500 +++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_sched.h       |   48 +
>  include/uapi/drm/panthor_drm.h                |  945 +++++
>  26 files changed, 13892 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
>  create mode 100644 drivers/gpu/drm/panthor/Kconfig
>  create mode 100644 drivers/gpu/drm/panthor/Makefile
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
>  create mode 100644 include/uapi/drm/panthor_drm.h
> 
> -- 
> 2.43.0
>