mbox series

[0/2] Lima DRM driver

Message ID 20190206131457.1072-1-yuq825@gmail.com (mailing list archive)
Headers show
Series Lima DRM driver | expand

Message

Qiang Yu Feb. 6, 2019, 1:14 p.m. UTC
Kernel DRM driver for ARM Mali 400/450 GPUs.

Since last RFC, all feedback has been addressed. Most Mali DTS
changes are already upstreamed by SoC maintainers. The kernel
driver and user-kernel interface are quite stable for several
months, so I think it's ready to be upstreamed.

This implementation mainly take amdgpu DRM driver as reference.

- Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
  OpenGL vertex shader processing and PP is for fragment shader
  processing. Each processor has its own MMU so prcessors work in
  virtual address space.
- There's only one GP but multiple PP (max 4 for mali 400 and 8
  for mali 450) in the same mali 4xx GPU. All PPs are grouped
  togather to handle a single fragment shader task divided by
  FB output tiled pixels. Mali 400 user space driver is
  responsible for assign target tiled pixels to each PP, but mali
  450 has a HW module called DLBU to dynamically balance each
  PP's load.
- User space driver allocate buffer object and map into GPU
  virtual address space, upload command stream and draw data with
  CPU mmap of the buffer object, then submit task to GP/PP with
  a register frame indicating where is the command stream and misc
  settings.
- There's no command stream validation/relocation due to each user
  process has its own GPU virtual address space. GP/PP's MMU switch
  virtual address space before running two tasks from different
  user process. Error or evil user space code just get MMU fault
  or GP/PP error IRQ, then the HW/SW will be recovered.
- Use TTM as MM. TTM_PL_TT type memory is used as the content of
  lima buffer object which is allocated from TTM page pool. all
  lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
  allocation, so there's no buffer eviction and swap for now.
- Use drm_sched for GPU task schedule. Each OpenGL context should
  have a lima context object in the kernel to distinguish tasks
  from different user. drm_sched gets task from each lima context
  in a fair way.

This patch serial is based on 5.0-rc5 and squash all the commits.
For whole history of this driver's development, see:
https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4

Mesa driver is still in development and not ready for daily usage,
but can run some simple tests like kmscube and glamrk2, and some
single full screen application like kodi-gbm, see:
https://gitlab.freedesktop.org/lima/mesa

[rfc]
https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html

Lima Project Developers (1):
  drm/lima: driver for ARM Mali4xx GPUs

Qiang Yu (1):
  drm/fourcc: add ARM tiled format modifier

 drivers/gpu/drm/Kconfig               |   2 +
 drivers/gpu/drm/Makefile              |   1 +
 drivers/gpu/drm/lima/Kconfig          |  10 +
 drivers/gpu/drm/lima/Makefile         |  22 ++
 drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
 drivers/gpu/drm/lima/lima_bcast.h     |  14 +
 drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
 drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
 drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
 drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
 drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
 drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
 drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
 drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
 drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
 drivers/gpu/drm/lima/lima_gem.h       |  25 ++
 drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
 drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
 drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
 drivers/gpu/drm/lima/lima_gp.h        |  16 +
 drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
 drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
 drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
 drivers/gpu/drm/lima/lima_mmu.h       |  16 +
 drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
 drivers/gpu/drm/lima/lima_object.h    |  72 ++++
 drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
 drivers/gpu/drm/lima/lima_pmu.h       |  12 +
 drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
 drivers/gpu/drm/lima/lima_pp.h        |  19 +
 drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
 drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
 drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
 drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
 drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
 drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
 drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
 include/uapi/drm/drm_fourcc.h         |   9 +
 include/uapi/drm/lima_drm.h           | 193 ++++++++++
 39 files changed, 5092 insertions(+)
 create mode 100644 drivers/gpu/drm/lima/Kconfig
 create mode 100644 drivers/gpu/drm/lima/Makefile
 create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
 create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
 create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
 create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
 create mode 100644 drivers/gpu/drm/lima/lima_device.c
 create mode 100644 drivers/gpu/drm/lima/lima_device.h
 create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
 create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
 create mode 100644 drivers/gpu/drm/lima/lima_drv.c
 create mode 100644 drivers/gpu/drm/lima/lima_drv.h
 create mode 100644 drivers/gpu/drm/lima/lima_gem.c
 create mode 100644 drivers/gpu/drm/lima/lima_gem.h
 create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
 create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
 create mode 100644 drivers/gpu/drm/lima/lima_gp.c
 create mode 100644 drivers/gpu/drm/lima/lima_gp.h
 create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
 create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
 create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
 create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
 create mode 100644 drivers/gpu/drm/lima/lima_object.c
 create mode 100644 drivers/gpu/drm/lima/lima_object.h
 create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
 create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
 create mode 100644 drivers/gpu/drm/lima/lima_pp.c
 create mode 100644 drivers/gpu/drm/lima/lima_pp.h
 create mode 100644 drivers/gpu/drm/lima/lima_regs.h
 create mode 100644 drivers/gpu/drm/lima/lima_sched.c
 create mode 100644 drivers/gpu/drm/lima/lima_sched.h
 create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
 create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
 create mode 100644 drivers/gpu/drm/lima/lima_vm.c
 create mode 100644 drivers/gpu/drm/lima/lima_vm.h
 create mode 100644 include/uapi/drm/lima_drm.h

Comments

Daniel Vetter Feb. 7, 2019, 9:09 a.m. UTC | #1
On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> Kernel DRM driver for ARM Mali 400/450 GPUs.
> 
> Since last RFC, all feedback has been addressed. Most Mali DTS
> changes are already upstreamed by SoC maintainers. The kernel
> driver and user-kernel interface are quite stable for several
> months, so I think it's ready to be upstreamed.
> 
> This implementation mainly take amdgpu DRM driver as reference.
> 
> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>   OpenGL vertex shader processing and PP is for fragment shader
>   processing. Each processor has its own MMU so prcessors work in
>   virtual address space.
> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>   for mali 450) in the same mali 4xx GPU. All PPs are grouped
>   togather to handle a single fragment shader task divided by
>   FB output tiled pixels. Mali 400 user space driver is
>   responsible for assign target tiled pixels to each PP, but mali
>   450 has a HW module called DLBU to dynamically balance each
>   PP's load.
> - User space driver allocate buffer object and map into GPU
>   virtual address space, upload command stream and draw data with
>   CPU mmap of the buffer object, then submit task to GP/PP with
>   a register frame indicating where is the command stream and misc
>   settings.
> - There's no command stream validation/relocation due to each user
>   process has its own GPU virtual address space. GP/PP's MMU switch
>   virtual address space before running two tasks from different
>   user process. Error or evil user space code just get MMU fault
>   or GP/PP error IRQ, then the HW/SW will be recovered.
> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>   lima buffer object which is allocated from TTM page pool. all
>   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>   allocation, so there's no buffer eviction and swap for now.

All other render gpu drivers that have unified memory (aka is on the SoC)
use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
(and i915 is kinda the same too really). TTM makes sense if you have some
discrete memory to manage, but imo not in any other place really.

What's the design choice behind this?

From an upstream pov having all soc gpu drivers use similar approaches
should help with shared infrastructure and stuff like that.

Another one: What's the plan with extending this to panfrost? Or are the
architectures for command submission totally different, and we'll need
separate kernel drivers for utgard/midgard/bifrost?

Thanks, Daniel

> - Use drm_sched for GPU task schedule. Each OpenGL context should
>   have a lima context object in the kernel to distinguish tasks
>   from different user. drm_sched gets task from each lima context
>   in a fair way.
> 
> This patch serial is based on 5.0-rc5 and squash all the commits.
> For whole history of this driver's development, see:
> https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
> https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
> 
> Mesa driver is still in development and not ready for daily usage,
> but can run some simple tests like kmscube and glamrk2, and some
> single full screen application like kodi-gbm, see:
> https://gitlab.freedesktop.org/lima/mesa
> 
> [rfc]
> https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
> 
> Lima Project Developers (1):
>   drm/lima: driver for ARM Mali4xx GPUs
> 
> Qiang Yu (1):
>   drm/fourcc: add ARM tiled format modifier
> 
>  drivers/gpu/drm/Kconfig               |   2 +
>  drivers/gpu/drm/Makefile              |   1 +
>  drivers/gpu/drm/lima/Kconfig          |  10 +
>  drivers/gpu/drm/lima/Makefile         |  22 ++
>  drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
>  drivers/gpu/drm/lima/lima_bcast.h     |  14 +
>  drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
>  drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
>  drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
>  drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
>  drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
>  drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
>  drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
>  drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
>  drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
>  drivers/gpu/drm/lima/lima_gem.h       |  25 ++
>  drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
>  drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
>  drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
>  drivers/gpu/drm/lima/lima_gp.h        |  16 +
>  drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
>  drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
>  drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
>  drivers/gpu/drm/lima/lima_mmu.h       |  16 +
>  drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
>  drivers/gpu/drm/lima/lima_object.h    |  72 ++++
>  drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
>  drivers/gpu/drm/lima/lima_pmu.h       |  12 +
>  drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
>  drivers/gpu/drm/lima/lima_pp.h        |  19 +
>  drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
>  drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
>  drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
>  drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
>  drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
>  drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
>  drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
>  include/uapi/drm/drm_fourcc.h         |   9 +
>  include/uapi/drm/lima_drm.h           | 193 ++++++++++
>  39 files changed, 5092 insertions(+)
>  create mode 100644 drivers/gpu/drm/lima/Kconfig
>  create mode 100644 drivers/gpu/drm/lima/Makefile
>  create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
>  create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
>  create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
>  create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
>  create mode 100644 drivers/gpu/drm/lima/lima_device.c
>  create mode 100644 drivers/gpu/drm/lima/lima_device.h
>  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
>  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
>  create mode 100644 drivers/gpu/drm/lima/lima_drv.c
>  create mode 100644 drivers/gpu/drm/lima/lima_drv.h
>  create mode 100644 drivers/gpu/drm/lima/lima_gem.c
>  create mode 100644 drivers/gpu/drm/lima/lima_gem.h
>  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
>  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
>  create mode 100644 drivers/gpu/drm/lima/lima_gp.c
>  create mode 100644 drivers/gpu/drm/lima/lima_gp.h
>  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
>  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
>  create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
>  create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
>  create mode 100644 drivers/gpu/drm/lima/lima_object.c
>  create mode 100644 drivers/gpu/drm/lima/lima_object.h
>  create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
>  create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
>  create mode 100644 drivers/gpu/drm/lima/lima_pp.c
>  create mode 100644 drivers/gpu/drm/lima/lima_pp.h
>  create mode 100644 drivers/gpu/drm/lima/lima_regs.h
>  create mode 100644 drivers/gpu/drm/lima/lima_sched.c
>  create mode 100644 drivers/gpu/drm/lima/lima_sched.h
>  create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
>  create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
>  create mode 100644 drivers/gpu/drm/lima/lima_vm.c
>  create mode 100644 drivers/gpu/drm/lima/lima_vm.h
>  create mode 100644 include/uapi/drm/lima_drm.h
> 
> -- 
> 2.17.1
>
Christian König Feb. 7, 2019, 9:39 a.m. UTC | #2
Am 07.02.19 um 10:09 schrieb Daniel Vetter:
> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>
>> Since last RFC, all feedback has been addressed. Most Mali DTS
>> changes are already upstreamed by SoC maintainers. The kernel
>> driver and user-kernel interface are quite stable for several
>> months, so I think it's ready to be upstreamed.
>>
>> This implementation mainly take amdgpu DRM driver as reference.
>>
>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>    OpenGL vertex shader processing and PP is for fragment shader
>>    processing. Each processor has its own MMU so prcessors work in
>>    virtual address space.
>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>    for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>    togather to handle a single fragment shader task divided by
>>    FB output tiled pixels. Mali 400 user space driver is
>>    responsible for assign target tiled pixels to each PP, but mali
>>    450 has a HW module called DLBU to dynamically balance each
>>    PP's load.
>> - User space driver allocate buffer object and map into GPU
>>    virtual address space, upload command stream and draw data with
>>    CPU mmap of the buffer object, then submit task to GP/PP with
>>    a register frame indicating where is the command stream and misc
>>    settings.
>> - There's no command stream validation/relocation due to each user
>>    process has its own GPU virtual address space. GP/PP's MMU switch
>>    virtual address space before running two tasks from different
>>    user process. Error or evil user space code just get MMU fault
>>    or GP/PP error IRQ, then the HW/SW will be recovered.
>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>    lima buffer object which is allocated from TTM page pool. all
>>    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>    allocation, so there's no buffer eviction and swap for now.
> All other render gpu drivers that have unified memory (aka is on the SoC)
> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> (and i915 is kinda the same too really). TTM makes sense if you have some
> discrete memory to manage, but imo not in any other place really.
>
> What's the design choice behind this?

Agree that this seems unnecessary complicated.

Additional to that why do you use TTM_PL_FLAG_NO_EVICT? That is a 
serious show stopper and as far as I can of hand see completely unnecessary.

Christian.

>
>  From an upstream pov having all soc gpu drivers use similar approaches
> should help with shared infrastructure and stuff like that.
>
> Another one: What's the plan with extending this to panfrost? Or are the
> architectures for command submission totally different, and we'll need
> separate kernel drivers for utgard/midgard/bifrost?
>
> Thanks, Daniel
>
>> - Use drm_sched for GPU task schedule. Each OpenGL context should
>>    have a lima context object in the kernel to distinguish tasks
>>    from different user. drm_sched gets task from each lima context
>>    in a fair way.
>>
>> This patch serial is based on 5.0-rc5 and squash all the commits.
>> For whole history of this driver's development, see:
>> https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
>> https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
>>
>> Mesa driver is still in development and not ready for daily usage,
>> but can run some simple tests like kmscube and glamrk2, and some
>> single full screen application like kodi-gbm, see:
>> https://gitlab.freedesktop.org/lima/mesa
>>
>> [rfc]
>> https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
>>
>> Lima Project Developers (1):
>>    drm/lima: driver for ARM Mali4xx GPUs
>>
>> Qiang Yu (1):
>>    drm/fourcc: add ARM tiled format modifier
>>
>>   drivers/gpu/drm/Kconfig               |   2 +
>>   drivers/gpu/drm/Makefile              |   1 +
>>   drivers/gpu/drm/lima/Kconfig          |  10 +
>>   drivers/gpu/drm/lima/Makefile         |  22 ++
>>   drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
>>   drivers/gpu/drm/lima/lima_bcast.h     |  14 +
>>   drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
>>   drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
>>   drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
>>   drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
>>   drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
>>   drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
>>   drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_gem.h       |  25 ++
>>   drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
>>   drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
>>   drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
>>   drivers/gpu/drm/lima/lima_gp.h        |  16 +
>>   drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
>>   drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
>>   drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
>>   drivers/gpu/drm/lima/lima_mmu.h       |  16 +
>>   drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
>>   drivers/gpu/drm/lima/lima_object.h    |  72 ++++
>>   drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
>>   drivers/gpu/drm/lima/lima_pmu.h       |  12 +
>>   drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_pp.h        |  19 +
>>   drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
>>   drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
>>   drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
>>   drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
>>   drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
>>   drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
>>   include/uapi/drm/drm_fourcc.h         |   9 +
>>   include/uapi/drm/lima_drm.h           | 193 ++++++++++
>>   39 files changed, 5092 insertions(+)
>>   create mode 100644 drivers/gpu/drm/lima/Kconfig
>>   create mode 100644 drivers/gpu/drm/lima/Makefile
>>   create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_device.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_device.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_drv.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_drv.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_gem.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_gem.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_gp.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_gp.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_object.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_object.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_pp.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_pp.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_regs.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_sched.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_sched.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
>>   create mode 100644 drivers/gpu/drm/lima/lima_vm.c
>>   create mode 100644 drivers/gpu/drm/lima/lima_vm.h
>>   create mode 100644 include/uapi/drm/lima_drm.h
>>
>> -- 
>> 2.17.1
>>
Qiang Yu Feb. 7, 2019, 3:21 p.m. UTC | #3
On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > Kernel DRM driver for ARM Mali 400/450 GPUs.
> >
> > Since last RFC, all feedback has been addressed. Most Mali DTS
> > changes are already upstreamed by SoC maintainers. The kernel
> > driver and user-kernel interface are quite stable for several
> > months, so I think it's ready to be upstreamed.
> >
> > This implementation mainly take amdgpu DRM driver as reference.
> >
> > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> >   OpenGL vertex shader processing and PP is for fragment shader
> >   processing. Each processor has its own MMU so prcessors work in
> >   virtual address space.
> > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
> >   togather to handle a single fragment shader task divided by
> >   FB output tiled pixels. Mali 400 user space driver is
> >   responsible for assign target tiled pixels to each PP, but mali
> >   450 has a HW module called DLBU to dynamically balance each
> >   PP's load.
> > - User space driver allocate buffer object and map into GPU
> >   virtual address space, upload command stream and draw data with
> >   CPU mmap of the buffer object, then submit task to GP/PP with
> >   a register frame indicating where is the command stream and misc
> >   settings.
> > - There's no command stream validation/relocation due to each user
> >   process has its own GPU virtual address space. GP/PP's MMU switch
> >   virtual address space before running two tasks from different
> >   user process. Error or evil user space code just get MMU fault
> >   or GP/PP error IRQ, then the HW/SW will be recovered.
> > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> >   lima buffer object which is allocated from TTM page pool. all
> >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> >   allocation, so there's no buffer eviction and swap for now.
>
> All other render gpu drivers that have unified memory (aka is on the SoC)
> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> (and i915 is kinda the same too really). TTM makes sense if you have some
> discrete memory to manage, but imo not in any other place really.
>
> What's the design choice behind this?
To be honest, it's just because TTM offers more helpers. I did implement
a GEM way with cma alloc at the beginning. But when implement paged mem,
I found TTM has mem pool alloc, sync and mmap related helpers which covers
much of my existing code. It's totally possible with GEM, but not as easy as
TTM to me. And virtio-gpu seems an example to use TTM without discrete
mem. Shouldn't TTM a super set of both unified mem and discrete mem?

>
> From an upstream pov having all soc gpu drivers use similar approaches
> should help with shared infrastructure and stuff like that.
Hope GEM gets more helpers now if Iima have to use it.

>
> Another one: What's the plan with extending this to panfrost? Or are the
> architectures for command submission totally different, and we'll need
> separate kernel drivers for utgard/midgard/bifrost?
+ Alyssa & Rob
There is a gitlab issue about sharing kernel driver:
https://gitlab.freedesktop.org/panfrost/linux/issues/1

But seems utgard (discrete shader arch) really differs with midgard/bifrost
(unified shader arch), so that ARM also have different official kernel drivers.

Thanks,
Qiang



>
> Thanks, Daniel
>
> > - Use drm_sched for GPU task schedule. Each OpenGL context should
> >   have a lima context object in the kernel to distinguish tasks
> >   from different user. drm_sched gets task from each lima context
> >   in a fair way.
> >
> > This patch serial is based on 5.0-rc5 and squash all the commits.
> > For whole history of this driver's development, see:
> > https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
> > https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
> >
> > Mesa driver is still in development and not ready for daily usage,
> > but can run some simple tests like kmscube and glamrk2, and some
> > single full screen application like kodi-gbm, see:
> > https://gitlab.freedesktop.org/lima/mesa
> >
> > [rfc]
> > https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
> >
> > Lima Project Developers (1):
> >   drm/lima: driver for ARM Mali4xx GPUs
> >
> > Qiang Yu (1):
> >   drm/fourcc: add ARM tiled format modifier
> >
> >  drivers/gpu/drm/Kconfig               |   2 +
> >  drivers/gpu/drm/Makefile              |   1 +
> >  drivers/gpu/drm/lima/Kconfig          |  10 +
> >  drivers/gpu/drm/lima/Makefile         |  22 ++
> >  drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
> >  drivers/gpu/drm/lima/lima_bcast.h     |  14 +
> >  drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
> >  drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
> >  drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
> >  drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
> >  drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
> >  drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
> >  drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_gem.h       |  25 ++
> >  drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
> >  drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
> >  drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
> >  drivers/gpu/drm/lima/lima_gp.h        |  16 +
> >  drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
> >  drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
> >  drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
> >  drivers/gpu/drm/lima/lima_mmu.h       |  16 +
> >  drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
> >  drivers/gpu/drm/lima/lima_object.h    |  72 ++++
> >  drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
> >  drivers/gpu/drm/lima/lima_pmu.h       |  12 +
> >  drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_pp.h        |  19 +
> >  drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
> >  drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
> >  drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
> >  drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
> >  drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
> >  drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
> >  include/uapi/drm/drm_fourcc.h         |   9 +
> >  include/uapi/drm/lima_drm.h           | 193 ++++++++++
> >  39 files changed, 5092 insertions(+)
> >  create mode 100644 drivers/gpu/drm/lima/Kconfig
> >  create mode 100644 drivers/gpu/drm/lima/Makefile
> >  create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_device.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_device.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_drv.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_drv.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_gem.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_gem.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_gp.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_gp.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_object.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_object.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_pp.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_pp.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_regs.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_sched.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_sched.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
> >  create mode 100644 drivers/gpu/drm/lima/lima_vm.c
> >  create mode 100644 drivers/gpu/drm/lima/lima_vm.h
> >  create mode 100644 include/uapi/drm/lima_drm.h
> >
> > --
> > 2.17.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Qiang Yu Feb. 7, 2019, 3:33 p.m. UTC | #4
On Thu, Feb 7, 2019 at 5:39 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 07.02.19 um 10:09 schrieb Daniel Vetter:
> > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> >> Kernel DRM driver for ARM Mali 400/450 GPUs.
> >>
> >> Since last RFC, all feedback has been addressed. Most Mali DTS
> >> changes are already upstreamed by SoC maintainers. The kernel
> >> driver and user-kernel interface are quite stable for several
> >> months, so I think it's ready to be upstreamed.
> >>
> >> This implementation mainly take amdgpu DRM driver as reference.
> >>
> >> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> >>    OpenGL vertex shader processing and PP is for fragment shader
> >>    processing. Each processor has its own MMU so prcessors work in
> >>    virtual address space.
> >> - There's only one GP but multiple PP (max 4 for mali 400 and 8
> >>    for mali 450) in the same mali 4xx GPU. All PPs are grouped
> >>    togather to handle a single fragment shader task divided by
> >>    FB output tiled pixels. Mali 400 user space driver is
> >>    responsible for assign target tiled pixels to each PP, but mali
> >>    450 has a HW module called DLBU to dynamically balance each
> >>    PP's load.
> >> - User space driver allocate buffer object and map into GPU
> >>    virtual address space, upload command stream and draw data with
> >>    CPU mmap of the buffer object, then submit task to GP/PP with
> >>    a register frame indicating where is the command stream and misc
> >>    settings.
> >> - There's no command stream validation/relocation due to each user
> >>    process has its own GPU virtual address space. GP/PP's MMU switch
> >>    virtual address space before running two tasks from different
> >>    user process. Error or evil user space code just get MMU fault
> >>    or GP/PP error IRQ, then the HW/SW will be recovered.
> >> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> >>    lima buffer object which is allocated from TTM page pool. all
> >>    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> >>    allocation, so there's no buffer eviction and swap for now.
> > All other render gpu drivers that have unified memory (aka is on the SoC)
> > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > (and i915 is kinda the same too really). TTM makes sense if you have some
> > discrete memory to manage, but imo not in any other place really.
> >
> > What's the design choice behind this?
>
> Agree that this seems unnecessary complicated.
>
> Additional to that why do you use TTM_PL_FLAG_NO_EVICT? That is a
> serious show stopper and as far as I can of hand see completely unnecessary.
Just for simplification. There's no eviction for unified mem, but swap will be
introduced when this flag is not set. So I have to do vm table clear/restore and
call bo validation which I plan to implement in the future.

Thanks,
Qiang

>
> Christian.
>
> >
> >  From an upstream pov having all soc gpu drivers use similar approaches
> > should help with shared infrastructure and stuff like that.
> >
> > Another one: What's the plan with extending this to panfrost? Or are the
> > architectures for command submission totally different, and we'll need
> > separate kernel drivers for utgard/midgard/bifrost?
> >
> > Thanks, Daniel
> >
> >> - Use drm_sched for GPU task schedule. Each OpenGL context should
> >>    have a lima context object in the kernel to distinguish tasks
> >>    from different user. drm_sched gets task from each lima context
> >>    in a fair way.
> >>
> >> This patch serial is based on 5.0-rc5 and squash all the commits.
> >> For whole history of this driver's development, see:
> >> https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
> >> https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
> >>
> >> Mesa driver is still in development and not ready for daily usage,
> >> but can run some simple tests like kmscube and glamrk2, and some
> >> single full screen application like kodi-gbm, see:
> >> https://gitlab.freedesktop.org/lima/mesa
> >>
> >> [rfc]
> >> https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
> >>
> >> Lima Project Developers (1):
> >>    drm/lima: driver for ARM Mali4xx GPUs
> >>
> >> Qiang Yu (1):
> >>    drm/fourcc: add ARM tiled format modifier
> >>
> >>   drivers/gpu/drm/Kconfig               |   2 +
> >>   drivers/gpu/drm/Makefile              |   1 +
> >>   drivers/gpu/drm/lima/Kconfig          |  10 +
> >>   drivers/gpu/drm/lima/Makefile         |  22 ++
> >>   drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
> >>   drivers/gpu/drm/lima/lima_bcast.h     |  14 +
> >>   drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
> >>   drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
> >>   drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
> >>   drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
> >>   drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
> >>   drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
> >>   drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_gem.h       |  25 ++
> >>   drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
> >>   drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
> >>   drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
> >>   drivers/gpu/drm/lima/lima_gp.h        |  16 +
> >>   drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
> >>   drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
> >>   drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
> >>   drivers/gpu/drm/lima/lima_mmu.h       |  16 +
> >>   drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
> >>   drivers/gpu/drm/lima/lima_object.h    |  72 ++++
> >>   drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
> >>   drivers/gpu/drm/lima/lima_pmu.h       |  12 +
> >>   drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_pp.h        |  19 +
> >>   drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
> >>   drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
> >>   drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
> >>   drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
> >>   include/uapi/drm/drm_fourcc.h         |   9 +
> >>   include/uapi/drm/lima_drm.h           | 193 ++++++++++
> >>   39 files changed, 5092 insertions(+)
> >>   create mode 100644 drivers/gpu/drm/lima/Kconfig
> >>   create mode 100644 drivers/gpu/drm/lima/Makefile
> >>   create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_device.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_device.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_drv.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_drv.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gem.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gem.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gp.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_gp.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_object.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_object.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_pp.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_pp.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_regs.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_sched.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_sched.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
> >>   create mode 100644 drivers/gpu/drm/lima/lima_vm.c
> >>   create mode 100644 drivers/gpu/drm/lima/lima_vm.h
> >>   create mode 100644 include/uapi/drm/lima_drm.h
> >>
> >> --
> >> 2.17.1
> >>
>
Alyssa Rosenzweig Feb. 7, 2019, 3:44 p.m. UTC | #5
> > Another one: What's the plan with extending this to panfrost? Or are the
> > architectures for command submission totally different, and we'll need
> > separate kernel drivers for utgard/midgard/bifrost?
> + Alyssa & Rob
> There is a gitlab issue about sharing kernel driver:
> https://gitlab.freedesktop.org/panfrost/linux/issues/1
> 
> But seems utgard (discrete shader arch) really differs with midgard/bifrost
> (unified shader arch), so that ARM also have different official kernel drivers.

We're operating off that Utgard and Midgard/Bifrost are two completely
separate architectures. They share a name, but not much else in
practice. Separate shaders vs unified shaders is one distinction;
command streams vs state descriptors is another huge one (as far as
userspace is concerned); etc. As Qiang pointed out, Arm has two (kernel)
drivers, one for Utgard, one for Midgard.

Beyond a few isolated routines (e.g. texture tiling), IMO sharing between
lima and panfrost makes about as much sense as sharing between lima and
etnaviv/vc4/freedreno. Yeah, there are similarities, but that's what
common code is for, not for forcing code cohabitation.

-Alyssa
Daniel Vetter Feb. 7, 2019, 3:51 p.m. UTC | #6
On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> > >
> > > Since last RFC, all feedback has been addressed. Most Mali DTS
> > > changes are already upstreamed by SoC maintainers. The kernel
> > > driver and user-kernel interface are quite stable for several
> > > months, so I think it's ready to be upstreamed.
> > >
> > > This implementation mainly take amdgpu DRM driver as reference.
> > >
> > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> > >   OpenGL vertex shader processing and PP is for fragment shader
> > >   processing. Each processor has its own MMU so prcessors work in
> > >   virtual address space.
> > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> > >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
> > >   togather to handle a single fragment shader task divided by
> > >   FB output tiled pixels. Mali 400 user space driver is
> > >   responsible for assign target tiled pixels to each PP, but mali
> > >   450 has a HW module called DLBU to dynamically balance each
> > >   PP's load.
> > > - User space driver allocate buffer object and map into GPU
> > >   virtual address space, upload command stream and draw data with
> > >   CPU mmap of the buffer object, then submit task to GP/PP with
> > >   a register frame indicating where is the command stream and misc
> > >   settings.
> > > - There's no command stream validation/relocation due to each user
> > >   process has its own GPU virtual address space. GP/PP's MMU switch
> > >   virtual address space before running two tasks from different
> > >   user process. Error or evil user space code just get MMU fault
> > >   or GP/PP error IRQ, then the HW/SW will be recovered.
> > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> > >   lima buffer object which is allocated from TTM page pool. all
> > >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> > >   allocation, so there's no buffer eviction and swap for now.
> >
> > All other render gpu drivers that have unified memory (aka is on the SoC)
> > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > (and i915 is kinda the same too really). TTM makes sense if you have some
> > discrete memory to manage, but imo not in any other place really.
> >
> > What's the design choice behind this?
> To be honest, it's just because TTM offers more helpers. I did implement
> a GEM way with cma alloc at the beginning. But when implement paged mem,
> I found TTM has mem pool alloc, sync and mmap related helpers which covers
> much of my existing code. It's totally possible with GEM, but not as easy as
> TTM to me. And virtio-gpu seems an example to use TTM without discrete
> mem. Shouldn't TTM a super set of both unified mem and discrete mem?

virtio does have fake vram and migration afaiui. And sure, you can use TTM
without the vram migration, it's just that most of the complexity of TTM
is due to buffer placement and migration and all that stuff. If you never
need to move buffers, then you don't need that ever.

Wrt lack of helpers, what exactly are you looking for? A big part of these
for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
closest, since it doesn't have a display block either) would be better I
think.

> > From an upstream pov having all soc gpu drivers use similar approaches
> > should help with shared infrastructure and stuff like that.
> Hope GEM gets more helpers now if Iima have to use it.
> 
> >
> > Another one: What's the plan with extending this to panfrost? Or are the
> > architectures for command submission totally different, and we'll need
> > separate kernel drivers for utgard/midgard/bifrost?
> + Alyssa & Rob
> There is a gitlab issue about sharing kernel driver:
> https://gitlab.freedesktop.org/panfrost/linux/issues/1
> 
> But seems utgard (discrete shader arch) really differs with midgard/bifrost
> (unified shader arch), so that ARM also have different official kernel drivers.

Ok, makes sense, just wanted to check.
-Daniel

> 
> Thanks,
> Qiang
> 
> 
> 
> >
> > Thanks, Daniel
> >
> > > - Use drm_sched for GPU task schedule. Each OpenGL context should
> > >   have a lima context object in the kernel to distinguish tasks
> > >   from different user. drm_sched gets task from each lima context
> > >   in a fair way.
> > >
> > > This patch serial is based on 5.0-rc5 and squash all the commits.
> > > For whole history of this driver's development, see:
> > > https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
> > > https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
> > >
> > > Mesa driver is still in development and not ready for daily usage,
> > > but can run some simple tests like kmscube and glamrk2, and some
> > > single full screen application like kodi-gbm, see:
> > > https://gitlab.freedesktop.org/lima/mesa
> > >
> > > [rfc]
> > > https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
> > >
> > > Lima Project Developers (1):
> > >   drm/lima: driver for ARM Mali4xx GPUs
> > >
> > > Qiang Yu (1):
> > >   drm/fourcc: add ARM tiled format modifier
> > >
> > >  drivers/gpu/drm/Kconfig               |   2 +
> > >  drivers/gpu/drm/Makefile              |   1 +
> > >  drivers/gpu/drm/lima/Kconfig          |  10 +
> > >  drivers/gpu/drm/lima/Makefile         |  22 ++
> > >  drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
> > >  drivers/gpu/drm/lima/lima_bcast.h     |  14 +
> > >  drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
> > >  drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
> > >  drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
> > >  drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
> > >  drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
> > >  drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
> > >  drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_gem.h       |  25 ++
> > >  drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
> > >  drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
> > >  drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
> > >  drivers/gpu/drm/lima/lima_gp.h        |  16 +
> > >  drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
> > >  drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
> > >  drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
> > >  drivers/gpu/drm/lima/lima_mmu.h       |  16 +
> > >  drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
> > >  drivers/gpu/drm/lima/lima_object.h    |  72 ++++
> > >  drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
> > >  drivers/gpu/drm/lima/lima_pmu.h       |  12 +
> > >  drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_pp.h        |  19 +
> > >  drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
> > >  drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
> > >  drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
> > >  drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
> > >  include/uapi/drm/drm_fourcc.h         |   9 +
> > >  include/uapi/drm/lima_drm.h           | 193 ++++++++++
> > >  39 files changed, 5092 insertions(+)
> > >  create mode 100644 drivers/gpu/drm/lima/Kconfig
> > >  create mode 100644 drivers/gpu/drm/lima/Makefile
> > >  create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_device.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_device.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_drv.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_drv.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gem.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gem.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gp.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_gp.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_object.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_object.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_pp.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_pp.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_regs.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_sched.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_sched.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
> > >  create mode 100644 drivers/gpu/drm/lima/lima_vm.c
> > >  create mode 100644 drivers/gpu/drm/lima/lima_vm.h
> > >  create mode 100644 include/uapi/drm/lima_drm.h
> > >
> > > --
> > > 2.17.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
Christian König Feb. 7, 2019, 7:14 p.m. UTC | #7
Am 07.02.19 um 16:33 schrieb Qiang Yu:
> On Thu, Feb 7, 2019 at 5:39 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 07.02.19 um 10:09 schrieb Daniel Vetter:
>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>
>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>> driver and user-kernel interface are quite stable for several
>>>> months, so I think it's ready to be upstreamed.
>>>>
>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>
>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>     OpenGL vertex shader processing and PP is for fragment shader
>>>>     processing. Each processor has its own MMU so prcessors work in
>>>>     virtual address space.
>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>     for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>     togather to handle a single fragment shader task divided by
>>>>     FB output tiled pixels. Mali 400 user space driver is
>>>>     responsible for assign target tiled pixels to each PP, but mali
>>>>     450 has a HW module called DLBU to dynamically balance each
>>>>     PP's load.
>>>> - User space driver allocate buffer object and map into GPU
>>>>     virtual address space, upload command stream and draw data with
>>>>     CPU mmap of the buffer object, then submit task to GP/PP with
>>>>     a register frame indicating where is the command stream and misc
>>>>     settings.
>>>> - There's no command stream validation/relocation due to each user
>>>>     process has its own GPU virtual address space. GP/PP's MMU switch
>>>>     virtual address space before running two tasks from different
>>>>     user process. Error or evil user space code just get MMU fault
>>>>     or GP/PP error IRQ, then the HW/SW will be recovered.
>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>     lima buffer object which is allocated from TTM page pool. all
>>>>     lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>     allocation, so there's no buffer eviction and swap for now.
>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>> discrete memory to manage, but imo not in any other place really.
>>>
>>> What's the design choice behind this?
>> Agree that this seems unnecessary complicated.
>>
>> Additional to that why do you use TTM_PL_FLAG_NO_EVICT? That is a
>> serious show stopper and as far as I can of hand see completely unnecessary.
> Just for simplification. There's no eviction for unified mem, but swap will be
> introduced when this flag is not set. So I have to do vm table clear/restore and
> call bo validation which I plan to implement in the future.

Ok, well if you are using GEM or GEM+TTM is up to you. I just think 
using GEM directly like Daniel suggest would be simpler in the long term.

But support for eviction is a serious prerequisite to allowing this 
upstream. So that is really something you need to clean up first.

Regards,
Christian.

>
> Thanks,
> Qiang
>
>> Christian.
>>
>>>   From an upstream pov having all soc gpu drivers use similar approaches
>>> should help with shared infrastructure and stuff like that.
>>>
>>> Another one: What's the plan with extending this to panfrost? Or are the
>>> architectures for command submission totally different, and we'll need
>>> separate kernel drivers for utgard/midgard/bifrost?
>>>
>>> Thanks, Daniel
>>>
>>>> - Use drm_sched for GPU task schedule. Each OpenGL context should
>>>>     have a lima context object in the kernel to distinguish tasks
>>>>     from different user. drm_sched gets task from each lima context
>>>>     in a fair way.
>>>>
>>>> This patch serial is based on 5.0-rc5 and squash all the commits.
>>>> For whole history of this driver's development, see:
>>>> https://gitlab.freedesktop.org/lima/linux/commits/lima-5.0-rc5
>>>> https://gitlab.freedesktop.org/lima/linux/commits/lima-4.17-rc4
>>>>
>>>> Mesa driver is still in development and not ready for daily usage,
>>>> but can run some simple tests like kmscube and glamrk2, and some
>>>> single full screen application like kodi-gbm, see:
>>>> https://gitlab.freedesktop.org/lima/mesa
>>>>
>>>> [rfc]
>>>> https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html
>>>>
>>>> Lima Project Developers (1):
>>>>     drm/lima: driver for ARM Mali4xx GPUs
>>>>
>>>> Qiang Yu (1):
>>>>     drm/fourcc: add ARM tiled format modifier
>>>>
>>>>    drivers/gpu/drm/Kconfig               |   2 +
>>>>    drivers/gpu/drm/Makefile              |   1 +
>>>>    drivers/gpu/drm/lima/Kconfig          |  10 +
>>>>    drivers/gpu/drm/lima/Makefile         |  22 ++
>>>>    drivers/gpu/drm/lima/lima_bcast.c     |  46 +++
>>>>    drivers/gpu/drm/lima/lima_bcast.h     |  14 +
>>>>    drivers/gpu/drm/lima/lima_ctx.c       | 124 +++++++
>>>>    drivers/gpu/drm/lima/lima_ctx.h       |  33 ++
>>>>    drivers/gpu/drm/lima/lima_device.c    | 384 ++++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_device.h    | 116 ++++++
>>>>    drivers/gpu/drm/lima/lima_dlbu.c      |  56 +++
>>>>    drivers/gpu/drm/lima/lima_dlbu.h      |  18 +
>>>>    drivers/gpu/drm/lima/lima_drv.c       | 459 ++++++++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_drv.h       |  59 ++++
>>>>    drivers/gpu/drm/lima/lima_gem.c       | 485 +++++++++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_gem.h       |  25 ++
>>>>    drivers/gpu/drm/lima/lima_gem_prime.c | 144 ++++++++
>>>>    drivers/gpu/drm/lima/lima_gem_prime.h |  18 +
>>>>    drivers/gpu/drm/lima/lima_gp.c        | 280 +++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_gp.h        |  16 +
>>>>    drivers/gpu/drm/lima/lima_l2_cache.c  |  79 +++++
>>>>    drivers/gpu/drm/lima/lima_l2_cache.h  |  14 +
>>>>    drivers/gpu/drm/lima/lima_mmu.c       | 135 +++++++
>>>>    drivers/gpu/drm/lima/lima_mmu.h       |  16 +
>>>>    drivers/gpu/drm/lima/lima_object.c    | 103 ++++++
>>>>    drivers/gpu/drm/lima/lima_object.h    |  72 ++++
>>>>    drivers/gpu/drm/lima/lima_pmu.c       |  61 ++++
>>>>    drivers/gpu/drm/lima/lima_pmu.h       |  12 +
>>>>    drivers/gpu/drm/lima/lima_pp.c        | 419 ++++++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_pp.h        |  19 +
>>>>    drivers/gpu/drm/lima/lima_regs.h      | 298 ++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_sched.c     | 486 ++++++++++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_sched.h     | 108 ++++++
>>>>    drivers/gpu/drm/lima/lima_ttm.c       | 319 +++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_ttm.h       |  24 ++
>>>>    drivers/gpu/drm/lima/lima_vm.c        | 354 +++++++++++++++++++
>>>>    drivers/gpu/drm/lima/lima_vm.h        |  59 ++++
>>>>    include/uapi/drm/drm_fourcc.h         |   9 +
>>>>    include/uapi/drm/lima_drm.h           | 193 ++++++++++
>>>>    39 files changed, 5092 insertions(+)
>>>>    create mode 100644 drivers/gpu/drm/lima/Kconfig
>>>>    create mode 100644 drivers/gpu/drm/lima/Makefile
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_bcast.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_bcast.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_ctx.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_ctx.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_device.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_device.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_dlbu.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_dlbu.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_drv.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_drv.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gem.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gem.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gem_prime.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gp.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_gp.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_l2_cache.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_mmu.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_mmu.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_object.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_object.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_pmu.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_pmu.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_pp.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_pp.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_regs.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_sched.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_sched.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_ttm.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_ttm.h
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_vm.c
>>>>    create mode 100644 drivers/gpu/drm/lima/lima_vm.h
>>>>    create mode 100644 include/uapi/drm/lima_drm.h
>>>>
>>>> --
>>>> 2.17.1
>>>>
Rob Herring Feb. 11, 2019, 6:11 p.m. UTC | #8
On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> > > >
> > > > Since last RFC, all feedback has been addressed. Most Mali DTS
> > > > changes are already upstreamed by SoC maintainers. The kernel
> > > > driver and user-kernel interface are quite stable for several
> > > > months, so I think it's ready to be upstreamed.
> > > >
> > > > This implementation mainly take amdgpu DRM driver as reference.
> > > >
> > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> > > >   OpenGL vertex shader processing and PP is for fragment shader
> > > >   processing. Each processor has its own MMU so prcessors work in
> > > >   virtual address space.
> > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> > > >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
> > > >   togather to handle a single fragment shader task divided by
> > > >   FB output tiled pixels. Mali 400 user space driver is
> > > >   responsible for assign target tiled pixels to each PP, but mali
> > > >   450 has a HW module called DLBU to dynamically balance each
> > > >   PP's load.
> > > > - User space driver allocate buffer object and map into GPU
> > > >   virtual address space, upload command stream and draw data with
> > > >   CPU mmap of the buffer object, then submit task to GP/PP with
> > > >   a register frame indicating where is the command stream and misc
> > > >   settings.
> > > > - There's no command stream validation/relocation due to each user
> > > >   process has its own GPU virtual address space. GP/PP's MMU switch
> > > >   virtual address space before running two tasks from different
> > > >   user process. Error or evil user space code just get MMU fault
> > > >   or GP/PP error IRQ, then the HW/SW will be recovered.
> > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> > > >   lima buffer object which is allocated from TTM page pool. all
> > > >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> > > >   allocation, so there's no buffer eviction and swap for now.
> > >
> > > All other render gpu drivers that have unified memory (aka is on the SoC)
> > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > > (and i915 is kinda the same too really). TTM makes sense if you have some
> > > discrete memory to manage, but imo not in any other place really.
> > >
> > > What's the design choice behind this?
> > To be honest, it's just because TTM offers more helpers. I did implement
> > a GEM way with cma alloc at the beginning. But when implement paged mem,
> > I found TTM has mem pool alloc, sync and mmap related helpers which covers
> > much of my existing code. It's totally possible with GEM, but not as easy as
> > TTM to me. And virtio-gpu seems an example to use TTM without discrete
> > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>
> virtio does have fake vram and migration afaiui. And sure, you can use TTM
> without the vram migration, it's just that most of the complexity of TTM
> is due to buffer placement and migration and all that stuff. If you never
> need to move buffers, then you don't need that ever.
>
> Wrt lack of helpers, what exactly are you looking for? A big part of these
> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> closest, since it doesn't have a display block either) would be better I
> think.

FWIW, I'm working on the panfrost driver and am using the shmem
helpers from Noralf. It's the early stages though. I started a patch
for etnaviv to use it too, but found I need to rework it to sub-class
the shmem GEM object.

Rob
Eric Anholt Feb. 13, 2019, 1 a.m. UTC | #9
Rob Herring <robh@kernel.org> writes:

> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>
>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>> > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> > >
>> > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>> > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
>> > > >
>> > > > Since last RFC, all feedback has been addressed. Most Mali DTS
>> > > > changes are already upstreamed by SoC maintainers. The kernel
>> > > > driver and user-kernel interface are quite stable for several
>> > > > months, so I think it's ready to be upstreamed.
>> > > >
>> > > > This implementation mainly take amdgpu DRM driver as reference.
>> > > >
>> > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>> > > >   OpenGL vertex shader processing and PP is for fragment shader
>> > > >   processing. Each processor has its own MMU so prcessors work in
>> > > >   virtual address space.
>> > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
>> > > >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
>> > > >   togather to handle a single fragment shader task divided by
>> > > >   FB output tiled pixels. Mali 400 user space driver is
>> > > >   responsible for assign target tiled pixels to each PP, but mali
>> > > >   450 has a HW module called DLBU to dynamically balance each
>> > > >   PP's load.
>> > > > - User space driver allocate buffer object and map into GPU
>> > > >   virtual address space, upload command stream and draw data with
>> > > >   CPU mmap of the buffer object, then submit task to GP/PP with
>> > > >   a register frame indicating where is the command stream and misc
>> > > >   settings.
>> > > > - There's no command stream validation/relocation due to each user
>> > > >   process has its own GPU virtual address space. GP/PP's MMU switch
>> > > >   virtual address space before running two tasks from different
>> > > >   user process. Error or evil user space code just get MMU fault
>> > > >   or GP/PP error IRQ, then the HW/SW will be recovered.
>> > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>> > > >   lima buffer object which is allocated from TTM page pool. all
>> > > >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>> > > >   allocation, so there's no buffer eviction and swap for now.
>> > >
>> > > All other render gpu drivers that have unified memory (aka is on the SoC)
>> > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>> > > (and i915 is kinda the same too really). TTM makes sense if you have some
>> > > discrete memory to manage, but imo not in any other place really.
>> > >
>> > > What's the design choice behind this?
>> > To be honest, it's just because TTM offers more helpers. I did implement
>> > a GEM way with cma alloc at the beginning. But when implement paged mem,
>> > I found TTM has mem pool alloc, sync and mmap related helpers which covers
>> > much of my existing code. It's totally possible with GEM, but not as easy as
>> > TTM to me. And virtio-gpu seems an example to use TTM without discrete
>> > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>
>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>> without the vram migration, it's just that most of the complexity of TTM
>> is due to buffer placement and migration and all that stuff. If you never
>> need to move buffers, then you don't need that ever.
>>
>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>> closest, since it doesn't have a display block either) would be better I
>> think.
>
> FWIW, I'm working on the panfrost driver and am using the shmem
> helpers from Noralf. It's the early stages though. I started a patch
> for etnaviv to use it too, but found I need to rework it to sub-class
> the shmem GEM object.

Did you just convert the shmem helpers over to doing alloc_coherent?  If
so, I'd be interested in picking them up for v3d, and that might help
get another patch out of your stack.

I'm particularly interested in the shmem helpers because I should start
doing dynamic binding in and out of the GPU's page table, to avoid
pinning so much memory all the time.
kernel test robot via dri-devel Feb. 13, 2019, 1:44 a.m. UTC | #10
On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
>
> Rob Herring <robh@kernel.org> writes:
>
> > On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>
> >> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> >> > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> > >
> >> > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> >> > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> >> > > >
> >> > > > Since last RFC, all feedback has been addressed. Most Mali DTS
> >> > > > changes are already upstreamed by SoC maintainers. The kernel
> >> > > > driver and user-kernel interface are quite stable for several
> >> > > > months, so I think it's ready to be upstreamed.
> >> > > >
> >> > > > This implementation mainly take amdgpu DRM driver as reference.
> >> > > >
> >> > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> >> > > >   OpenGL vertex shader processing and PP is for fragment shader
> >> > > >   processing. Each processor has its own MMU so prcessors work in
> >> > > >   virtual address space.
> >> > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> >> > > >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
> >> > > >   togather to handle a single fragment shader task divided by
> >> > > >   FB output tiled pixels. Mali 400 user space driver is
> >> > > >   responsible for assign target tiled pixels to each PP, but mali
> >> > > >   450 has a HW module called DLBU to dynamically balance each
> >> > > >   PP's load.
> >> > > > - User space driver allocate buffer object and map into GPU
> >> > > >   virtual address space, upload command stream and draw data with
> >> > > >   CPU mmap of the buffer object, then submit task to GP/PP with
> >> > > >   a register frame indicating where is the command stream and misc
> >> > > >   settings.
> >> > > > - There's no command stream validation/relocation due to each user
> >> > > >   process has its own GPU virtual address space. GP/PP's MMU switch
> >> > > >   virtual address space before running two tasks from different
> >> > > >   user process. Error or evil user space code just get MMU fault
> >> > > >   or GP/PP error IRQ, then the HW/SW will be recovered.
> >> > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> >> > > >   lima buffer object which is allocated from TTM page pool. all
> >> > > >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> >> > > >   allocation, so there's no buffer eviction and swap for now.
> >> > >
> >> > > All other render gpu drivers that have unified memory (aka is on the SoC)
> >> > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> >> > > (and i915 is kinda the same too really). TTM makes sense if you have some
> >> > > discrete memory to manage, but imo not in any other place really.
> >> > >
> >> > > What's the design choice behind this?
> >> > To be honest, it's just because TTM offers more helpers. I did implement
> >> > a GEM way with cma alloc at the beginning. But when implement paged mem,
> >> > I found TTM has mem pool alloc, sync and mmap related helpers which covers
> >> > much of my existing code. It's totally possible with GEM, but not as easy as
> >> > TTM to me. And virtio-gpu seems an example to use TTM without discrete
> >> > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
> >>
> >> virtio does have fake vram and migration afaiui. And sure, you can use TTM
> >> without the vram migration, it's just that most of the complexity of TTM
> >> is due to buffer placement and migration and all that stuff. If you never
> >> need to move buffers, then you don't need that ever.
> >>
> >> Wrt lack of helpers, what exactly are you looking for? A big part of these
> >> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> >> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> >> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> >> closest, since it doesn't have a display block either) would be better I
> >> think.
> >
> > FWIW, I'm working on the panfrost driver and am using the shmem
> > helpers from Noralf. It's the early stages though. I started a patch
> > for etnaviv to use it too, but found I need to rework it to sub-class
> > the shmem GEM object.
>
> Did you just convert the shmem helpers over to doing alloc_coherent?  If
> so, I'd be interested in picking them up for v3d, and that might help
> get another patch out of your stack.

I haven't really fully addressed that yet, but yeah, my plan is just
to switch to WC alloc and mappings. I was going to try to make it
configurable, but there is a comment in the ARM dma mapping code which
makes me wonder if tinydrm using streaming DMA for SPI is
fundamentally broken (and maybe CMA is less broken?). If not broken,
not guaranteed to work.

/*
 * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
 * that the intention is to allow exporting memory allocated via the
 * coherent DMA APIs through the dma_buf API, which only accepts a
 * scattertable.  This presents a couple of problems:
 * 1. Not all memory allocated via the coherent DMA APIs is backed by
 *    a struct page
 * 2. Passing coherent DMA memory into the streaming APIs is not allowed
 *    as we will try to flush the memory through a different alias to that
 *    actually being used (and the flushes are redundant.)
 */

> I'm particularly interested in the shmem helpers because I should start
> doing dynamic binding in and out of the GPU's page table, to avoid
> pinning so much memory all the time.

I'll try to post something in the next couple of days.

Rob
Daniel Vetter Feb. 13, 2019, 7:59 a.m. UTC | #11
On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
>
> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
> >
> > Rob Herring <robh@kernel.org> writes:
> >
> > > On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >>
> > >> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> > >> > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >> > >
> > >> > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > >> > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> > >> > > >
> > >> > > > Since last RFC, all feedback has been addressed. Most Mali DTS
> > >> > > > changes are already upstreamed by SoC maintainers. The kernel
> > >> > > > driver and user-kernel interface are quite stable for several
> > >> > > > months, so I think it's ready to be upstreamed.
> > >> > > >
> > >> > > > This implementation mainly take amdgpu DRM driver as reference.
> > >> > > >
> > >> > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> > >> > > >   OpenGL vertex shader processing and PP is for fragment shader
> > >> > > >   processing. Each processor has its own MMU so prcessors work in
> > >> > > >   virtual address space.
> > >> > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> > >> > > >   for mali 450) in the same mali 4xx GPU. All PPs are grouped
> > >> > > >   togather to handle a single fragment shader task divided by
> > >> > > >   FB output tiled pixels. Mali 400 user space driver is
> > >> > > >   responsible for assign target tiled pixels to each PP, but mali
> > >> > > >   450 has a HW module called DLBU to dynamically balance each
> > >> > > >   PP's load.
> > >> > > > - User space driver allocate buffer object and map into GPU
> > >> > > >   virtual address space, upload command stream and draw data with
> > >> > > >   CPU mmap of the buffer object, then submit task to GP/PP with
> > >> > > >   a register frame indicating where is the command stream and misc
> > >> > > >   settings.
> > >> > > > - There's no command stream validation/relocation due to each user
> > >> > > >   process has its own GPU virtual address space. GP/PP's MMU switch
> > >> > > >   virtual address space before running two tasks from different
> > >> > > >   user process. Error or evil user space code just get MMU fault
> > >> > > >   or GP/PP error IRQ, then the HW/SW will be recovered.
> > >> > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> > >> > > >   lima buffer object which is allocated from TTM page pool. all
> > >> > > >   lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> > >> > > >   allocation, so there's no buffer eviction and swap for now.
> > >> > >
> > >> > > All other render gpu drivers that have unified memory (aka is on the SoC)
> > >> > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > >> > > (and i915 is kinda the same too really). TTM makes sense if you have some
> > >> > > discrete memory to manage, but imo not in any other place really.
> > >> > >
> > >> > > What's the design choice behind this?
> > >> > To be honest, it's just because TTM offers more helpers. I did implement
> > >> > a GEM way with cma alloc at the beginning. But when implement paged mem,
> > >> > I found TTM has mem pool alloc, sync and mmap related helpers which covers
> > >> > much of my existing code. It's totally possible with GEM, but not as easy as
> > >> > TTM to me. And virtio-gpu seems an example to use TTM without discrete
> > >> > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
> > >>
> > >> virtio does have fake vram and migration afaiui. And sure, you can use TTM
> > >> without the vram migration, it's just that most of the complexity of TTM
> > >> is due to buffer placement and migration and all that stuff. If you never
> > >> need to move buffers, then you don't need that ever.
> > >>
> > >> Wrt lack of helpers, what exactly are you looking for? A big part of these
> > >> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> > >> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> > >> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> > >> closest, since it doesn't have a display block either) would be better I
> > >> think.
> > >
> > > FWIW, I'm working on the panfrost driver and am using the shmem
> > > helpers from Noralf. It's the early stages though. I started a patch
> > > for etnaviv to use it too, but found I need to rework it to sub-class
> > > the shmem GEM object.
> >
> > Did you just convert the shmem helpers over to doing alloc_coherent?  If
> > so, I'd be interested in picking them up for v3d, and that might help
> > get another patch out of your stack.
>
> I haven't really fully addressed that yet, but yeah, my plan is just
> to switch to WC alloc and mappings. I was going to try to make it
> configurable, but there is a comment in the ARM dma mapping code which
> makes me wonder if tinydrm using streaming DMA for SPI is
> fundamentally broken (and maybe CMA is less broken?). If not broken,
> not guaranteed to work.
>
> /*
>  * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>  * that the intention is to allow exporting memory allocated via the
>  * coherent DMA APIs through the dma_buf API, which only accepts a
>  * scattertable.  This presents a couple of problems:
>  * 1. Not all memory allocated via the coherent DMA APIs is backed by
>  *    a struct page
>  * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>  *    as we will try to flush the memory through a different alias to that
>  *    actually being used (and the flushes are redundant.)
>  */

The sg table is only for device access, which avoids both of these
issues. That's the idea at least, except all ttm-based drivers don't
care, instead they expect a struct page and then use that to build a
ttm_bo. And then use all the ttm cpu side access functions, instead of
using the dma-buf interfaces (which need to exist to avoid the above
issues).

So except if you want to fix ttm dma-buf import (which is going to be
a pile of work), add this to the list of why ttm is probably not the
best choice for something mostly running on arm soc. x86 gets away
because dma is easy on x86.
-Daniel

> > I'm particularly interested in the shmem helpers because I should start
> > doing dynamic binding in and out of the GPU's page table, to avoid
> > pinning so much memory all the time.
>
> I'll try to post something in the next couple of days.
>
> Rob
kernel test robot via dri-devel Feb. 13, 2019, 8:35 a.m. UTC | #12
Am 13.02.19 um 08:59 schrieb Daniel Vetter:
> On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
>> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
>>> Rob Herring <robh@kernel.org> writes:
>>>
>>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>>>>>
>>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>>>>>> driver and user-kernel interface are quite stable for several
>>>>>>>> months, so I think it's ready to be upstreamed.
>>>>>>>>
>>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>>>>>
>>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>>>>>    OpenGL vertex shader processing and PP is for fragment shader
>>>>>>>>    processing. Each processor has its own MMU so prcessors work in
>>>>>>>>    virtual address space.
>>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>>>>>    for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>>>>>    togather to handle a single fragment shader task divided by
>>>>>>>>    FB output tiled pixels. Mali 400 user space driver is
>>>>>>>>    responsible for assign target tiled pixels to each PP, but mali
>>>>>>>>    450 has a HW module called DLBU to dynamically balance each
>>>>>>>>    PP's load.
>>>>>>>> - User space driver allocate buffer object and map into GPU
>>>>>>>>    virtual address space, upload command stream and draw data with
>>>>>>>>    CPU mmap of the buffer object, then submit task to GP/PP with
>>>>>>>>    a register frame indicating where is the command stream and misc
>>>>>>>>    settings.
>>>>>>>> - There's no command stream validation/relocation due to each user
>>>>>>>>    process has its own GPU virtual address space. GP/PP's MMU switch
>>>>>>>>    virtual address space before running two tasks from different
>>>>>>>>    user process. Error or evil user space code just get MMU fault
>>>>>>>>    or GP/PP error IRQ, then the HW/SW will be recovered.
>>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>>>>>    lima buffer object which is allocated from TTM page pool. all
>>>>>>>>    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>>>>>    allocation, so there's no buffer eviction and swap for now.
>>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>>>>>> discrete memory to manage, but imo not in any other place really.
>>>>>>>
>>>>>>> What's the design choice behind this?
>>>>>> To be honest, it's just because TTM offers more helpers. I did implement
>>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
>>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
>>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
>>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
>>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>>>>> without the vram migration, it's just that most of the complexity of TTM
>>>>> is due to buffer placement and migration and all that stuff. If you never
>>>>> need to move buffers, then you don't need that ever.
>>>>>
>>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>>>>> closest, since it doesn't have a display block either) would be better I
>>>>> think.
>>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>>> helpers from Noralf. It's the early stages though. I started a patch
>>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>>> the shmem GEM object.
>>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>>> so, I'd be interested in picking them up for v3d, and that might help
>>> get another patch out of your stack.
>> I haven't really fully addressed that yet, but yeah, my plan is just
>> to switch to WC alloc and mappings. I was going to try to make it
>> configurable, but there is a comment in the ARM dma mapping code which
>> makes me wonder if tinydrm using streaming DMA for SPI is
>> fundamentally broken (and maybe CMA is less broken?). If not broken,
>> not guaranteed to work.
>>
>> /*
>>   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>>   * that the intention is to allow exporting memory allocated via the
>>   * coherent DMA APIs through the dma_buf API, which only accepts a
>>   * scattertable.  This presents a couple of problems:
>>   * 1. Not all memory allocated via the coherent DMA APIs is backed by
>>   *    a struct page
>>   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>>   *    as we will try to flush the memory through a different alias to that
>>   *    actually being used (and the flushes are redundant.)
>>   */
> The sg table is only for device access, which avoids both of these
> issues. That's the idea at least, except all ttm-based drivers don't
> care, instead they expect a struct page and then use that to build a
> ttm_bo. And then use all the ttm cpu side access functions, instead of
> using the dma-buf interfaces (which need to exist to avoid the above
> issues).

Actually that is not correct any more. I've fixed this while working on 
directly sharing BOs between amdgpu devices.

TTM now uses the DMA addresses from the sg table and I actually wanted 
to remove the pages for imported DMA-buf BOs for a while now.

Regards,
Christian.

>
> So except if you want to fix ttm dma-buf import (which is going to be
> a pile of work), add this to the list of why ttm is probably not the
> best choice for something mostly running on arm soc. x86 gets away
> because dma is easy on x86.
> -Daniel
>
>>> I'm particularly interested in the shmem helpers because I should start
>>> doing dynamic binding in and out of the GPU's page table, to avoid
>>> pinning so much memory all the time.
>> I'll try to post something in the next couple of days.
>>
>> Rob
>
>
Daniel Vetter Feb. 13, 2019, 9:38 a.m. UTC | #13
On Wed, Feb 13, 2019 at 09:35:30AM +0100, Christian König wrote:
> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
> > On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
> > > On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
> > > > Rob Herring <robh@kernel.org> writes:
> > > > 
> > > > > On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> > > > > > > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > > > > > > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> > > > > > > > > 
> > > > > > > > > Since last RFC, all feedback has been addressed. Most Mali DTS
> > > > > > > > > changes are already upstreamed by SoC maintainers. The kernel
> > > > > > > > > driver and user-kernel interface are quite stable for several
> > > > > > > > > months, so I think it's ready to be upstreamed.
> > > > > > > > > 
> > > > > > > > > This implementation mainly take amdgpu DRM driver as reference.
> > > > > > > > > 
> > > > > > > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> > > > > > > > >    OpenGL vertex shader processing and PP is for fragment shader
> > > > > > > > >    processing. Each processor has its own MMU so prcessors work in
> > > > > > > > >    virtual address space.
> > > > > > > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> > > > > > > > >    for mali 450) in the same mali 4xx GPU. All PPs are grouped
> > > > > > > > >    togather to handle a single fragment shader task divided by
> > > > > > > > >    FB output tiled pixels. Mali 400 user space driver is
> > > > > > > > >    responsible for assign target tiled pixels to each PP, but mali
> > > > > > > > >    450 has a HW module called DLBU to dynamically balance each
> > > > > > > > >    PP's load.
> > > > > > > > > - User space driver allocate buffer object and map into GPU
> > > > > > > > >    virtual address space, upload command stream and draw data with
> > > > > > > > >    CPU mmap of the buffer object, then submit task to GP/PP with
> > > > > > > > >    a register frame indicating where is the command stream and misc
> > > > > > > > >    settings.
> > > > > > > > > - There's no command stream validation/relocation due to each user
> > > > > > > > >    process has its own GPU virtual address space. GP/PP's MMU switch
> > > > > > > > >    virtual address space before running two tasks from different
> > > > > > > > >    user process. Error or evil user space code just get MMU fault
> > > > > > > > >    or GP/PP error IRQ, then the HW/SW will be recovered.
> > > > > > > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> > > > > > > > >    lima buffer object which is allocated from TTM page pool. all
> > > > > > > > >    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> > > > > > > > >    allocation, so there's no buffer eviction and swap for now.
> > > > > > > > All other render gpu drivers that have unified memory (aka is on the SoC)
> > > > > > > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > > > > > > > (and i915 is kinda the same too really). TTM makes sense if you have some
> > > > > > > > discrete memory to manage, but imo not in any other place really.
> > > > > > > > 
> > > > > > > > What's the design choice behind this?
> > > > > > > To be honest, it's just because TTM offers more helpers. I did implement
> > > > > > > a GEM way with cma alloc at the beginning. But when implement paged mem,
> > > > > > > I found TTM has mem pool alloc, sync and mmap related helpers which covers
> > > > > > > much of my existing code. It's totally possible with GEM, but not as easy as
> > > > > > > TTM to me. And virtio-gpu seems an example to use TTM without discrete
> > > > > > > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
> > > > > > virtio does have fake vram and migration afaiui. And sure, you can use TTM
> > > > > > without the vram migration, it's just that most of the complexity of TTM
> > > > > > is due to buffer placement and migration and all that stuff. If you never
> > > > > > need to move buffers, then you don't need that ever.
> > > > > > 
> > > > > > Wrt lack of helpers, what exactly are you looking for? A big part of these
> > > > > > for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> > > > > > provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> > > > > > the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> > > > > > closest, since it doesn't have a display block either) would be better I
> > > > > > think.
> > > > > FWIW, I'm working on the panfrost driver and am using the shmem
> > > > > helpers from Noralf. It's the early stages though. I started a patch
> > > > > for etnaviv to use it too, but found I need to rework it to sub-class
> > > > > the shmem GEM object.
> > > > Did you just convert the shmem helpers over to doing alloc_coherent?  If
> > > > so, I'd be interested in picking them up for v3d, and that might help
> > > > get another patch out of your stack.
> > > I haven't really fully addressed that yet, but yeah, my plan is just
> > > to switch to WC alloc and mappings. I was going to try to make it
> > > configurable, but there is a comment in the ARM dma mapping code which
> > > makes me wonder if tinydrm using streaming DMA for SPI is
> > > fundamentally broken (and maybe CMA is less broken?). If not broken,
> > > not guaranteed to work.
> > > 
> > > /*
> > >   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
> > >   * that the intention is to allow exporting memory allocated via the
> > >   * coherent DMA APIs through the dma_buf API, which only accepts a
> > >   * scattertable.  This presents a couple of problems:
> > >   * 1. Not all memory allocated via the coherent DMA APIs is backed by
> > >   *    a struct page
> > >   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
> > >   *    as we will try to flush the memory through a different alias to that
> > >   *    actually being used (and the flushes are redundant.)
> > >   */
> > The sg table is only for device access, which avoids both of these
> > issues. That's the idea at least, except all ttm-based drivers don't
> > care, instead they expect a struct page and then use that to build a
> > ttm_bo. And then use all the ttm cpu side access functions, instead of
> > using the dma-buf interfaces (which need to exist to avoid the above
> > issues).
> 
> Actually that is not correct any more. I've fixed this while working on
> directly sharing BOs between amdgpu devices.
> 
> TTM now uses the DMA addresses from the sg table and I actually wanted to
> remove the pages for imported DMA-buf BOs for a while now.

Nice! And yeah it's been a while since I looked at this ... So just a bit
of cleanup work left to do, fundamentals are in place. Shouldn't be too
hard to get rid of the pages, since the dma-buf cpu accessor functions
have been modelled after the ttm_bo interfaces.
-Daniel
kernel test robot via dri-devel Feb. 13, 2019, 10:09 a.m. UTC | #14
Am 13.02.19 um 10:38 schrieb Daniel Vetter:
> On Wed, Feb 13, 2019 at 09:35:30AM +0100, Christian König wrote:
>> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
>>> On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
>>>> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
>>>>> Rob Herring <robh@kernel.org> writes:
>>>>>
>>>>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>>>>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>>>>>>>
>>>>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>>>>>>>> driver and user-kernel interface are quite stable for several
>>>>>>>>>> months, so I think it's ready to be upstreamed.
>>>>>>>>>>
>>>>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>>>>>>>
>>>>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>>>>>>>     OpenGL vertex shader processing and PP is for fragment shader
>>>>>>>>>>     processing. Each processor has its own MMU so prcessors work in
>>>>>>>>>>     virtual address space.
>>>>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>>>>>>>     for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>>>>>>>     togather to handle a single fragment shader task divided by
>>>>>>>>>>     FB output tiled pixels. Mali 400 user space driver is
>>>>>>>>>>     responsible for assign target tiled pixels to each PP, but mali
>>>>>>>>>>     450 has a HW module called DLBU to dynamically balance each
>>>>>>>>>>     PP's load.
>>>>>>>>>> - User space driver allocate buffer object and map into GPU
>>>>>>>>>>     virtual address space, upload command stream and draw data with
>>>>>>>>>>     CPU mmap of the buffer object, then submit task to GP/PP with
>>>>>>>>>>     a register frame indicating where is the command stream and misc
>>>>>>>>>>     settings.
>>>>>>>>>> - There's no command stream validation/relocation due to each user
>>>>>>>>>>     process has its own GPU virtual address space. GP/PP's MMU switch
>>>>>>>>>>     virtual address space before running two tasks from different
>>>>>>>>>>     user process. Error or evil user space code just get MMU fault
>>>>>>>>>>     or GP/PP error IRQ, then the HW/SW will be recovered.
>>>>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>>>>>>>     lima buffer object which is allocated from TTM page pool. all
>>>>>>>>>>     lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>>>>>>>     allocation, so there's no buffer eviction and swap for now.
>>>>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>>>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>>>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>>>>>>>> discrete memory to manage, but imo not in any other place really.
>>>>>>>>>
>>>>>>>>> What's the design choice behind this?
>>>>>>>> To be honest, it's just because TTM offers more helpers. I did implement
>>>>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
>>>>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
>>>>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
>>>>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
>>>>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>>>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>>>>>>> without the vram migration, it's just that most of the complexity of TTM
>>>>>>> is due to buffer placement and migration and all that stuff. If you never
>>>>>>> need to move buffers, then you don't need that ever.
>>>>>>>
>>>>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>>>>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>>>>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>>>>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>>>>>>> closest, since it doesn't have a display block either) would be better I
>>>>>>> think.
>>>>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>>>>> helpers from Noralf. It's the early stages though. I started a patch
>>>>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>>>>> the shmem GEM object.
>>>>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>>>>> so, I'd be interested in picking them up for v3d, and that might help
>>>>> get another patch out of your stack.
>>>> I haven't really fully addressed that yet, but yeah, my plan is just
>>>> to switch to WC alloc and mappings. I was going to try to make it
>>>> configurable, but there is a comment in the ARM dma mapping code which
>>>> makes me wonder if tinydrm using streaming DMA for SPI is
>>>> fundamentally broken (and maybe CMA is less broken?). If not broken,
>>>> not guaranteed to work.
>>>>
>>>> /*
>>>>    * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>>>>    * that the intention is to allow exporting memory allocated via the
>>>>    * coherent DMA APIs through the dma_buf API, which only accepts a
>>>>    * scattertable.  This presents a couple of problems:
>>>>    * 1. Not all memory allocated via the coherent DMA APIs is backed by
>>>>    *    a struct page
>>>>    * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>>>>    *    as we will try to flush the memory through a different alias to that
>>>>    *    actually being used (and the flushes are redundant.)
>>>>    */
>>> The sg table is only for device access, which avoids both of these
>>> issues. That's the idea at least, except all ttm-based drivers don't
>>> care, instead they expect a struct page and then use that to build a
>>> ttm_bo. And then use all the ttm cpu side access functions, instead of
>>> using the dma-buf interfaces (which need to exist to avoid the above
>>> issues).
>> Actually that is not correct any more. I've fixed this while working on
>> directly sharing BOs between amdgpu devices.
>>
>> TTM now uses the DMA addresses from the sg table and I actually wanted to
>> remove the pages for imported DMA-buf BOs for a while now.
> Nice! And yeah it's been a while since I looked at this ... So just a bit
> of cleanup work left to do, fundamentals are in place. Shouldn't be too
> hard to get rid of the pages, since the dma-buf cpu accessor functions
> have been modelled after the ttm_bo interfaces.

Well at least in radeon and amdgpu CPU mapping an imported BO is 
forbidden (userspace directly maps the DMA-buf fd).

The only case left is mapping a BO in the kernel, and that in turn is 
only used in very few places in radeon/amdgpu:
1. Command stream patching.
2. CPU based page table updates.
3. Debugging

And I think all of them doesn't make sense on a DMA-buf imported BO.

Regards,
Christian.

> -Daniel
Noralf Trønnes Feb. 14, 2019, 9:15 p.m. UTC | #15
Den 13.02.2019 02.44, skrev Rob Herring:
> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
>>
>> Rob Herring <robh@kernel.org> writes:
>>

[snip]

>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>> helpers from Noralf. It's the early stages though. I started a patch
>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>> the shmem GEM object.
>>
>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>> so, I'd be interested in picking them up for v3d, and that might help
>> get another patch out of your stack.
> 
> I haven't really fully addressed that yet, but yeah, my plan is just
> to switch to WC alloc and mappings. I was going to try to make it
> configurable, but there is a comment in the ARM dma mapping code which
> makes me wonder if tinydrm using streaming DMA for SPI is
> fundamentally broken (and maybe CMA is less broken?). If not broken,
> not guaranteed to work.
> 
> /*
>  * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>  * that the intention is to allow exporting memory allocated via the
>  * coherent DMA APIs through the dma_buf API, which only accepts a
>  * scattertable.  This presents a couple of problems:
>  * 1. Not all memory allocated via the coherent DMA APIs is backed by
>  *    a struct page
>  * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>  *    as we will try to flush the memory through a different alias to that
>  *    actually being used (and the flushes are redundant.)
>  */
> 

Thanks for drawing my attention to this, I wasn't aware of it. Sadly the
SPI subsystem doesn't have a way to pass in dma buffers, everything has
to go through the streaming API. Long term I guess I have to add support
for that.

Noralf.
Daniel Vetter Feb. 26, 2019, 3:58 p.m. UTC | #16
On Wed, Feb 13, 2019 at 9:35 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
> > On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
> >> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
> >>> Rob Herring <robh@kernel.org> writes:
> >>>
> >>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> >>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> >>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
> >>>>>>>>
> >>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
> >>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
> >>>>>>>> driver and user-kernel interface are quite stable for several
> >>>>>>>> months, so I think it's ready to be upstreamed.
> >>>>>>>>
> >>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
> >>>>>>>>
> >>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> >>>>>>>>    OpenGL vertex shader processing and PP is for fragment shader
> >>>>>>>>    processing. Each processor has its own MMU so prcessors work in
> >>>>>>>>    virtual address space.
> >>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
> >>>>>>>>    for mali 450) in the same mali 4xx GPU. All PPs are grouped
> >>>>>>>>    togather to handle a single fragment shader task divided by
> >>>>>>>>    FB output tiled pixels. Mali 400 user space driver is
> >>>>>>>>    responsible for assign target tiled pixels to each PP, but mali
> >>>>>>>>    450 has a HW module called DLBU to dynamically balance each
> >>>>>>>>    PP's load.
> >>>>>>>> - User space driver allocate buffer object and map into GPU
> >>>>>>>>    virtual address space, upload command stream and draw data with
> >>>>>>>>    CPU mmap of the buffer object, then submit task to GP/PP with
> >>>>>>>>    a register frame indicating where is the command stream and misc
> >>>>>>>>    settings.
> >>>>>>>> - There's no command stream validation/relocation due to each user
> >>>>>>>>    process has its own GPU virtual address space. GP/PP's MMU switch
> >>>>>>>>    virtual address space before running two tasks from different
> >>>>>>>>    user process. Error or evil user space code just get MMU fault
> >>>>>>>>    or GP/PP error IRQ, then the HW/SW will be recovered.
> >>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> >>>>>>>>    lima buffer object which is allocated from TTM page pool. all
> >>>>>>>>    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> >>>>>>>>    allocation, so there's no buffer eviction and swap for now.
> >>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
> >>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> >>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
> >>>>>>> discrete memory to manage, but imo not in any other place really.
> >>>>>>>
> >>>>>>> What's the design choice behind this?
> >>>>>> To be honest, it's just because TTM offers more helpers. I did implement
> >>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
> >>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
> >>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
> >>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
> >>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
> >>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
> >>>>> without the vram migration, it's just that most of the complexity of TTM
> >>>>> is due to buffer placement and migration and all that stuff. If you never
> >>>>> need to move buffers, then you don't need that ever.
> >>>>>
> >>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
> >>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> >>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> >>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> >>>>> closest, since it doesn't have a display block either) would be better I
> >>>>> think.
> >>>> FWIW, I'm working on the panfrost driver and am using the shmem
> >>>> helpers from Noralf. It's the early stages though. I started a patch
> >>>> for etnaviv to use it too, but found I need to rework it to sub-class
> >>>> the shmem GEM object.
> >>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
> >>> so, I'd be interested in picking them up for v3d, and that might help
> >>> get another patch out of your stack.
> >> I haven't really fully addressed that yet, but yeah, my plan is just
> >> to switch to WC alloc and mappings. I was going to try to make it
> >> configurable, but there is a comment in the ARM dma mapping code which
> >> makes me wonder if tinydrm using streaming DMA for SPI is
> >> fundamentally broken (and maybe CMA is less broken?). If not broken,
> >> not guaranteed to work.
> >>
> >> /*
> >>   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
> >>   * that the intention is to allow exporting memory allocated via the
> >>   * coherent DMA APIs through the dma_buf API, which only accepts a
> >>   * scattertable.  This presents a couple of problems:
> >>   * 1. Not all memory allocated via the coherent DMA APIs is backed by
> >>   *    a struct page
> >>   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
> >>   *    as we will try to flush the memory through a different alias to that
> >>   *    actually being used (and the flushes are redundant.)
> >>   */
> > The sg table is only for device access, which avoids both of these
> > issues. That's the idea at least, except all ttm-based drivers don't
> > care, instead they expect a struct page and then use that to build a
> > ttm_bo. And then use all the ttm cpu side access functions, instead of
> > using the dma-buf interfaces (which need to exist to avoid the above
> > issues).
>
> Actually that is not correct any more. I've fixed this while working on
> directly sharing BOs between amdgpu devices.
>
> TTM now uses the DMA addresses from the sg table and I actually wanted
> to remove the pages for imported DMA-buf BOs for a while now.

Finally gotten around to reading ttm code to update my understanding,
and I think I realized why I never realized this changed:
TTM_PAGE_FLAG_SG and related code seems to be the fancy new code you
added to go sg table native in ttm, and from a quick look rolled out
everywhere. But drm_prime_sg_to_page_addr_arrays is still called. Is
that the missing cleanup you're referring to? Would be nice if we
could nuke it to stop the copypasta spread (and spreading it seems to
do :-/). Maybe as a todo.rst entry?

Cheers, Daniel

>
> Regards,
> Christian.
>
> >
> > So except if you want to fix ttm dma-buf import (which is going to be
> > a pile of work), add this to the list of why ttm is probably not the
> > best choice for something mostly running on arm soc. x86 gets away
> > because dma is easy on x86.
> > -Daniel
> >
> >>> I'm particularly interested in the shmem helpers because I should start
> >>> doing dynamic binding in and out of the GPU's page table, to avoid
> >>> pinning so much memory all the time.
> >> I'll try to post something in the next couple of days.
> >>
> >> Rob
> >
> >
>
Christian König Feb. 26, 2019, 4:23 p.m. UTC | #17
Am 26.02.19 um 16:58 schrieb Daniel Vetter:
> On Wed, Feb 13, 2019 at 9:35 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
>>> On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@kernel.org> wrote:
>>>> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@anholt.net> wrote:
>>>>> Rob Herring <robh@kernel.org> writes:
>>>>>
>>>>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>>>>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>>>>>>>
>>>>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>>>>>>>> driver and user-kernel interface are quite stable for several
>>>>>>>>>> months, so I think it's ready to be upstreamed.
>>>>>>>>>>
>>>>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>>>>>>>
>>>>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>>>>>>>     OpenGL vertex shader processing and PP is for fragment shader
>>>>>>>>>>     processing. Each processor has its own MMU so prcessors work in
>>>>>>>>>>     virtual address space.
>>>>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>>>>>>>     for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>>>>>>>     togather to handle a single fragment shader task divided by
>>>>>>>>>>     FB output tiled pixels. Mali 400 user space driver is
>>>>>>>>>>     responsible for assign target tiled pixels to each PP, but mali
>>>>>>>>>>     450 has a HW module called DLBU to dynamically balance each
>>>>>>>>>>     PP's load.
>>>>>>>>>> - User space driver allocate buffer object and map into GPU
>>>>>>>>>>     virtual address space, upload command stream and draw data with
>>>>>>>>>>     CPU mmap of the buffer object, then submit task to GP/PP with
>>>>>>>>>>     a register frame indicating where is the command stream and misc
>>>>>>>>>>     settings.
>>>>>>>>>> - There's no command stream validation/relocation due to each user
>>>>>>>>>>     process has its own GPU virtual address space. GP/PP's MMU switch
>>>>>>>>>>     virtual address space before running two tasks from different
>>>>>>>>>>     user process. Error or evil user space code just get MMU fault
>>>>>>>>>>     or GP/PP error IRQ, then the HW/SW will be recovered.
>>>>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>>>>>>>     lima buffer object which is allocated from TTM page pool. all
>>>>>>>>>>     lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>>>>>>>     allocation, so there's no buffer eviction and swap for now.
>>>>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>>>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>>>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>>>>>>>> discrete memory to manage, but imo not in any other place really.
>>>>>>>>>
>>>>>>>>> What's the design choice behind this?
>>>>>>>> To be honest, it's just because TTM offers more helpers. I did implement
>>>>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
>>>>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
>>>>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
>>>>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
>>>>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>>>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>>>>>>> without the vram migration, it's just that most of the complexity of TTM
>>>>>>> is due to buffer placement and migration and all that stuff. If you never
>>>>>>> need to move buffers, then you don't need that ever.
>>>>>>>
>>>>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>>>>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>>>>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>>>>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>>>>>>> closest, since it doesn't have a display block either) would be better I
>>>>>>> think.
>>>>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>>>>> helpers from Noralf. It's the early stages though. I started a patch
>>>>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>>>>> the shmem GEM object.
>>>>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>>>>> so, I'd be interested in picking them up for v3d, and that might help
>>>>> get another patch out of your stack.
>>>> I haven't really fully addressed that yet, but yeah, my plan is just
>>>> to switch to WC alloc and mappings. I was going to try to make it
>>>> configurable, but there is a comment in the ARM dma mapping code which
>>>> makes me wonder if tinydrm using streaming DMA for SPI is
>>>> fundamentally broken (and maybe CMA is less broken?). If not broken,
>>>> not guaranteed to work.
>>>>
>>>> /*
>>>>    * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>>>>    * that the intention is to allow exporting memory allocated via the
>>>>    * coherent DMA APIs through the dma_buf API, which only accepts a
>>>>    * scattertable.  This presents a couple of problems:
>>>>    * 1. Not all memory allocated via the coherent DMA APIs is backed by
>>>>    *    a struct page
>>>>    * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>>>>    *    as we will try to flush the memory through a different alias to that
>>>>    *    actually being used (and the flushes are redundant.)
>>>>    */
>>> The sg table is only for device access, which avoids both of these
>>> issues. That's the idea at least, except all ttm-based drivers don't
>>> care, instead they expect a struct page and then use that to build a
>>> ttm_bo. And then use all the ttm cpu side access functions, instead of
>>> using the dma-buf interfaces (which need to exist to avoid the above
>>> issues).
>> Actually that is not correct any more. I've fixed this while working on
>> directly sharing BOs between amdgpu devices.
>>
>> TTM now uses the DMA addresses from the sg table and I actually wanted
>> to remove the pages for imported DMA-buf BOs for a while now.
> Finally gotten around to reading ttm code to update my understanding,
> and I think I realized why I never realized this changed:
> TTM_PAGE_FLAG_SG and related code seems to be the fancy new code you
> added to go sg table native in ttm, and from a quick look rolled out
> everywhere. But drm_prime_sg_to_page_addr_arrays is still called. Is
> that the missing cleanup you're referring to?

Yes, exactly. The last thing I pushed upstream was making the pages 
optional in drm_prime_sg_to_page_addr_arrays.

I just never got around to really not fill ttm->pages any more, but in 
theory it should be possible to just comment that out and be happy about it.

Christian.

>   Would be nice if we
> could nuke it to stop the copypasta spread (and spreading it seems to
> do :-/). Maybe as a todo.rst entry?
>
> Cheers, Daniel
>
>> Regards,
>> Christian.
>>
>>> So except if you want to fix ttm dma-buf import (which is going to be
>>> a pile of work), add this to the list of why ttm is probably not the
>>> best choice for something mostly running on arm soc. x86 gets away
>>> because dma is easy on x86.
>>> -Daniel
>>>
>>>>> I'm particularly interested in the shmem helpers because I should start
>>>>> doing dynamic binding in and out of the GPU's page table, to avoid
>>>>> pinning so much memory all the time.
>>>> I'll try to post something in the next couple of days.
>>>>
>>>> Rob
>>>
>