diff mbox series

drm/doc: Start documenting aspects specific to tile-based renderers

Message ID 20250418122524.410448-1-boris.brezillon@collabora.com (mailing list archive)
State New
Headers show
Series drm/doc: Start documenting aspects specific to tile-based renderers | expand

Commit Message

Boris Brezillon April 18, 2025, 12:25 p.m. UTC
Tile-based GPUs come with a set of constraints that are not present
when immediate rendering is used. This new document tries to explain
the differences between tile/immediate rendering, the problems that
come with tilers, and how we plan to address them.

This is just a started point, this document will be updated with new
materials as we refine the libraries we add to help deal with
tilers, and have more drivers converted to follow the rules listed
here.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 Documentation/gpu/drm-tile-based-renderer.rst | 201 ++++++++++++++++++
 Documentation/gpu/index.rst                   |   1 +
 2 files changed, 202 insertions(+)
 create mode 100644 Documentation/gpu/drm-tile-based-renderer.rst
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-tile-based-renderer.rst b/Documentation/gpu/drm-tile-based-renderer.rst
new file mode 100644
index 000000000000..19b56b9476fc
--- /dev/null
+++ b/Documentation/gpu/drm-tile-based-renderer.rst
@@ -0,0 +1,201 @@ 
+==================================================
+Infrastructure and tricks for tile-based renderers
+==================================================
+
+All lot of embedded GPUs are using tile-based rendering instead of immediate
+rendering. This mode of rendering has various implications that we try to
+document here along with some hints about how to deal with some of the
+problems that surface with tile-based renderers.
+
+The main idea behind tile-based rendering is to batch processing of nearby
+pixels during the fragment shading phase to limit the traffic on the memory
+bus by making optimal use of the various caches present in the GPU. Unlike
+immediate rendering, where primitives generated by the geometry stages of
+the pipeline are directly consumed by the fragment stage, tilers have to
+record primitives in bins that are somehow attached to tiles (the
+granularity of the tile being GPU-specific). This data is usually stored
+in memory, and pulled back when the fragment stage is executed.
+
+This approach has several issues that most drivers need to handle somehow,
+sometimes with a bit of help from the hardware.
+
+Issues at hand
+==============
+
+Tiler memory
+------------
+
+The amount of memory needed to store primitives data and metadata is hard
+to guess ahead of time, because it depends on various parameters that are
+not in control of the UMD (UserMode Driver). Here is a non-exhaustive list
+of things that may complicate the calculation of the memory needed to store
+primitive information:
+
+- Primitives distribution across tiles is hard to guess: the binning process
+  is about assigning each primitive to the set tiles it covers. The more tiles
+  being covered the more memory is needed to record those. We can estimate
+  the worst case scenario by assuming all primitives will cover all tiles but
+  this will lead to over-allocation most of the time, which is not good
+- Indirect draws: the number of vertices comes from a GPU buffer that might
+  be filled by previous GPU compute jobs. This means we only know the number
+  of vertices when the GPU executes the draw, and thus can't guess how much
+  memory will be needed for those and allocate a GPU buffer that's big enough
+  to hold those
+- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
+  it gets even trickier to guess the number of primitives from the number
+  of vertices passed to the vertex shader.
+
+For all these reasons, the tiler usually allocates memory dynamically, but
+DRM has not been designed with this use case in mind. Drivers will address
+these problems differently based on the functionality provided by their
+hardware, but all of them almost certainly have to deal with this somehow.
+
+The easy solution is to statically allocate a huge buffer to pick from when
+tiler memory is needed, and fail the rendering when this buffer is depleted.
+Some drivers try to be smarter to avoid reserving a lot of memory upfront.
+Instead, they start with an almost empty buffer and progressively populate it
+when the GPU faults on an address sitting in the tiler buffer range. This
+works okay most of the time but it falls short when the system is under
+memory pressure, because the memory request is not guaranteed to be satisfied.
+In that case, the driver either fails the rendering, or, if the hardware
+allows it, it tries to flush the primitives that have been processed and
+triggers a fragment job that will consume those primitives and free up some
+memory to be recycled and make further progress on the tiling step. This is
+usually referred as partial/incremental rendering (it might have other names).
+
+Compute based emulation of geometry stages
+------------------------------------------
+
+More and more hardware vendors don't bother providing hardware support for
+geometry/tesselation/mesh stages, since those can be emulated with compute
+shaders. But the same problem we have with tiler memory exists with those
+intermediate compute-emulated stages, because transient data shared between
+stages need to be stored in memory for the next stage to consume, and this
+bubbles up until the tiling stage is reached, because ultimately, what the
+tiling stage will need to process is a set of vertices it can turn into
+primitives, like would happen if the application had emulated the geometry,
+tesselation or mesh stages with compute.
+
+Unlike tiling, where the hardware can provide a fallback to recycle memory,
+there is no way the intermediate primitives can be flushed up to the framebuffer,
+because it's a purely software emulation here. This being said, the same
+"start small, grow on-demand" can be applied to avoid over-allocating memory
+upfront.
+
+On-demand memory allocation
+---------------------------
+
+As explained in previous sections, on-demand allocation is a central piece
+of tile-based renderer if we don't want to over-allocate, which is bad for
+integrated GPUs who share their memory with the rest of the system.
+
+The problem with on-demand allocation is that suddenly, GPU accesses can
+fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
+were not designed for that. Those are assuming that buffers memory is
+populated at job submission time, and will stay around for the job lifetime.
+If a GPU fault happens, it's the user fault, and the context can be flagged
+unusable. On-demand allocation is usually implemented as allocation-on-fault,
+and the dma_fence contract prevents us from blocking on allocations in that
+path (GPU fault handlers are in the dma-fence signalling path). So now we
+have GPU allocations that will be satisfied most of the time, but can fail
+occasionally. And this is not great, because an allocation failure might
+kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the
+application having dong anything wrong. So, we need something that makes those
+allocation failures rare enough that most users won't experience them, and
+we need a fallback for when this happens to try to avoid them on the next
+user attempt to submit a graphics job.
+
+The plan
+========
+
+On-demand allocation rules
+--------------------------
+
+First of all, all allocations happening in the fault handler path must
+be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked
+(clean FS cache will be flushed for instance), but an error will be
+returned if no memory is readily available. GFP_NOWAIT will also trigger
+background reclaim to hopefully free-up some memory for our future
+requests.
+
+How to deal with allocation failures
+------------------------------------
+
+The first trick here is to try to guess approximately how much memory
+will be needed, and force-populate on-demand buffers with that amount
+of memory when the job is started. It's not about guessing the worst
+case scenario here, but more the most likely case, probably with a
+reasonable margin, so that the job is likely to succeed when this amount
+of memory is provided by the KMD.
+
+The second trick to try to avoid over-allocation, even with this
+sub-optimistic estimate, is to have a shared pool of memory that can be
+used by all GPU contexts when they need tiler/geometry memory. This
+implies returning chunks to this pool at some point, so other contexts
+can re-use those. Details about what this global memory pool implementation
+would look like is currently undefined, but it needs to be filled to
+guarantee that pre-allocation requests for on-demand buffers used by a
+GPU job can be satisfied in the fault handler path.
+
+As a last resort, we can try to allocate with GFP_ATOMIC if everything
+else fails, but this is a dangerous game, because we would be stealing
+memory from the atomic reserve, so it's not entirely clear if this is
+better than failing the job at this point.
+
+Ideas on how to make allocation failures decrease over time
+-----------------------------------------------------------
+
+When an on-demand allocation fails and the hardware doesn't have a
+flush-primitives fallback, we usually can't do much apart from failing the
+whole job. But it's important to try to avoid future allocation failures
+when the application creates a new context. There's no clear path for
+how to guess the actual size to force-populate on the next attempt. One
+option is to have a simple heuristics, like double the current resident size,
+but this has the downside of potentially taking a few attempts before reaching
+the stability point. Another option is to repeatedly map a dummy page at the
+fault addresses, so we can get a sense of how much memory was needed for this
+particular job.
+
+Once userspace gets an idea of what the application needs, it should force
+this to be the minimum populated size on the next context creation. For GL
+drivers, the UMD is in control of the context recreation, so it can easily
+record the next buffer size to use. For Vulkan applications, something should
+be recorded to track that, maybe in the form of some implicit dri-conf
+database that can overload the explicit dri-conf.
+
+Various implementation details have been discussed
+`here <https://lore.kernel.org/dri-devel/Z_kEjFjmsumfmbfM@phenom.ffwll.local/>`_
+but nothing has been decided yet.
+
+DRM infrastructure changes for tile-based renderers
+===================================================
+
+As seen in previous sections, allocation for tile-based GPUs can be tricky,
+so we really want to add as much facility as we can, and document how these
+helpers must be used. This section tries to list the various components and
+how we expect them to work.
+
+GEM SHMEM sparse backing
+------------------------
+
+On-demand allocation is not something the GEM layer has been designed for.
+The idea is to extend the existing GEM and GEM SHMEM helpers to cover the
+concept of sparse backing.
+
+A solution has been proposed
+`here<https://lore.kernel.org/dri-devel/20250404092634.2968115-1-boris.brezillon@collabora.com/>`_
+
+Fault injection mechanism
+-------------------------
+
+In order to easily test/validate the on-demand allocation logic, we need
+a way to fake GPU faults and trigger on-demand allocation. We also need
+to fake allocation failures are various points.
+
+This part is likely to be driver specific, and should probably involve
+new debugfs knobs.
+
+Global memory pool for on-demand allocation
+-------------------------------------------
+
+TBD.
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 7dcb15850afd..186917524854 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -14,6 +14,7 @@  GPU Driver Developer's Guide
    driver-uapi
    drm-client
    drm-compute
+   drm-tile-based-renderer
    drivers
    backlight
    vga-switcheroo