drm/doc: Start documenting aspects specific to tile-based renderers

Message ID	20250418122524.410448-1-boris.brezillon@collabora.com (mailing list archive)
State	New
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> sender: bbrezillon) by bali.collaboradmins.com (Postfix) with ESMTPSA id B1F9017E08E3; Fri, 18 Apr 2025 14:25:33 +0200 (CEST) From: Boris Brezillon <boris.brezillon@collabora.com> To: dri-devel@lists.freedesktop.org Cc: Boris Brezillon <boris.brezillon@collabora.com>, Steven Price <steven.price@arm.com>, Liviu Dudau <liviu.dudau@arm.com>, =?utf-8?q?Adri=C3=A1n_Larumbe?= <adrian.larumbe@collabora.com>, lima@lists.freedesktop.org, Qiang Yu <yuq825@gmail.com>, David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, Dmitry Osipenko <dmitry.osipenko@collabora.com>, Alyssa Rosenzweig <alyssa@rosenzweig.io>, Christian Koenig <christian.koenig@amd.com>, Faith Ekstrand <faith.ekstrand@collabora.com>, kernel@collabora.com Subject: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers Date: Fri, 18 Apr 2025 14:25:24 +0200 Message-ID: <20250418122524.410448-1-boris.brezillon@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/doc: Start documenting aspects specific to tile-based renderers \| expand drm/doc: Start documenting aspects specific to tile-based renderers

diff --git a/Documentation/gpu/drm-tile-based-renderer.rst b/Documentation/gpu/drm-tile-based-renderer.rst new file mode 100644 index 000000000000..19b56b9476fc --- /dev/null +++ b/Documentation/gpu/drm-tile-based-renderer.rst @@ -0,0 +1,201 @@ +================================================== +Infrastructure and tricks for tile-based renderers +================================================== + +All lot of embedded GPUs are using tile-based rendering instead of immediate +rendering. This mode of rendering has various implications that we try to +document here along with some hints about how to deal with some of the +problems that surface with tile-based renderers. + +The main idea behind tile-based rendering is to batch processing of nearby +pixels during the fragment shading phase to limit the traffic on the memory +bus by making optimal use of the various caches present in the GPU. Unlike +immediate rendering, where primitives generated by the geometry stages of +the pipeline are directly consumed by the fragment stage, tilers have to +record primitives in bins that are somehow attached to tiles (the +granularity of the tile being GPU-specific). This data is usually stored +in memory, and pulled back when the fragment stage is executed. + +This approach has several issues that most drivers need to handle somehow, +sometimes with a bit of help from the hardware. + +Issues at hand +============== + +Tiler memory +------------ + +The amount of memory needed to store primitives data and metadata is hard +to guess ahead of time, because it depends on various parameters that are +not in control of the UMD (UserMode Driver). Here is a non-exhaustive list +of things that may complicate the calculation of the memory needed to store +primitive information: + +- Primitives distribution across tiles is hard to guess: the binning process + is about assigning each primitive to the set tiles it covers. The more tiles + being covered the more memory is needed to record those. We can estimate + the worst case scenario by assuming all primitives will cover all tiles but + this will lead to over-allocation most of the time, which is not good +- Indirect draws: the number of vertices comes from a GPU buffer that might + be filled by previous GPU compute jobs. This means we only know the number + of vertices when the GPU executes the draw, and thus can't guess how much + memory will be needed for those and allocate a GPU buffer that's big enough + to hold those +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders + it gets even trickier to guess the number of primitives from the number + of vertices passed to the vertex shader. + +For all these reasons, the tiler usually allocates memory dynamically, but +DRM has not been designed with this use case in mind. Drivers will address +these problems differently based on the functionality provided by their +hardware, but all of them almost certainly have to deal with this somehow. + +The easy solution is to statically allocate a huge buffer to pick from when +tiler memory is needed, and fail the rendering when this buffer is depleted. +Some drivers try to be smarter to avoid reserving a lot of memory upfront. +Instead, they start with an almost empty buffer and progressively populate it +when the GPU faults on an address sitting in the tiler buffer range. This +works okay most of the time but it falls short when the system is under +memory pressure, because the memory request is not guaranteed to be satisfied. +In that case, the driver either fails the rendering, or, if the hardware +allows it, it tries to flush the primitives that have been processed and +triggers a fragment job that will consume those primitives and free up some +memory to be recycled and make further progress on the tiling step. This is +usually referred as partial/incremental rendering (it might have other names). + +Compute based emulation of geometry stages +------------------------------------------ + +More and more hardware vendors don't bother providing hardware support for +geometry/tesselation/mesh stages, since those can be emulated with compute +shaders. But the same problem we have with tiler memory exists with those +intermediate compute-emulated stages, because transient data shared between +stages need to be stored in memory for the next stage to consume, and this +bubbles up until the tiling stage is reached, because ultimately, what the +tiling stage will need to process is a set of vertices it can turn into +primitives, like would happen if the application had emulated the geometry, +tesselation or mesh stages with compute. + +Unlike tiling, where the hardware can provide a fallback to recycle memory, +there is no way the intermediate primitives can be flushed up to the framebuffer, +because it's a purely software emulation here. This being said, the same +"start small, grow on-demand" can be applied to avoid over-allocating memory +upfront. + +On-demand memory allocation +--------------------------- + +As explained in previous sections, on-demand allocation is a central piece +of tile-based renderer if we don't want to over-allocate, which is bad for +integrated GPUs who share their memory with the rest of the system. + +The problem with on-demand allocation is that suddenly, GPU accesses can +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly) +were not designed for that. Those are assuming that buffers memory is +populated at job submission time, and will stay around for the job lifetime. +If a GPU fault happens, it's the user fault, and the context can be flagged +unusable. On-demand allocation is usually implemented as allocation-on-fault, +and the dma_fence contract prevents us from blocking on allocations in that +path (GPU fault handlers are in the dma-fence signalling path). So now we +have GPU allocations that will be satisfied most of the time, but can fail +occasionally. And this is not great, because an allocation failure might +kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the +application having dong anything wrong. So, we need something that makes those +allocation failures rare enough that most users won't experience them, and +we need a fallback for when this happens to try to avoid them on the next +user attempt to submit a graphics job. + +The plan +======== + +On-demand allocation rules +-------------------------- + +First of all, all allocations happening in the fault handler path must +be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked +(clean FS cache will be flushed for instance), but an error will be +returned if no memory is readily available. GFP_NOWAIT will also trigger +background reclaim to hopefully free-up some memory for our future +requests. + +How to deal with allocation failures +------------------------------------ + +The first trick here is to try to guess approximately how much memory +will be needed, and force-populate on-demand buffers with that amount +of memory when the job is started. It's not about guessing the worst +case scenario here, but more the most likely case, probably with a +reasonable margin, so that the job is likely to succeed when this amount +of memory is provided by the KMD. + +The second trick to try to avoid over-allocation, even with this +sub-optimistic estimate, is to have a shared pool of memory that can be +used by all GPU contexts when they need tiler/geometry memory. This +implies returning chunks to this pool at some point, so other contexts +can re-use those. Details about what this global memory pool implementation +would look like is currently undefined, but it needs to be filled to +guarantee that pre-allocation requests for on-demand buffers used by a +GPU job can be satisfied in the fault handler path. + +As a last resort, we can try to allocate with GFP_ATOMIC if everything +else fails, but this is a dangerous game, because we would be stealing +memory from the atomic reserve, so it's not entirely clear if this is +better than failing the job at this point. + +Ideas on how to make allocation failures decrease over time +----------------------------------------------------------- + +When an on-demand allocation fails and the hardware doesn't have a +flush-primitives fallback, we usually can't do much apart from failing the +whole job. But it's important to try to avoid future allocation failures +when the application creates a new context. There's no clear path for +how to guess the actual size to force-populate on the next attempt. One +option is to have a simple heuristics, like double the current resident size, +but this has the downside of potentially taking a few attempts before reaching +the stability point. Another option is to repeatedly map a dummy page at the +fault addresses, so we can get a sense of how much memory was needed for this +particular job. + +Once userspace gets an idea of what the application needs, it should force +this to be the minimum populated size on the next context creation. For GL +drivers, the UMD is in control of the context recreation, so it can easily +record the next buffer size to use. For Vulkan applications, something should +be recorded to track that, maybe in the form of some implicit dri-conf +database that can overload the explicit dri-conf. + +Various implementation details have been discussed +`here <https://lore.kernel.org/dri-devel/Z_kEjFjmsumfmbfM@phenom.ffwll.local/>`_ +but nothing has been decided yet. + +DRM infrastructure changes for tile-based renderers +=================================================== + +As seen in previous sections, allocation for tile-based GPUs can be tricky, +so we really want to add as much facility as we can, and document how these +helpers must be used. This section tries to list the various components and +how we expect them to work. + +GEM SHMEM sparse backing +------------------------ + +On-demand allocation is not something the GEM layer has been designed for. +The idea is to extend the existing GEM and GEM SHMEM helpers to cover the +concept of sparse backing. + +A solution has been proposed +`here<https://lore.kernel.org/dri-devel/20250404092634.2968115-1-boris.brezillon@collabora.com/>`_ + +Fault injection mechanism +------------------------- + +In order to easily test/validate the on-demand allocation logic, we need +a way to fake GPU faults and trigger on-demand allocation. We also need +to fake allocation failures are various points. + +This part is likely to be driver specific, and should probably involve +new debugfs knobs. + +Global memory pool for on-demand allocation +------------------------------------------- + +TBD. diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst index 7dcb15850afd..186917524854 100644 --- a/Documentation/gpu/index.rst +++ b/Documentation/gpu/index.rst @@ -14,6 +14,7 @@ GPU Driver Developer's Guide driver-uapi drm-client drm-compute + drm-tile-based-renderer drivers backlight vga-switcheroo

drm/doc: Start documenting aspects specific to tile-based renderers

Commit Message

Patch