Message ID | 20190806195259.3531-1-alyssa.rosenzweig@collabora.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/panfrost: Add "compute shader only" hint | expand |
On 06/08/2019 21:25, Alyssa Rosenzweig wrote: >>> It's not obvious to me when it actually needs to be enabled. Besides the >>> errata, it's only when... device_nr=1 for a compute-only job in kbase? >>> >>> I'm afraid I don't know nearly enough about how kbase plumbs CL to grok >>> the signifiance... >> >> Figuring out the nr_core_groups was the complicated part of this as I >> recall. Seems like we should at least figure out if we (or will need) >> PANFROST_JD_REQ_CORE_GRP_MASK added to the UAPI as well. > > I suspect this is something OpenCL/Vulkan specific. Hopefully Stephen > can shine some light here :) *switches torch on*... Ok, this is actually a lot more complex than it first appears, so I'll have to start with a bit of background: Mali Midgard GPUs have 2 "thread creators" per core. There is one for fragment threads and one for 'compute' threads (vertex work is considered compute in this context). The cores have a set number of threads (e.g. 256 for the early GPUs) and they effectively get divided between fragment threads and compute threads - it's effectively round-robin between the thread creators, but I think there's some extra 'magic' in the hardware. The idea is that for graphics you can run fragment and vertex workloads at the same time on the core and make better use of the hardware (i.e. fragment units are using the texturing hardware, which vertex is using the ALU). However two things stand in the way of this working nicely: 1. Core groups - this is a lovely design feature for hardware engineers, but a pain for software. Basically you can have multiple sets of cores. The cores in a set are coherent with each other, but they are not coherent between sets. This is because each core group has it's own L2 cache. To complicate things even further the tiler notationally exists within core group 0, so is only coherent with that core group. This means that if you have a vertex/tiler job chain it has to be run entirely within core group 0 - or you will need to insert appropriate cache flushes. For fragment work you generally don't need coherency between threads so this isn't a problem and you can run over all the cores in all groups. For compute (i.e. OpenCL) you probably care about coherency in a work group, but you may have several independent jobs that can run in parallel. In this case you can run some (coherent) work on core group 0, and some other (independent but coherent) work on core group 1. 2. Starvation. For compute work it's common to insert barriers requiring all threads to reach the same point in the shader before any thread can progress. If your workgroup size (i.e. the number of threads which synchronise on the barrier) is the same as the number of threads in the core this means that all threads have to be allocated to compute before the barrier can complete. However if the compute thread creator is competing with the fragment thread creator this can lead to the situation where compute threads are idle waiting for fragment threads to complete. This implies that running compute workloads with barriers at the same time as fragment work on the same cores isn't very optimal. </end of background> kbase has several flags: * BASE_JD_REQ_COHERENT_GROUP - the job chain must be run on a coherent set of cores. I.e. must be restricted to a single core group. * BASE_JD_REQ_ONLY_COMPUTE - the job chain is compute jobs and may contain barriers. * BASE_JD_REQ_SPECIFIC_COHERENT_GROUP - we care about being on a particular core group. device_nr is used to select which (and device_nr is otherwise ignored) In practice all this only really matters on the T62x GPU. All other GPUs have only one core group[1]. So it only really makes sense to use JS2 on the T62x where you want to use both JS1 and JS2 to run two independent jobs: one on each core group. Of course kbase makes all this into a maze of twisty little passages, all alike! :) Oh, and there is one hardware workaround (BASE_HW_ISSUE_8987) that uses JS2. This is to avoid vertex and compute jobs landing on the same slot. This affects T604 "dev15" only and is because some state was not properly cleared between jobs. Steve [1] There might be multiple L2 caches in hardware, but they are coherent and are logically a single L2 (only 1 bit set in L2_PRESENT).
Ah-ha! Thank you for the detailed explanation; this makes a lot more sense now :) > In practice all this only really matters on the T62x GPU. All other GPUs > have only one core group[1]. So it only really makes sense to use JS2 on > the T62x where you want to use both JS1 and JS2 to run two independent > jobs: one on each core group. > > Oh, and there is one hardware workaround (BASE_HW_ISSUE_8987) that uses > JS2. This is to avoid vertex and compute jobs landing on the same slot. > This affects T604 "dev15" only and is because some state was not > properly cleared between jobs. On my end at least, our work with compute has been focused on RK3399/T860. In userspace, it's tempting to limit T6xx/T720 to e.g. OpenGL ES 2.0 (between this/other errata, and of course fake-MRT), at least for the interim. For completeness, we'll maybe need to deal with this Eventually, but it doesn't sound like this needs to be handled right now? (It sounds like there's no changes needed for compute on T700+, since #8987 doesn't apply and there's only one core group.). In that case, maybe it's not worth it to start mucking with the UABI for a feature that may or may not ever be used.
diff --git a/drivers/gpu/drm/panfrost/TODO b/drivers/gpu/drm/panfrost/TODO index c2e44add3..8c367a5a6 100644 --- a/drivers/gpu/drm/panfrost/TODO +++ b/drivers/gpu/drm/panfrost/TODO @@ -20,8 +20,5 @@ - Support for madvise and a shrinker. -- Compute job support. So called 'compute only' jobs need to be plumbed up to - userspace. - - Performance counter support. (Boris) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 9bb9260d9..3e1385a3b 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -108,7 +108,9 @@ static int panfrost_job_get_slot(struct panfrost_job *job) if (job->requirements & PANFROST_JD_REQ_FS) return 0; -/* Not exposed to userspace yet */ + /* Compute jobs can run on JS1, so compute-only jobs can run with this + * simple implementations (useful for backwards compatibility). As an + * optimization, we will eventually want to schedule to JS2. */ #if 0 if (job->requirements & PANFROST_JD_REQ_ONLY_COMPUTE) { if ((job->requirements & PANFROST_JD_REQ_CORE_GRP_MASK) && diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h index b5d370638..25acde34b 100644 --- a/include/uapi/drm/panfrost_drm.h +++ b/include/uapi/drm/panfrost_drm.h @@ -38,6 +38,14 @@ extern "C" { #define DRM_IOCTL_PANFROST_PERFCNT_DUMP DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_PERFCNT_DUMP, struct drm_panfrost_perfcnt_dump) #define PANFROST_JD_REQ_FS (1 << 0) + +/* Optional (mandatory for T600 r0p0 15dev0 due to errata #8987) hint to the + * kernel that the commands only contain JOB_TYPE_COMPUTE jobs, without + * vertex/tiler/fragment jobs. If present, the kernel may use this as an + * optimization, but it is not strictly necessary. */ + +#define PANFROST_JD_REQ_ONLY_COMPUTE (1 << 1) + /** * struct drm_panfrost_submit - ioctl argument for submitting commands to the 3D * engine.
Midgard contains two job slots capable of compute jobs, JS1 and JS2. As an optimization, it is preferable to schedule compute-only workloads to JS2, although compute jobs are functionally able to be scheduled to JS1 (barring an obscure errata). Accordingly, we reserve a compute-only hint in the UABI to allow future compute-equipped userspace and future compute-optimized kernelspace to hint towards JS2. At the moment, the hint is ignored, but this is harmless. I have verified compute jobs can be successfully submitted and executed with an appropriate userspace against a 5.1 kernel without this hint. Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> --- drivers/gpu/drm/panfrost/TODO | 3 --- drivers/gpu/drm/panfrost/panfrost_job.c | 4 +++- include/uapi/drm/panfrost_drm.h | 8 ++++++++ 3 files changed, 11 insertions(+), 4 deletions(-)