diff mbox series

[RFC] cgroup: Document interface files and rationale for DRM controller

Message ID 20191104220847.23283-1-brian.welty@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC] cgroup: Document interface files and rationale for DRM controller | expand

Commit Message

Welty, Brian Nov. 4, 2019, 10:08 p.m. UTC
Here is one proposal for a set of interface files to be implemented in
in a new DRM controller.  This is an alternate set of interface files
than in the current v4 of DRM controller [1].  As there was not a clear
consensus on the interface files and associated controls, it seems worth
again considering if we should have a set of controls more consistent
in functionality with what existing CPU and MEM controllers provide.

This is in some respects a follow-up to my prior RFC series [2] where
I attempted to implement per-device controls within existing cgroup
subsystems.

The basic goal of that design was to reuse well known controls. Instead
we can achieve that same goal, but by simply having the proposed DRM
controller implement those same interface files with identical name.
The intent being that equivalent functionality is then provided by the
underlying GPU device driver.

Here is the rationale, repeated from proposed DRM controller documention:
As an accelerator or GPU device is similar in many respects to a CPU
with (or without) attached system memory, the design principle here is
to clone existing controls from other controllers where those controls
are used for the same fundamental purpose.  For example, rather than
inventing new but similar controls for managing memory allocation and
workload isolation and scheduling, we clone (in name and as possible in
functionality) controls from following controllers: CPU, CPUSET, and MEM.
The controls found in those controllers are well understood already by
system administrators and the intent here is to provide these controls
for use with accelarator and GPU devices via the DRM controller.

RFC note:
In fact, to make the above point more strongly, one suggestion is to
rename this cgroup controller as XPU instead of the current DRM.
This makes it clear this is not GPU-specific and no reason to make this
restricted to drivers under drivers/gpu/drm/.

[1] https://lists.freedesktop.org/archives/dri-devel/2019-August/233463.html
[2] https://lists.freedesktop.org/archives/dri-devel/2019-May/216373.html

Signed-off-by: Brian Welty <brian.welty@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 89 +++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

Comments

Tejun Heo Nov. 5, 2019, 12:15 a.m. UTC | #1
On Mon, Nov 04, 2019 at 05:08:47PM -0500, Brian Welty wrote:
> +  gpuset.units
> +  gpuset.units.effective
> +  gpuset.units.partition
> +
> +  gpuset.mems
> +  gpuset.mems.effective
> +  gpuset.mems.partition
> +
> +  sched.max
> +  sched.stats
> +  sched.weight
> +  sched.weight.nice
> +
> +  memory.current
> +  memory.events
> +  memory.high
> +  memory.low
> +  memory.max
> +  memory.min
> +  memory.stat
> +  memory.swap.current
> +  memory.swap.max

I don't understand why it needs to replicate essentially *all* the
interfaces that system resources are implementing from the get-go.
Some of the above have intersecting functionalities and exist more for
historical reasons and I fail to see how distinctions like min vs. low
and high vs. max would make sense for gpus.  Also, why would it have a
separate swap limit of its own?

Please start with something small and intuitive.  I'm gonna nack
anything which sprawls out like this.  Most likely, there's still a
ton you guys need to work through to reach the resource model which is
actually useful and trying to define a comprehensive interface upfront
like this is gonna look really silly and will become an ugly drag by
the time the problem space is actually understood.

It doesn't seem like this is coming through but can you please start
with a simple weight knob?

Thanks.
Welty, Brian Nov. 6, 2019, 12:08 a.m. UTC | #2
On 11/4/2019 4:15 PM, Tejun Heo wrote:
> On Mon, Nov 04, 2019 at 05:08:47PM -0500, Brian Welty wrote:
>> +  gpuset.units
>> +  gpuset.units.effective
>> +  gpuset.units.partition
>> +
>> +  gpuset.mems
>> +  gpuset.mems.effective
>> +  gpuset.mems.partition
>> +
>> +  sched.max
>> +  sched.stats
>> +  sched.weight
>> +  sched.weight.nice
>> +
>> +  memory.current
>> +  memory.events
>> +  memory.high
>> +  memory.low
>> +  memory.max
>> +  memory.min
>> +  memory.stat
>> +  memory.swap.current
>> +  memory.swap.max
> 
> I don't understand why it needs to replicate essentially *all* the
> interfaces that system resources are implementing from the get-go.
> Some of the above have intersecting functionalities and exist more for
> historical reasons and I fail to see how distinctions like min vs. low
> and high vs. max would make sense for gpus.  Also, why would it have a
> separate swap limit of its own?
> 
> Please start with something small and intuitive.  I'm gonna nack
> anything which sprawls out like this.  Most likely, there's still a
> ton you guys need to work through to reach the resource model which is
> actually useful and trying to define a comprehensive interface upfront
> like this is gonna look really silly and will become an ugly drag by
> the time the problem space is actually understood.
> 
> It doesn't seem like this is coming through but can you please start
> with a simple weight knob?
> 
> Thanks.
> 

Thanks Tejun for the feedback.
I was more interested in hearing your thoughts on whether you like
the approach to have a set of controls that are consistent with 
some subset of the existing CPU/MEM ones.  Any feedback on this?
Didn't really mean to suggest that all of these would be included
from the start.

Would you agree that this reduced set is a reasonable starting point?
+  sched.weight
+  memory.current
+  memory.max

Thoughts on whether this should be very GPU-specific cgroups controller
or should be more forward thinking to be useful for other 'accelerator'
type devices as well?

Thanks,
-Brian
Tejun Heo Nov. 7, 2019, 3:29 p.m. UTC | #3
Hello,

On Tue, Nov 05, 2019 at 04:08:22PM -0800, Brian Welty wrote:
> I was more interested in hearing your thoughts on whether you like
> the approach to have a set of controls that are consistent with 
> some subset of the existing CPU/MEM ones.  Any feedback on this?
> Didn't really mean to suggest that all of these would be included
> from the start.

I don't see why they should be synchronized.  If it ends up being
about the same anyway, there's no reason to not sync them but that
doesn't seem very likely to me and trying to sync sounds like adding
an unnecessary constraint.  One side of the ballot is possibly missing
on aesthetics a bit while the other side is constraining interface
design even before understanding the design space, so...

> Would you agree that this reduced set is a reasonable starting point?
> +  sched.weight
> +  memory.current
> +  memory.max
> 
> Thoughts on whether this should be very GPU-specific cgroups controller
> or should be more forward thinking to be useful for other 'accelerator'
> type devices as well?

My preference is starting small and focused.  GPU by itself is already
a big enough problem and the progress upto this point evidently
indicates even that alone is poorly mapped out.  Please start with the
smallest and most common (tied to usage, not hardware) interface
that's viable.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5361ebec3361..2a713059ccbd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2050,6 +2050,95 @@  RDMA Interface Files
 	  mlx4_0 hca_handle=1 hca_object=20
 	  ocrdma1 hca_handle=1 hca_object=23
 
+DRM
+----
+
+The "drm" controller regulates the distribution and accounting of
+resources within a GPU device (or other similar accelerator device), but
+only those controlled by device drivers under drivers/gpu/drm/ and thus
+which register with DRM.
+
+As an accelerator or GPU device is similar in many respects to a CPU
+with (or without) attached system memory, the design principle here is
+to clone existing controls from other controllers where those controls
+are used for the same fundamental purpose.  For example, rather than
+inventing new but similar controls for managing memory allocation and
+workload isolation and scheduling, we clone (in name and as possible in
+functionality) controls from following controllers: CPU, CPUSET, and MEM.
+The controls found in those controllers are well understood already by
+system administrators and the intent here is to provide these controls
+for use with accelarator and GPU devices via the DRM controller.
+
+RFC note:
+In fact, to make the above point more strongly, one suggestion is to
+rename this cgroup controller as XPU instead of the current DRM.
+This makes it clear this is not GPU-specific and no reason to make this
+restricted to drivers under drivers/gpu/drm/.
+
+DRM Interface Files
+~~~~~~~~~~~~~~~~~~~
+
+DRM controller supports usage of multiple DRM devices within a cgroup.
+As such, for all interface files, output is keyed by device name and is
+not ordered.
+
+The first set of control files are intended to clone functionality from
+CPUSETs and thus provide a mechanism for assigning a set of workload
+execution units and a set of attached memories to a cgroup in order to
+provide resource isolation.  The term 'workload execution unit' is
+unrealistic to have a common underlying hardware implementation across
+all devices.  The intent is to represent the available set of hardware
+resources that provides scheduling and/or partitioning of workloads, by
+which potentially this maps to 'GPU cores' or to 'hardware engines'.
+
+  gpuset.units
+  gpuset.units.effective
+  gpuset.units.partition
+
+  gpuset.mems
+  gpuset.mems.effective
+  gpuset.mems.partition
+
+[As this is an RFC, each control above is not yet described.  For now,
+please refer to CPUSET interface file documentation as these are intended
+to provide equivalent functionality.]
+
+The next set of control files are intended to clone functionality from
+CPU controller and thus provide a mechanism to influence workload
+scheduling.
+
+  sched.max
+  sched.stats
+  sched.weight
+  sched.weight.nice
+
+[As this is an RFC, each control above is not yet described.  For now,
+please refer to CPU interface file documentation as these are intended
+to provide equivalent functionality.]
+
+The next set of control files are intended to clone functionality from
+the MEM controller and thus provide a mechanism for regulating allocation
+and accounting of attached memory.  All memory amounts are in bytes.
+
+  memory.current
+  memory.events
+  memory.high
+  memory.low
+  memory.max
+  memory.min
+  memory.stat
+  memory.swap.current
+  memory.swap.max
+
+Potentially, with substantial justifciation, the above can be expanded
+in the future with new memory.xxx.yyy set of controls for additional
+memory 'types' or 'placement regions' that are unique from each other
+and warrant separate controls, (such as memory.swap.yyy).
+
+[As this is an RFC, each control above is not yet described.  For now,
+please refer to MEM interface file documentation as these are intended
+to provide equivalent functionality.]
+
 
 Misc
 ----