[RFC,0/5] cgroup support for GPU devices

Message ID	20190501140438.9506-1-brian.welty@intel.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Brian Welty <brian.welty@intel.com> To: cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>, Li Zefan <lizefan@huawei.com>, Johannes Weiner <hannes@cmpxchg.org>, linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, dri-devel@lists.freedesktop.org, David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>, intel-gfx@lists.freedesktop.org, Jani Nikula <jani.nikula@linux.intel.com>, Joonas Lahtinen <joonas.lahtinen@linux.intel.com>, Rodrigo Vivi <rodrigo.vivi@intel.com>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Alex Deucher <alexander.deucher@amd.com>, ChunMing Zhou <David1.Zhou@amd.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com> Subject: [RFC PATCH 0/5] cgroup support for GPU devices Date: Wed, 1 May 2019 10:04:33 -0400 Message-Id: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	cgroup support for GPU devices \| expand [RFC,0/5] cgroup support for GPU devices [RFC,1/5] cgroup: Add cgroup_subsys per-device registration framework [RFC,2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state [RFC,3/5] memcg: Add per-device support to memory cgroup subsystem [RFC,4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit [RFC,5/5] drm/i915: Use memory cgroup for enforcing device memory limit

Welty, Brian May 1, 2019, 2:04 p.m. UTC

In containerized or virtualized environments, there is desire to have
controls in place for resources that can be consumed by users of a GPU
device.  This RFC patch series proposes a framework for integrating 
use of existing cgroup controllers into device drivers.
The i915 driver is updated in this series as our primary use case to
leverage this framework and to serve as an example for discussion.

The patch series enables device drivers to use cgroups to control the
following resources within a GPU (or other accelerator device):
*  control allocation of device memory (reuse of memcg)
and with future work, we could extend to:
*  track and control share of GPU time (reuse of cpu/cpuacct)
*  apply mask of allowed execution engines (reuse of cpusets)

Instead of introducing a new cgroup subsystem for GPU devices, a new
framework is proposed to allow devices to register with existing cgroup
controllers, which creates per-device cgroup_subsys_state within the
cgroup.  This gives device drivers their own private cgroup controls
(such as memory limits or other parameters) to be applied to device
resources instead of host system resources.
Device drivers (GPU or other) are then able to reuse the existing cgroup
controls, instead of inventing similar ones.

Per-device controls would be exposed in cgroup filesystem as:
    mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
such as (for example):
    mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
    mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
    mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
    mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight

The drm/i915 patch in this series is based on top of other RFC work [1]
for i915 device memory support.

AMD [2] and Intel [3] have proposed related work in this area within the
last few years, listed below as reference.  This new RFC reuses existing
cgroup controllers and takes a different approach than prior work.

Finally, some potential discussion points for this series:
* merge proposed <subsys_name>.devices into a single devices directory?
* allow devices to have multiple registrations for subsets of resources?
* document a 'common charging policy' for device drivers to follow?

[1] https://patchwork.freedesktop.org/series/56683/
[2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
[3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html


Brian Welty (5):
  cgroup: Add cgroup_subsys per-device registration framework
  cgroup: Change kernfs_node for directories to store
    cgroup_subsys_state
  memcg: Add per-device support to memory cgroup subsystem
  drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
  drm/i915: Use memory cgroup for enforcing device memory limit

 drivers/gpu/drm/drm_drv.c                  |  12 +
 drivers/gpu/drm/drm_gem.c                  |   7 +
 drivers/gpu/drm/i915/i915_drv.c            |   2 +-
 drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
 include/drm/drm_device.h                   |   3 +
 include/drm/drm_drv.h                      |   8 +
 include/drm/drm_gem.h                      |  11 +
 include/linux/cgroup-defs.h                |  28 ++
 include/linux/cgroup.h                     |   3 +
 include/linux/memcontrol.h                 |  10 +
 kernel/cgroup/cgroup-v1.c                  |  10 +-
 kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
 mm/memcontrol.c                            | 183 +++++++++++-
 13 files changed, 552 insertions(+), 59 deletions(-)

Leon Romanovsky May 2, 2019, 8:34 a.m. UTC | #1

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> In containerized or virtualized environments, there is desire to have
> controls in place for resources that can be consumed by users of a GPU
> device.  This RFC patch series proposes a framework for integrating
> use of existing cgroup controllers into device drivers.
> The i915 driver is updated in this series as our primary use case to
> leverage this framework and to serve as an example for discussion.
>
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)

Count us (Mellanox) too, our RDMA devices are exposing special and
limited in size device memory to the users and we would like to provide
an option to use cgroup to control its exposure.

> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
>
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.
>
> Per-device controls would be exposed in cgroup filesystem as:
>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> such as (for example):
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
>
> The drm/i915 patch in this series is based on top of other RFC work [1]
> for i915 device memory support.
>
> AMD [2] and Intel [3] have proposed related work in this area within the
> last few years, listed below as reference.  This new RFC reuses existing
> cgroup controllers and takes a different approach than prior work.
>
> Finally, some potential discussion points for this series:
> * merge proposed <subsys_name>.devices into a single devices directory?
> * allow devices to have multiple registrations for subsets of resources?
> * document a 'common charging policy' for device drivers to follow?
>
> [1] https://patchwork.freedesktop.org/series/56683/
> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>
>
> Brian Welty (5):
>   cgroup: Add cgroup_subsys per-device registration framework
>   cgroup: Change kernfs_node for directories to store
>     cgroup_subsys_state
>   memcg: Add per-device support to memory cgroup subsystem
>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
>   drm/i915: Use memory cgroup for enforcing device memory limit
>
>  drivers/gpu/drm/drm_drv.c                  |  12 +
>  drivers/gpu/drm/drm_gem.c                  |   7 +
>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
>  include/drm/drm_device.h                   |   3 +
>  include/drm/drm_drv.h                      |   8 +
>  include/drm/drm_gem.h                      |  11 +
>  include/linux/cgroup-defs.h                |  28 ++
>  include/linux/cgroup.h                     |   3 +
>  include/linux/memcontrol.h                 |  10 +
>  kernel/cgroup/cgroup-v1.c                  |  10 +-
>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
>  mm/memcontrol.c                            | 183 +++++++++++-
>  13 files changed, 552 insertions(+), 59 deletions(-)
>
> --
> 2.21.0
>

Kenny Ho May 2, 2019, 10:48 p.m. UTC | #2

> Count us (Mellanox) too, our RDMA devices are exposing special and
> limited in size device memory to the users and we would like to provide
> an option to use cgroup to control its exposure.
Doesn't RDMA already has a separate cgroup?  Why not implement it there?


> > and with future work, we could extend to:
> > *  track and control share of GPU time (reuse of cpu/cpuacct)
> > *  apply mask of allowed execution engines (reuse of cpusets)
> >
> > Instead of introducing a new cgroup subsystem for GPU devices, a new
> > framework is proposed to allow devices to register with existing cgroup
> > controllers, which creates per-device cgroup_subsys_state within the
> > cgroup.  This gives device drivers their own private cgroup controls
> > (such as memory limits or other parameters) to be applied to device
> > resources instead of host system resources.
> > Device drivers (GPU or other) are then able to reuse the existing cgroup
> > controls, instead of inventing similar ones.
> >
> > Per-device controls would be exposed in cgroup filesystem as:
> >     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > such as (for example):
> >     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> >     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> >     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> >     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> >
> > The drm/i915 patch in this series is based on top of other RFC work [1]
> > for i915 device memory support.
> >
> > AMD [2] and Intel [3] have proposed related work in this area within the
> > last few years, listed below as reference.  This new RFC reuses existing
> > cgroup controllers and takes a different approach than prior work.
> >
> > Finally, some potential discussion points for this series:
> > * merge proposed <subsys_name>.devices into a single devices directory?
> > * allow devices to have multiple registrations for subsets of resources?
> > * document a 'common charging policy' for device drivers to follow?
> >
> > [1] https://patchwork.freedesktop.org/series/56683/
> > [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >
> >
> > Brian Welty (5):
> >   cgroup: Add cgroup_subsys per-device registration framework
> >   cgroup: Change kernfs_node for directories to store
> >     cgroup_subsys_state
> >   memcg: Add per-device support to memory cgroup subsystem
> >   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> >   drm/i915: Use memory cgroup for enforcing device memory limit
> >
> >  drivers/gpu/drm/drm_drv.c                  |  12 +
> >  drivers/gpu/drm/drm_gem.c                  |   7 +
> >  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> >  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> >  include/drm/drm_device.h                   |   3 +
> >  include/drm/drm_drv.h                      |   8 +
> >  include/drm/drm_gem.h                      |  11 +
> >  include/linux/cgroup-defs.h                |  28 ++
> >  include/linux/cgroup.h                     |   3 +
> >  include/linux/memcontrol.h                 |  10 +
> >  kernel/cgroup/cgroup-v1.c                  |  10 +-
> >  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> >  mm/memcontrol.c                            | 183 +++++++++++-
> >  13 files changed, 552 insertions(+), 59 deletions(-)
> >
> > --
> > 2.21.0
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Welty, Brian May 3, 2019, 9:14 p.m. UTC | #3

On 5/2/2019 3:48 PM, Kenny Ho wrote:
> On 5/2/2019 1:34 AM, Leon Romanovsky wrote:
>> Count us (Mellanox) too, our RDMA devices are exposing special and
>> limited in size device memory to the users and we would like to provide
>> an option to use cgroup to control its exposure.

Hi Leon, great to hear and happy to work with you and RDMA community
to shape this framework for use by RDMA devices as well.  The intent
was to support more than GPU devices.

Incidentally, I also wanted to ask about the rdma cgroup controller
and if there is interest in updating the device registration implemented
in that controller.  It could use the cgroup_device_register() that is
proposed here.   But this is perhaps future work, so can discuss separately.


> Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> 

Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
I gave in the cover letter.  Namely, to implement in rdma controller, would
mean duplicating existing memcg controls there.

Is AMD interested in collaborating to help shape this framework?
It is intended to be device-neutral, so could be leveraged by various
types of devices.
If you have an alternative solution well underway, then maybe
we can work together to merge our efforts into one.
In the end, the DRM community is best served with common solution.


> 
>>> and with future work, we could extend to:
>>> *  track and control share of GPU time (reuse of cpu/cpuacct)
>>> *  apply mask of allowed execution engines (reuse of cpusets)
>>>
>>> Instead of introducing a new cgroup subsystem for GPU devices, a new
>>> framework is proposed to allow devices to register with existing cgroup
>>> controllers, which creates per-device cgroup_subsys_state within the
>>> cgroup.  This gives device drivers their own private cgroup controls
>>> (such as memory limits or other parameters) to be applied to device
>>> resources instead of host system resources.
>>> Device drivers (GPU or other) are then able to reuse the existing cgroup
>>> controls, instead of inventing similar ones.
>>>
>>> Per-device controls would be exposed in cgroup filesystem as:
>>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
>>> such as (for example):
>>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
>>>
>>> The drm/i915 patch in this series is based on top of other RFC work [1]
>>> for i915 device memory support.
>>>
>>> AMD [2] and Intel [3] have proposed related work in this area within the
>>> last few years, listed below as reference.  This new RFC reuses existing
>>> cgroup controllers and takes a different approach than prior work.
>>>
>>> Finally, some potential discussion points for this series:
>>> * merge proposed <subsys_name>.devices into a single devices directory?
>>> * allow devices to have multiple registrations for subsets of resources?
>>> * document a 'common charging policy' for device drivers to follow?
>>>
>>> [1] https://patchwork.freedesktop.org/series/56683/
>>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
>>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>>>
>>>
>>> Brian Welty (5):
>>>   cgroup: Add cgroup_subsys per-device registration framework
>>>   cgroup: Change kernfs_node for directories to store
>>>     cgroup_subsys_state
>>>   memcg: Add per-device support to memory cgroup subsystem
>>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
>>>   drm/i915: Use memory cgroup for enforcing device memory limit
>>>
>>>  drivers/gpu/drm/drm_drv.c                  |  12 +
>>>  drivers/gpu/drm/drm_gem.c                  |   7 +
>>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
>>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
>>>  include/drm/drm_device.h                   |   3 +
>>>  include/drm/drm_drv.h                      |   8 +
>>>  include/drm/drm_gem.h                      |  11 +
>>>  include/linux/cgroup-defs.h                |  28 ++
>>>  include/linux/cgroup.h                     |   3 +
>>>  include/linux/memcontrol.h                 |  10 +
>>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
>>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
>>>  mm/memcontrol.c                            | 183 +++++++++++-
>>>  13 files changed, 552 insertions(+), 59 deletions(-)
>>>
>>> --
>>> 2.21.0
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Leon Romanovsky May 5, 2019, 7:14 a.m. UTC | #4

On Fri, May 03, 2019 at 02:14:33PM -0700, Welty, Brian wrote:
>
> On 5/2/2019 3:48 PM, Kenny Ho wrote:
> > On 5/2/2019 1:34 AM, Leon Romanovsky wrote:
> >> Count us (Mellanox) too, our RDMA devices are exposing special and
> >> limited in size device memory to the users and we would like to provide
> >> an option to use cgroup to control its exposure.
>
> Hi Leon, great to hear and happy to work with you and RDMA community
> to shape this framework for use by RDMA devices as well.  The intent
> was to support more than GPU devices.
>
> Incidentally, I also wanted to ask about the rdma cgroup controller
> and if there is interest in updating the device registration implemented
> in that controller.  It could use the cgroup_device_register() that is
> proposed here.   But this is perhaps future work, so can discuss separately.

I'll try to take a look later this week.

>
>
> > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> >
>
> Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> I gave in the cover letter.  Namely, to implement in rdma controller, would
> mean duplicating existing memcg controls there.

Exactly, I didn't feel comfortable to add notion of "device memory"
to RDMA cgroup and postponed that decision to later point of time.
RDMA operates with verbs objects and all our user space API is based around
that concept. At the end, system administrator will have hard time to
understand the differences between memcg and RDMA memory.

>
> Is AMD interested in collaborating to help shape this framework?
> It is intended to be device-neutral, so could be leveraged by various
> types of devices.
> If you have an alternative solution well underway, then maybe
> we can work together to merge our efforts into one.
> In the end, the DRM community is best served with common solution.
>
>
> >
> >>> and with future work, we could extend to:
> >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> >>> *  apply mask of allowed execution engines (reuse of cpusets)
> >>>
> >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> >>> framework is proposed to allow devices to register with existing cgroup
> >>> controllers, which creates per-device cgroup_subsys_state within the
> >>> cgroup.  This gives device drivers their own private cgroup controls
> >>> (such as memory limits or other parameters) to be applied to device
> >>> resources instead of host system resources.
> >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> >>> controls, instead of inventing similar ones.
> >>>
> >>> Per-device controls would be exposed in cgroup filesystem as:
> >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> >>> such as (for example):
> >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> >>>
> >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> >>> for i915 device memory support.
> >>>
> >>> AMD [2] and Intel [3] have proposed related work in this area within the
> >>> last few years, listed below as reference.  This new RFC reuses existing
> >>> cgroup controllers and takes a different approach than prior work.
> >>>
> >>> Finally, some potential discussion points for this series:
> >>> * merge proposed <subsys_name>.devices into a single devices directory?
> >>> * allow devices to have multiple registrations for subsets of resources?
> >>> * document a 'common charging policy' for device drivers to follow?
> >>>
> >>> [1] https://patchwork.freedesktop.org/series/56683/
> >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >>>
> >>>
> >>> Brian Welty (5):
> >>>   cgroup: Add cgroup_subsys per-device registration framework
> >>>   cgroup: Change kernfs_node for directories to store
> >>>     cgroup_subsys_state
> >>>   memcg: Add per-device support to memory cgroup subsystem
> >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> >>>
> >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> >>>  include/drm/drm_device.h                   |   3 +
> >>>  include/drm/drm_drv.h                      |   8 +
> >>>  include/drm/drm_gem.h                      |  11 +
> >>>  include/linux/cgroup-defs.h                |  28 ++
> >>>  include/linux/cgroup.h                     |   3 +
> >>>  include/linux/memcontrol.h                 |  10 +
> >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> >>>  mm/memcontrol.c                            | 183 +++++++++++-
> >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> >>>
> >>> --
> >>> 2.21.0
> >>>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Kenny Ho May 5, 2019, 2:21 p.m. UTC | #5

On Sun, May 5, 2019 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> > >
> >
> > Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> > I gave in the cover letter.  Namely, to implement in rdma controller, would
> > mean duplicating existing memcg controls there.
>
> Exactly, I didn't feel comfortable to add notion of "device memory"
> to RDMA cgroup and postponed that decision to later point of time.
> RDMA operates with verbs objects and all our user space API is based around
> that concept. At the end, system administrator will have hard time to
> understand the differences between memcg and RDMA memory.
Interesting.  I actually don't understand this part (I worked in
devops/sysadmin side of things but never with rdma.)  Don't
applications that use rdma require some awareness of rdma (I mean, you
mentioned verbs and objects... or do they just use regular malloc for
buffer allocation and then send it through some function?)  As a user,
I would have this question: why do I need to configure some part of
rdma resources under rdma cgroup while other part of rdma resources in
a different, seemingly unrelated cgroups.

I think we need to be careful about drawing the line between
duplication and over couplings between subsystems.  I have other
thoughts and concerns and I will try to organize them into a response
in the next few days.

Regards,
Kenny


> >
> > Is AMD interested in collaborating to help shape this framework?
> > It is intended to be device-neutral, so could be leveraged by various
> > types of devices.
> > If you have an alternative solution well underway, then maybe
> > we can work together to merge our efforts into one.
> > In the end, the DRM community is best served with common solution.
> >
> >
> > >
> > >>> and with future work, we could extend to:
> > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > >>>
> > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > >>> framework is proposed to allow devices to register with existing cgroup
> > >>> controllers, which creates per-device cgroup_subsys_state within the
> > >>> cgroup.  This gives device drivers their own private cgroup controls
> > >>> (such as memory limits or other parameters) to be applied to device
> > >>> resources instead of host system resources.
> > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > >>> controls, instead of inventing similar ones.
> > >>>
> > >>> Per-device controls would be exposed in cgroup filesystem as:
> > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > >>> such as (for example):
> > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > >>>
> > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > >>> for i915 device memory support.
> > >>>
> > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > >>> last few years, listed below as reference.  This new RFC reuses existing
> > >>> cgroup controllers and takes a different approach than prior work.
> > >>>
> > >>> Finally, some potential discussion points for this series:
> > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > >>> * allow devices to have multiple registrations for subsets of resources?
> > >>> * document a 'common charging policy' for device drivers to follow?
> > >>>
> > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > >>>
> > >>>
> > >>> Brian Welty (5):
> > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > >>>   cgroup: Change kernfs_node for directories to store
> > >>>     cgroup_subsys_state
> > >>>   memcg: Add per-device support to memory cgroup subsystem
> > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > >>>
> > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > >>>  include/drm/drm_device.h                   |   3 +
> > >>>  include/drm/drm_drv.h                      |   8 +
> > >>>  include/drm/drm_gem.h                      |  11 +
> > >>>  include/linux/cgroup-defs.h                |  28 ++
> > >>>  include/linux/cgroup.h                     |   3 +
> > >>>  include/linux/memcontrol.h                 |  10 +
> > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > >>>
> > >>> --
> > >>> 2.21.0
> > >>>
> > >> _______________________________________________
> > >> dri-devel mailing list
> > >> dri-devel@lists.freedesktop.org
> > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Leon Romanovsky May 5, 2019, 4:05 p.m. UTC | #6

On Sun, May 05, 2019 at 10:21:30AM -0400, Kenny Ho wrote:
> On Sun, May 5, 2019 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> > > >
> > >
> > > Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> > > I gave in the cover letter.  Namely, to implement in rdma controller, would
> > > mean duplicating existing memcg controls there.
> >
> > Exactly, I didn't feel comfortable to add notion of "device memory"
> > to RDMA cgroup and postponed that decision to later point of time.
> > RDMA operates with verbs objects and all our user space API is based around
> > that concept. At the end, system administrator will have hard time to
> > understand the differences between memcg and RDMA memory.
> Interesting.  I actually don't understand this part (I worked in
> devops/sysadmin side of things but never with rdma.)  Don't
> applications that use rdma require some awareness of rdma (I mean, you
> mentioned verbs and objects... or do they just use regular malloc for
> buffer allocation and then send it through some function?)  As a user,
> I would have this question: why do I need to configure some part of
> rdma resources under rdma cgroup while other part of rdma resources in
> a different, seemingly unrelated cgroups.

We are talking about two different access patterns for this device
memory (DM). One is to use this device memory (DM) and second to configure/limit.
Usually those actions will be performed by different groups.

First group (programmers) is using special API [1] through libibverbs [2]
without any notion of cgroups or any limitations. Second group (sysadmins)
is less interested in application specifics and for them "device memory" means
"memory" and not "rdma, nic specific, internal memory".

[1] ibv_alloc_dm()
http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
[2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/

>
> I think we need to be careful about drawing the line between
> duplication and over couplings between subsystems.  I have other
> thoughts and concerns and I will try to organize them into a response
> in the next few days.
>
> Regards,
> Kenny
>
>
> > >
> > > Is AMD interested in collaborating to help shape this framework?
> > > It is intended to be device-neutral, so could be leveraged by various
> > > types of devices.
> > > If you have an alternative solution well underway, then maybe
> > > we can work together to merge our efforts into one.
> > > In the end, the DRM community is best served with common solution.
> > >
> > >
> > > >
> > > >>> and with future work, we could extend to:
> > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > >>>
> > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > >>> framework is proposed to allow devices to register with existing cgroup
> > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > >>> (such as memory limits or other parameters) to be applied to device
> > > >>> resources instead of host system resources.
> > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > >>> controls, instead of inventing similar ones.
> > > >>>
> > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > >>> such as (for example):
> > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > >>>
> > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > >>> for i915 device memory support.
> > > >>>
> > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > >>> cgroup controllers and takes a different approach than prior work.
> > > >>>
> > > >>> Finally, some potential discussion points for this series:
> > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > >>> * document a 'common charging policy' for device drivers to follow?
> > > >>>
> > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > >>>
> > > >>>
> > > >>> Brian Welty (5):
> > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > >>>   cgroup: Change kernfs_node for directories to store
> > > >>>     cgroup_subsys_state
> > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > >>>
> > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > >>>  include/drm/drm_device.h                   |   3 +
> > > >>>  include/drm/drm_drv.h                      |   8 +
> > > >>>  include/drm/drm_gem.h                      |  11 +
> > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > >>>  include/linux/cgroup.h                     |   3 +
> > > >>>  include/linux/memcontrol.h                 |  10 +
> > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > >>>
> > > >>> --
> > > >>> 2.21.0
> > > >>>
> > > >> _______________________________________________
> > > >> dri-devel mailing list
> > > >> dri-devel@lists.freedesktop.org
> > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Kenny Ho May 5, 2019, 4:34 p.m. UTC | #7

(sent again.  Not sure why my previous email was just a reply instead
of reply-all.)

On Sun, May 5, 2019 at 12:05 PM Leon Romanovsky <leon@kernel.org> wrote:
> We are talking about two different access patterns for this device
> memory (DM). One is to use this device memory (DM) and second to configure/limit.
> Usually those actions will be performed by different groups.
>
> First group (programmers) is using special API [1] through libibverbs [2]
> without any notion of cgroups or any limitations. Second group (sysadmins)
> is less interested in application specifics and for them "device memory" means
> "memory" and not "rdma, nic specific, internal memory".
Um... I am not sure that answered it, especially in the context of
cgroup (this is just for my curiosity btw, I don't know much about
rdma.)  You said sysadmins are less interested in application
specifics but then how would they make the judgement call on how much
"device memory" is provisioned to one application/container over
another (let say you have 5 cgroup sharing an rdma device)?  What are
the consequences of under provisioning "device memory" to an
application?  And if they are all just memory, can a sysadmin
provision more system memory in place of device memory (like, are they
interchangeable)?  I guess I am confused because if device memory is
just memory (not rdma, nic specific) to sysadmins how would they know
to set the right amount?

Regards,
Kenny

> [1] ibv_alloc_dm()
> http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
> https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
> [2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/
>
> >
> > I think we need to be careful about drawing the line between
> > duplication and over couplings between subsystems.  I have other
> > thoughts and concerns and I will try to organize them into a response
> > in the next few days.
> >
> > Regards,
> > Kenny
> >
> >
> > > >
> > > > Is AMD interested in collaborating to help shape this framework?
> > > > It is intended to be device-neutral, so could be leveraged by various
> > > > types of devices.
> > > > If you have an alternative solution well underway, then maybe
> > > > we can work together to merge our efforts into one.
> > > > In the end, the DRM community is best served with common solution.
> > > >
> > > >
> > > > >
> > > > >>> and with future work, we could extend to:
> > > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > > >>>
> > > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > > >>> framework is proposed to allow devices to register with existing cgroup
> > > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > > >>> (such as memory limits or other parameters) to be applied to device
> > > > >>> resources instead of host system resources.
> > > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > > >>> controls, instead of inventing similar ones.
> > > > >>>
> > > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > > >>> such as (for example):
> > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > > >>>
> > > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > > >>> for i915 device memory support.
> > > > >>>
> > > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > > >>> cgroup controllers and takes a different approach than prior work.
> > > > >>>
> > > > >>> Finally, some potential discussion points for this series:
> > > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > > >>> * document a 'common charging policy' for device drivers to follow?
> > > > >>>
> > > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > > >>>
> > > > >>>
> > > > >>> Brian Welty (5):
> > > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > > >>>   cgroup: Change kernfs_node for directories to store
> > > > >>>     cgroup_subsys_state
> > > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > > >>>
> > > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > > >>>  include/drm/drm_device.h                   |   3 +
> > > > >>>  include/drm/drm_drv.h                      |   8 +
> > > > >>>  include/drm/drm_gem.h                      |  11 +
> > > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > > >>>  include/linux/cgroup.h                     |   3 +
> > > > >>>  include/linux/memcontrol.h                 |  10 +
> > > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > > >>>
> > > > >>> --
> > > > >>> 2.21.0
> > > > >>>
> > > > >> _______________________________________________
> > > > >> dri-devel mailing list
> > > > >> dri-devel@lists.freedesktop.org
> > > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Chris Down May 5, 2019, 4:46 p.m. UTC | #8

Leon Romanovsky writes:
>First group (programmers) is using special API [1] through libibverbs [2]
>without any notion of cgroups or any limitations. Second group (sysadmins)
>is less interested in application specifics and for them "device memory" means
>"memory" and not "rdma, nic specific, internal memory".

I'd suggest otherwise, based on historic precedent -- sysadmins are typically 
very opinionated about operation of the memory subsystem (hence the endless 
discussions about swap, caching behaviour, etc).

Especially in this case, these types of memory operate fundamentally 
differently and have significantly different performance and availability 
characteristics. That's not something that can be trivially abstracted over 
without non-trivial drawbacks.

Leon Romanovsky May 5, 2019, 4:55 p.m. UTC | #9

On Sun, May 05, 2019 at 12:34:16PM -0400, Kenny Ho wrote:
> (sent again.  Not sure why my previous email was just a reply instead
> of reply-all.)
>
> On Sun, May 5, 2019 at 12:05 PM Leon Romanovsky <leon@kernel.org> wrote:
> > We are talking about two different access patterns for this device
> > memory (DM). One is to use this device memory (DM) and second to configure/limit.
> > Usually those actions will be performed by different groups.
> >
> > First group (programmers) is using special API [1] through libibverbs [2]
> > without any notion of cgroups or any limitations. Second group (sysadmins)
> > is less interested in application specifics and for them "device memory" means
> > "memory" and not "rdma, nic specific, internal memory".
> Um... I am not sure that answered it, especially in the context of
> cgroup (this is just for my curiosity btw, I don't know much about
> rdma.)  You said sysadmins are less interested in application
> specifics but then how would they make the judgement call on how much
> "device memory" is provisioned to one application/container over
> another (let say you have 5 cgroup sharing an rdma device)?  What are
> the consequences of under provisioning "device memory" to an
> application?  And if they are all just memory, can a sysadmin
> provision more system memory in place of device memory (like, are they
> interchangeable)?  I guess I am confused because if device memory is
> just memory (not rdma, nic specific) to sysadmins how would they know
> to set the right amount?

One of the immediate usages of this DM that come to my mind is very
fast spinlocks for MPI applications. In such case, the amount of DM
will be property of network topology in given MPI cluster.

In this scenario, precise amount of memory will ensure that all jobs
will continue to give maximal performance despite any programmer's
error in DM allocation.

For under provisioning scenario and if application is written correctly,
users will experience more latency and less performance, due to the PCI
accesses.

Slide 3 in Liran's presentation gives brief overview about motivation.

Thanks

>
> Regards,
> Kenny
>
> > [1] ibv_alloc_dm()
> > http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
> > https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
> > [2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/
> >
> > >
> > > I think we need to be careful about drawing the line between
> > > duplication and over couplings between subsystems.  I have other
> > > thoughts and concerns and I will try to organize them into a response
> > > in the next few days.
> > >
> > > Regards,
> > > Kenny
> > >
> > >
> > > > >
> > > > > Is AMD interested in collaborating to help shape this framework?
> > > > > It is intended to be device-neutral, so could be leveraged by various
> > > > > types of devices.
> > > > > If you have an alternative solution well underway, then maybe
> > > > > we can work together to merge our efforts into one.
> > > > > In the end, the DRM community is best served with common solution.
> > > > >
> > > > >
> > > > > >
> > > > > >>> and with future work, we could extend to:
> > > > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > > > >>>
> > > > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> > > > > >>> framework is proposed to allow devices to register with existing cgroup
> > > > > >>> controllers, which creates per-device cgroup_subsys_state within the
> > > > > >>> cgroup.  This gives device drivers their own private cgroup controls
> > > > > >>> (such as memory limits or other parameters) to be applied to device
> > > > > >>> resources instead of host system resources.
> > > > > >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> > > > > >>> controls, instead of inventing similar ones.
> > > > > >>>
> > > > > >>> Per-device controls would be exposed in cgroup filesystem as:
> > > > > >>>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> > > > > >>> such as (for example):
> > > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
> > > > > >>>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
> > > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
> > > > > >>>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight
> > > > > >>>
> > > > > >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> > > > > >>> for i915 device memory support.
> > > > > >>>
> > > > > >>> AMD [2] and Intel [3] have proposed related work in this area within the
> > > > > >>> last few years, listed below as reference.  This new RFC reuses existing
> > > > > >>> cgroup controllers and takes a different approach than prior work.
> > > > > >>>
> > > > > >>> Finally, some potential discussion points for this series:
> > > > > >>> * merge proposed <subsys_name>.devices into a single devices directory?
> > > > > >>> * allow devices to have multiple registrations for subsets of resources?
> > > > > >>> * document a 'common charging policy' for device drivers to follow?
> > > > > >>>
> > > > > >>> [1] https://patchwork.freedesktop.org/series/56683/
> > > > > >>> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > > > > >>> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> > > > > >>>
> > > > > >>>
> > > > > >>> Brian Welty (5):
> > > > > >>>   cgroup: Add cgroup_subsys per-device registration framework
> > > > > >>>   cgroup: Change kernfs_node for directories to store
> > > > > >>>     cgroup_subsys_state
> > > > > >>>   memcg: Add per-device support to memory cgroup subsystem
> > > > > >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> > > > > >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> > > > > >>>
> > > > > >>>  drivers/gpu/drm/drm_drv.c                  |  12 +
> > > > > >>>  drivers/gpu/drm/drm_gem.c                  |   7 +
> > > > > >>>  drivers/gpu/drm/i915/i915_drv.c            |   2 +-
> > > > > >>>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
> > > > > >>>  include/drm/drm_device.h                   |   3 +
> > > > > >>>  include/drm/drm_drv.h                      |   8 +
> > > > > >>>  include/drm/drm_gem.h                      |  11 +
> > > > > >>>  include/linux/cgroup-defs.h                |  28 ++
> > > > > >>>  include/linux/cgroup.h                     |   3 +
> > > > > >>>  include/linux/memcontrol.h                 |  10 +
> > > > > >>>  kernel/cgroup/cgroup-v1.c                  |  10 +-
> > > > > >>>  kernel/cgroup/cgroup.c                     | 310 ++++++++++++++++++---
> > > > > >>>  mm/memcontrol.c                            | 183 +++++++++++-
> > > > > >>>  13 files changed, 552 insertions(+), 59 deletions(-)
> > > > > >>>
> > > > > >>> --
> > > > > >>> 2.21.0
> > > > > >>>
> > > > > >> _______________________________________________
> > > > > >> dri-devel mailing list
> > > > > >> dri-devel@lists.freedesktop.org
> > > > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Johannes Weiner May 6, 2019, 3:16 p.m. UTC | #10

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> In containerized or virtualized environments, there is desire to have
> controls in place for resources that can be consumed by users of a GPU
> device.  This RFC patch series proposes a framework for integrating 
> use of existing cgroup controllers into device drivers.
> The i915 driver is updated in this series as our primary use case to
> leverage this framework and to serve as an example for discussion.
> 
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)
> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)

Please create a separate controller for your purposes.

The memory controller is for traditional RAM. I don't see it having
much in common with what you're trying to do, and it's barely reusing
any of the memcg code. You can use the page_counter API directly.

> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.
> 
> Per-device controls would be exposed in cgroup filesystem as:
>     mount/<cgroup_name>/<subsys_name>.devices/<dev_name>/<subsys_files>
> such as (for example):
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.max
>     mount/<cgroup_name>/memory.devices/<dev_name>/memory.current
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.stat
>     mount/<cgroup_name>/cpu.devices/<dev_name>/cpu.weight

Subdirectories for anything other than actual cgroups are a no-go. If
you need a hierarchy, use dotted filenames:

gpu.memory.max
gpu.cycles.max

etc. and look at Documentation/admin-guide/cgroup-v2.rst's 'Format'
and 'Conventions', as well as how the io controller works, to see how
multi-key / multi-device control files are implemented in cgroup2.

Tejun Heo May 6, 2019, 3:26 p.m. UTC | #11

Hello,

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)
> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
> 
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.

I'm really skeptical about this approach.  When creating resource
controllers, I think what's the most important and challenging is
establishing resource model - what resources are and how they can be
distributed.  This patchset is going the other way around - building
out core infrastructure for bolierplates at a significant risk of
mixing up resource models across different types of resources.

IO controllers already implement per-device controls.  I'd suggest
following the same interface conventions and implementing a dedicated
controller for the subsystem.

Thanks.

Welty, Brian May 7, 2019, 7:50 p.m. UTC | #12

On 5/6/2019 8:26 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
>> The patch series enables device drivers to use cgroups to control the
>> following resources within a GPU (or other accelerator device):
>> *  control allocation of device memory (reuse of memcg)
>> and with future work, we could extend to:
>> *  track and control share of GPU time (reuse of cpu/cpuacct)
>> *  apply mask of allowed execution engines (reuse of cpusets)
>>
>> Instead of introducing a new cgroup subsystem for GPU devices, a new
>> framework is proposed to allow devices to register with existing cgroup
>> controllers, which creates per-device cgroup_subsys_state within the
>> cgroup.  This gives device drivers their own private cgroup controls
>> (such as memory limits or other parameters) to be applied to device
>> resources instead of host system resources.
>> Device drivers (GPU or other) are then able to reuse the existing cgroup
>> controls, instead of inventing similar ones.
> 
> I'm really skeptical about this approach.  When creating resource
> controllers, I think what's the most important and challenging is
> establishing resource model - what resources are and how they can be
> distributed.  This patchset is going the other way around - building
> out core infrastructure for bolierplates at a significant risk of
> mixing up resource models across different types of resources.
> 
> IO controllers already implement per-device controls.  I'd suggest
> following the same interface conventions and implementing a dedicated
> controller for the subsystem.
>
Okay, thanks for feedback.  Preference is clear to see this done as
dedicated cgroup controller.

Part of my proposal was an attempt for devices with "mem like" and "cpu 
like" attributes to be managed by common controller.   We can ignore this
idea for cpu attributes, as those can just go in a GPU controller.

There might still be merit in having a 'device mem' cgroup controller.
The resource model at least is then no longer mixed up with host memory.
RDMA community seemed to have some interest in a common controller at
least for device memory aspects.
Thoughts on this?   I believe could still reuse the 'struct mem_cgroup' data
structure.  There should be some opportunity to reuse charging APIs and
have some nice integration with HMM for charging to device memory, depending
on backing store.

-Brian

Tejun Heo May 9, 2019, 4:52 p.m. UTC | #13

Hello,

On Tue, May 07, 2019 at 12:50:50PM -0700, Welty, Brian wrote:
> There might still be merit in having a 'device mem' cgroup controller.
> The resource model at least is then no longer mixed up with host memory.
> RDMA community seemed to have some interest in a common controller at
> least for device memory aspects.
> Thoughts on this?   I believe could still reuse the 'struct mem_cgroup' data
> structure.  There should be some opportunity to reuse charging APIs and
> have some nice integration with HMM for charging to device memory, depending
> on backing store.

Library-ish sharing is fine but in terms of interface, I think it'd be
better to keep them separate at least for now.  Down the line maybe
these resources will interact with each other in a more integrated way
but I don't think it's a good idea to try to design and implement
resource models for something like that preemptively.

Thanks.

[RFC,0/5] cgroup support for GPU devices

Message

Comments