Message ID | 20210126214626.16260-1-brian.welty@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | cgroup support for GPU devices | expand |
On 2021/1/27 上午5:46, Brian Welty wrote: > We'd like to revisit the proposal of a GPU cgroup controller for managing > GPU devices but with just a basic set of controls. This series is based on > the prior patch series from Kenny Ho [1]. We take Kenny's base patches > which implement the basic framework for the controller, but we propose an > alternate set of control files. Here we've taken a subset of the controls > proposed in earlier discussion on ML here [2]. > > This series proposes a set of device memory controls (gpu.memory.current, > gpu.memory.max, and gpu.memory.total) and accounting of GPU time usage > (gpu.sched.runtime). GPU time sharing controls are left as future work. > These are implemented within the GPU controller along with integration/usage > of the device memory controls by the i915 device driver. > > As an accelerator or GPU device is similar in many respects to a CPU with > (or without) attached system memory, the basic principle here is try to > copy the semantics of existing controls from other controllers when possible > and where these controls serve the same underlying purpose. > For example, the memory.max and memory.current controls are based on > same controls from MEMCG controller. It seems not to be DRM specific, or even GPU specific. Would we have an universal control group for any accelerator, GPGPU device etc, that hold sharable resources like device memory, compute utility, bandwidth, with extra control file to select between devices(or vendors)? e.g. /cgname.device that stores PCI BDF, or enum(intel, amdgpu, nvidia, ...), defaults to none, means not enabled.
On 2021/1/27 上午5:46, Brian Welty wrote: > We'd like to revisit the proposal of a GPU cgroup controller for managing > GPU devices but with just a basic set of controls. This series is based on > the prior patch series from Kenny Ho [1]. We take Kenny's base patches > which implement the basic framework for the controller, but we propose an > alternate set of control files. Here we've taken a subset of the controls > proposed in earlier discussion on ML here [2]. > > This series proposes a set of device memory controls (gpu.memory.current, > gpu.memory.max, and gpu.memory.total) and accounting of GPU time usage > (gpu.sched.runtime). GPU time sharing controls are left as future work. > These are implemented within the GPU controller along with integration/usage > of the device memory controls by the i915 device driver. > > As an accelerator or GPU device is similar in many respects to a CPU with > (or without) attached system memory, the basic principle here is try to > copy the semantics of existing controls from other controllers when possible > and where these controls serve the same underlying purpose. > For example, the memory.max and memory.current controls are based on > same controls from MEMCG controller. It seems not to be DRM specific, or even GPU specific. Would we have an universal control group for any accelerator, GPGPU device etc, that hold sharable resources like device memory, compute utility, bandwidth, with extra control file to select between devices(or vendors)? e.g. /cgname.device that stores PCI BDF, or enum(intel, amdgpu, nvidia, ...), defaults to none, means not enabled.
On 1/28/2021 7:00 PM, Xingyou Chen wrote: > On 2021/1/27 上午5:46, Brian Welty wrote: > >> We'd like to revisit the proposal of a GPU cgroup controller for managing >> GPU devices but with just a basic set of controls. This series is based on >> the prior patch series from Kenny Ho [1]. We take Kenny's base patches >> which implement the basic framework for the controller, but we propose an >> alternate set of control files. Here we've taken a subset of the controls >> proposed in earlier discussion on ML here [2]. >> >> This series proposes a set of device memory controls (gpu.memory.current, >> gpu.memory.max, and gpu.memory.total) and accounting of GPU time usage >> (gpu.sched.runtime). GPU time sharing controls are left as future work. >> These are implemented within the GPU controller along with integration/usage >> of the device memory controls by the i915 device driver. >> >> As an accelerator or GPU device is similar in many respects to a CPU with >> (or without) attached system memory, the basic principle here is try to >> copy the semantics of existing controls from other controllers when possible >> and where these controls serve the same underlying purpose. >> For example, the memory.max and memory.current controls are based on >> same controls from MEMCG controller. > > It seems not to be DRM specific, or even GPU specific. Would we have an universal > control group for any accelerator, GPGPU device etc, that hold sharable resources > like device memory, compute utility, bandwidth, with extra control file to select > between devices(or vendors)? > > e.g. /cgname.device that stores PCI BDF, or enum(intel, amdgpu, nvidia, ...), > defaults to none, means not enabled. > Hi, thanks for the feedback. Yes, I tend to agree. I've asked about this in earlier work; my suggestion is to name the controller something like 'XPU' to be clear that these controls could apply to more than GPU. But at least for now, based on Tejun's reply [1], the feedback is to try and keep this controller as small and focused as possible on just GPU. At least until we get some consensus on set of controls for GPU..... but for this we need more active input from community...... -Brian [1] https://lists.freedesktop.org/archives/dri-devel/2019-November/243167.html
On Mon, Feb 01, 2021 at 03:21:35PM -0800, Brian Welty wrote: > > On 1/28/2021 7:00 PM, Xingyou Chen wrote: > > On 2021/1/27 上午5:46, Brian Welty wrote: > > > >> We'd like to revisit the proposal of a GPU cgroup controller for managing > >> GPU devices but with just a basic set of controls. This series is based on > >> the prior patch series from Kenny Ho [1]. We take Kenny's base patches > >> which implement the basic framework for the controller, but we propose an > >> alternate set of control files. Here we've taken a subset of the controls > >> proposed in earlier discussion on ML here [2]. > >> > >> This series proposes a set of device memory controls (gpu.memory.current, > >> gpu.memory.max, and gpu.memory.total) and accounting of GPU time usage > >> (gpu.sched.runtime). GPU time sharing controls are left as future work. > >> These are implemented within the GPU controller along with integration/usage > >> of the device memory controls by the i915 device driver. > >> > >> As an accelerator or GPU device is similar in many respects to a CPU with > >> (or without) attached system memory, the basic principle here is try to > >> copy the semantics of existing controls from other controllers when possible > >> and where these controls serve the same underlying purpose. > >> For example, the memory.max and memory.current controls are based on > >> same controls from MEMCG controller. > > > > It seems not to be DRM specific, or even GPU specific. Would we have an universal > > control group for any accelerator, GPGPU device etc, that hold sharable resources > > like device memory, compute utility, bandwidth, with extra control file to select > > between devices(or vendors)? > > > > e.g. /cgname.device that stores PCI BDF, or enum(intel, amdgpu, nvidia, ...), > > defaults to none, means not enabled. > > > > Hi, thanks for the feedback. Yes, I tend to agree. I've asked about this in > earlier work; my suggestion is to name the controller something like 'XPU' to > be clear that these controls could apply to more than GPU. > > But at least for now, based on Tejun's reply [1], the feedback is to try and keep > this controller as small and focused as possible on just GPU. At least until > we get some consensus on set of controls for GPU..... but for this we need more > active input from community...... There's also nothing stopping anyone from exposing any kind of XPU as drivers/gpu device. Aside from the "full stack must be open requirement we have" in drm. And frankly with drm being very confusing acronym we could also rename GPU to be the "general processing unit" subsytem :-) -Daniel > > -Brian > > [1] https://lists.freedesktop.org/archives/dri-devel/2019-November/243167.html