Message ID | 20230712114605.519432-17-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DRM cgroup controller with scheduling control and memory stats | expand |
On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: > $ cat drm.memory.stat > card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936 > card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0 > > Data is generated on demand for simplicty of implementation ie. no running > totals are kept or accounted during migrations and such. Various > optimisations such as cheaper collection of data are possible but > deliberately left out for now. > > Overall, the feature is deemed to be useful to container orchestration > software (and manual management). > > Limits, either soft or hard, are not envisaged to be implemented on top of > this approach due on demand nature of collecting the stats. So, yeah, if you want to add memory controls, we better think through how the fd ownership migration should work. Thanks.
Hey, On 2023-07-22 00:21, Tejun Heo wrote: > On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: >> $ cat drm.memory.stat >> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936 >> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0 >> >> Data is generated on demand for simplicty of implementation ie. no running >> totals are kept or accounted during migrations and such. Various >> optimisations such as cheaper collection of data are possible but >> deliberately left out for now. >> >> Overall, the feature is deemed to be useful to container orchestration >> software (and manual management). >> >> Limits, either soft or hard, are not envisaged to be implemented on top of >> this approach due on demand nature of collecting the stats. > > So, yeah, if you want to add memory controls, we better think through how > the fd ownership migration should work. I've taken a look at the series, since I have been working on cgroup memory eviction. The scheduling stuff will work for i915, since it has a purely software execlist scheduler, but I don't think it will work for GuC (firmware) scheduling or other drivers that use the generic drm scheduler. For something like this, you would probably want it to work inside the drm scheduler first. Presumably, this can be done by setting a weight on each runqueue, and perhaps adding a callback to update one for a running queue. Calculating the weights hierarchically might be fun.. I have taken a look at how the rest of cgroup controllers change ownership when moved to a different cgroup, and the answer was: not at all. If we attempt to create the scheduler controls only on the first time the fd is used, you could probably get rid of all the tracking. This can be done very easily with the drm scheduler. WRT memory, I think the consensus is to track system memory like normal memory. Stolen memory doesn't need to be tracked. It's kernel only memory, used for internal bookkeeping only. The only time userspace can directly manipulate stolen memory, is by mapping the pinned initial framebuffer to its own address space. The only allocation it can do is when a framebuffer is displayed, and framebuffer compression creates some stolen memory. Userspace is not aware of this though, and has no way to manipulate those contents. Cheers, ~Maarten
On 26/07/2023 11:14, Maarten Lankhorst wrote: > Hey, > > On 2023-07-22 00:21, Tejun Heo wrote: >> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: >>> $ cat drm.memory.stat >>> card0 region=system total=12898304 shared=0 active=0 >>> resident=12111872 purgeable=167936 >>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 >>> purgeable=0 >>> >>> Data is generated on demand for simplicty of implementation ie. no >>> running >>> totals are kept or accounted during migrations and such. Various >>> optimisations such as cheaper collection of data are possible but >>> deliberately left out for now. >>> >>> Overall, the feature is deemed to be useful to container orchestration >>> software (and manual management). >>> >>> Limits, either soft or hard, are not envisaged to be implemented on >>> top of >>> this approach due on demand nature of collecting the stats. >> >> So, yeah, if you want to add memory controls, we better think through how >> the fd ownership migration should work. > I've taken a look at the series, since I have been working on cgroup > memory eviction. > > The scheduling stuff will work for i915, since it has a purely software > execlist scheduler, but I don't think it will work for GuC (firmware) > scheduling or other drivers that use the generic drm scheduler. It actually works - I used to have a blurb in the cover letter about it but apparently I dropped it. Just a bit less well with many clients, since there are fewer priority levels. All that the design requires from the invididual drivers is some way to react to the "you are over budget by this much" signal. The rest is driver and backend specific. > For something like this, you would probably want it to work inside the > drm scheduler first. Presumably, this can be done by setting a weight on > each runqueue, and perhaps adding a callback to update one for a running > queue. Calculating the weights hierarchically might be fun.. It is not needed to work in drm scheduler first. In fact drm scheduler based drivers can plug into what I have since it already has the notion of scheduling priorities. They would only need to implement a hook which allow the cgroup controller to query client GPU utilisation and another to received the over budget signal. Amdgpu and msm AFAIK could be easy candidates because they both support per client utilisation and priorities. Looks like I need to put all this info back into the cover letter. Also, hierarchic weights and time budgets are all already there. What could be done later is make this all smarter and respect the time budget with more precision. That would however, in many cases including Intel, require co-operation with the firmware. In any case it is only work in the implementation, while the cgroup control interface remains the same. > I have taken a look at how the rest of cgroup controllers change > ownership when moved to a different cgroup, and the answer was: not at > all. If we attempt to create the scheduler controls only on the first > time the fd is used, you could probably get rid of all the tracking. Can you send a CPU file descriptor from process A to process B and have CPU usage belonging to process B show up in process' A cgroup, or vice-versa? Nope, I am not making any sense, am I? My point being it is not like-to-like, model is different. No ownership transfer would mean in wide deployments all GPU utilisation would be assigned to Xorg and so there is no point to any of this. No way to throttle a cgroup with un-important GPU clients for instance. > This can be done very easily with the drm scheduler. > > WRT memory, I think the consensus is to track system memory like normal > memory. Stolen memory doesn't need to be tracked. It's kernel only > memory, used for internal bookkeeping only. > > The only time userspace can directly manipulate stolen memory, is by > mapping the pinned initial framebuffer to its own address space. The > only allocation it can do is when a framebuffer is displayed, and > framebuffer compression creates some stolen memory. Userspace is not > aware of this though, and has no way to manipulate those contents. Stolen memory is irrelevant and not something cgroup controller knows about. Point is drivers say which memory regions they have and their utilisation. Imagine instead of stolen it said vram0, or on Intel multi-tile it shows local0 and local1. People working with containers are interested to see this breakdown. I guess the parallel and use case here is closer to memory.numa_stat. Regards, Tvrtko
On 21/07/2023 23:21, Tejun Heo wrote: > On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: >> $ cat drm.memory.stat >> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936 >> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0 >> >> Data is generated on demand for simplicty of implementation ie. no running >> totals are kept or accounted during migrations and such. Various >> optimisations such as cheaper collection of data are possible but >> deliberately left out for now. >> >> Overall, the feature is deemed to be useful to container orchestration >> software (and manual management). >> >> Limits, either soft or hard, are not envisaged to be implemented on top of >> this approach due on demand nature of collecting the stats. > > So, yeah, if you want to add memory controls, we better think through how > the fd ownership migration should work. It would be quite easy to make the implicit migration fail - just the matter of failing the first ioctl, which is what triggers the migration, after the file descriptor access from a new owner. But I don't think I can really add that in the RFC given I have no hard controls or anything like that. With GPU usage throttling it doesn't really apply, at least I don't think it does, since even when migrated to a lower budget group it would just get immediately de-prioritized. I don't think hard GPU time limits are feasible in general, and while soft might be, again I don't see that any limiting would necessarily have to run immediately on implicit migration. Second part of the story are hypothetical/future memory controls. I think first thing to say is that implicit migration is important, but it is not really established to use the file descriptor from two places or to migrate more than once. It is simply fresh fd which gets sent to clients from Xorg, which is one of the legacy ways of doing things. So we probably can just ignore that given no significant amount of memory ownership would be getting migrated. And for drm.memory.stat I think what I have is good enough - both private and shared data get accounted, for any clients that have handles to particular buffers. Maarten was working on memory controls so maybe he would have more thoughts on memory ownership and implicit migration. But I don't think there is anything incompatible with that and drm.memory.stats as proposed here, given how the categories reported are the established ones from the DRM fdinfo spec, and it is fact of the matter that we can have multiple memory regions per driver. The main thing that would change between this RFC and future memory controls in the area of drm.memory.stat is the implementation - it would have to get changed under the hood from "collect on query" to "account at allocation/free/etc". But that is just implementation details. Regards, Tvrtko
Hello, On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote: > > So, yeah, if you want to add memory controls, we better think through how > > the fd ownership migration should work. > > I've taken a look at the series, since I have been working on cgroup memory > eviction. > > The scheduling stuff will work for i915, since it has a purely software > execlist scheduler, but I don't think it will work for GuC (firmware) > scheduling or other drivers that use the generic drm scheduler. > > For something like this, you would probably want it to work inside the drm > scheduler first. Presumably, this can be done by setting a weight on each > runqueue, and perhaps adding a callback to update one for a running queue. > Calculating the weights hierarchically might be fun.. I don't have any idea on this front. The basic idea of making high level distribution decisions in core code and letting individual drivers enforce that in a way which fits them the best makes sense to me but I don't know enough to have an opinion here. > I have taken a look at how the rest of cgroup controllers change ownership > when moved to a different cgroup, and the answer was: not at all. If we For persistent resources, that's the general rule. Whoever instantiates a resource gets to own it until the resource gets freed. There is an exception with the pid controller and there are discussions around whether we want some sort of migration behavior with memcg but yes by and large instantiator being the owner is the general model cgroup follows. > attempt to create the scheduler controls only on the first time the fd is > used, you could probably get rid of all the tracking. > This can be done very easily with the drm scheduler. > > WRT memory, I think the consensus is to track system memory like normal > memory. Stolen memory doesn't need to be tracked. It's kernel only memory, > used for internal bookkeeping only. > > The only time userspace can directly manipulate stolen memory, is by mapping > the pinned initial framebuffer to its own address space. The only allocation > it can do is when a framebuffer is displayed, and framebuffer compression > creates some stolen memory. Userspace is not > aware of this though, and has no way to manipulate those contents. So, my dumb understanding: * Ownership of an fd can be established on the first ioctl call and doesn't need to be migrated afterwards. There are no persistent resources to migration on the first call. * Memory then can be tracked in a similar way to memcg. Memory gets charged to the initial instantiator and doesn't need to be moved around afterwards. There may be some discrepancies around stolen memory but the magnitude of inaccuracy introduced that way is limited and bound and can be safely ignored. Is that correct? Thanks.
Hello, On Wed, Jul 26, 2023 at 05:44:28PM +0100, Tvrtko Ursulin wrote: ... > > So, yeah, if you want to add memory controls, we better think through how > > the fd ownership migration should work. > > It would be quite easy to make the implicit migration fail - just the matter > of failing the first ioctl, which is what triggers the migration, after the > file descriptor access from a new owner. So, it'd be best if there's no migration involved at all as per the discussion with Maarten. > But I don't think I can really add that in the RFC given I have no hard > controls or anything like that. > > With GPU usage throttling it doesn't really apply, at least I don't think it > does, since even when migrated to a lower budget group it would just get > immediately de-prioritized. > > I don't think hard GPU time limits are feasible in general, and while soft > might be, again I don't see that any limiting would necessarily have to run > immediately on implicit migration. Yeah, I wouldn't worry about hard allocation of GPU time. CPU RT control does that but it's barely used. > Second part of the story are hypothetical/future memory controls. > > I think first thing to say is that implicit migration is important, but it > is not really established to use the file descriptor from two places or to > migrate more than once. It is simply fresh fd which gets sent to clients > from Xorg, which is one of the legacy ways of doing things. > > So we probably can just ignore that given no significant amount of memory > ownership would be getting migrated. So, if this is the case, it'd be better to clarify this. ie. if the summary is: fd gets assigned to the user with a certain action at which point the fd doesn't have significant resources attached to it and the fd can't be moved to some other cgroup afterwards. then, everything is pretty simple. No need to worry about migration at all. fd just gets assigned once at the beginning and everything gets accounted towards that afterwards. > And for drm.memory.stat I think what I have is good enough - both private > and shared data get accounted, for any clients that have handles to > particular buffers. > > Maarten was working on memory controls so maybe he would have more thoughts > on memory ownership and implicit migration. > > But I don't think there is anything incompatible with that and > drm.memory.stats as proposed here, given how the categories reported are the > established ones from the DRM fdinfo spec, and it is fact of the matter that > we can have multiple memory regions per driver. > > The main thing that would change between this RFC and future memory controls > in the area of drm.memory.stat is the implementation - it would have to get > changed under the hood from "collect on query" to "account at > allocation/free/etc". But that is just implementation details. I'd much prefer to straighten out this before adding a prelimiary stat only thing. If the previously described ownership model is sufficient, none of this is complicated, right? We can just add counters to track the resources and print them out. Thanks.
Hey, On 2023-07-26 13:41, Tvrtko Ursulin wrote: > > On 26/07/2023 11:14, Maarten Lankhorst wrote: >> Hey, >> >> On 2023-07-22 00:21, Tejun Heo wrote: >>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: >>>> $ cat drm.memory.stat >>>> card0 region=system total=12898304 shared=0 active=0 >>>> resident=12111872 purgeable=167936 >>>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 >>>> purgeable=0 >>>> >>>> Data is generated on demand for simplicty of implementation ie. no >>>> running >>>> totals are kept or accounted during migrations and such. Various >>>> optimisations such as cheaper collection of data are possible but >>>> deliberately left out for now. >>>> >>>> Overall, the feature is deemed to be useful to container orchestration >>>> software (and manual management). >>>> >>>> Limits, either soft or hard, are not envisaged to be implemented on >>>> top of >>>> this approach due on demand nature of collecting the stats. >>> >>> So, yeah, if you want to add memory controls, we better think through >>> how >>> the fd ownership migration should work. >> I've taken a look at the series, since I have been working on cgroup >> memory eviction. >> >> The scheduling stuff will work for i915, since it has a purely >> software execlist scheduler, but I don't think it will work for GuC >> (firmware) scheduling or other drivers that use the generic drm >> scheduler. > > It actually works - I used to have a blurb in the cover letter about it > but apparently I dropped it. Just a bit less well with many clients, > since there are fewer priority levels. > > All that the design requires from the invididual drivers is some way to > react to the "you are over budget by this much" signal. The rest is > driver and backend specific. What I mean is that this signal may not be applicable since the drm scheduler just schedules jobs that run. Adding a weight might be done in hardware, since it's responsible for scheduling which context gets to run. The over budget signal is useless in that case, and you just need to set a scheduling priority for the hardware instead. >> For something like this, you would probably want it to work inside >> the drm scheduler first. Presumably, this can be done by setting a >> weight on each runqueue, and perhaps adding a callback to update one >> for a running queue. Calculating the weights hierarchically might be >> fun.. > > It is not needed to work in drm scheduler first. In fact drm scheduler > based drivers can plug into what I have since it already has the notion > of scheduling priorities. > > They would only need to implement a hook which allow the cgroup > controller to query client GPU utilisation and another to received the > over budget signal. > > Amdgpu and msm AFAIK could be easy candidates because they both support > per client utilisation and priorities. > > Looks like I need to put all this info back into the cover letter. > > Also, hierarchic weights and time budgets are all already there. What > could be done later is make this all smarter and respect the time budget > with more precision. That would however, in many cases including Intel, > require co-operation with the firmware. In any case it is only work in > the implementation, while the cgroup control interface remains the same. > >> I have taken a look at how the rest of cgroup controllers change >> ownership when moved to a different cgroup, and the answer was: not at >> all. If we attempt to create the scheduler controls only on the first >> time the fd is used, you could probably get rid of all the tracking. > > Can you send a CPU file descriptor from process A to process B and have > CPU usage belonging to process B show up in process' A cgroup, or > vice-versa? Nope, I am not making any sense, am I? My point being it is > not like-to-like, model is different. > > No ownership transfer would mean in wide deployments all GPU utilisation > would be assigned to Xorg and so there is no point to any of this. No > way to throttle a cgroup with un-important GPU clients for instance. If you just grab the current process' cgroup when a drm_sched_entity is created, you don't have everything charged to X.org. No need for complicated ownership tracking in drm_file. The same equivalent should be done in i915 as well when a context is created as it's not using the drm scheduler. >> This can be done very easily with the drm scheduler. >> >> WRT memory, I think the consensus is to track system memory like >> normal memory. Stolen memory doesn't need to be tracked. It's kernel >> only memory, used for internal bookkeeping only. >> >> The only time userspace can directly manipulate stolen memory, is by >> mapping the pinned initial framebuffer to its own address space. The >> only allocation it can do is when a framebuffer is displayed, and >> framebuffer compression creates some stolen memory. Userspace is not >> aware of this though, and has no way to manipulate those contents. > > Stolen memory is irrelevant and not something cgroup controller knows > about. Point is drivers say which memory regions they have and their > utilisation. > > Imagine instead of stolen it said vram0, or on Intel multi-tile it shows > local0 and local1. People working with containers are interested to see > this breakdown. I guess the parallel and use case here is closer to > memory.numa_stat. Correct, but for the same reason, I think it might be more useful to split up the weight too. A single scheduling weight for the global GPU might be less useful than per engine, or per tile perhaps.. Cheers, ~Maarten
Hey, On 2023-07-26 21:44, Tejun Heo wrote: > Hello, > > On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote: >>> So, yeah, if you want to add memory controls, we better think through how >>> the fd ownership migration should work. >> >> I've taken a look at the series, since I have been working on cgroup memory >> eviction. >> >> The scheduling stuff will work for i915, since it has a purely software >> execlist scheduler, but I don't think it will work for GuC (firmware) >> scheduling or other drivers that use the generic drm scheduler. >> >> For something like this, you would probably want it to work inside the drm >> scheduler first. Presumably, this can be done by setting a weight on each >> runqueue, and perhaps adding a callback to update one for a running queue. >> Calculating the weights hierarchically might be fun.. > > I don't have any idea on this front. The basic idea of making high level > distribution decisions in core code and letting individual drivers enforce > that in a way which fits them the best makes sense to me but I don't know > enough to have an opinion here. > >> I have taken a look at how the rest of cgroup controllers change ownership >> when moved to a different cgroup, and the answer was: not at all. If we > > For persistent resources, that's the general rule. Whoever instantiates a > resource gets to own it until the resource gets freed. There is an exception > with the pid controller and there are discussions around whether we want > some sort of migration behavior with memcg but yes by and large instantiator > being the owner is the general model cgroup follows. > >> attempt to create the scheduler controls only on the first time the fd is >> used, you could probably get rid of all the tracking. >> This can be done very easily with the drm scheduler. >> >> WRT memory, I think the consensus is to track system memory like normal >> memory. Stolen memory doesn't need to be tracked. It's kernel only memory, >> used for internal bookkeeping only. >> >> The only time userspace can directly manipulate stolen memory, is by mapping >> the pinned initial framebuffer to its own address space. The only allocation >> it can do is when a framebuffer is displayed, and framebuffer compression >> creates some stolen memory. Userspace is not >> aware of this though, and has no way to manipulate those contents. > > So, my dumb understanding: > > * Ownership of an fd can be established on the first ioctl call and doesn't > need to be migrated afterwards. There are no persistent resources to > migration on the first call. > > * Memory then can be tracked in a similar way to memcg. Memory gets charged > to the initial instantiator and doesn't need to be moved around > afterwards. There may be some discrepancies around stolen memory but the > magnitude of inaccuracy introduced that way is limited and bound and can > be safely ignored. > > Is that correct? Hey, Yeah mostly, I think we can stop tracking stolen memory. I stopped doing that for Xe, there is literally nothing to control for userspace in there. Cheers, Maarten
On 27/07/2023 14:42, Maarten Lankhorst wrote: > On 2023-07-26 21:44, Tejun Heo wrote: >> Hello, >> >> On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote: >>>> So, yeah, if you want to add memory controls, we better think >>>> through how >>>> the fd ownership migration should work. >>> >>> I've taken a look at the series, since I have been working on cgroup >>> memory >>> eviction. >>> >>> The scheduling stuff will work for i915, since it has a purely software >>> execlist scheduler, but I don't think it will work for GuC (firmware) >>> scheduling or other drivers that use the generic drm scheduler. >>> >>> For something like this, you would probably want it to work inside >>> the drm >>> scheduler first. Presumably, this can be done by setting a weight on >>> each >>> runqueue, and perhaps adding a callback to update one for a running >>> queue. >>> Calculating the weights hierarchically might be fun.. >> >> I don't have any idea on this front. The basic idea of making high level >> distribution decisions in core code and letting individual drivers >> enforce >> that in a way which fits them the best makes sense to me but I don't know >> enough to have an opinion here. >> >>> I have taken a look at how the rest of cgroup controllers change >>> ownership >>> when moved to a different cgroup, and the answer was: not at all. If we >> >> For persistent resources, that's the general rule. Whoever instantiates a >> resource gets to own it until the resource gets freed. There is an >> exception >> with the pid controller and there are discussions around whether we want >> some sort of migration behavior with memcg but yes by and large >> instantiator >> being the owner is the general model cgroup follows. >> >>> attempt to create the scheduler controls only on the first time the >>> fd is >>> used, you could probably get rid of all the tracking. >>> This can be done very easily with the drm scheduler. >>> >>> WRT memory, I think the consensus is to track system memory like normal >>> memory. Stolen memory doesn't need to be tracked. It's kernel only >>> memory, >>> used for internal bookkeeping only. >>> >>> The only time userspace can directly manipulate stolen memory, is by >>> mapping >>> the pinned initial framebuffer to its own address space. The only >>> allocation >>> it can do is when a framebuffer is displayed, and framebuffer >>> compression >>> creates some stolen memory. Userspace is not >>> aware of this though, and has no way to manipulate those contents. >> >> So, my dumb understanding: >> >> * Ownership of an fd can be established on the first ioctl call and >> doesn't >> need to be migrated afterwards. There are no persistent resources to >> migration on the first call. Yes, keyword is "can". Trouble is migration may or may not happen. One may choose "Plasma X.org" session type in your login manager and all DRM fds would be under Xorg if not migrated. Or one may choose "Plasma Wayland" and migration wouldn't matter. But former is I think has a huge deployed base so that not supporting implicit migration would be a significant asterisk next to the controller. >> * Memory then can be tracked in a similar way to memcg. Memory gets >> charged >> to the initial instantiator and doesn't need to be moved around >> afterwards. There may be some discrepancies around stolen memory >> but the >> magnitude of inaccuracy introduced that way is limited and bound >> and can >> be safely ignored. >> >> Is that correct? > > Hey, > > Yeah mostly, I think we can stop tracking stolen memory. I stopped doing > that for Xe, there is literally nothing to control for userspace in there. Right, but for reporting stolen is a red-herring. In this RFC I simply report on all memory regions known by the driver. As I said in the other reply, imagine the keys are 'system' and 'vram0'. Point was just to illustrate multiplicity of regions. Regards, Tvrtko
On 27/07/2023 12:54, Maarten Lankhorst wrote: > Hey, > > On 2023-07-26 13:41, Tvrtko Ursulin wrote: >> >> On 26/07/2023 11:14, Maarten Lankhorst wrote: >>> Hey, >>> >>> On 2023-07-22 00:21, Tejun Heo wrote: >>>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote: >>>>> $ cat drm.memory.stat >>>>> card0 region=system total=12898304 shared=0 active=0 >>>>> resident=12111872 purgeable=167936 >>>>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 >>>>> purgeable=0 >>>>> >>>>> Data is generated on demand for simplicty of implementation ie. no >>>>> running >>>>> totals are kept or accounted during migrations and such. Various >>>>> optimisations such as cheaper collection of data are possible but >>>>> deliberately left out for now. >>>>> >>>>> Overall, the feature is deemed to be useful to container orchestration >>>>> software (and manual management). >>>>> >>>>> Limits, either soft or hard, are not envisaged to be implemented on >>>>> top of >>>>> this approach due on demand nature of collecting the stats. >>>> >>>> So, yeah, if you want to add memory controls, we better think >>>> through how >>>> the fd ownership migration should work. >>> I've taken a look at the series, since I have been working on cgroup >>> memory eviction. >>> >>> The scheduling stuff will work for i915, since it has a purely >>> software execlist scheduler, but I don't think it will work for GuC >>> (firmware) scheduling or other drivers that use the generic drm >>> scheduler. >> >> It actually works - I used to have a blurb in the cover letter about >> it but apparently I dropped it. Just a bit less well with many >> clients, since there are fewer priority levels. >> >> All that the design requires from the invididual drivers is some way >> to react to the "you are over budget by this much" signal. The rest is >> driver and backend specific. > > What I mean is that this signal may not be applicable since the drm > scheduler just schedules jobs that run. Adding a weight might be done in > hardware, since it's responsible for scheduling which context gets to > run. The over budget signal is useless in that case, and you just need > to set a scheduling priority for the hardware instead. The over budget callback lets the driver know its assigned budget and its current utilisation. Already with that data drivers could implement something smarter than what I did in my RFC. So I don't think callback is completely useless even for some smarter implementation which potentially ties into firmware scheduling. Anyway, I maintain this is implementation details. >>> For something like this, you would probably want it to work inside >>> the drm scheduler first. Presumably, this can be done by setting a >>> weight on each runqueue, and perhaps adding a callback to update one >>> for a running queue. Calculating the weights hierarchically might be >>> fun.. >> >> It is not needed to work in drm scheduler first. In fact drm scheduler >> based drivers can plug into what I have since it already has the >> notion of scheduling priorities. >> >> They would only need to implement a hook which allow the cgroup >> controller to query client GPU utilisation and another to received the >> over budget signal. >> >> Amdgpu and msm AFAIK could be easy candidates because they both >> support per client utilisation and priorities. >> >> Looks like I need to put all this info back into the cover letter. >> >> Also, hierarchic weights and time budgets are all already there. What >> could be done later is make this all smarter and respect the time >> budget with more precision. That would however, in many cases >> including Intel, require co-operation with the firmware. In any case >> it is only work in the implementation, while the cgroup control >> interface remains the same. >> >>> I have taken a look at how the rest of cgroup controllers change >>> ownership when moved to a different cgroup, and the answer was: not >>> at all. If we attempt to create the scheduler controls only on the >>> first time the fd is used, you could probably get rid of all the >>> tracking. >> >> Can you send a CPU file descriptor from process A to process B and >> have CPU usage belonging to process B show up in process' A cgroup, or >> vice-versa? Nope, I am not making any sense, am I? My point being it >> is not like-to-like, model is different. >> >> No ownership transfer would mean in wide deployments all GPU >> utilisation would be assigned to Xorg and so there is no point to any >> of this. No way to throttle a cgroup with un-important GPU clients for >> instance. > If you just grab the current process' cgroup when a drm_sched_entity is > created, you don't have everything charged to X.org. No need for > complicated ownership tracking in drm_file. The same equivalent should > be done in i915 as well when a context is created as it's not using the > drm scheduler. Okay so essentially nuking the concept of DRM clients belongs to one cgroup and instead tracking at the context level. That is an interesting idea. I suspect implementation could require somewhat generalizing the concept of an "execution context", or at least expressing it via the DRM cgroup controller. I can give this a spin, or at least some more detailed thought, once we close on a few more details regarding charging in general. >>> This can be done very easily with the drm scheduler. >>> >>> WRT memory, I think the consensus is to track system memory like >>> normal memory. Stolen memory doesn't need to be tracked. It's kernel >>> only memory, used for internal bookkeeping only. >>> >>> The only time userspace can directly manipulate stolen memory, is by >>> mapping the pinned initial framebuffer to its own address space. The >>> only allocation it can do is when a framebuffer is displayed, and >>> framebuffer compression creates some stolen memory. Userspace is not >>> aware of this though, and has no way to manipulate those contents. >> >> Stolen memory is irrelevant and not something cgroup controller knows >> about. Point is drivers say which memory regions they have and their >> utilisation. >> >> Imagine instead of stolen it said vram0, or on Intel multi-tile it >> shows local0 and local1. People working with containers are interested >> to see this breakdown. I guess the parallel and use case here is >> closer to memory.numa_stat. > Correct, but for the same reason, I think it might be more useful to > split up the weight too. > > A single scheduling weight for the global GPU might be less useful than > per engine, or per tile perhaps.. Yeah, there is some complexity there for sure and could be a larger write up. In short per engine stuff tends to work out in practice as is given how each driver can decide upon receiving the signal what to do. In the i915 RFC for instance if it gets "over budget" signal from the group, but it sees that the physical engines belonging to this specific GPU are not over-subscribed, it simply omits any throttling. Which in practice works out fine for two clients competing for different engines. Same would be for multiple GPUs (or tiles with our stuff) in the same cgroup. Going back to the single scheduling weight or more fine grained. We could choose to follow for instance io.weight format? Start with drm.weight being "default 1000" and later extend to per card (or more): """ default 100 card0 20 card1 50 """ In this case I would convert drm.weight to this format straight away for the next respin, just wouldn't support per card just yet. Regards, Tvrtko
One additional thought on one sub-topic: On 27/07/2023 18:08, Tvrtko Ursulin wrote: [snip] >>>> For something like this, you would probably want it to work inside >>>> the drm scheduler first. Presumably, this can be done by setting a >>>> weight on each runqueue, and perhaps adding a callback to update one >>>> for a running queue. Calculating the weights hierarchically might be >>>> fun.. >>> >>> It is not needed to work in drm scheduler first. In fact drm >>> scheduler based drivers can plug into what I have since it already >>> has the notion of scheduling priorities. >>> >>> They would only need to implement a hook which allow the cgroup >>> controller to query client GPU utilisation and another to received >>> the over budget signal. >>> >>> Amdgpu and msm AFAIK could be easy candidates because they both >>> support per client utilisation and priorities. >>> >>> Looks like I need to put all this info back into the cover letter. >>> >>> Also, hierarchic weights and time budgets are all already there. What >>> could be done later is make this all smarter and respect the time >>> budget with more precision. That would however, in many cases >>> including Intel, require co-operation with the firmware. In any case >>> it is only work in the implementation, while the cgroup control >>> interface remains the same. >>> >>>> I have taken a look at how the rest of cgroup controllers change >>>> ownership when moved to a different cgroup, and the answer was: not >>>> at all. If we attempt to create the scheduler controls only on the >>>> first time the fd is used, you could probably get rid of all the >>>> tracking. >>> >>> Can you send a CPU file descriptor from process A to process B and >>> have CPU usage belonging to process B show up in process' A cgroup, >>> or vice-versa? Nope, I am not making any sense, am I? My point being >>> it is not like-to-like, model is different. >>> >>> No ownership transfer would mean in wide deployments all GPU >>> utilisation would be assigned to Xorg and so there is no point to any >>> of this. No way to throttle a cgroup with un-important GPU clients >>> for instance. >> If you just grab the current process' cgroup when a drm_sched_entity >> is created, you don't have everything charged to X.org. No need for >> complicated ownership tracking in drm_file. The same equivalent should >> be done in i915 as well when a context is created as it's not using >> the drm scheduler. > > Okay so essentially nuking the concept of DRM clients belongs to one > cgroup and instead tracking at the context level. That is an interesting > idea. I suspect implementation could require somewhat generalizing the > concept of an "execution context", or at least expressing it via the DRM > cgroup controller. > > I can give this a spin, or at least some more detailed thought, once we > close on a few more details regarding charging in general. I didn't get much time to brainstorm this just yet, only one downside randomly came to mind later - with this approach for i915 we wouldn't correctly attribute any GPU activity done in the receiving process against our default contexts. Those would still be accounted to the sending process. How much problem in practice that would be remains to be investigated, including if it applies to other drivers too. If there is a good amount of deployed userspace which use the default context, then it would be a bit messy. Regards, Tvrtko *) For non DRM and non i915 people, default context is a GPU submission context implicitly created during the device node open. It always remains valid, including in the receiving process if SCM_RIGHTS is used.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index bbe986366f4a..1891c7d98206 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2452,6 +2452,28 @@ DRM scheduling soft limits interface files Standard cgroup weight based control [1, 10000] used to configure the relative distributing of GPU time between the sibling groups. + drm.memory.stat + A nested file containing cumulative memory statistics for the whole + sub-hierarchy, broken down into separate GPUs and separate memory + regions supported by the latter. + + For example:: + + $ cat drm.memory.stat + card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936 + card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0 + + Card designation corresponds to the DRM device names and multiple line + entries can be present per card. + + Memory region names should be expected to be driver specific with the + exception of 'system' which is standardised and applicable for GPUs + which can operate on system memory buffers. + + Sub-keys 'resident' and 'purgeable' are optional. + + Per category region usage is reported in bytes. + Misc ---- diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 29e11a87bf75..2ea9a46b5031 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -41,6 +41,7 @@ struct drm_minor; struct dma_buf; struct dma_buf_attachment; struct drm_display_mode; +struct drm_memory_stats; struct drm_mode_create_dumb; struct drm_printer; struct sg_table; @@ -175,6 +176,66 @@ struct drm_cgroup_ops { * messages sent by the DRM cgroup controller. */ int (*signal_budget) (struct drm_file *, u64 used, u64 budget); + + + /** + * @num_memory_regions: + * + * Optional callback reporting the number of memory regions driver + * supports. + * + * Callback is allowed to report a larger number than present memory + * regions, but @memory_region_name is then supposed to return NULL for + * those indices. + * + * Used by the DRM core when queried by the DRM cgroup controller. + * + * All three callbacks of @num_memory_regions, @memory_region_name and + * @memory_stats need to be implemented for DRM cgroup memory stats + * support. + */ + unsigned int (*num_memory_regions) (const struct drm_device *); + + /** + * @memory_region_name: + * + * Optional callback reporting the name of the queried memory region. + * + * Can be NULL if the memory region index is not supported by the + * passed in device. + * + * Used by the DRM core when queried by the DRM cgroup controller. + * + * All three callbacks of @num_memory_regions, @memory_region_name and + * @memory_stats need to be implemented for DRM cgroup memory stats + * support. + */ + const char * (*memory_region_name) (const struct drm_device *, + unsigned int index); + + /** + * memory_stats: + * + * Optional callback adding to the passed in array of struct + * drm_memory_stats objects. + * + * Number of objects in the array is passed in the @num argument. + * + * Returns a bitmask of supported enum drm_gem_object_status by the + * driver instance. + * + * Callback is only allow to add to the existing fields and should + * never clear them. + * + * Used by the DRM core when queried by the DRM cgroup controller. + * + * All three callbacks of @num_memory_regions, @memory_region_name and + * @memory_stats need to be implemented for DRM cgroup memory stats + * support. + */ + unsigned int (*memory_stats) (struct drm_file *, + struct drm_memory_stats *, + unsigned int num); }; /** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 7c20d4ebc634..22fc180dd659 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -5,6 +5,7 @@ #include <linux/cgroup.h> #include <linux/cgroup_drm.h> +#include <linux/device.h> #include <linux/list.h> #include <linux/moduleparam.h> #include <linux/mutex.h> @@ -12,6 +13,8 @@ #include <linux/slab.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> +#include <drm/drm_gem.h> struct drm_cgroup_state { struct cgroup_subsys_state css; @@ -133,6 +136,147 @@ drmcs_read_total_us(struct cgroup_subsys_state *css, struct cftype *cft) return val; } +struct drmcs_stat { + const struct drm_device *dev; + const struct drm_cgroup_ops *cg_ops; + const char *device_name; + unsigned int regions; + enum drm_gem_object_status flags; + struct drm_memory_stats *stats; +}; + +static int drmcs_seq_show_memory(struct seq_file *sf, void *v) +{ + struct cgroup_subsys_state *node; + struct drmcs_stat *stats = NULL; + unsigned int num_devices, i; + int ret; + + /* + * We could avoid taking the cgroup_lock and just walk the tree under + * RCU but then allocating temporary storage becomes a problem. So for + * now keep it simple and take the lock. + */ + cgroup_lock(); + + /* Protect against client migrations and clients disappearing. */ + ret = mutex_lock_interruptible(&drmcg_mutex); + if (ret) { + cgroup_unlock(); + return ret; + } + + num_devices = 0; + css_for_each_descendant_pre(node, seq_css(sf)) { + struct drm_cgroup_state *drmcs = css_to_drmcs(node); + struct drm_file *fpriv; + + list_for_each_entry(fpriv, &drmcs->clients, clink) { + const struct drm_cgroup_ops *cg_ops = + fpriv->minor->dev->driver->cg_ops; + const char *device_name = dev_name(fpriv->minor->kdev); + struct drmcs_stat *stat; + unsigned int regions; + + /* Does this driver supports memory stats? */ + if (cg_ops && + cg_ops->num_memory_regions && + cg_ops->memory_region_name && + cg_ops->memory_stats) + regions = + cg_ops->num_memory_regions(fpriv->minor->dev); + else + regions = 0; + + if (!regions) + continue; + + /* Have we seen this device before? */ + stat = NULL; + for (i = 0; i < num_devices; i++) { + if (!strcmp(stats[i].device_name, + device_name)) { + stat = &stats[i]; + break; + } + } + + /* If not allocate space for it. */ + if (!stat) { + stats = krealloc_array(stats, num_devices + 1, + sizeof(*stats), + GFP_USER); + if (!stats) { + ret = -ENOMEM; + goto out; + } + + stat = &stats[num_devices++]; + stat->dev = fpriv->minor->dev; + stat->cg_ops = cg_ops; + stat->device_name = device_name; + stat->flags = 0; + stat->regions = regions; + stat->stats = + kcalloc(regions, + sizeof(struct drm_memory_stats), + GFP_USER); + if (!stat->stats) { + ret = -ENOMEM; + goto out; + } + } + + /* Accumulate the stats for this device+client. */ + stat->flags |= cg_ops->memory_stats(fpriv, + stat->stats, + stat->regions); + } + } + + for (i = 0; i < num_devices; i++) { + struct drmcs_stat *stat = &stats[i]; + unsigned int j; + + for (j = 0; j < stat->regions; j++) { + const char *name = + stat->cg_ops->memory_region_name(stat->dev, j); + + if (!name) + continue; + + seq_printf(sf, + "%s region=%s total=%llu shared=%llu active=%llu", + stat->device_name, + name, + stat->stats[j].private + + stat->stats[j].shared, + stat->stats[j].shared, + stat->stats[j].active); + + if (stat->flags & DRM_GEM_OBJECT_RESIDENT) + seq_printf(sf, " resident=%llu", + stat->stats[j].resident); + + if (stat->flags & DRM_GEM_OBJECT_PURGEABLE) + seq_printf(sf, " purgeable=%llu", + stat->stats[j].purgeable); + + seq_puts(sf, "\n"); + } + } + +out: + mutex_unlock(&drmcg_mutex); + cgroup_unlock(); + + for (i = 0; i < num_devices; i++) + kfree(stats[i].stats); + kfree(stats); + + return ret; +} + static bool __start_scanning(unsigned int period_us) { struct drm_cgroup_state *root = &root_drmcs.drmcs; @@ -575,6 +719,11 @@ struct cftype files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = drmcs_read_total_us, }, + { + .name = "memory.stat", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = drmcs_seq_show_memory, + }, { } /* Zero entry terminates. */ };